Change image dimensions requirement for DiT models #742
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, there is a requirement on the CLI to ensure that the image dimensions are multiple of 64. This is because the unet architecture of stable diffusion expects the latent image to be divisible by 8 (8 latent pixels is 64 pixels) in each dimension, but the supported DiT models like SD3 and Flux don't have the same requirements
With Flux and SD3.x, the image dimensions need to be divisible by 8 for the VAE, but then the transformer can run without crashing on latent images of arbitrary size. I've noticed that the images look very broken if the dimensions of the latents are odd (and it would also break VAE tiling), so I added a requirement for the dimensions to be multiple of 16. (looks like a positional encoding issue that could be fixable?)
As there is no way to know the architecture of the model from the CLI, I had to move the check to the stable-diffusion.cpp file in the generate_image function. The downside of doing so is that the model has to be loaded (which can take some time) at the creation of the sd_ctx, before we can check if the image dimensions are correct.