Skip to content

Change image dimensions requirement for DiT models #742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 28, 2025

Conversation

stduhpf
Copy link
Contributor

@stduhpf stduhpf commented Jul 26, 2025

Currently, there is a requirement on the CLI to ensure that the image dimensions are multiple of 64. This is because the unet architecture of stable diffusion expects the latent image to be divisible by 8 (8 latent pixels is 64 pixels) in each dimension, but the supported DiT models like SD3 and Flux don't have the same requirements

With Flux and SD3.x, the image dimensions need to be divisible by 8 for the VAE, but then the transformer can run without crashing on latent images of arbitrary size. I've noticed that the images look very broken if the dimensions of the latents are odd (and it would also break VAE tiling), so I added a requirement for the dimensions to be multiple of 16. (looks like a positional encoding issue that could be fixable?)

As there is no way to know the architecture of the model from the CLI, I had to move the check to the stable-diffusion.cpp file in the generate_image function. The downside of doing so is that the model has to be loaded (which can take some time) at the creation of the sd_ctx, before we can check if the image dimensions are correct.

@wbruna
Copy link
Contributor

wbruna commented Jul 26, 2025

Since 64 is a multiple of 16, perhaps it'd be better to keep the test for 16 at the command line validation, and only check for 64 for non-DiT models?

Or maybe rounding instead of validating could be a bit more user-friendly.

@stduhpf
Copy link
Contributor Author

stduhpf commented Jul 26, 2025

Since 64 is a multiple of 16, perhaps it'd be better to keep the test for 16 at the command line validation, and only check for 64 for non-DiT models?

Maybe adding the test for 16 in the cli is better, but I think it should be kept in the generate function anyways, as sd.cpp is supposed to be able to be used as a library and is not just a CLI program.

Or maybe rounding instead of validating could be a bit more user-friendly.

I thought about something like that, but on the other hand it messes with the users input wich could cause some problems. Plus there's always the question: How should we round? Up, down or nearest?

On a related note if we're rounding anyways, we might as well add padding around input images in img2img mode to match the rounding.

@wbruna
Copy link
Contributor

wbruna commented Jul 26, 2025

Maybe adding the test for 16 in the cli is better, but I think it should be kept in the generate function anyways, as sd.cpp is supposed to be able to be used as a library and is not just a CLI program.

True.

How should we round? Up, down or nearest?

Or round to the nearest aspect ratio, like I've implemented for Koboldcpp:
https://github.com/LostRuins/koboldcpp/blob/concedo/otherarch/sdcpp/sdtype_adapter.cpp#L352

On a related note if we're rounding anyways, we might as well add padding around input images in img2img mode to match the rounding.

It would make sense, yeah.

I also miss an option to preserve the input image aspect ratio while resizing (but cropping instead of padding). A way to adjust the image placement for outpainting would be very useful, too... Far too complex for this same PR, of course :-)

@leejet leejet merged commit 59080d3 into leejet:master Jul 28, 2025
9 checks passed
@leejet
Copy link
Owner

leejet commented Jul 28, 2025

Thank you for your contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants