FluxPipeline silently rounds the generated image shape #9904

albertochimentiinbibo · 2024-11-11T08:23:44Z

Describe the bug

When prompting the FluxPipeline class to generate an image with shape (1920, 1080), the output image shape is rounded to (1920, 1072) which to me seems like the nearest multiple of 16 instead of 8.
As the FluxPipeline class accepts input sizes divisible by 8 I would expect them to remain consistent throught the generation process.

By giving a quick look at the code it seems that in the FluxPipeline._unpack_latents method, the height and width are floor divided (//) by the vae_scale_factor which is 16.

I would love to understand why the scale factor is set like the following:
https://github.com/huggingface/diffusers/blob/89e4d6219805975bd7d253a267e1951badc9f1c0/src/diffusers/pipelines/flux/pipeline_flux.py#L197C9-L199C10

Reproduction

Here is the minimal code to reproduce the bug, feel free to change the number of inference steps as it should not influence the scope of the test.

from diffusers.pipelines import FluxPipeline
import torch

bf_repo = "black-forest-labs/FLUX.1-dev"

prompt = "Astronaut drinking coffe on the moon."
shape = (1920, 1080)

pipe = FluxPipeline.from_pretrained(bf_repo, torch_dtype=torch.bfloat16)

pipe.enable_sequential_cpu_offload()

image = pipe(
    prompt,
    height=shape[1],
    width=shape[0],
    num_inference_steps=28,
    generator=torch.Generator('cpu').manual_seed(123)
    ).images[0]

print(f"Prompted shape: {shape}")
print(f"Generated shape: {image.size}")
image.show()

Logs

Prompted shape: (1920, 1080)
Generated shape: (1920, 1072)

System Info

🤗 Diffusers version: 0.31.0
Platform: Windows-10-10.0.22631-SP0
Running on Google Colab?: No
Python version: 3.10.6
PyTorch version (GPU?): 2.4.1+cu124 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.26.2
Transformers version: 4.44.2
Accelerate version: 0.34.2
PEFT version: not installed
Bitsandbytes version: not installed
Safetensors version: 0.4.5
xFormers version: not installed
Accelerator: NVIDIA GeForce RTX 4070 Laptop GPU, 8188 MiB
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@sayakpaul @DN6

The text was updated successfully, but these errors were encountered:

albertochimentiinbibo · 2024-11-11T09:04:27Z

For the record, on the main branch ( aligned with commit SHA dac623b ) the same script provided above errors with the following traceback:

Traceback (most recent call last):
  File "C:\dev\github-issues\diffusers\test.py", line 13, in <module>
    image = pipe(
  File "...\torch\2.4.1+cu124\python\3.10.6\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "...\diffusers\pipelines\flux\pipeline_flux.py", line 684, in __call__
    latents, latent_image_ids = self.prepare_latents(
  File "...\diffusers\pipelines\flux\pipeline_flux.py", line 520, in prepare_latents
    latents = self._pack_latents(latents, batch_size, num_channels_latents, height, width)
  File "...\diffusers\pipelines\flux\pipeline_flux.py", line 444, in _pack_latents
    latents = latents.view(batch_size, num_channels_latents, height // 2, 2, width // 2, 2)
RuntimeError: shape '[1, 16, 67, 2, 120, 2]' is invalid for input of size 518400

This happens for all sizes which are not multiple of 16.

Checking out the recent PRs I found this one which seems to change what I was pointing out in the comment above.

sayakpaul · 2024-11-11T11:27:20Z

Cc: @yiyixuxu @DN6

DN6 · 2024-11-13T15:36:17Z

Hi @albertochimentiinbibo thanks for catching this. You're right the FluxPipeline is meant to work with images that are multiples of 16.

The vae_scale_factor is based on how much the VAE spatially compresses an image to a latent. In Flux's case it's 8 (1024x1024 image gets turned into a 128x128 latent). PR: #9711 made the usage of the scale factor clearer,
but I think we missed the fact that in the previous version the division by 16 was intended to factor in the fact that the latent height and width need to be divisible by 2. This is because the packing step breaks up the latent into 2x2 patches.

I'll open a PR to fix. We'll also raise a warning that the image will be resized to a compatible height, width.

albertochimentiinbibo · 2024-11-14T08:57:01Z

Thank you for the feedback @DN6 will be waiting on the fix!

albertochimentiinbibo added the bug Something isn't working label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FluxPipeline silently rounds the generated image shape #9904

FluxPipeline silently rounds the generated image shape #9904

albertochimentiinbibo commented Nov 11, 2024 •

edited

Loading

albertochimentiinbibo commented Nov 11, 2024

sayakpaul commented Nov 11, 2024

DN6 commented Nov 13, 2024

albertochimentiinbibo commented Nov 14, 2024

FluxPipeline silently rounds the generated image shape #9904

FluxPipeline silently rounds the generated image shape #9904

Comments

albertochimentiinbibo commented Nov 11, 2024 • edited Loading

Describe the bug

Reproduction

Logs

System Info

Who can help?

albertochimentiinbibo commented Nov 11, 2024

sayakpaul commented Nov 11, 2024

DN6 commented Nov 13, 2024

albertochimentiinbibo commented Nov 14, 2024

albertochimentiinbibo commented Nov 11, 2024 •

edited

Loading