Skip to content

[CI] Update Bagel Pixels#4081

Merged
Gaohan123 merged 7 commits into
vllm-project:mainfrom
alex-jw-brooks:enable_bagel_test
Jun 4, 2026
Merged

[CI] Update Bagel Pixels#4081
Gaohan123 merged 7 commits into
vllm-project:mainfrom
alex-jw-brooks:enable_bagel_test

Conversation

@alex-jw-brooks

@alex-jw-brooks alex-jw-brooks commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Purpose

Fix #3977

Re-enables the Bagel tests that were failing in the CI due to incorrect handling in batched CFG; the PR for Lance fixed the correctness of the output for CFG, but it added two changes that change the output, so we need to update the reference pixels.

  1. Initialization changes, i.e.,added _regen_init_noise_on_device to the pipeline. This is the main the reason the output changes a lot.

  2. Correction in number of timesteps

        timesteps = torch.linspace(1, 0, num_timesteps, device=x_t.device)

was changed to add one more timestep

        timesteps = torch.linspace(1, 0, num_timesteps + 1, device=x_t.device)

As a result, the reference image on CUDA seems to have changed from the left one to the right one:
bagel_ref_text2img_main bagel_ref_text2img_enable_bagel_test

For the img2img, its less dramatic looking, but there are changes as well. You can run the first commit in this PR (which reverted the fixes in Lance) to see the tests pass with the old values as a confidence check.

@Gaohan123 @lishunyang12 @zhangj1an can you please take a look? I will open a separate PR to add the batched CFG path back, but I think it's better to do in separate PRs since the current behavior is actually correct, and generate_image is pretty messy

Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
@alex-jw-brooks alex-jw-brooks requested a review from yenuo26 as a code owner June 2, 2026 20:35
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@alex-jw-brooks alex-jw-brooks changed the title [CR] Update Bagel Pixels [CI] Update Bagel Pixels Jun 2, 2026
]

if current_omni_platform.is_rocm():
REFERENCE_PIXELS = [

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why rocm pixels are redefined here, but since they were exactly the same, I deleted them. Unfortunately don't have a rocm device to test the new values on 😅

@hsliuustc0106

Copy link
Copy Markdown
Collaborator

cc @princepride @natureofnature

@Gaohan123 Gaohan123 added this to the v0.22.0 milestone Jun 3, 2026
@Gaohan123 Gaohan123 added the high priority high priority issue, needs to be done asap label Jun 3, 2026
@princepride

Copy link
Copy Markdown
Collaborator

Interesting, have you compare the result with the original code?

@alex-jw-brooks

alex-jw-brooks commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

@princepride yup! The PR is split into two commits so it's easier to compare since I wanted to make sure the timestep fix and initialization were the only reasons the pixels changed. The first one (e840a24c4df9f57fbeeb6a73d7c6b895f0e23d1a) reverts the Lance fixes to show that things will pass with the old values. To verify, I had run the shared memory connector tests.

# on e840a24c4df9f57fbeeb6a73d7c6b895f0e23d1a
pytest tests/distributed/omni_connectors/test_bagel_shared_memory_connector.py -v -s --run-level advanced_model -s

========================== 2 passed, 21 warnings in 226.36s (0:03:46) ==========================

The second commit brings the Lance fixes back and updates the pixel values for tti/i2i to match what the tests currently produce, so passes on cuda with new values

@Gaohan123

Copy link
Copy Markdown
Collaborator

@princepride yup! The PR is split into two commits so it's easier to compare since I wanted to make sure the timestep fix and initialization were the only reasons the pixels changed. The first one (e840a24c4df9f57fbeeb6a73d7c6b895f0e23d1a) reverts the Lance fixes to show that things will pass with the old values. To verify, I had run the shared memory connector tests.

# on e840a24c4df9f57fbeeb6a73d7c6b895f0e23d1a
pytest tests/distributed/omni_connectors/test_bagel_shared_memory_connector.py -v -s --run-level advanced_model -s

========================== 2 passed, 21 warnings in 226.36s (0:03:46) ==========================

The second commit brings the Lance fixes back and updates the pixel values for tti/i2i to match what the tests currently produce, so passes on cuda with new values

May I ask what is the lance fixes?

@princepride princepride added ready label to trigger buildkite CI and removed ready label to trigger buildkite CI labels Jun 3, 2026
@princepride

Copy link
Copy Markdown
Collaborator

Apologies for the delayed response as I've been quite busy lately.😂 I took a closer look, and it seems a previous code change caused the pixel values to change. Why are we modifying the pixel values in this PR instead of just reverting that previous change directly? I checked the original code in the bagel repository, and the timesteps calculation is exactly the same as before.

@zhangj1an

Copy link
Copy Markdown
Contributor

I think previously bagel was using batched CFG, then the lance PR switched bagel to use sequential CFG (because the lance model re-used bagel as part of its model structure). Alex will bring back batched CFG in #4098.

I will finish review #4098 and this PR tomorrow, also check whether it is possible to not change reference image pixels. in my attempt, i generated a cat image similar to the 1st image in the PR description as shown below, so maybe it is possible (im not sure yet).

main branch my personal
text2img_cute_cat_BEFORE_fix text2img_cute_cat_AFTER_fix

@alex-jw-brooks

alex-jw-brooks commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Hi @zhangj1an, thanks! That is actually a stale output. On the current main, you should get a good output since it's calling CFG sequentially, it's just not the same one. Here is a repro script:

from vllm_omni.entrypoints.omni import Omni

if __name__ == "__main__":
    omni = Omni(
        model="ByteDance-Seed/BAGEL-7B-MoT",
        enforce_eager=True,
    )

    formatted_prompt = {
        "prompt": f"<|im_start|>A cute cat<|im_end|>",
        "modalities": ["image"],
    }

    omni_outputs = list(omni.generate(prompts=[formatted_prompt], sampling_params_list=omni.default_sampling_params_list))
    omni_outputs[1].images[0].save("output_main.png")

You should get something like this:
output_main

The main reason for the large change in output is that the Lance PR added this change, so it's now regenerating the packed_init_noises on the device. If you comment out the call to this, you'll get something very similar to the expected result.

output_e840a24c4df9f57fbeeb6a73d7c6b895f0e23d1a_no_regen

The result may be off by a little though, because the Lance PR also fixed an off by one error in the timesteps. I.e., from the original Bagel code here:

        timesteps = torch.linspace(1, 0, num_timesteps, device=x_t.device)
        timesteps = timestep_shift * timesteps / (1 + (timestep_shift - 1) * timesteps)
        dts =  timesteps[:-1] - timesteps[1:]
        timesteps = timesteps[:-1] # will have num_timesteps - 1  elements

Lance has an identical timestep creation with the +1 fix here, so this was fixed while Bagel was fixed while porting Lance to Omni, which also shifts the values a bit.

I assume the values changed will be bad for ROCm though since its now device specific noise initialization. So I guess either:

  • We revert the call to _regen_init_noise_on_device for now so that the noise matches, and make a small adjustment to the pixel values if needed (may fail due to extra timestep, but will be on the edge of the tolerance)
    or
  • We use the new device specific values, and only run it on CUDA for now. I can also see if I can find an AMD GPU to test with to get ground truth values

@Gaohan123 @princepride any preference?

Signed-off-by: Alex Brooks <albrooks@redhat.com>
@princepride

Copy link
Copy Markdown
Collaborator
image

This image should be the expect output.

Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
@alex-jw-brooks

Copy link
Copy Markdown
Collaborator Author

@princepride we can match the current image, but we need to disable _regen_init_noise_on_device in Bagel to make sure the latents are the same, which feels a bit strange to me since it's just a different latent rather than a bug. Although then we can run it on CUDA and AMD at least 🤞

For now, I've commented the latent regeneration out and set num_inference_steps to 14 to account for the off by one fix in Lance, so it should pass now. FYI @lishunyang12

@zhangj1an

Copy link
Copy Markdown
Contributor

LGTM, is good to merge,

  • the +1 timestep fix from lance PR is correct and still there, (bagel_transformer.py:1700, linspace(1, 0, num_timesteps + 1))
  • Bagel now uses the original seeded CPU/fp32 noise, which (stays the same as before,) works on both CUDA and AMD/ROCm. Lance uses its own _regen_init_noise_on_device to sample init noise on-device (CUDA + bf16), which matches with its upstream Lance repo.

@Gaohan123 Gaohan123 added the ready label to trigger buildkite CI label Jun 4, 2026
@hsliuustc0106 hsliuustc0106 added merge-test label to trigger buildkite merge test CI and removed ready label to trigger buildkite CI labels Jun 4, 2026
@Gaohan123 Gaohan123 enabled auto-merge (squash) June 4, 2026 09:56
@Gaohan123 Gaohan123 merged commit e10aca3 into vllm-project:main Jun 4, 2026
6 checks passed
86MaxCao pushed a commit to 86MaxCao/vllm-omni that referenced this pull request Jun 4, 2026
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap merge-test label to trigger buildkite merge test CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Test failures due to pixel value mismatch in Bagel connectors

5 participants