[CI] Update Bagel Pixels#4081
Conversation
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
| ] | ||
|
|
||
| if current_omni_platform.is_rocm(): | ||
| REFERENCE_PIXELS = [ |
There was a problem hiding this comment.
Not sure why rocm pixels are redefined here, but since they were exactly the same, I deleted them. Unfortunately don't have a rocm device to test the new values on 😅
|
Interesting, have you compare the result with the original code? |
|
@princepride yup! The PR is split into two commits so it's easier to compare since I wanted to make sure the timestep fix and initialization were the only reasons the pixels changed. The first one ( # on e840a24c4df9f57fbeeb6a73d7c6b895f0e23d1a
pytest tests/distributed/omni_connectors/test_bagel_shared_memory_connector.py -v -s --run-level advanced_model -s
The second commit brings the Lance fixes back and updates the pixel values for tti/i2i to match what the tests currently produce, so passes on cuda with new values |
May I ask what is the lance fixes? |
|
Apologies for the delayed response as I've been quite busy lately.😂 I took a closer look, and it seems a previous code change caused the pixel values to change. Why are we modifying the pixel values in this PR instead of just reverting that previous change directly? I checked the original code in the bagel repository, and the timesteps calculation is exactly the same as before. |
|
I think previously bagel was using batched CFG, then the lance PR switched bagel to use sequential CFG (because the lance model re-used bagel as part of its model structure). Alex will bring back batched CFG in #4098. I will finish review #4098 and this PR tomorrow, also check whether it is possible to not change reference image pixels. in my attempt, i generated a cat image similar to the 1st image in the PR description as shown below, so maybe it is possible (im not sure yet).
|
|
Hi @zhangj1an, thanks! That is actually a stale output. On the current main, you should get a good output since it's calling CFG sequentially, it's just not the same one. Here is a repro script: from vllm_omni.entrypoints.omni import Omni
if __name__ == "__main__":
omni = Omni(
model="ByteDance-Seed/BAGEL-7B-MoT",
enforce_eager=True,
)
formatted_prompt = {
"prompt": f"<|im_start|>A cute cat<|im_end|>",
"modalities": ["image"],
}
omni_outputs = list(omni.generate(prompts=[formatted_prompt], sampling_params_list=omni.default_sampling_params_list))
omni_outputs[1].images[0].save("output_main.png")You should get something like this: The main reason for the large change in output is that the Lance PR added this change, so it's now regenerating the
The result may be off by a little though, because the Lance PR also fixed an off by one error in the timesteps. I.e., from the original Bagel code here: timesteps = torch.linspace(1, 0, num_timesteps, device=x_t.device)
timesteps = timestep_shift * timesteps / (1 + (timestep_shift - 1) * timesteps)
dts = timesteps[:-1] - timesteps[1:]
timesteps = timesteps[:-1] # will have num_timesteps - 1 elementsLance has an identical timestep creation with the +1 fix here, so this was fixed while Bagel was fixed while porting Lance to Omni, which also shifts the values a bit. I assume the values changed will be bad for ROCm though since its now device specific noise initialization. So I guess either:
@Gaohan123 @princepride any preference? |
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
|
@princepride we can match the current image, but we need to disable For now, I've commented the latent regeneration out and set |
|
LGTM, is good to merge,
|
Signed-off-by: Alex Brooks <albrooks@redhat.com>





Purpose
Fix #3977
Re-enables the Bagel tests that were failing in the CI due to incorrect handling in batched CFG; the PR for Lance fixed the correctness of the output for CFG, but it added two changes that change the output, so we need to update the reference pixels.
Initialization changes, i.e.,added
_regen_init_noise_on_deviceto the pipeline. This is the main the reason the output changes a lot.Correction in number of timesteps
was changed to add one more timestep
As a result, the reference image on CUDA seems to have changed from the left one to the right one:

For the img2img, its less dramatic looking, but there are changes as well. You can run the first commit in this PR (which reverted the fixes in Lance) to see the tests pass with the old values as a confidence check.
@Gaohan123 @lishunyang12 @zhangj1an can you please take a look? I will open a separate PR to add the batched CFG path back, but I think it's better to do in separate PRs since the current behavior is actually correct, and
generate_imageis pretty messy