[Bugfix] GLM-Image: fix noisy / washed-out t2i output (#3034)#3077
[Bugfix] GLM-Image: fix noisy / washed-out t2i output (#3034)#3077ptarasiewiczNV wants to merge 8 commits into
Conversation
PR vllm-project#2320 (`7e28eda9`) dropped `max_tokens: 1281` from the GLM-Image stage config and moved the compute into `serving_chat._apply_request_overrides`, but gated it on `height is not None and width is not None`. For the recipe's bare-curl request (no `extra_body.height` / `extra_body.width`) the gate skipped the compute; `SamplingParams.max_tokens` then fell through to vLLM's `max_model_len - seq_len` (~131k) and the AR stage's generation budget no longer matched the VQ token layout the parser expects, leaving the pre-refactor path latently broken since vllm-project#2320 and surfacing as the IndexError the deploy-yaml edit in vllm-project#3034 was working around. Fix: when the user didn't pass h/w, fall back to the diffusion stage's default h/w (GLM-Image stage-1 yaml already declares `height: 1024, width: 1024`), rather than hardcoding a second size default in serving_chat or re-adding the yaml entry. This makes the compute effectively unconditional for AR + image-diffusion pipelines that declare a target size in their sampling params; LLM-only and audio pipelines have neither height nor width in any stage's params and continue to skip the block — no architecture gate needed. Also fix a related bug: `getattr(explicit_fields, "max_tokens", None)` was reading an attribute off a `set[str]` (Pydantic's `model_fields_set`), so it always returned `None` and silently overwrote user-provided `max_tokens`. Replaced with a proper set membership check. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
…or (vllm-project#3034) vllm-omni issue vllm-project#3034: `zai-org/GLM-Image` served via `vllm serve --omni` returns noisy / washed-out images for the minimal curl from the recipe: {"messages":[{"role":"user","content":"A beautiful landscape painting"}]} Root cause: - `OmniOpenAIServingChat` only attached `mm_processor_kwargs` to the tprompt when the request explicitly supplied `extra_body.height` / `extra_body.width`. For the bare-curl request the field was omitted entirely. - `OmniInputPreprocessor._process_text` checked `elif mm_processor_kwargs:` (truthiness). With the field omitted the default `{}` was falsy, so the preprocessor fell back to plain `_tokenize_prompt`, skipping the multimodal processor path. - That path is where GLM-Image's HF processor emits its image-generation scaffold `<|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|>`. Without the scaffold the AR stage never entered image-generation mode and collapsed to a handful of repeated VQ codes (unique=15 across 1281 positions, no terminal EOS), which the DiT denoised into a uniform / near-white image (mean=249, std=15). Fix (minimal, two one-file changes): - `serving_chat`: always attach `mm_processor_kwargs` (possibly empty) for image-modality requests, so the preprocessor sees it. - `OmniInputPreprocessor._process_text`: switch from truthiness to presence — `"mm_processor_kwargs" in parsed_content`. An explicitly-attached empty dict is now a valid "route through the multimodal processor" signal, matching callers who want the HF processor's defaults to apply. After the fix the AR produces 139 unique tokens with a terminal EOS and the image is a coherent landscape (mean=117, std=71, full 0-255 range). Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Comments should explain the invariant, not where to read about it; the PR body / commit log is the right place for issue links. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Cosmetic: restore the two-line `ref_image_count = len(reference_images)` / `is_img2img = ref_image_count > 0` shape from the pre-vllm-project#2320 code to keep the diff against main smaller and match the surrounding style. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25c4e15e19
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Match the upstream pre-vllm-project#2320 intent: the AR `max_tokens` is a function of the target h/w (small-preview + large-target + EOS); a user-supplied `max_tokens` can only mismatch the VQ token layout the parser expects. Explicit `"max_tokens": null` on the request also lands here, and the field-copy loop drops None values, so presence- based gating would leave `params.max_tokens` unset. Restoring the simple "always compute" shape avoids both edge cases. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
hsliuustc0106
left a comment
There was a problem hiding this comment.
BLOCKING:
-
Test Coverage — This is a bugfix PR but lacks automated regression tests. The manual curl test evidence is thorough, but we need at least unit tests to prevent future regressions:
- Test for with explicitly-empty to verify presence-based routing (not truthiness)
- Test for to verify the height/width fallback logic when user doesn't specify them
- Test for to verify max_tokens is always computed for AR + image-diffusion pipelines
The changes are well-documented and the manual testing evidence is convincing, but automated tests are required for maintainability.
…d max_tokens compute
Adds three regression tests for the GLM-Image noise fix so the
invariants this PR introduces can't silently regress:
- `tests/inputs/test_preprocess.py::TestProcessTextMmProcessorKwargsRouting`
- `test_empty_mm_processor_kwargs_routes_to_multimodal`: an explicit
empty `mm_processor_kwargs` dict on the prompt routes through
`_process_multimodal` (presence, not truthiness).
- `test_missing_mm_processor_kwargs_routes_to_tokenize`: absence of
the key still routes through plain `_tokenize_prompt` (control).
- `tests/entrypoints/openai_api/test_serving_chat_sampling_params.py::TestApplyRequestOverridesGLMImage`
- `test_falls_back_to_diffusion_stage_defaults_when_no_extra_body`:
no `extra_body` → `_apply_request_overrides` pulls h/w from any
stage's default sampling params (the GLM-Image stage-1 pattern)
and computes `max_tokens=1281` for t2i 1024x1024.
- `test_explicit_null_max_tokens_still_computes`: sending
`"max_tokens": null` (Pydantic includes it in `model_fields_set`)
must not suppress the compute. Guards the explicit-null edge case
where `max_tokens` would otherwise fall through to
`max_model_len - seq_len` and reintroduce the original IndexError.
Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Thanks, @hsliuustc0106, for the review. I have added four regression tests covering each of the three points:
|
Purpose
Fix vllm-omni#3034:
zai-org/GLM-Imageserved viavllm serve --omnireturns a noisy / near-uniform white image for the minimal curl from the recipe:Two independent regressions contribute, both fixed here.
1. Route t2i requests through the multimodal processor.
OmniOpenAIServingChatonly attachedmm_processor_kwargsto the tprompt when the user suppliedextra_body.height/extra_body.width.OmniInputPreprocessor._process_textthen gated its multimodal branch onelif mm_processor_kwargs:(truthiness), so when the field was omitted the default{}was falsy and routing fell back to plain_tokenize_prompt. That bypassed GLM-Image's HF processor and the image-generation scaffold<|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|>it emits — the AR never entered image-gen mode and collapsed to a handful of repeated VQ codes (unique=15/1281, no terminal EOS), which the DiT denoised into uniform white. Nowserving_chatalways attachesmm_processor_kwargs(possibly empty) for image-modality requests, and_process_textswitches from truthiness to presence ("mm_processor_kwargs" in parsed_content) so an explicitly-empty dict correctly routes through the multimodal processor.2. Make the AR
max_tokenscompute cover the default target size.PR #2320 dropped
max_tokens: 1281from the GLM-Image stage config and moved the compute into_apply_request_overrides, but gated it onheight is not None and width is not None. For the bare-curl request (noextra_body) the gate skipped the compute andmax_tokensfell through tomax_model_len - seq_len(~131k), producing an upstream IndexError. Now when the user didn't pass h/w we fall back to any stage's default h/w (GLM-Image stage-1 yaml declaresheight: 1024, width: 1024) so the compute fires for the bare-curl too. The implicit gate becomes "a stage declares h/w in its sampling params" — LLM-only / audio pipelines skip, no architecture check needed. Also fixes a latentgetattr(explicit_fields, "max_tokens", None)bug —explicit_fieldsis aset, so thegetattralways returnedNoneand silently overwrote user-providedmax_tokens.Test Plan
Reproduce the bug's minimal curl and compare the returned PNG against the same run on
main:Also tested with an
extra_body-specified override and with a second prompt ("A red apple on a wooden table").Test Result
Same prompt and seed=42 before/after:
Before: 1024×1024 PNG that is uniform near-white, no scene content.
After: coherent landscape (mountain + lake + wildflowers).
Second prompt sanity-check (
"A red apple on a wooden table") also renders a coherent image.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.