[Bugfix] GLM-Image: fix noisy / washed-out t2i output (#3034)#1
Closed
ptarasiewiczNV wants to merge 4 commits intomainfrom
Closed
[Bugfix] GLM-Image: fix noisy / washed-out t2i output (#3034)#1ptarasiewiczNV wants to merge 4 commits intomainfrom
ptarasiewiczNV wants to merge 4 commits intomainfrom
Conversation
fd0c3f3 to
64ebeb6
Compare
PR vllm-project#2320 (`7e28eda9`) dropped `max_tokens: 1281` from the GLM-Image stage config and moved the compute into `serving_chat._apply_request_overrides`, but gated it on `height is not None and width is not None`. For the recipe's bare-curl request (no `extra_body.height` / `extra_body.width`) the gate skipped the compute; `SamplingParams.max_tokens` then fell through to vLLM's `max_model_len - seq_len` (~131k) and the AR stage's generation budget no longer matched the VQ token layout the parser expects, leaving the pre-refactor path latently broken since vllm-project#2320 and surfacing as the IndexError the deploy-yaml edit in vllm-project#3034 was working around. Fix: when the user didn't pass h/w, fall back to the diffusion stage's default h/w (GLM-Image stage-1 yaml already declares `height: 1024, width: 1024`), rather than hardcoding a second size default in serving_chat or re-adding the yaml entry. This makes the compute effectively unconditional for AR + image-diffusion pipelines that declare a target size in their sampling params; LLM-only and audio pipelines have neither height nor width in any stage's params and continue to skip the block — no architecture gate needed. Also fix a related bug: `getattr(explicit_fields, "max_tokens", None)` was reading an attribute off a `set[str]` (Pydantic's `model_fields_set`), so it always returned `None` and silently overwrote user-provided `max_tokens`. Replaced with a proper set membership check. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
64ebeb6 to
f6d452d
Compare
…or (vllm-project#3034) vllm-omni issue vllm-project#3034: `zai-org/GLM-Image` served via `vllm serve --omni` returns noisy / washed-out images for the minimal curl from the recipe: {"messages":[{"role":"user","content":"A beautiful landscape painting"}]} Root cause: - `OmniOpenAIServingChat` only attached `mm_processor_kwargs` to the tprompt when the request explicitly supplied `extra_body.height` / `extra_body.width`. For the bare-curl request the field was omitted entirely. - `OmniInputPreprocessor._process_text` checked `elif mm_processor_kwargs:` (truthiness). With the field omitted the default `{}` was falsy, so the preprocessor fell back to plain `_tokenize_prompt`, skipping the multimodal processor path. - That path is where GLM-Image's HF processor emits its image-generation scaffold `<|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|>`. Without the scaffold the AR stage never entered image-generation mode and collapsed to a handful of repeated VQ codes (unique=15 across 1281 positions, no terminal EOS), which the DiT denoised into a uniform / near-white image (mean=249, std=15). Fix (minimal, two one-file changes): - `serving_chat`: always attach `mm_processor_kwargs` (possibly empty) for image-modality requests, so the preprocessor sees it. - `OmniInputPreprocessor._process_text`: switch from truthiness to presence — `"mm_processor_kwargs" in parsed_content`. An explicitly-attached empty dict is now a valid "route through the multimodal processor" signal, matching callers who want the HF processor's defaults to apply. After the fix the AR produces 139 unique tokens with a terminal EOS and the image is a coherent landscape (mean=117, std=71, full 0-255 range). Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
f6d452d to
5e32dc0
Compare
Comments should explain the invariant, not where to read about it; the PR body / commit log is the right place for issue links. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Cosmetic: restore the two-line `ref_image_count = len(reference_images)` / `is_img2img = ref_image_count > 0` shape from the pre-vllm-project#2320 code to keep the diff against main smaller and match the surrounding style. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes GLM-Image's noisy / washed-out t2i output reported in vllm-omni#3034. The recipe's minimal curl was producing a near-uniform white image (mean=249, std=15) instead of a coherent landscape.
Two independent fixes in
vllm_omni/entrypoints/openai/serving_chat.pyandvllm_omni/inputs/preprocess.py:1. Route t2i requests through the multimodal processor.
OmniOpenAIServingChatonly attachedmm_processor_kwargsto the tprompt when the user suppliedextra_body.height/width.OmniInputPreprocessor._process_textthen gated its multimodal branch onelif mm_processor_kwargs:(truthiness), so when the field was omitted the default{}was falsy and routing fell back to plain_tokenize_prompt. That bypassed GLM-Image's HF processor and the image-generation scaffold<|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|>it emits, so the AR never entered image-gen mode and collapsed to a handful of repeated VQ codes (unique=15/1281, no terminal EOS) which the DiT denoised into uniform white. Nowserving_chatalways attachesmm_processor_kwargs(possibly empty) for image-modality requests, and_process_textswitches from truthiness to presence ("mm_processor_kwargs" in parsed_content) so an explicitly-empty dict correctly routes through the multimodal processor.2. Make the AR
max_tokenscompute cover the default target size.PR vllm-project#2320 dropped
max_tokens: 1281from the GLM-Image stage config and moved the compute into_apply_request_overrides, but gated it onheight is not None and width is not None. For the bare-curl request (noextra_body) the gate skipped the compute andmax_tokensfell through tomax_model_len - seq_len(~131k), which produced the upstream IndexError that the original yaml edit was working around. Now when the user didn't pass h/w we fall back to any stage's default h/w (GLM-Image stage-1 yaml declaresheight: 1024, width: 1024), so the compute fires for the bare-curl too. The implicit gate becomes "a stage declares h/w in its sampling params" — LLM-only / audio pipelines skip, no architecture check needed. Also fixes a latentgetattr(explicit_fields, "max_tokens", None)bug —explicit_fieldsis aset, so the getattr always returnedNoneand silently overwrote user-providedmax_tokens.Before / after (same prompt, same seed=42)
Second prompt sanity-check (
"A red apple on a wooden table") also renders a coherent image (mean=117, std=82).Test plan
"A red apple on a wooden table") — coherent image.pre-commit run --all-files.🤖 Generated with Claude Code