Skip to content

[Bugfix] GLM-Image: fix noisy / washed-out t2i output (#3034)#1

Closed
ptarasiewiczNV wants to merge 4 commits intomainfrom
fix/glm-image-noise-3034
Closed

[Bugfix] GLM-Image: fix noisy / washed-out t2i output (#3034)#1
ptarasiewiczNV wants to merge 4 commits intomainfrom
fix/glm-image-noise-3034

Conversation

@ptarasiewiczNV
Copy link
Copy Markdown
Owner

@ptarasiewiczNV ptarasiewiczNV commented Apr 23, 2026

Summary

Fixes GLM-Image's noisy / washed-out t2i output reported in vllm-omni#3034. The recipe's minimal curl was producing a near-uniform white image (mean=249, std=15) instead of a coherent landscape.

Two independent fixes in vllm_omni/entrypoints/openai/serving_chat.py and vllm_omni/inputs/preprocess.py:

1. Route t2i requests through the multimodal processor.
OmniOpenAIServingChat only attached mm_processor_kwargs to the tprompt when the user supplied extra_body.height/width. OmniInputPreprocessor._process_text then gated its multimodal branch on elif mm_processor_kwargs: (truthiness), so when the field was omitted the default {} was falsy and routing fell back to plain _tokenize_prompt. That bypassed GLM-Image's HF processor and the image-generation scaffold <|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|> it emits, so the AR never entered image-gen mode and collapsed to a handful of repeated VQ codes (unique=15/1281, no terminal EOS) which the DiT denoised into uniform white. Now serving_chat always attaches mm_processor_kwargs (possibly empty) for image-modality requests, and _process_text switches from truthiness to presence ("mm_processor_kwargs" in parsed_content) so an explicitly-empty dict correctly routes through the multimodal processor.

2. Make the AR max_tokens compute cover the default target size.
PR vllm-project#2320 dropped max_tokens: 1281 from the GLM-Image stage config and moved the compute into _apply_request_overrides, but gated it on height is not None and width is not None. For the bare-curl request (no extra_body) the gate skipped the compute and max_tokens fell through to max_model_len - seq_len (~131k), which produced the upstream IndexError that the original yaml edit was working around. Now when the user didn't pass h/w we fall back to any stage's default h/w (GLM-Image stage-1 yaml declares height: 1024, width: 1024), so the compute fires for the bare-curl too. The implicit gate becomes "a stage declares h/w in its sampling params" — LLM-only / audio pipelines skip, no architecture check needed. Also fixes a latent getattr(explicit_fields, "max_tokens", None) bug — explicit_fields is a set, so the getattr always returned None and silently overwrote user-provided max_tokens.

Before / after (same prompt, same seed=42)

mean std min max unique AR codes EOS emitted
Before 249 15 135 255 15 / 1281
After 117 71 0 255 139 / 1281

Second prompt sanity-check ("A red apple on a wooden table") also renders a coherent image (mean=117, std=82).

Test plan

  • Reproduce the issue's bare curl: returns coherent landscape (1024×1024, mean=117, std=71).
  • Second prompt ("A red apple on a wooden table") — coherent image.
  • i2i smoke-test with an input image — no regression.
  • pre-commit run --all-files.
  • Existing GLM-Image unit tests still pass.

🤖 Generated with Claude Code

@ptarasiewiczNV ptarasiewiczNV force-pushed the fix/glm-image-noise-3034 branch 2 times, most recently from fd0c3f3 to 64ebeb6 Compare April 23, 2026 17:37
PR vllm-project#2320 (`7e28eda9`) dropped `max_tokens: 1281` from the GLM-Image
stage config and moved the compute into
`serving_chat._apply_request_overrides`, but gated it on
`height is not None and width is not None`. For the recipe's bare-curl
request (no `extra_body.height` / `extra_body.width`) the gate skipped
the compute; `SamplingParams.max_tokens` then fell through to vLLM's
`max_model_len - seq_len` (~131k) and the AR stage's generation
budget no longer matched the VQ token layout the parser expects,
leaving the pre-refactor path latently broken since vllm-project#2320 and
surfacing as the IndexError the deploy-yaml edit in vllm-project#3034 was
working around.

Fix: when the user didn't pass h/w, fall back to the diffusion stage's
default h/w (GLM-Image stage-1 yaml already declares
`height: 1024, width: 1024`), rather than hardcoding a second size
default in serving_chat or re-adding the yaml entry. This makes the
compute effectively unconditional for AR + image-diffusion pipelines
that declare a target size in their sampling params; LLM-only and
audio pipelines have neither height nor width in any stage's params
and continue to skip the block — no architecture gate needed.

Also fix a related bug: `getattr(explicit_fields, "max_tokens", None)`
was reading an attribute off a `set[str]` (Pydantic's
`model_fields_set`), so it always returned `None` and silently
overwrote user-provided `max_tokens`. Replaced with a proper set
membership check.

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
@ptarasiewiczNV ptarasiewiczNV force-pushed the fix/glm-image-noise-3034 branch from 64ebeb6 to f6d452d Compare April 23, 2026 17:55
…or (vllm-project#3034)

vllm-omni issue vllm-project#3034: `zai-org/GLM-Image` served via
`vllm serve --omni` returns noisy / washed-out images for the minimal
curl from the recipe:

    {"messages":[{"role":"user","content":"A beautiful landscape painting"}]}

Root cause:

- `OmniOpenAIServingChat` only attached `mm_processor_kwargs` to the
  tprompt when the request explicitly supplied
  `extra_body.height` / `extra_body.width`. For the bare-curl request
  the field was omitted entirely.
- `OmniInputPreprocessor._process_text` checked
  `elif mm_processor_kwargs:` (truthiness). With the field omitted the
  default `{}` was falsy, so the preprocessor fell back to plain
  `_tokenize_prompt`, skipping the multimodal processor path.
- That path is where GLM-Image's HF processor emits its
  image-generation scaffold
  `<|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|>`. Without
  the scaffold the AR stage never entered image-generation mode and
  collapsed to a handful of repeated VQ codes (unique=15 across 1281
  positions, no terminal EOS), which the DiT denoised into a uniform
  / near-white image (mean=249, std=15).

Fix (minimal, two one-file changes):

- `serving_chat`: always attach `mm_processor_kwargs` (possibly empty)
  for image-modality requests, so the preprocessor sees it.
- `OmniInputPreprocessor._process_text`: switch from truthiness to
  presence — `"mm_processor_kwargs" in parsed_content`. An
  explicitly-attached empty dict is now a valid "route through the
  multimodal processor" signal, matching callers who want the HF
  processor's defaults to apply.

After the fix the AR produces 139 unique tokens with a terminal EOS
and the image is a coherent landscape (mean=117, std=71, full
0-255 range).

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
@ptarasiewiczNV ptarasiewiczNV force-pushed the fix/glm-image-noise-3034 branch from f6d452d to 5e32dc0 Compare April 23, 2026 18:08
Comments should explain the invariant, not where to read about it;
the PR body / commit log is the right place for issue links.

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Cosmetic: restore the two-line `ref_image_count = len(reference_images)`
/ `is_img2img = ref_image_count > 0` shape from the pre-vllm-project#2320 code to
keep the diff against main smaller and match the surrounding style.

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant