[Bugfix] GLM-Image: fix noisy / washed-out t2i output (#3034) by ptarasiewiczNV · Pull Request #3077 · vllm-project/vllm-omni

ptarasiewiczNV · 2026-04-23T18:17:21Z

Purpose

Fix vllm-omni#3034: zai-org/GLM-Image served via vllm serve --omni returns a noisy / near-uniform white image for the minimal curl from the recipe:

curl -sS http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"A beautiful landscape painting"}]}'

Two independent regressions contribute, both fixed here.

1. Route t2i requests through the multimodal processor.
OmniOpenAIServingChat only attached mm_processor_kwargs to the tprompt when the user supplied extra_body.height / extra_body.width. OmniInputPreprocessor._process_text then gated its multimodal branch on elif mm_processor_kwargs: (truthiness), so when the field was omitted the default {} was falsy and routing fell back to plain _tokenize_prompt. That bypassed GLM-Image's HF processor and the image-generation scaffold <|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|> it emits — the AR never entered image-gen mode and collapsed to a handful of repeated VQ codes (unique=15/1281, no terminal EOS), which the DiT denoised into uniform white. Now serving_chat always attaches mm_processor_kwargs (possibly empty) for image-modality requests, and _process_text switches from truthiness to presence ("mm_processor_kwargs" in parsed_content) so an explicitly-empty dict correctly routes through the multimodal processor.

2. Make the AR max_tokens compute cover the default target size.
PR #2320 dropped max_tokens: 1281 from the GLM-Image stage config and moved the compute into _apply_request_overrides, but gated it on height is not None and width is not None. For the bare-curl request (no extra_body) the gate skipped the compute and max_tokens fell through to max_model_len - seq_len (~131k), producing an upstream IndexError. Now when the user didn't pass h/w we fall back to any stage's default h/w (GLM-Image stage-1 yaml declares height: 1024, width: 1024) so the compute fires for the bare-curl too. The implicit gate becomes "a stage declares h/w in its sampling params" — LLM-only / audio pipelines skip, no architecture check needed. Also fixes a latent getattr(explicit_fields, "max_tokens", None) bug — explicit_fields is a set, so the getattr always returned None and silently overwrote user-provided max_tokens.

Test Plan

Reproduce the bug's minimal curl and compare the returned PNG against the same run on main:

# Launch (2× 48 GB GPUs; A6000 tested)
vllm serve zai-org/GLM-Image --omni --port 8091

# In another shell
curl -sS -X POST http://localhost:8091/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"A beautiful landscape painting"}]}' \
  -o /tmp/resp.json

python - <<'PY'
import base64, io, json, numpy as np
from PIL import Image
j = json.load(open("/tmp/resp.json"))
url = j["choices"][0]["message"]["content"][0]["image_url"]["url"]
im = Image.open(io.BytesIO(base64.b64decode(url.split(",", 1)[1]))).convert("RGB")
a = np.asarray(im)
print(f"{im.size} mean={a.mean():.2f} std={a.std():.2f} min={a.min()} max={a.max()}")
im.save("/tmp/out.png")
PY

Also tested with an extra_body-specified override and with a second prompt ("A red apple on a wooden table").

Test Result

Same prompt and seed=42 before/after:

	mean	std	min	max	unique AR codes	EOS emitted
Before	249	15	135	255	15 / 1281	❌
After	117	71	0	255	139 / 1281	✅

Before: 1024×1024 PNG that is uniform near-white, no scene content.
After: coherent landscape (mountain + lake + wildflowers).

Second prompt sanity-check ("A red apple on a wooden table") also renders a coherent image.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

PR vllm-project#2320 (`7e28eda9`) dropped `max_tokens: 1281` from the GLM-Image stage config and moved the compute into `serving_chat._apply_request_overrides`, but gated it on `height is not None and width is not None`. For the recipe's bare-curl request (no `extra_body.height` / `extra_body.width`) the gate skipped the compute; `SamplingParams.max_tokens` then fell through to vLLM's `max_model_len - seq_len` (~131k) and the AR stage's generation budget no longer matched the VQ token layout the parser expects, leaving the pre-refactor path latently broken since vllm-project#2320 and surfacing as the IndexError the deploy-yaml edit in vllm-project#3034 was working around. Fix: when the user didn't pass h/w, fall back to the diffusion stage's default h/w (GLM-Image stage-1 yaml already declares `height: 1024, width: 1024`), rather than hardcoding a second size default in serving_chat or re-adding the yaml entry. This makes the compute effectively unconditional for AR + image-diffusion pipelines that declare a target size in their sampling params; LLM-only and audio pipelines have neither height nor width in any stage's params and continue to skip the block — no architecture gate needed. Also fix a related bug: `getattr(explicit_fields, "max_tokens", None)` was reading an attribute off a `set[str]` (Pydantic's `model_fields_set`), so it always returned `None` and silently overwrote user-provided `max_tokens`. Replaced with a proper set membership check. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>

…or (vllm-project#3034) vllm-omni issue vllm-project#3034: `zai-org/GLM-Image` served via `vllm serve --omni` returns noisy / washed-out images for the minimal curl from the recipe: {"messages":[{"role":"user","content":"A beautiful landscape painting"}]} Root cause: - `OmniOpenAIServingChat` only attached `mm_processor_kwargs` to the tprompt when the request explicitly supplied `extra_body.height` / `extra_body.width`. For the bare-curl request the field was omitted entirely. - `OmniInputPreprocessor._process_text` checked `elif mm_processor_kwargs:` (truthiness). With the field omitted the default `{}` was falsy, so the preprocessor fell back to plain `_tokenize_prompt`, skipping the multimodal processor path. - That path is where GLM-Image's HF processor emits its image-generation scaffold `<|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|>`. Without the scaffold the AR stage never entered image-generation mode and collapsed to a handful of repeated VQ codes (unique=15 across 1281 positions, no terminal EOS), which the DiT denoised into a uniform / near-white image (mean=249, std=15). Fix (minimal, two one-file changes): - `serving_chat`: always attach `mm_processor_kwargs` (possibly empty) for image-modality requests, so the preprocessor sees it. - `OmniInputPreprocessor._process_text`: switch from truthiness to presence — `"mm_processor_kwargs" in parsed_content`. An explicitly-attached empty dict is now a valid "route through the multimodal processor" signal, matching callers who want the HF processor's defaults to apply. After the fix the AR produces 139 unique tokens with a terminal EOS and the image is a coherent landscape (mean=117, std=71, full 0-255 range). Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>

Comments should explain the invariant, not where to read about it; the PR body / commit log is the right place for issue links. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>

Cosmetic: restore the two-line `ref_image_count = len(reference_images)` / `is_img2img = ref_image_count > 0` shape from the pre-vllm-project#2320 code to keep the diff against main smaller and match the surrounding style. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25c4e15e19

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Match the upstream pre-vllm-project#2320 intent: the AR `max_tokens` is a function of the target h/w (small-preview + large-target + EOS); a user-supplied `max_tokens` can only mismatch the VQ token layout the parser expects. Explicit `"max_tokens": null` on the request also lands here, and the field-copy loop drops None values, so presence- based gating would leave `params.max_tokens` unset. Restoring the simple "always compute" shape avoids both edge cases. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>

hsliuustc0106

BLOCKING:

Test Coverage — This is a bugfix PR but lacks automated regression tests. The manual curl test evidence is thorough, but we need at least unit tests to prevent future regressions:
1. Test for with explicitly-empty to verify presence-based routing (not truthiness)
2. Test for to verify the height/width fallback logic when user doesn't specify them
3. Test for to verify max_tokens is always computed for AR + image-diffusion pipelines

The changes are well-documented and the manual testing evidence is convincing, but automated tests are required for maintainability.

…d max_tokens compute Adds three regression tests for the GLM-Image noise fix so the invariants this PR introduces can't silently regress: - `tests/inputs/test_preprocess.py::TestProcessTextMmProcessorKwargsRouting` - `test_empty_mm_processor_kwargs_routes_to_multimodal`: an explicit empty `mm_processor_kwargs` dict on the prompt routes through `_process_multimodal` (presence, not truthiness). - `test_missing_mm_processor_kwargs_routes_to_tokenize`: absence of the key still routes through plain `_tokenize_prompt` (control). - `tests/entrypoints/openai_api/test_serving_chat_sampling_params.py::TestApplyRequestOverridesGLMImage` - `test_falls_back_to_diffusion_stage_defaults_when_no_extra_body`: no `extra_body` → `_apply_request_overrides` pulls h/w from any stage's default sampling params (the GLM-Image stage-1 pattern) and computes `max_tokens=1281` for t2i 1024x1024. - `test_explicit_null_max_tokens_still_computes`: sending `"max_tokens": null` (Pydantic includes it in `model_fields_set`) must not suppress the compute. Guards the explicit-null edge case where `max_tokens` would otherwise fall through to `max_model_len - seq_len` and reintroduce the original IndexError. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>

ptarasiewiczNV · 2026-04-24T09:47:55Z

BLOCKING:

Test Coverage — This is a bugfix PR but lacks automated regression tests. The manual curl test evidence is thorough, but we need at least unit tests to prevent future regressions:

Test for with explicitly-empty to verify presence-based routing (not truthiness)

Test for to verify the height/width fallback logic when user doesn't specify them

Test for to verify max_tokens is always computed for AR + image-diffusion pipelines

The changes are well-documented and the manual testing evidence is convincing, but automated tests are required for maintainability.

Thanks, @hsliuustc0106, for the review. I have added four regression tests covering each of the three points:

Presence-based routing (empty mm_processor_kwargs) — tests/inputs/test_preprocess.py::TestProcessTextMmProcessorKwargsRouting:
- test_empty_mm_processor_kwargs_routes_to_multimodal — explicit {"mm_processor_kwargs": {}} routes through _process_multimodal, not _tokenize_prompt.
- test_missing_mm_processor_kwargs_routes_to_tokenize — control: missing key still falls through to plain tokenize.
h/w fallback when user doesn't specify them — tests/entrypoints/openai_api/test_serving_chat_sampling_params.py::TestApplyRequestOverridesGLMImage::test_falls_back_to_diffusion_stage_defaults_when_no_extra_body — with no extra_body and stage-1 params carrying height=1024, width=1024,
_apply_request_overrides pulls h/w from the diffusion stage defaults; max_tokens==1281 and extra_args["target_h/w"]==1024.
max_tokens always computed for AR + image-diffusion — covered by the existing test_dynamic_max_tokens_overrides_user_value (user-supplied value is overridden by compute) plus a new test_explicit_null_max_tokens_still_computes which guards the case where max_tokens is set to null.

ptarasiewiczNV · 2026-04-27T14:20:06Z

Closing for rebased: #3189 after #3084 fixed one of the two addressed issues.

ptarasiewiczNV added 4 commits April 23, 2026 19:54

[Misc] GLM-Image: drop issue-tracker references from code comments

cdc9694

Comments should explain the invariant, not where to read about it; the PR body / commit log is the right place for issue links. Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>

ptarasiewiczNV requested a review from hsliuustc0106 as a code owner April 23, 2026 18:17

Merge branch 'main' into fix/glm-image-noise-3034

b2204c0

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/serving_chat.py Outdated

hsliuustc0106 reviewed Apr 23, 2026

View reviewed changes

This comment was marked as duplicate.

Sign in to view

Merge branch 'main' into fix/glm-image-noise-3034

38a978c

ptarasiewiczNV requested a review from hsliuustc0106 April 24, 2026 09:48

ptarasiewiczNV closed this Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] GLM-Image: fix noisy / washed-out t2i output (#3034)#3077

[Bugfix] GLM-Image: fix noisy / washed-out t2i output (#3034)#3077
ptarasiewiczNV wants to merge 8 commits into
vllm-project:mainfrom
ptarasiewiczNV:fix/glm-image-noise-3034

ptarasiewiczNV commented Apr 23, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Uh oh!

This comment was marked as duplicate.

Uh oh!

ptarasiewiczNV commented Apr 24, 2026

Uh oh!

ptarasiewiczNV commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ptarasiewiczNV commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as duplicate.

Uh oh!

ptarasiewiczNV commented Apr 24, 2026

Uh oh!

ptarasiewiczNV commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ptarasiewiczNV commented Apr 23, 2026 •

edited

Loading