Skip to content

[Bugfix] GLM-Image: fix noisy / washed-out t2i output (#3034)#3077

Closed
ptarasiewiczNV wants to merge 8 commits into
vllm-project:mainfrom
ptarasiewiczNV:fix/glm-image-noise-3034
Closed

[Bugfix] GLM-Image: fix noisy / washed-out t2i output (#3034)#3077
ptarasiewiczNV wants to merge 8 commits into
vllm-project:mainfrom
ptarasiewiczNV:fix/glm-image-noise-3034

Conversation

@ptarasiewiczNV
Copy link
Copy Markdown
Contributor

@ptarasiewiczNV ptarasiewiczNV commented Apr 23, 2026

Purpose

Fix vllm-omni#3034: zai-org/GLM-Image served via vllm serve --omni returns a noisy / near-uniform white image for the minimal curl from the recipe:

curl -sS http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"A beautiful landscape painting"}]}'

Two independent regressions contribute, both fixed here.

1. Route t2i requests through the multimodal processor.
OmniOpenAIServingChat only attached mm_processor_kwargs to the tprompt when the user supplied extra_body.height / extra_body.width. OmniInputPreprocessor._process_text then gated its multimodal branch on elif mm_processor_kwargs: (truthiness), so when the field was omitted the default {} was falsy and routing fell back to plain _tokenize_prompt. That bypassed GLM-Image's HF processor and the image-generation scaffold <|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|> it emits — the AR never entered image-gen mode and collapsed to a handful of repeated VQ codes (unique=15/1281, no terminal EOS), which the DiT denoised into uniform white. Now serving_chat always attaches mm_processor_kwargs (possibly empty) for image-modality requests, and _process_text switches from truthiness to presence ("mm_processor_kwargs" in parsed_content) so an explicitly-empty dict correctly routes through the multimodal processor.

2. Make the AR max_tokens compute cover the default target size.
PR #2320 dropped max_tokens: 1281 from the GLM-Image stage config and moved the compute into _apply_request_overrides, but gated it on height is not None and width is not None. For the bare-curl request (no extra_body) the gate skipped the compute and max_tokens fell through to max_model_len - seq_len (~131k), producing an upstream IndexError. Now when the user didn't pass h/w we fall back to any stage's default h/w (GLM-Image stage-1 yaml declares height: 1024, width: 1024) so the compute fires for the bare-curl too. The implicit gate becomes "a stage declares h/w in its sampling params" — LLM-only / audio pipelines skip, no architecture check needed. Also fixes a latent getattr(explicit_fields, "max_tokens", None) bug — explicit_fields is a set, so the getattr always returned None and silently overwrote user-provided max_tokens.

Test Plan

Reproduce the bug's minimal curl and compare the returned PNG against the same run on main:

# Launch (2× 48 GB GPUs; A6000 tested)
vllm serve zai-org/GLM-Image --omni --port 8091

# In another shell
curl -sS -X POST http://localhost:8091/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"A beautiful landscape painting"}]}' \
  -o /tmp/resp.json

python - <<'PY'
import base64, io, json, numpy as np
from PIL import Image
j = json.load(open("/tmp/resp.json"))
url = j["choices"][0]["message"]["content"][0]["image_url"]["url"]
im = Image.open(io.BytesIO(base64.b64decode(url.split(",", 1)[1]))).convert("RGB")
a = np.asarray(im)
print(f"{im.size} mean={a.mean():.2f} std={a.std():.2f} min={a.min()} max={a.max()}")
im.save("/tmp/out.png")
PY

Also tested with an extra_body-specified override and with a second prompt ("A red apple on a wooden table").

Test Result

Same prompt and seed=42 before/after:

mean std min max unique AR codes EOS emitted
Before 249 15 135 255 15 / 1281
After 117 71 0 255 139 / 1281

Before: 1024×1024 PNG that is uniform near-white, no scene content.
After: coherent landscape (mountain + lake + wildflowers).

Second prompt sanity-check ("A red apple on a wooden table") also renders a coherent image.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

PR vllm-project#2320 (`7e28eda9`) dropped `max_tokens: 1281` from the GLM-Image
stage config and moved the compute into
`serving_chat._apply_request_overrides`, but gated it on
`height is not None and width is not None`. For the recipe's bare-curl
request (no `extra_body.height` / `extra_body.width`) the gate skipped
the compute; `SamplingParams.max_tokens` then fell through to vLLM's
`max_model_len - seq_len` (~131k) and the AR stage's generation
budget no longer matched the VQ token layout the parser expects,
leaving the pre-refactor path latently broken since vllm-project#2320 and
surfacing as the IndexError the deploy-yaml edit in vllm-project#3034 was
working around.

Fix: when the user didn't pass h/w, fall back to the diffusion stage's
default h/w (GLM-Image stage-1 yaml already declares
`height: 1024, width: 1024`), rather than hardcoding a second size
default in serving_chat or re-adding the yaml entry. This makes the
compute effectively unconditional for AR + image-diffusion pipelines
that declare a target size in their sampling params; LLM-only and
audio pipelines have neither height nor width in any stage's params
and continue to skip the block — no architecture gate needed.

Also fix a related bug: `getattr(explicit_fields, "max_tokens", None)`
was reading an attribute off a `set[str]` (Pydantic's
`model_fields_set`), so it always returned `None` and silently
overwrote user-provided `max_tokens`. Replaced with a proper set
membership check.

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
…or (vllm-project#3034)

vllm-omni issue vllm-project#3034: `zai-org/GLM-Image` served via
`vllm serve --omni` returns noisy / washed-out images for the minimal
curl from the recipe:

    {"messages":[{"role":"user","content":"A beautiful landscape painting"}]}

Root cause:

- `OmniOpenAIServingChat` only attached `mm_processor_kwargs` to the
  tprompt when the request explicitly supplied
  `extra_body.height` / `extra_body.width`. For the bare-curl request
  the field was omitted entirely.
- `OmniInputPreprocessor._process_text` checked
  `elif mm_processor_kwargs:` (truthiness). With the field omitted the
  default `{}` was falsy, so the preprocessor fell back to plain
  `_tokenize_prompt`, skipping the multimodal processor path.
- That path is where GLM-Image's HF processor emits its
  image-generation scaffold
  `<|image|>PROMPT<sop>H W<eop><sop>h w<eop><|dit_token_N|>`. Without
  the scaffold the AR stage never entered image-generation mode and
  collapsed to a handful of repeated VQ codes (unique=15 across 1281
  positions, no terminal EOS), which the DiT denoised into a uniform
  / near-white image (mean=249, std=15).

Fix (minimal, two one-file changes):

- `serving_chat`: always attach `mm_processor_kwargs` (possibly empty)
  for image-modality requests, so the preprocessor sees it.
- `OmniInputPreprocessor._process_text`: switch from truthiness to
  presence — `"mm_processor_kwargs" in parsed_content`. An
  explicitly-attached empty dict is now a valid "route through the
  multimodal processor" signal, matching callers who want the HF
  processor's defaults to apply.

After the fix the AR produces 139 unique tokens with a terminal EOS
and the image is a coherent landscape (mean=117, std=71, full
0-255 range).

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Comments should explain the invariant, not where to read about it;
the PR body / commit log is the right place for issue links.

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Cosmetic: restore the two-line `ref_image_count = len(reference_images)`
/ `is_img2img = ref_image_count > 0` shape from the pre-vllm-project#2320 code to
keep the diff against main smaller and match the surrounding style.

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25c4e15e19

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/entrypoints/openai/serving_chat.py Outdated
Match the upstream pre-vllm-project#2320 intent: the AR `max_tokens` is a function
of the target h/w (small-preview + large-target + EOS); a
user-supplied `max_tokens` can only mismatch the VQ token layout the
parser expects. Explicit `"max_tokens": null` on the request also
lands here, and the field-copy loop drops None values, so presence-
based gating would leave `params.max_tokens` unset. Restoring the
simple "always compute" shape avoids both edge cases.

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKING:

  • Test Coverage — This is a bugfix PR but lacks automated regression tests. The manual curl test evidence is thorough, but we need at least unit tests to prevent future regressions:

    1. Test for with explicitly-empty to verify presence-based routing (not truthiness)
    2. Test for to verify the height/width fallback logic when user doesn't specify them
    3. Test for to verify max_tokens is always computed for AR + image-diffusion pipelines

The changes are well-documented and the manual testing evidence is convincing, but automated tests are required for maintainability.

hsliuustc0106

This comment was marked as duplicate.

…d max_tokens compute

Adds three regression tests for the GLM-Image noise fix so the
invariants this PR introduces can't silently regress:

- `tests/inputs/test_preprocess.py::TestProcessTextMmProcessorKwargsRouting`
  - `test_empty_mm_processor_kwargs_routes_to_multimodal`: an explicit
    empty `mm_processor_kwargs` dict on the prompt routes through
    `_process_multimodal` (presence, not truthiness).
  - `test_missing_mm_processor_kwargs_routes_to_tokenize`: absence of
    the key still routes through plain `_tokenize_prompt` (control).

- `tests/entrypoints/openai_api/test_serving_chat_sampling_params.py::TestApplyRequestOverridesGLMImage`
  - `test_falls_back_to_diffusion_stage_defaults_when_no_extra_body`:
    no `extra_body` → `_apply_request_overrides` pulls h/w from any
    stage's default sampling params (the GLM-Image stage-1 pattern)
    and computes `max_tokens=1281` for t2i 1024x1024.
  - `test_explicit_null_max_tokens_still_computes`: sending
    `"max_tokens": null` (Pydantic includes it in `model_fields_set`)
    must not suppress the compute. Guards the explicit-null edge case
    where `max_tokens` would otherwise fall through to
    `max_model_len - seq_len` and reintroduce the original IndexError.

Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
@ptarasiewiczNV
Copy link
Copy Markdown
Contributor Author

BLOCKING:

  • Test Coverage — This is a bugfix PR but lacks automated regression tests. The manual curl test evidence is thorough, but we need at least unit tests to prevent future regressions:

    1. Test for with explicitly-empty to verify presence-based routing (not truthiness)
    2. Test for to verify the height/width fallback logic when user doesn't specify them
    3. Test for to verify max_tokens is always computed for AR + image-diffusion pipelines

The changes are well-documented and the manual testing evidence is convincing, but automated tests are required for maintainability.

Thanks, @hsliuustc0106, for the review. I have added four regression tests covering each of the three points:

  • Presence-based routing (empty mm_processor_kwargs)tests/inputs/test_preprocess.py::TestProcessTextMmProcessorKwargsRouting:

    • test_empty_mm_processor_kwargs_routes_to_multimodal — explicit {"mm_processor_kwargs": {}} routes through _process_multimodal, not _tokenize_prompt.
    • test_missing_mm_processor_kwargs_routes_to_tokenize — control: missing key still falls through to plain tokenize.
  • h/w fallback when user doesn't specify themtests/entrypoints/openai_api/test_serving_chat_sampling_params.py::TestApplyRequestOverridesGLMImage::test_falls_back_to_diffusion_stage_defaults_when_no_extra_body — with no extra_body and stage-1 params carrying height=1024, width=1024,
    _apply_request_overrides pulls h/w from the diffusion stage defaults; max_tokens==1281 and extra_args["target_h/w"]==1024.

  • max_tokens always computed for AR + image-diffusion — covered by the existing test_dynamic_max_tokens_overrides_user_value (user-supplied value is overridden by compute) plus a new test_explicit_null_max_tokens_still_computes which guards the case where max_tokens is set to null.

@ptarasiewiczNV
Copy link
Copy Markdown
Contributor Author

Closing for rebased: #3189 after #3084 fixed one of the two addressed issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: GLM-Image produces noisy/incorrect images on vllm-omni 0.18.0

2 participants