Skip to content

Optimize GLM-Image AR token upsampling and add profiling/tests#2888

Open
zeel2104 wants to merge 1 commit into
vllm-project:mainfrom
zeel2104:feat/glm-image-ar-bridge-profile
Open

Optimize GLM-Image AR token upsampling and add profiling/tests#2888
zeel2104 wants to merge 1 commit into
vllm-project:mainfrom
zeel2104:feat/glm-image-ar-bridge-profile

Conversation

@zeel2104
Copy link
Copy Markdown

@zeel2104 zeel2104 commented Apr 17, 2026

Purpose

Optimize the GLM-Image AR-to-diffusion token upsampling path for issue #2834.

This PR replaces GLM token-grid upsampling from float + F.interpolate(mode="nearest") with integer repeat_interleave in both:

  • the AR model helper path
  • the AR -> Diffusion stage input processor path

The goal is to reduce avoidable cast/interpolate overhead in the AR bridge while preserving identical token layout. This PR also expands unit coverage for GLM stage-input processing edge cases.

Test Plan

Added focused unit coverage in:

tests/model_executor/stage_input_processors/test_glm_image_stage_input_processors.py

Validated the changed GLM stage-input logic with standalone pytest execution in a local environment because full repo-native pytest was blocked by local vllm installation/runtime issues.

Additional local microbenchmarking was performed to compare the previous F.interpolate(..., mode="nearest") implementation against the new integer repeat_interleave implementation.

Test Result

Focused unit validation:

6 passed in 21.96s

Covered cases:

  • nearest-neighbor token upsample layout
  • t2i prior-token construction
  • serialized prior_token_image_ids normalization
  • pure i2i large-token path with EOS trimming
  • fallback read from CompletionOutput.multimodal_output
  • truncated AR output with grid down-adjustment

Local microbenchmark:

16x16: old=0.0112 ms  new=0.0120 ms  speedup=0.94x
32x32: old=0.0206 ms  new=0.0122 ms  speedup=1.69x
64x64: old=0.0251 ms  new=0.0218 ms  speedup=1.15x

Summary:

  • The new integer upsampling path is neutral-to-faster depending on token-grid size, with the strongest gain at 32x32 (~1.69x faster).
  • I do not yet have a reliable full GLM-Image e2e speedup measurement from a complete target runtime, so this PR only claims the local microbenchmark improvement above.

@zeel2104 zeel2104 requested a review from hsliuustc0106 as a code owner April 17, 2026 16:09
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@lishunyang12
Copy link
Copy Markdown
Collaborator

Thanks for your contribution:) Please fix DCO.

@zeel2104 zeel2104 force-pushed the feat/glm-image-ar-bridge-profile branch from 07881d1 to 338dfd3 Compare April 17, 2026 16:18
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

BLOCKER scan:

  • Correctness: PASS
  • Reliability/Safety: PASS
  • Breaking Changes: PASS
  • Test Coverage: PASS (added 6 comprehensive unit tests)
  • Documentation: ISSUE - profiling feature undocumented
  • Security: PASS

BLOCKING ISSUES:

  1. Documentation - The environment variable for profiling is not documented. Please add documentation for this feature in the user guide or README.

VERDICT: REQUEST_CHANGES


Suggestion: The test plan mentions standalone benchmark files (/tmp/test_glm_stage_standalone.py, /tmp/bench_glm_stage.py) that are not part of this PR. Consider adding these as permanent benchmark tests or remove them from the PR description to avoid confusion.

@zeel2104 zeel2104 force-pushed the feat/glm-image-ar-bridge-profile branch from 338dfd3 to 221a9de Compare April 18, 2026 13:22
@zeel2104
Copy link
Copy Markdown
Author

Thanks for the review.

Addressed the requested changes:

  • Added documentation for VLLM_OMNI_PROFILE_GLM_IMAGE in the GLM-Image user guide pages for both online serving and offline inference.
  • Updated the PR description to clarify that the /tmp/... benchmark/pytest scripts were local validation helpers and are not part of this PR.

I also kept the PR test/result section focused on the actual in-repo unit test coverage plus the local benchmark results for the changed path.

from vllm_omni.model_executor.models.output_templates import OmniOutput

logger = init_logger(__name__)
_PROFILE_GLM_IMAGE = os.environ.get("VLLM_OMNI_PROFILE_GLM_IMAGE", "").lower() in {"1", "true", "yes", "on"}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we add profiling codes here? these should be removed


# Upsample from 32x to 16x
prior_token_ids = _upsample_token_ids(prior_token_ids_d32, actual_h, actual_w)
_log_profile_timing(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it please


diffusion_inputs.append(diffusion_input)

_log_profile_timing(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm all log profile timing function

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

can you provide the e2e speedup

@zeel2104 zeel2104 force-pushed the feat/glm-image-ar-bridge-profile branch from 221a9de to 46c28b5 Compare April 18, 2026 14:50
@zeel2104
Copy link
Copy Markdown
Author

@hsliuustc0106 Thanks, addressed.

  • Removed all profiling/logging additions from glm_image_ar.py and stage_input_processors/glm_image.py.
  • Removed the related doc updates as well so the PR stays focused on the token upsampling optimization + tests.

For performance data: I only have local microbenchmark results for the changed upsampling path, not a reliable full GLM-Image end-to-end measurement on a working target runtime environment. I checked whether I could run e2e locally, but my current WSL environment does not have a visible CUDA GPU (torch.cuda.is_available() == False, device_count == 0), and my native Windows environment is not in a working full vllm runtime state for this GLM-Image path. So I do not have a trustworthy e2e speedup number to report from this setup.

The PR description has been updated accordingly and does not claim a verified e2e speedup.

Signed-off-by: Zeel <desaizeel2128@gmail.com>
@zeel2104 zeel2104 force-pushed the feat/glm-image-ar-bridge-profile branch from 46c28b5 to 54896c9 Compare April 18, 2026 14:57
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 Thanks, addressed.

  • Removed all profiling/logging additions from glm_image_ar.py and stage_input_processors/glm_image.py.
  • Removed the related doc updates as well so the PR stays focused on the token upsampling optimization + tests.

For performance data: I only have local microbenchmark results for the changed upsampling path, not a reliable full GLM-Image end-to-end measurement on a working target runtime environment. I checked whether I could run e2e locally, but my current WSL environment does not have a visible CUDA GPU (torch.cuda.is_available() == False, device_count == 0), and my native Windows environment is not in a working full vllm runtime state for this GLM-Image path. So I do not have a trustworthy e2e speedup number to report from this setup.

The PR description has been updated accordingly and does not claim a verified e2e speedup.

thanks, I'll ask someone else to test it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants