Optimize GLM-Image AR token upsampling and add profiling/tests#2888
Optimize GLM-Image AR token upsampling and add profiling/tests#2888zeel2104 wants to merge 1 commit into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Thanks for your contribution:) Please fix DCO. |
07881d1 to
338dfd3
Compare
|
BLOCKER scan:
BLOCKING ISSUES:
VERDICT: REQUEST_CHANGES Suggestion: The test plan mentions standalone benchmark files (/tmp/test_glm_stage_standalone.py, /tmp/bench_glm_stage.py) that are not part of this PR. Consider adding these as permanent benchmark tests or remove them from the PR description to avoid confusion. |
338dfd3 to
221a9de
Compare
|
Thanks for the review. Addressed the requested changes:
I also kept the PR test/result section focused on the actual in-repo unit test coverage plus the local benchmark results for the changed path. |
| from vllm_omni.model_executor.models.output_templates import OmniOutput | ||
|
|
||
| logger = init_logger(__name__) | ||
| _PROFILE_GLM_IMAGE = os.environ.get("VLLM_OMNI_PROFILE_GLM_IMAGE", "").lower() in {"1", "true", "yes", "on"} |
There was a problem hiding this comment.
why we add profiling codes here? these should be removed
|
|
||
| # Upsample from 32x to 16x | ||
| prior_token_ids = _upsample_token_ids(prior_token_ids_d32, actual_h, actual_w) | ||
| _log_profile_timing( |
|
|
||
| diffusion_inputs.append(diffusion_input) | ||
|
|
||
| _log_profile_timing( |
There was a problem hiding this comment.
rm all log profile timing function
|
can you provide the e2e speedup |
221a9de to
46c28b5
Compare
|
@hsliuustc0106 Thanks, addressed.
For performance data: I only have local microbenchmark results for the changed upsampling path, not a reliable full GLM-Image end-to-end measurement on a working target runtime environment. I checked whether I could run e2e locally, but my current WSL environment does not have a visible CUDA GPU ( The PR description has been updated accordingly and does not claim a verified e2e speedup. |
Signed-off-by: Zeel <desaizeel2128@gmail.com>
46c28b5 to
54896c9
Compare
thanks, I'll ask someone else to test it |
Purpose
Optimize the GLM-Image AR-to-diffusion token upsampling path for issue #2834.
This PR replaces GLM token-grid upsampling from
float + F.interpolate(mode="nearest")with integerrepeat_interleavein both:The goal is to reduce avoidable cast/interpolate overhead in the AR bridge while preserving identical token layout. This PR also expands unit coverage for GLM stage-input processing edge cases.
Test Plan
Added focused unit coverage in:
Validated the changed GLM stage-input logic with standalone pytest execution in a local environment because full repo-native
pytestwas blocked by localvllminstallation/runtime issues.Additional local microbenchmarking was performed to compare the previous
F.interpolate(..., mode="nearest")implementation against the new integerrepeat_interleaveimplementation.Test Result
Focused unit validation:
Covered cases:
prior_token_image_idsnormalizationCompletionOutput.multimodal_outputLocal microbenchmark:
Summary:
32x32(~1.69xfaster).