[Test] Add L4 diffusion feature test for GLM-Image by herotai214 · Pull Request #3451 · vllm-project/vllm-omni

herotai214 · 2026-05-08T10:51:11Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR adds L4 test that covers the baseline & 2 of the supported features Tensor-Parallel, HSDP in both t2i & i2i cases for GLM-Image
Referred to #2167. Since it only covers i2i case, didn't effectively test the features, and the codes are quite different now, decided to raise this separate PR.

Tests are picked up by nightly: pytest -sv test_glm_image_expansion.py -m "full_model" --run-level "full_model".
The tests are intended to be picked up by the L4 nightly diffusion pipeline described in RFC #1832.

Test Plan

Model

zai-org/GLM-Image

Changes:

.buildkite/test-nightly.yml to trigger test
tests/e2e/online_serving/test_glm_image_expansion.py (the new test file)

Tests added:

After referring to [RFC]: Continuous Diffusion Model Acceleration Support #1217:

A normal baseline execution
Tensor-Parallel=2
HSDP=2

Above 3 tests loop through both t2i and i2i cases -> 6 tests in total.
All tests use 2 cards.

Remark:

By default, a baseline GLM-Image instance has Stage-0 AR on device 0 and Stage-1 DiT on device 1 -> Total 2 cards.
For TP and HSDP, only Stage-1 DiT part gets parallel.
- To occupy the least cards as possible, we put 1 part of the Stage-1 to device 0 too -> Total 2 cards.
Skip CFG parallel test since it is not working properly now, referring to [Bug]: GLM-Image CFG Parallel is not working #3382.
This test only works correctly after the CLI bugfix by [Bugfix] Fix for multi-stage CLI overrides, which cause features like HSDP not working for multi-stage models #3384 is merged.
(latest update: 8 May, 2026)

Test Result

pytest -sv tests/e2e/online_serving/test_glm_image_expansion.py -m "full_model" --run-level "full_model"
(With #3384 fix!)

================= 6 passed, 22 warnings in 1418.49s (0:23:38) ==================

pytest -sv tests/e2e/online_serving/test_glm_image_expansion.py -m "full_model" --collect-only

<Dir vllm-omni>
  <Package tests>
    <Package e2e>
      <Package online_serving>
        <Module test_glm_image_expansion.py>
          Comprehensive tests of diffusion features that are available in online serving mode
          and are supported by the following models:
          - zai-org/GLM-Image: (supports t2i & i2i)
          Coverage:
              For both t2i & i2i cases:
              - Tensor-Parallel
              - HSDP
          
          assert_diffusion_response validates successful generation and the expected
          1024x1024 resolution.
          <Function test_glm_image[baseline_t2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.
          <Function test_glm_image[tensor_parallel_2_t2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.
          <Function test_glm_image[hsdp_2_t2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.
          <Function test_glm_image[baseline_i2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.
          <Function test_glm_image[tensor_parallel_2_i2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.
          <Function test_glm_image[hsdp_2_i2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.

---

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a80dae5abc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

herotai214 · 2026-05-08T10:54:15Z

@yenuo26 PTAL 🙏 not sure if I edit the .buildkite/test-nightly.yml correctly

hsliuustc0106 · 2026-05-08T12:08:57Z

                      path: /mnt/hf-cache
                      type: DirectoryOrCreate

+      - label: ":full_moon: Diffusion X2I(&A&T) · GLM-Image Function Test with H100"


is this going to be an indepdent test pipeline?

hsliuustc0106 · 2026-05-08T13:03:15Z

@JaredforReal

hsliuustc0106

LGTM — clean test PR with good use of shared helpers and parametrization. No substantive issues.

hsliuustc0106 · 2026-05-08T21:34:50Z

COMMENT

This PR adds comprehensive L4 diffusion feature tests for GLM-Image, covering baseline, Tensor-Parallel, and HSDP configurations in both T2I and I2I modes.

Notes:

Good test structure - 6 tests total (3 configurations × 2 modes) with proper parametrization.
The Buildkite step is well-configured:
- Appropriate timeout (120 minutes for 6 tests averaging ~3.8 minutes each)
- Correct resource allocation (2 H100 GPUs)
- Proper markers (full_model, diffusion, H100)
Stage overrides are correctly configured:
- TP: Stage-0 stays on device 0, Stage-1 uses devices 0,1
- HSDP: Only Stage-1 is sharded with devices 0,1 (runtime avoids sharding Stage-0)
- Baseline: Stage-0 on device 0, Stage-1 on device 1
Test evidence is solid:
- All 6 tests passed in 1418.49s (~23:38 total)
- Tests validate both successful generation and expected 1024x1024 resolution
The PR correctly notes dependency on PR [Bugfix] Fix for multi-stage CLI overrides, which cause features like HSDP not working for multi-stage models #3384 (multi-stage CLI overrides bugfix).

No blocking issues found. This is a valuable addition to the test suite that will help catch regressions in GLM-Image's parallel features.

yenuo26

LGTM

hsliuustc0106 · 2026-05-16T13:04:44Z

Hi @herotai214, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 7 days. 🕐

Could you please provide an update?

If you're still working on it, that's great — just let us know.
If you're blocked on something, feel free to ask for help.
If this PR is no longer being pursued, please consider closing it so we can keep the review queue manageable.

Thanks for your contribution! 🙏

herotai214 · 2026-05-28T15:02:36Z

Hi @herotai214, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 7 days. 🕐

Could you please provide an update?

If you're still working on it, that's great — just let us know.

If you're blocked on something, feel free to ask for help.

If this PR is no longer being pursued, please consider closing it so we can keep the review queue manageable.

Thanks for your contribution! 🙏

Thanks for reminding; I was working for Hunyuan for a period previously, and I was aware that the multi-stage CLI issue solution seems to be uncertain; Therefore, I put this PR on hold.

Now I recognize the #3483 solves to CLI issue and have been merged today;

I'll verify asap if it helps this L4 test (support TP, HSDP properly through CLI), or otherwise I may need to push a revised version after removing those unsupported features (through CLI).

Signed-off-by: herotai214 <herotai214@gmail.com>

herotai214 · 2026-06-01T06:34:13Z

Updated the script; Verified that #3483 worked, and able to use CLI here to apply TP, SP, Cache-DiT....

# with relevant logs like:
# Cache-DiT
�[32mINFO�[0m �[90m06-01 03:27:48�[0m �[90m[cache_dit_backend.py:1818]�[0m Cache-dit enabled successfully on GlmImagePipeline

# TP2
�[0;36m(APIServer pid=352086)�[0;0m INFO 06-01 03:30:39 [stage_utils.py:138] Stage 1 logical-to-physical device mapping: 0->4, 1->5

# HSDP2
�[32mINFO�[0m �[90m06-01 03:36:03�[0m �[90m[hsdp.py:128]�[0m HSDP Inference: replicate_size=1, shard_size=2, world_size=2, rank=0, fs_world_size=2, fs_rank=0

# SP2
�[32mINFO�[0m �[90m06-01 03:20:14�[0m �[90m[parallel_state.py:588]�[0m Building SP subgroups from explicit sp_group_ranks (sp_size=2, ulysses=2, ring=1, use_ulysses_low=True).
�[32mINFO�[0m �[90m06-01 03:20:14�[0m �[90m[parallel_state.py:630]�[0m SP group details for rank 1: sp_group=[0, 1], ulysses_group=[0, 1], ring_group=[1]
�[32mINFO�[0m �[90m06-01 03:20:14�[0m �[90m[parallel_state.py:588]�[0m Building SP subgroups from explicit sp_group_ranks (sp_size=2, ulysses=2, ring=1, use_ulysses_low=True).
�[32mINFO�[0m �[90m06-01 03:20:14�[0m �[90m[parallel_state.py:630]�[0m SP group details for rank 0: sp_group=[0, 1], ulysses_group=[0, 1], ring_group=[0]

...

================= 10 passed, 23 warnings in 1226.11s (0:20:26) =================

But after rebased and upgrading to vllm 0.22.0 this morning with the commit [Rebase] Rebase to vllm releases/v0.22.0 (#3891), I cannot use CLI to apply any feature here anymore...

Will double check and raise issue very soon....

herotai214 requested review from congw729 and yenuo26 as code owners May 8, 2026 10:51

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread tests/e2e/online_serving/test_glm_image_expansion.py Outdated

hsliuustc0106 reviewed May 8, 2026

View reviewed changes

hsliuustc0106 approved these changes May 8, 2026

View reviewed changes

yenuo26 approved these changes May 9, 2026

View reviewed changes

[Test] Add L4 diffusion feature tests for GLM-Image

5b8d60c

Signed-off-by: herotai214 <herotai214@gmail.com>

herotai214 force-pushed the glm_test branch from 6a989cb to 5b8d60c Compare June 1, 2026 06:18

chickeyton mentioned this pull request Jun 1, 2026

[CI] GLM-Image testcases #2167

Closed

5 tasks

herotai214 mentioned this pull request Jun 1, 2026

[RFC]: CI optimization and L4 model tests supplement JiusiServe/vllm-omni#177

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Test] Add L4 diffusion feature test for GLM-Image#3451

[Test] Add L4 diffusion feature test for GLM-Image#3451
herotai214 wants to merge 1 commit into
vllm-project:mainfrom
herotai214:glm_test

herotai214 commented May 8, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

herotai214 commented May 8, 2026

Uh oh!

hsliuustc0106 May 8, 2026

Uh oh!

hsliuustc0106 commented May 8, 2026

Uh oh!

hsliuustc0106 left a comment

Uh oh!

hsliuustc0106 commented May 8, 2026

Uh oh!

yenuo26 left a comment

Uh oh!

hsliuustc0106 commented May 16, 2026

Uh oh!

herotai214 commented May 28, 2026

Uh oh!

herotai214 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

herotai214 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Model

Changes:

Tests added:

Remark:

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

herotai214 commented May 8, 2026

Uh oh!

hsliuustc0106 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented May 8, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented May 8, 2026

Uh oh!

yenuo26 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented May 16, 2026

Uh oh!

herotai214 commented May 28, 2026

Uh oh!

herotai214 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

herotai214 commented May 8, 2026 •

edited

Loading