Skip to content

[Test] Add L4 diffusion feature test for GLM-Image#3451

Open
herotai214 wants to merge 1 commit into
vllm-project:mainfrom
herotai214:glm_test
Open

[Test] Add L4 diffusion feature test for GLM-Image#3451
herotai214 wants to merge 1 commit into
vllm-project:mainfrom
herotai214:glm_test

Conversation

@herotai214
Copy link
Copy Markdown
Contributor

@herotai214 herotai214 commented May 8, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR adds L4 test that covers the baseline & 2 of the supported features Tensor-Parallel, HSDP in both t2i & i2i cases for GLM-Image
Referred to #2167. Since it only covers i2i case, didn't effectively test the features, and the codes are quite different now, decided to raise this separate PR.

Tests are picked up by nightly: pytest -sv test_glm_image_expansion.py -m "full_model" --run-level "full_model".
The tests are intended to be picked up by the L4 nightly diffusion pipeline described in RFC #1832.

Test Plan

Model

  • zai-org/GLM-Image

Changes:

  • .buildkite/test-nightly.yml to trigger test
  • tests/e2e/online_serving/test_glm_image_expansion.py (the new test file)

Tests added:

After referring to [RFC]: Continuous Diffusion Model Acceleration Support #1217:

  • A normal baseline execution
  • Tensor-Parallel=2
  • HSDP=2

Above 3 tests loop through both t2i and i2i cases -> 6 tests in total.
All tests use 2 cards.

Remark:

Test Result

pytest -sv tests/e2e/online_serving/test_glm_image_expansion.py -m "full_model" --run-level "full_model"
(With #3384 fix!)

================= 6 passed, 22 warnings in 1418.49s (0:23:38) ==================
pytest -sv tests/e2e/online_serving/test_glm_image_expansion.py -m "full_model" --collect-only
<Dir vllm-omni>
  <Package tests>
    <Package e2e>
      <Package online_serving>
        <Module test_glm_image_expansion.py>
          Comprehensive tests of diffusion features that are available in online serving mode
          and are supported by the following models:
          - zai-org/GLM-Image: (supports t2i & i2i)
          Coverage:
              For both t2i & i2i cases:
              - Tensor-Parallel
              - HSDP
          
          assert_diffusion_response validates successful generation and the expected
          1024x1024 resolution.
          <Function test_glm_image[baseline_t2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.
          <Function test_glm_image[tensor_parallel_2_t2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.
          <Function test_glm_image[hsdp_2_t2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.
          <Function test_glm_image[baseline_i2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.
          <Function test_glm_image[tensor_parallel_2_i2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.
          <Function test_glm_image[hsdp_2_i2i]>
            Test GLM-Image in both T2I and I2I modes across all parallel configurations.
---
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a80dae5abc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/e2e/online_serving/test_glm_image_expansion.py Outdated
@herotai214
Copy link
Copy Markdown
Contributor Author

@yenuo26 PTAL 🙏 not sure if I edit the .buildkite/test-nightly.yml correctly

path: /mnt/hf-cache
type: DirectoryOrCreate

- label: ":full_moon: Diffusion X2I(&A&T) · GLM-Image Function Test with H100"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to be an indepdent test pipeline?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@JaredforReal

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean test PR with good use of shared helpers and parametrization. No substantive issues.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

COMMENT

This PR adds comprehensive L4 diffusion feature tests for GLM-Image, covering baseline, Tensor-Parallel, and HSDP configurations in both T2I and I2I modes.

Notes:

  1. Good test structure - 6 tests total (3 configurations × 2 modes) with proper parametrization.

  2. The Buildkite step is well-configured:

    • Appropriate timeout (120 minutes for 6 tests averaging ~3.8 minutes each)
    • Correct resource allocation (2 H100 GPUs)
    • Proper markers (full_model, diffusion, H100)
  3. Stage overrides are correctly configured:

    • TP: Stage-0 stays on device 0, Stage-1 uses devices 0,1
    • HSDP: Only Stage-1 is sharded with devices 0,1 (runtime avoids sharding Stage-0)
    • Baseline: Stage-0 on device 0, Stage-1 on device 1
  4. Test evidence is solid:

    • All 6 tests passed in 1418.49s (~23:38 total)
    • Tests validate both successful generation and expected 1024x1024 resolution
  5. The PR correctly notes dependency on PR [Bugfix] Fix for multi-stage CLI overrides, which cause features like HSDP not working for multi-stage models #3384 (multi-stage CLI overrides bugfix).

No blocking issues found. This is a valuable addition to the test suite that will help catch regressions in GLM-Image's parallel features.

Copy link
Copy Markdown
Collaborator

@yenuo26 yenuo26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Hi @herotai214, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 7 days. 🕐

Could you please provide an update?

  • If you're still working on it, that's great — just let us know.
  • If you're blocked on something, feel free to ask for help.
  • If this PR is no longer being pursued, please consider closing it so we can keep the review queue manageable.

Thanks for your contribution! 🙏

@herotai214
Copy link
Copy Markdown
Contributor Author

Hi @herotai214, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 7 days. 🕐

Could you please provide an update?

  • If you're still working on it, that's great — just let us know.
  • If you're blocked on something, feel free to ask for help.
  • If this PR is no longer being pursued, please consider closing it so we can keep the review queue manageable.

Thanks for your contribution! 🙏

Thanks for reminding; I was working for Hunyuan for a period previously, and I was aware that the multi-stage CLI issue solution seems to be uncertain; Therefore, I put this PR on hold.

Now I recognize the #3483 solves to CLI issue and have been merged today;

I'll verify asap if it helps this L4 test (support TP, HSDP properly through CLI), or otherwise I may need to push a revised version after removing those unsupported features (through CLI).

Signed-off-by: herotai214 <herotai214@gmail.com>
@herotai214
Copy link
Copy Markdown
Contributor Author

Updated the script; Verified that #3483 worked, and able to use CLI here to apply TP, SP, Cache-DiT....

# with relevant logs like:
# Cache-DiT
�[32mINFO�[0m �[90m06-01 03:27:48�[0m �[90m[cache_dit_backend.py:1818]�[0m Cache-dit enabled successfully on GlmImagePipeline

# TP2
�[0;36m(APIServer pid=352086)�[0;0m INFO 06-01 03:30:39 [stage_utils.py:138] Stage 1 logical-to-physical device mapping: 0->4, 1->5

# HSDP2
�[32mINFO�[0m �[90m06-01 03:36:03�[0m �[90m[hsdp.py:128]�[0m HSDP Inference: replicate_size=1, shard_size=2, world_size=2, rank=0, fs_world_size=2, fs_rank=0

# SP2
�[32mINFO�[0m �[90m06-01 03:20:14�[0m �[90m[parallel_state.py:588]�[0m Building SP subgroups from explicit sp_group_ranks (sp_size=2, ulysses=2, ring=1, use_ulysses_low=True).
�[32mINFO�[0m �[90m06-01 03:20:14�[0m �[90m[parallel_state.py:630]�[0m SP group details for rank 1: sp_group=[0, 1], ulysses_group=[0, 1], ring_group=[1]
�[32mINFO�[0m �[90m06-01 03:20:14�[0m �[90m[parallel_state.py:588]�[0m Building SP subgroups from explicit sp_group_ranks (sp_size=2, ulysses=2, ring=1, use_ulysses_low=True).
�[32mINFO�[0m �[90m06-01 03:20:14�[0m �[90m[parallel_state.py:630]�[0m SP group details for rank 0: sp_group=[0, 1], ulysses_group=[0, 1], ring_group=[0]

...

================= 10 passed, 23 warnings in 1226.11s (0:20:26) =================

But after rebased and upgrading to vllm 0.22.0 this morning with the commit [Rebase] Rebase to vllm releases/v0.22.0 (#3891), I cannot use CLI to apply any feature here anymore...

Will double check and raise issue very soon....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants