Skip to content

Nextstep online e2e#2107

Merged
hsliuustc0106 merged 18 commits intovllm-project:mainfrom
Joshna-Medisetty:nextstep-online-e2e-xpu
Apr 17, 2026
Merged

Nextstep online e2e#2107
hsliuustc0106 merged 18 commits intovllm-project:mainfrom
Joshna-Medisetty:nextstep-online-e2e-xpu

Conversation

@Joshna-Medisetty
Copy link
Copy Markdown
Contributor

@Joshna-Medisetty Joshna-Medisetty commented Mar 23, 2026

Purpose

Add an online serving E2E test for NextStep-1.1 text-to-image using OmniServer and OpenAIClientHandler.send_diffusion_request, consistent with other online diffusion tests in the repo. Supports broader L4 diffusion online coverage under RFC #1832.

What’s included

  • Model: NextStep-1.1 (stepfun-ai/NextStep-1.1).
  • Server config: TP=2, NextStep11Pipeline (single lightweight case; no Cache-DiT / Ulysses / FP8 stack).
  • Pytest marks: advanced_model, diffusion, L4, distributed_cuda(2).

Test plan

  • Ran the test successfully on Intel XPU using the same server flags (local validation; the committed test stays CUDA L4–marked for CI).
  • Expect 2× L4 to be sufficient in CI.

Test Results (Intel XPU / 2× L4)

Config Resolution Steps Latency Peak Memory reserved Peak memory allocated
TP=2 256x256 2 25.7 s ~20.2 GB ~16.1 GB
TP=2 256x256 20 57.3 s ~20.2 GB ~16.1 GB
TP=2 512x512 15 191.5 s ~20.2 GB ~16.6 GB
TP=2 512x512 20 230.0 s ~20.2 GB ~16.6 GB
TP=2 512x512 28 286.0 s ~20.2 GB ~16.6 GB

Sample Output

256×256 @ 2 steps
nextstep_256_2

256×256 @ 20 steps
nextstep_256_20

512×512 @ 15 steps
nextstep_512_15

512×512 @ 20 steps
nextstep_512_20

512×512 @ 28 steps
nextstep_512_28

@Joshna-Medisetty
Copy link
Copy Markdown
Contributor Author

@xuechendi @fhfuih

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b65437cd72

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +34 to +35
"--vae-use-slicing",
"--vae-use-tiling",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exercise an actual VAE optimization path for NextStep

For stepfun-ai/NextStep-1.1, these flags do not change the server behavior today: vllm_omni/diffusion/models/nextstep_1_1/modeling_flux_vae.py implements AutoencoderKL.decode() as a plain decoder call and the class never defines use_slicing, while vllm_omni/diffusion/registry.py only applies vae_use_slicing when that attribute exists. In other words, this parametrization cannot catch regressions in VAE slicing or tiling for NextStep and only ends up validating TP=2, which gives the nightly L4 expansion suite false coverage for the advertised VAE paths.

Useful? React with 👍 / 👎.

@Joshna-Medisetty Joshna-Medisetty changed the title Nextstep online e2e xpu Nextstep online e2e Mar 23, 2026
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test Evidence Request

Thank you for adding NextStep-1.1 online serving coverage! Before merging, could you please provide some baseline test evidence to help us understand the model's characteristics and validate the CI configuration?

Requested Evidence

1. Latency

  • Time to generate a single image with the test config (TP=2, 256x256, 2 steps)
  • If available, also include latency for a more realistic config (e.g., 512x512, 20-30 steps)

2. Accuracy / Correctness

  • Sample output image(s) from a test run, OR
  • Validation that send_diffusion_request returns a valid image with expected dimensions

3. Memory

  • Peak VRAM usage with TP=2 on L4 (or your XPU equivalent)
  • This helps validate that 2× L4 is sufficient (per your PR description) vs. needing TP=4 on 4× L4

How to Provide

You can add this directly in the PR description or as a comment. Example format:

### Test Results (Intel XPU / 2× L4)

| Config | Resolution | Steps | Latency | Peak Memory |
|--------|------------|-------|---------|-------------|
| TP=2   | 256x256    | 2     | X.XX s  | XX GB       |
| TP=2   | 512x512    | 28    | XX.XX s | XX GB       |

Sample output: [attach image or describe validation]

This information helps us:

  • Validate the CI hardware allocation is appropriate
  • Establish a baseline for future regression detection
  • Document expected behavior for this model

Note: The docs/readthedocs.org failure appears transient (this PR doesn't modify documentation). It should pass on re-run.

Copy link
Copy Markdown
Collaborator

@congw729 congw729 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

from tests.utils import hardware_marks

# Same host class as FLUX.2-klein expansion (g6.12xlarge / gpu_4_queue); TP=2 needs 2 devices.
TWO_CARD_L4_MARKS = hardware_marks(res={"cuda": "L4"}, num_cards=2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test only on CUDA?

Copy link
Copy Markdown
Contributor Author

@Joshna-Medisetty Joshna-Medisetty Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tested it locally on xpu but removed it for pr to stay consistent with existing diffusion E2E tests, which currently target CUDA L4 . Would you prefer that I also include XPU coverage or should we keep it aligned with CUDA- only tests for now?

@Joshna-Medisetty
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 Thanks for the checklist. I’ve updated the PR description with test results and sample outputs.

For memory, we see about ~20.2 GB peak reserved per GPU (PyTorch worker log on Intel XPU, TP=2) and about ~16–16.6 GB allocated depending on resolution; reserved vs allocated is called out in the description.
With ~20 GB per device, 2× L4 (24 GB each) should be enough for this setup without TP=4, noting CUDA/L4 may differ a bit from these XPU numbers.

I also changed the E2E config to 512×512 @ 20 steps so the test matches a more realistic setting.

@Gaohan123 Gaohan123 added ready label to trigger buildkite CI nightly-test label to trigger buildkite nightly test CI and removed ready label to trigger buildkite CI labels Mar 25, 2026
@Gaohan123
Copy link
Copy Markdown
Collaborator

Please fix CI error. Thanks

@Joshna-Medisetty Joshna-Medisetty force-pushed the nextstep-online-e2e-xpu branch 3 times, most recently from 6ad2084 to db5ad57 Compare March 26, 2026 22:35
Signed-off-by: Joshna Medisetty <joshna.medisetty@intel.com>
@Joshna-Medisetty Joshna-Medisetty force-pushed the nextstep-online-e2e-xpu branch from 8f5962d to 26c69f4 Compare March 27, 2026 06:36
@Gaohan123
Copy link
Copy Markdown
Collaborator

Are you ready? If not, I can remove the label. Then we can save some resources

@Gaohan123 Gaohan123 removed the nightly-test label to trigger buildkite nightly test CI label Mar 27, 2026
@Joshna-Medisetty
Copy link
Copy Markdown
Contributor Author

@Gaohan123 PR is ready and passes tests on L4.
I’m currently seeing a failure in the Qwen image diffusion perf test, which does not include the NextStep test introduced in this PR, so it appears unrelated to these changes.

@Joshna-Medisetty
Copy link
Copy Markdown
Contributor Author

Hi @Gaohan123, PR is ready for merge. Could please add the label. Thank you.

@congw729 congw729 added the nightly-test label to trigger buildkite nightly test CI label Apr 1, 2026
@congw729
Copy link
Copy Markdown
Collaborator

congw729 commented Apr 1, 2026

Hi @Gaohan123, PR is ready for merge. Could please add the label. Thank you.

Hi, I just added the nightly label for L4 level tests to double check.

Signed-off-by: Joshna-Medisetty <joshna.medisetty@intel.com>
Signed-off-by: Joshna-Medisetty <joshna.medisetty@intel.com>
@Joshna-Medisetty
Copy link
Copy Markdown
Contributor Author

The tests it fails are unrelated. May be need a force merge

@fhfuih
Copy link
Copy Markdown
Contributor

fhfuih commented Apr 2, 2026

Seems like #2435. It's OK to not merge main if there is no conflict.

Thanks for your patience and continuous tracking of this PR. It has really been a while. But since you happen to have done another merge, let's wait one more time (after that issue is resolved -> you update the branch) 🙏

@yenuo26 yenuo26 added ready label to trigger buildkite CI and removed nightly-test label to trigger buildkite nightly test CI labels Apr 14, 2026
@hsliuustc0106 hsliuustc0106 merged commit b5ddff7 into vllm-project:main Apr 17, 2026
8 checks passed
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
Signed-off-by: Joshna Medisetty <joshna.medisetty@intel.com>
Signed-off-by: Joshna-Medisetty <joshna.medisetty@intel.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
Signed-off-by: Joshna Medisetty <joshna.medisetty@intel.com>
Signed-off-by: Joshna-Medisetty <joshna.medisetty@intel.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants