Nextstep online e2e#2107
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b65437cd72
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "--vae-use-slicing", | ||
| "--vae-use-tiling", |
There was a problem hiding this comment.
Exercise an actual VAE optimization path for NextStep
For stepfun-ai/NextStep-1.1, these flags do not change the server behavior today: vllm_omni/diffusion/models/nextstep_1_1/modeling_flux_vae.py implements AutoencoderKL.decode() as a plain decoder call and the class never defines use_slicing, while vllm_omni/diffusion/registry.py only applies vae_use_slicing when that attribute exists. In other words, this parametrization cannot catch regressions in VAE slicing or tiling for NextStep and only ends up validating TP=2, which gives the nightly L4 expansion suite false coverage for the advertised VAE paths.
Useful? React with 👍 / 👎.
hsliuustc0106
left a comment
There was a problem hiding this comment.
Test Evidence Request
Thank you for adding NextStep-1.1 online serving coverage! Before merging, could you please provide some baseline test evidence to help us understand the model's characteristics and validate the CI configuration?
Requested Evidence
1. Latency
- Time to generate a single image with the test config (TP=2, 256x256, 2 steps)
- If available, also include latency for a more realistic config (e.g., 512x512, 20-30 steps)
2. Accuracy / Correctness
- Sample output image(s) from a test run, OR
- Validation that
send_diffusion_requestreturns a valid image with expected dimensions
3. Memory
- Peak VRAM usage with TP=2 on L4 (or your XPU equivalent)
- This helps validate that 2× L4 is sufficient (per your PR description) vs. needing TP=4 on 4× L4
How to Provide
You can add this directly in the PR description or as a comment. Example format:
### Test Results (Intel XPU / 2× L4)
| Config | Resolution | Steps | Latency | Peak Memory |
|--------|------------|-------|---------|-------------|
| TP=2 | 256x256 | 2 | X.XX s | XX GB |
| TP=2 | 512x512 | 28 | XX.XX s | XX GB |
Sample output: [attach image or describe validation]
This information helps us:
- Validate the CI hardware allocation is appropriate
- Establish a baseline for future regression detection
- Document expected behavior for this model
Note: The docs/readthedocs.org failure appears transient (this PR doesn't modify documentation). It should pass on re-run.
| from tests.utils import hardware_marks | ||
|
|
||
| # Same host class as FLUX.2-klein expansion (g6.12xlarge / gpu_4_queue); TP=2 needs 2 devices. | ||
| TWO_CARD_L4_MARKS = hardware_marks(res={"cuda": "L4"}, num_cards=2) |
There was a problem hiding this comment.
i tested it locally on xpu but removed it for pr to stay consistent with existing diffusion E2E tests, which currently target CUDA L4 . Would you prefer that I also include XPU coverage or should we keep it aligned with CUDA- only tests for now?
|
@hsliuustc0106 Thanks for the checklist. I’ve updated the PR description with test results and sample outputs. For memory, we see about ~20.2 GB peak reserved per GPU (PyTorch worker log on Intel XPU, TP=2) and about ~16–16.6 GB allocated depending on resolution; reserved vs allocated is called out in the description. I also changed the E2E config to 512×512 @ 20 steps so the test matches a more realistic setting. |
|
Please fix CI error. Thanks |
6ad2084 to
db5ad57
Compare
8f5962d to
26c69f4
Compare
|
Are you ready? If not, I can remove the label. Then we can save some resources |
Signed-off-by: Joshna-Medisetty <joshna.medisetty@intel.com>
|
@Gaohan123 PR is ready and passes tests on L4. |
|
Hi @Gaohan123, PR is ready for merge. Could please add the label. Thank you. |
Hi, I just added the nightly label for L4 level tests to double check. |
Signed-off-by: Joshna-Medisetty <joshna.medisetty@intel.com>
Signed-off-by: Joshna-Medisetty <joshna.medisetty@intel.com>
Signed-off-by: Joshna-Medisetty <joshna.medisetty@intel.com>
|
The tests it fails are unrelated. May be need a force merge |
|
Seems like #2435. It's OK to not merge main if there is no conflict. Thanks for your patience and continuous tracking of this PR. It has really been a while. But since you happen to have done another merge, let's wait one more time (after that issue is resolved -> you update the branch) 🙏 |
Signed-off-by: Joshna-Medisetty <joshna.medisetty@intel.com>
Signed-off-by: Joshna Medisetty <joshna.medisetty@intel.com> Signed-off-by: Joshna-Medisetty <joshna.medisetty@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: Joshna Medisetty <joshna.medisetty@intel.com> Signed-off-by: Joshna-Medisetty <joshna.medisetty@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Purpose
Add an online serving E2E test for NextStep-1.1 text-to-image using
OmniServerandOpenAIClientHandler.send_diffusion_request, consistent with other online diffusion tests in the repo. Supports broader L4 diffusion online coverage under RFC #1832.What’s included
stepfun-ai/NextStep-1.1).NextStep11Pipeline(single lightweight case; no Cache-DiT / Ulysses / FP8 stack).advanced_model,diffusion,L4,distributed_cuda(2).Test plan
Test Results (Intel XPU / 2× L4)
Sample Output
256×256 @ 2 steps

256×256 @ 20 steps

512×512 @ 15 steps

512×512 @ 20 steps

512×512 @ 28 steps
