Skip to content

[CI] Add mandatory H100 TP=2 smoke test#36157

Closed
stecasta wants to merge 1 commit intovllm-project:mainfrom
stecasta:add-h100-tp2-smoke-test
Closed

[CI] Add mandatory H100 TP=2 smoke test#36157
stecasta wants to merge 1 commit intovllm-project:mainfrom
stecasta:add-h100-tp2-smoke-test

Conversation

@stecasta
Copy link
Contributor

@stecasta stecasta commented Mar 5, 2026

Summary

Add a mandatory test that starts a vLLM server with TP=2 on H100 using default settings and verifies it can serve requests across 3 models (dense BF16, dense FP8, MoE BF16).

Context

No mandatory test currently starts a vLLM server on Hopper/Blackwell with TP>1, DP=1, and default config. On this hardware, vLLM auto-enables optimizations like fuse_allreduce_rms and cudagraphs. Existing mandatory tests either run on L4 (where these are disabled), use DP>1, or set CUDAGraphMode.NONE.

This means regressions in the default TP serving path on Hopper/Blackwell can go undetected until users hit them. For example, #34109 introduced a cudagraph capture hang (#35772) that passed all mandatory CI. The issue surfaced in an optional nightly LM eval test but took several days before it was noticed and reported.

Models

Model Arch Quant
meta-llama/Llama-3.2-1B-Instruct Dense BF16
RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8 Dense FP8
microsoft/Phi-mini-MoE-instruct MoE BF16

I chose these 3 models to cover dense, FP8, and MoE architectures with minimal overhead (~8 min total on H100). Open to suggestions for additional models, quantizations, or platforms.

Test plan

  • Verify test passes on H100 with 2 GPUs
  • Confirm runtime fits within 15 min timeout

Add a smoke test that starts a vLLM server with TP=2 on H100 using
default settings and verifies it can serve requests across dense BF16,
dense FP8, and MoE BF16 models.

Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
@stecasta
Copy link
Contributor Author

stecasta commented Mar 5, 2026

@ProExpertProg @zou3519 @benchislett could you review when you get a chance?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a mandatory smoke test for Tensor Parallelism (TP=2) on H100 GPUs to catch regressions in the default serving path. The changes include a new Buildkite CI step and the corresponding pytest test file. The test covers dense, FP8, and MoE models. My review found one critical issue in the new test file where one of the selected models for testing is not registered, which will cause the test to fail.

Comment on lines +15 to +19
MODELS = [
"meta-llama/Llama-3.2-1B-Instruct",
"RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8",
"microsoft/Phi-mini-MoE-instruct",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The model microsoft/Phi-mini-MoE-instruct is not present in tests/models/registry.py. This will cause HF_EXAMPLE_MODELS.find_hf_info(model_id) on line 24 to raise a ValueError, failing the test for this model.

To fix this, you can either add the model to the registry or replace it with an existing small MoE model. For example, you could use TitanML/tiny-mixtral, which is already in the registry.

Suggested change
MODELS = [
"meta-llama/Llama-3.2-1B-Instruct",
"RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8",
"microsoft/Phi-mini-MoE-instruct",
]
MODELS = [
"meta-llama/Llama-3.2-1B-Instruct",
"RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8",
"TitanML/tiny-mixtral",
]

@robertgshaw2-redhat
Copy link
Collaborator

the current H100 tests are optional due to capacity constraints. these run every night or on a conditional basis.

closing this. please feel free to join #ci-notifications where we monitor and triage nightly status.

@xinli-sw
Copy link
Contributor

xinli-sw commented Mar 5, 2026

@robertgshaw2-redhat @stecasta do you think it makes more sense to target B200 instead?

@stecasta
Copy link
Contributor Author

stecasta commented Mar 6, 2026

From the test perspective it makes sense to target B200. The idea is to avoid having a broken main that slows down development. Ideally we would target both Hopper and Blackwell, but we can start with the least scarce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants