[CI] Add mandatory H100 TP=2 smoke test#36157
[CI] Add mandatory H100 TP=2 smoke test#36157stecasta wants to merge 1 commit intovllm-project:mainfrom
Conversation
Add a smoke test that starts a vLLM server with TP=2 on H100 using default settings and verifies it can serve requests across dense BF16, dense FP8, and MoE BF16 models. Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
|
@ProExpertProg @zou3519 @benchislett could you review when you get a chance? |
There was a problem hiding this comment.
Code Review
This pull request adds a mandatory smoke test for Tensor Parallelism (TP=2) on H100 GPUs to catch regressions in the default serving path. The changes include a new Buildkite CI step and the corresponding pytest test file. The test covers dense, FP8, and MoE models. My review found one critical issue in the new test file where one of the selected models for testing is not registered, which will cause the test to fail.
| MODELS = [ | ||
| "meta-llama/Llama-3.2-1B-Instruct", | ||
| "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8", | ||
| "microsoft/Phi-mini-MoE-instruct", | ||
| ] |
There was a problem hiding this comment.
The model microsoft/Phi-mini-MoE-instruct is not present in tests/models/registry.py. This will cause HF_EXAMPLE_MODELS.find_hf_info(model_id) on line 24 to raise a ValueError, failing the test for this model.
To fix this, you can either add the model to the registry or replace it with an existing small MoE model. For example, you could use TitanML/tiny-mixtral, which is already in the registry.
| MODELS = [ | |
| "meta-llama/Llama-3.2-1B-Instruct", | |
| "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8", | |
| "microsoft/Phi-mini-MoE-instruct", | |
| ] | |
| MODELS = [ | |
| "meta-llama/Llama-3.2-1B-Instruct", | |
| "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8", | |
| "TitanML/tiny-mixtral", | |
| ] |
|
the current H100 tests are optional due to capacity constraints. these run every night or on a conditional basis. closing this. please feel free to join #ci-notifications where we monitor and triage nightly status. |
|
@robertgshaw2-redhat @stecasta do you think it makes more sense to target B200 instead? |
|
From the test perspective it makes sense to target B200. The idea is to avoid having a broken main that slows down development. Ideally we would target both Hopper and Blackwell, but we can start with the least scarce. |
Summary
Add a mandatory test that starts a vLLM server with TP=2 on H100 using default settings and verifies it can serve requests across 3 models (dense BF16, dense FP8, MoE BF16).
Context
No mandatory test currently starts a vLLM server on Hopper/Blackwell with TP>1, DP=1, and default config. On this hardware, vLLM auto-enables optimizations like
fuse_allreduce_rmsand cudagraphs. Existing mandatory tests either run on L4 (where these are disabled), use DP>1, or setCUDAGraphMode.NONE.This means regressions in the default TP serving path on Hopper/Blackwell can go undetected until users hit them. For example, #34109 introduced a cudagraph capture hang (#35772) that passed all mandatory CI. The issue surfaced in an optional nightly LM eval test but took several days before it was noticed and reported.
Models
I chose these 3 models to cover dense, FP8, and MoE architectures with minimal overhead (~8 min total on H100). Open to suggestions for additional models, quantizations, or platforms.
Test plan