[CI] Migrate remaining B200 jobs to b200-k8s with test fixes#42387
Conversation
Migrate the 3 B200 jobs that had pre-existing test failures: 1. Spec Decode Eagle: skip test_eagle_correctness_light on Blackwell — DeepSeek head_dim=192 is not supported on SM100/SM110 2. Spec Decode Speculators+MTP: skip deepseek MTP variant on Blackwell — CUDA graph compilation hangs indefinitely 3. LM Eval Small Models: split models-blackwell.txt into single-GPU and multi-GPU configs. The 3 EP/TP2 models (Qwen3-Next-80B, Qwen3-Next-FP8, Nemotron-120B) now run in a separate step with num_devices: 2. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: khluu <khluu000@gmail.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
There was a problem hiding this comment.
Code Review
This pull request updates Buildkite CI configurations to utilize b200-k8s devices and introduces a new evaluation job for large models on B200 with multi-device support. It also moves several large models to a dedicated configuration file and adds skip markers for DeepSeek-related spec decode tests on Blackwell platforms due to hardware-specific limitations. Feedback was provided regarding incomplete source_file_dependencies for the new CI job, which may prevent the tests from triggering on relevant model code changes.
| source_file_dependencies: | ||
| - csrc/ | ||
| - vllm/model_executor/layers/quantization |
There was a problem hiding this comment.
The source_file_dependencies for the new lm-eval-large-models-b200-ep job are incomplete. This job is specifically added to test large models like Qwen3-Next, but it currently lacks dependencies on the model implementation files. This means changes to the model logic will not trigger this CI job, potentially allowing regressions to go unnoticed. It is recommended to include the relevant model files and the configuration file itself, similar to the configuration of the lm-eval-qwen3-5-models-b200 job.
source_file_dependencies:
- csrc/
- vllm/model_executor/layers/quantization
- vllm/model_executor/models/qwen3_next.py
- vllm/model_executor/models/qwen3_next_mtp.py
- vllm/model_executor/layers/fla/ops/
- tests/evals/gsm8k/configs/models-blackwell-ep.txt| 0.0, | ||
| marks=pytest.mark.skipif( | ||
| current_platform.is_device_capability_family(100), | ||
| reason="DeepSeek MTP CUDA graph compilation hangs on Blackwell", |
There was a problem hiding this comment.
this test still failed when I ran it recently :(
|
+1 For this, it also fix the CI failure LM Eval Small Models (B200) |
The actual failure is a flashinfer TRTLLM MoE routing check (top_k must be less than total experts in selected groups), not a CUDA graph compilation hang. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: khluu <khluu000@gmail.com>
…oject#42387) Signed-off-by: khluu <khluu000@gmail.com>
…oject#42387) Signed-off-by: khluu <khluu000@gmail.com>
…oject#42387) Signed-off-by: khluu <khluu000@gmail.com>
…oject#42387) Signed-off-by: khluu <khluu000@gmail.com>
…oject#42387) Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Summary
device: b200jobs tob200-k8s, with fixes for pre-existing test failures discovered during validation in [CI] Migrate more B200 jobs to b200-k8s queue #42356spec-decode-eagle-nightly-b200): Skiptest_eagle_correctness_lighton Blackwell — DeepSeekhead_dim=(192,192)is not supported on SM100/SM110 with FLASH_ATTNspec-decode-speculators-mtp-nightly-b200): SkipdeepseekMTP variant on Blackwell — CUDA graph compilation hangs indefinitely during model initlm-eval-small-models-b200): Splitmodels-blackwell.txt— the 3 EP/TP2 models (Qwen3-Next-80B, Qwen3-Next-FP8, Nemotron-120B) were crashing because they require 2 GPUs but the step only had 1. Moved them to a newmodels-blackwell-ep.txtwith a dedicated 2-GPU step.After this PR + #42356, all B200 jobs are on the
b200-k8squeue.Test plan
🤖 Generated with Claude Code