Skip to content

[CI] Migrate remaining B200 jobs to b200-k8s with test fixes#42387

Merged
vllm-bot merged 2 commits into
mainfrom
ci/b200-k8s-remaining-fixes
May 12, 2026
Merged

[CI] Migrate remaining B200 jobs to b200-k8s with test fixes#42387
vllm-bot merged 2 commits into
mainfrom
ci/b200-k8s-remaining-fixes

Conversation

@khluu
Copy link
Copy Markdown
Member

@khluu khluu commented May 12, 2026

Summary

  • Migrates the last 3 device: b200 jobs to b200-k8s, with fixes for pre-existing test failures discovered during validation in [CI] Migrate more B200 jobs to b200-k8s queue #42356
  • Spec Decode Eagle (spec-decode-eagle-nightly-b200): Skip test_eagle_correctness_light on Blackwell — DeepSeek head_dim=(192,192) is not supported on SM100/SM110 with FLASH_ATTN
  • Spec Decode Speculators+MTP (spec-decode-speculators-mtp-nightly-b200): Skip deepseek MTP variant on Blackwell — CUDA graph compilation hangs indefinitely during model init
  • LM Eval Small Models (lm-eval-small-models-b200): Split models-blackwell.txt — the 3 EP/TP2 models (Qwen3-Next-80B, Qwen3-Next-FP8, Nemotron-120B) were crashing because they require 2 GPUs but the step only had 1. Moved them to a new models-blackwell-ep.txt with a dedicated 2-GPU step.

After this PR + #42356, all B200 jobs are on the b200-k8s queue.

Test plan

  • Pre-existing failures identified via build #65711
  • Trigger build with NOAUTO=1, unblock the 3 affected B200 jobs to validate fixes

🤖 Generated with Claude Code

Migrate the 3 B200 jobs that had pre-existing test failures:

1. Spec Decode Eagle: skip test_eagle_correctness_light on Blackwell
   — DeepSeek head_dim=192 is not supported on SM100/SM110

2. Spec Decode Speculators+MTP: skip deepseek MTP variant on Blackwell
   — CUDA graph compilation hangs indefinitely

3. LM Eval Small Models: split models-blackwell.txt into single-GPU
   and multi-GPU configs. The 3 EP/TP2 models (Qwen3-Next-80B,
   Qwen3-Next-FP8, Nemotron-120B) now run in a separate step with
   num_devices: 2.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: khluu <khluu000@gmail.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates Buildkite CI configurations to utilize b200-k8s devices and introduces a new evaluation job for large models on B200 with multi-device support. It also moves several large models to a dedicated configuration file and adds skip markers for DeepSeek-related spec decode tests on Blackwell platforms due to hardware-specific limitations. Feedback was provided regarding incomplete source_file_dependencies for the new CI job, which may prevent the tests from triggering on relevant model code changes.

Comment on lines +57 to +59
source_file_dependencies:
- csrc/
- vllm/model_executor/layers/quantization
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The source_file_dependencies for the new lm-eval-large-models-b200-ep job are incomplete. This job is specifically added to test large models like Qwen3-Next, but it currently lacks dependencies on the model implementation files. This means changes to the model logic will not trigger this CI job, potentially allowing regressions to go unnoticed. It is recommended to include the relevant model files and the configuration file itself, similar to the configuration of the lm-eval-qwen3-5-models-b200 job.

  source_file_dependencies:
  - csrc/
  - vllm/model_executor/layers/quantization
  - vllm/model_executor/models/qwen3_next.py
  - vllm/model_executor/models/qwen3_next_mtp.py
  - vllm/model_executor/layers/fla/ops/
  - tests/evals/gsm8k/configs/models-blackwell-ep.txt

0.0,
marks=pytest.mark.skipif(
current_platform.is_device_capability_family(100),
reason="DeepSeek MTP CUDA graph compilation hangs on Blackwell",
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test still failed when I ran it recently :(

@haosdent
Copy link
Copy Markdown
Contributor

+1 For this, it also fix the CI failure LM Eval Small Models (B200)
I saw before

The actual failure is a flashinfer TRTLLM MoE routing check
(top_k must be less than total experts in selected groups),
not a CUDA graph compilation hang.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: khluu <khluu000@gmail.com>
@vllm-bot vllm-bot merged commit 1ff9d33 into main May 12, 2026
8 of 9 checks passed
@vllm-bot vllm-bot deleted the ci/b200-k8s-remaining-fixes branch May 12, 2026 09:00
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…oject#42387)

Signed-off-by: khluu <khluu000@gmail.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants