[CI] Migrate remaining B200 jobs to b200-k8s with test fixes by khluu · Pull Request #42387 · vllm-project/vllm

khluu · 2026-05-12T07:37:22Z

Summary

Migrates the last 3 device: b200 jobs to b200-k8s, with fixes for pre-existing test failures discovered during validation in [CI] Migrate more B200 jobs to b200-k8s queue #42356
Spec Decode Eagle (spec-decode-eagle-nightly-b200): Skip test_eagle_correctness_light on Blackwell — DeepSeek head_dim=(192,192) is not supported on SM100/SM110 with FLASH_ATTN
Spec Decode Speculators+MTP (spec-decode-speculators-mtp-nightly-b200): Skip deepseek MTP variant on Blackwell — CUDA graph compilation hangs indefinitely during model init
LM Eval Small Models (lm-eval-small-models-b200): Split models-blackwell.txt — the 3 EP/TP2 models (Qwen3-Next-80B, Qwen3-Next-FP8, Nemotron-120B) were crashing because they require 2 GPUs but the step only had 1. Moved them to a new models-blackwell-ep.txt with a dedicated 2-GPU step.

After this PR + #42356, all B200 jobs are on the b200-k8s queue.

Test plan

Pre-existing failures identified via build #65711
Trigger build with NOAUTO=1, unblock the 3 affected B200 jobs to validate fixes

🤖 Generated with Claude Code

Migrate the 3 B200 jobs that had pre-existing test failures: 1. Spec Decode Eagle: skip test_eagle_correctness_light on Blackwell — DeepSeek head_dim=192 is not supported on SM100/SM110 2. Spec Decode Speculators+MTP: skip deepseek MTP variant on Blackwell — CUDA graph compilation hangs indefinitely 3. LM Eval Small Models: split models-blackwell.txt into single-GPU and multi-GPU configs. The 3 EP/TP2 models (Qwen3-Next-80B, Qwen3-Next-FP8, Nemotron-120B) now run in a separate step with num_devices: 2. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: khluu <khluu000@gmail.com>

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

gemini-code-assist

Code Review

This pull request updates Buildkite CI configurations to utilize b200-k8s devices and introduces a new evaluation job for large models on B200 with multi-device support. It also moves several large models to a dedicated configuration file and adds skip markers for DeepSeek-related spec decode tests on Blackwell platforms due to hardware-specific limitations. Feedback was provided regarding incomplete source_file_dependencies for the new CI job, which may prevent the tests from triggering on relevant model code changes.

gemini-code-assist · 2026-05-12T07:40:14Z

+  source_file_dependencies:
+  - csrc/
+  - vllm/model_executor/layers/quantization


The source_file_dependencies for the new lm-eval-large-models-b200-ep job are incomplete. This job is specifically added to test large models like Qwen3-Next, but it currently lacks dependencies on the model implementation files. This means changes to the model logic will not trigger this CI job, potentially allowing regressions to go unnoticed. It is recommended to include the relevant model files and the configuration file itself, similar to the configuration of the lm-eval-qwen3-5-models-b200 job.

source_file_dependencies: - csrc/ - vllm/model_executor/layers/quantization - vllm/model_executor/models/qwen3_next.py - vllm/model_executor/models/qwen3_next_mtp.py - vllm/model_executor/layers/fla/ops/ - tests/evals/gsm8k/configs/models-blackwell-ep.txt

khluu · 2026-05-12T08:22:28Z

+            0.0,
+            marks=pytest.mark.skipif(
+                current_platform.is_device_capability_family(100),
+                reason="DeepSeek MTP CUDA graph compilation hangs on Blackwell",


this test still failed when I ran it recently :(

haosdent · 2026-05-12T08:29:26Z

+1 For this, it also fix the CI failure LM Eval Small Models (B200)
I saw before

The actual failure is a flashinfer TRTLLM MoE routing check (top_k must be less than total experts in selected groups), not a CUDA graph compilation hang. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: khluu <khluu000@gmail.com>

…oject#42387) Signed-off-by: khluu <khluu000@gmail.com>

…oject#42387) Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

khluu requested review from Harry-Chen, mgoin and vadiklyutiy as code owners May 12, 2026 07:37

claude Bot reviewed May 12, 2026

View reviewed changes

khluu mentioned this pull request May 12, 2026

[CI] Migrate more B200 jobs to b200-k8s queue #42356

Merged

2 tasks

mergify Bot added ci/build v1 labels May 12, 2026

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

haosdent mentioned this pull request May 12, 2026

[CI] Split B200 LM Eval Small Models suite by GPU count #41320

Closed

khluu commented May 12, 2026

View reviewed changes

khluu mentioned this pull request May 12, 2026

Add nightly b200 test for spec decode eagle correctness #38577

Merged

vllm-bot merged commit 1ff9d33 into main May 12, 2026
8 of 9 checks passed

vllm-bot deleted the ci/b200-k8s-remaining-fixes branch May 12, 2026 09:00

weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026

[CI] Migrate remaining B200 jobs to b200-k8s with test fixes (vllm-pr…

b81a693

…oject#42387) Signed-off-by: khluu <khluu000@gmail.com>

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[CI] Migrate remaining B200 jobs to b200-k8s with test fixes (vllm-pr…

d9409c2

…oject#42387) Signed-off-by: khluu <khluu000@gmail.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[CI] Migrate remaining B200 jobs to b200-k8s with test fixes (vllm-pr…

ea7ce1f

…oject#42387) Signed-off-by: khluu <khluu000@gmail.com>

h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026

[CI] Migrate remaining B200 jobs to b200-k8s with test fixes (vllm-pr…

464c3ec

…oject#42387) Signed-off-by: khluu <khluu000@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Migrate remaining B200 jobs to b200-k8s with test fixes#42387

[CI] Migrate remaining B200 jobs to b200-k8s with test fixes#42387
vllm-bot merged 2 commits into
mainfrom
ci/b200-k8s-remaining-fixes

khluu commented May 12, 2026

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

khluu May 12, 2026

Uh oh!

haosdent commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

khluu commented May 12, 2026

Summary

Test plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

khluu May 12, 2026

Choose a reason for hiding this comment

Uh oh!

haosdent commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants