Add nightly b200 test for spec decode eagle correctness by puririshi98 · Pull Request #38577 · vllm-project/vllm

puririshi98 · 2026-03-30T19:41:15Z

Signed-off-by: Rishi Puri <riship@nvidia.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request adds a new Buildkite test step for 'Spec Decode Eagle Nightly B200' to execute correctness tests on B200 hardware using nightly PyTorch builds. A review comment pointed out that the pytest command uses an incorrect path, missing the 'tests/' prefix required to correctly locate the test suite.

Signed-off-by: Rishi Puri <riship@nvidia.com>

benchislett · 2026-04-07T17:11:29Z

+    - vllm/v1/worker/gpu/spec_decode/
+    - tests/v1/e2e/spec_decode/
+  commands:
+    - pytest -v -s v1/e2e/spec_decode -k "eagle_correctness"


This nightly should cover more than just the eagle correctness. Ideally we'd check at least MTP, maybe also draft model.

ProExpertProg

LGTM

Signed-off-by: Rishi Puri <riship@nvidia.com>

mgoin · 2026-04-07T18:17:37Z

I kicked off manual jobs for the new entries here
https://buildkite.com/vllm/ci/builds/60198/steps/canvas?sid=019d690f-0c0b-412e-ac98-f99b09bde554&tab=output
https://buildkite.com/vllm/ci/builds/60198/steps/canvas?sid=019d690f-0c0f-4379-863f-c444bb7502a1&tab=output
https://buildkite.com/vllm/ci/builds/60198/steps/canvas?sid=019d690f-0c13-40e3-8af7-43874a3a85db&tab=output

mgoin · 2026-04-07T18:55:03Z

Why did you rebase :( It canceled the tests

…-project#38577)" This reverts commit adaabb8. Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…)" (#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Re-applies the 3 optional nightly B200 buildkite steps originally added in vllm-project#38577 and reverted in vllm-project#39512. The revert was due to the Blackwell specdec correctness regression; the preceding commit in this PR fixes the underlying bug. Addresses Matthew Bonanni's review ask to re-enable the previously failing tests and confirm they pass CI. Co-authored-by: Rishi Puri <riship@nvidia.com> Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>

…m-project#38577)" (vllm-project#39512) This reverts commit af661a1. Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

khluu · 2026-05-12T08:56:40Z

Hey @puririshi98 — heads up: while migrating B200 jobs from the old b200 queue to b200-k8s (Kubernetes), we found that test_mtp_correctness[deepseek] now fails on Blackwell with:

RuntimeError: Check failed: args->top_k < (args->topk_group * args->num_experts / args->n_group) (4 vs. 4)
: top_k must be less than total number of experts in selected groups

This comes from flashinfer's TRTLLM fused MoE kernel (trtllm_fused_moe_kernel_launcher.cu:991). The kernel uses a strict < check, while vllm's own check in csrc/moe/grouped_topk_kernels.cu:1023 uses <= for the same constraint.

This test was previously being skipped because it's decorated with @single_gpu_only and the old b200 runners had 2 GPUs. Now that b200-k8s runners have 1 GPU, the test actually runs and hits this failure.

For now we've added a skipif on Blackwell in #42387 so the migration can proceed.

Build with the failure: https://buildkite.com/vllm/ci/builds/65711#019e1b4e-7f4c-41c5-8f0e-82cbde49317a

khluu · 2026-05-12T09:00:18Z

Also, test_eagle_correctness_light[FLASH_ATTN-deepseek_eagle] has the same story — it was previously skipped on B200 because @single_gpu_only + 2-GPU runners meant it never ran. Now with 1-GPU b200-k8s runners it actually executes and fails with:

AssertionError: (head_dim, head_dim_v)=(192, 192) is not supported on SM100/SM110.
head_dim and head_dim_v must be between 8 and 128 and divisible by 8, or (192, 128) for DeepSeek.

Added a skipif on Blackwell for this test as well in #42387.

Build with the failure: https://buildkite.com/vllm/ci/builds/65711#019e19b3-b994-44e0-a1cc-ce8614caa13a

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

Add nightly b200 test for spec decode eagle correctness

3c8a59d

Signed-off-by: Rishi Puri <riship@nvidia.com>

claude Bot reviewed Mar 30, 2026

View reviewed changes

mergify Bot added the ci/build label Mar 30, 2026

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread .buildkite/test_areas/spec_decode.yaml

puririshi98 added 7 commits March 31, 2026 10:13

Merge branch 'main' into patch-8

e7491ba

Merge branch 'main' into patch-8

a75cfd9

Merge branch 'main' into patch-8

5d7eb1e

Merge branch 'main' into patch-8

e3b6c8d

Merge branch 'main' into patch-8

558630c

Merge branch 'main' into patch-8

b58fc66

Merge branch 'main' into patch-8

49f0ee4

benchislett reviewed Apr 6, 2026

View reviewed changes

Comment thread .buildkite/test_areas/spec_decode.yaml Outdated

puririshi98 and others added 3 commits April 6, 2026 13:17

Update spec_decode.yaml

b26e597

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-8

962ee48

Merge branch 'main' into patch-8

f866bd1

benchislett added the verified Run pre-commit for new contributors without triggering other tests label Apr 7, 2026

puririshi98 added 2 commits April 6, 2026 20:45

Merge branch 'main' into patch-8

7041559

Merge branch 'main' into patch-8

2bbb1c9

benchislett reviewed Apr 7, 2026

View reviewed changes

Comment thread .buildkite/test_areas/spec_decode.yaml

robertgshaw2-redhat approved these changes Apr 7, 2026

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) April 7, 2026 17:43

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 7, 2026

ProExpertProg approved these changes Apr 7, 2026

View reviewed changes

Update spec_decode.yaml

fffad1e

Signed-off-by: Rishi Puri <riship@nvidia.com>

auto-merge was automatically disabled April 7, 2026 17:46
Head branch was pushed to by a user without write access

benchislett enabled auto-merge (squash) April 7, 2026 17:53

Merge branch 'main' into patch-8

2623ae4

Merge branch 'main' into patch-8

d7d9ffe

puririshi98 mentioned this pull request Apr 9, 2026

[Bug]: spec decode tests fail on nightly b200 job #39441

Open

1 task

benchislett merged commit adaabb8 into vllm-project:main Apr 9, 2026
16 checks passed

benchislett added a commit to CentML/vllm that referenced this pull request Apr 10, 2026

Revert "Add nightly b200 test for spec decode eagle correctness (vllm…

65d4da1

…-project#38577)" This reverts commit adaabb8. Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

ZhanqiuHu mentioned this pull request Apr 10, 2026

[CI] test_no_sync_with_spec_decode[eagle3-llama]: unexpected GPU-CPU sync #39537

Closed

3 tasks

benchislett added a commit that referenced this pull request Apr 11, 2026

Revert "Add nightly b200 test for spec decode eagle correctness (#38577…

af661a1

…)" (#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

wojciech-wais pushed a commit to wojciech-wais/vllm that referenced this pull request Apr 13, 2026

Add nightly b200 test for spec decode eagle correctness (vllm-project…

64604e8

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

wojciech-wais pushed a commit to wojciech-wais/vllm that referenced this pull request Apr 13, 2026

Revert "Add nightly b200 test for spec decode eagle correctness (vllm…

b6382fb

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026

Add nightly b200 test for spec decode eagle correctness (vllm-project…

157c6e9

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026

Revert "Add nightly b200 test for spec decode eagle correctness (vllm…

4d0c057

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

Add nightly b200 test for spec decode eagle correctness (vllm-project…

fcb4093

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

Revert "Add nightly b200 test for spec decode eagle correctness (vllm…

378becd

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

Add nightly b200 test for spec decode eagle correctness (vllm-project…

b480bb2

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

Add nightly b200 test for spec decode eagle correctness (vllm-project…

e3ab1bd

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

Add nightly b200 test for spec decode eagle correctness (vllm-project…

0e99d1b

…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

Revert "Add nightly b200 test for spec decode eagle correctness (vllm…

d81eb8c

…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add nightly b200 test for spec decode eagle correctness#38577

Add nightly b200 test for spec decode eagle correctness#38577
benchislett merged 16 commits into
vllm-project:mainfrom
puririshi98:patch-8

puririshi98 commented Mar 30, 2026

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

benchislett Apr 7, 2026

Uh oh!

puririshi98 Apr 7, 2026

Uh oh!

Uh oh!

ProExpertProg left a comment

Uh oh!

mgoin commented Apr 7, 2026

Uh oh!

mgoin commented Apr 7, 2026

Uh oh!

Uh oh!

khluu commented May 12, 2026 •

edited

Loading

Uh oh!

khluu commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

puririshi98 commented Mar 30, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

benchislett Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

puririshi98 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin commented Apr 7, 2026

Uh oh!

mgoin commented Apr 7, 2026

Uh oh!

Uh oh!

khluu commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khluu commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

khluu commented May 12, 2026 •

edited

Loading