Add nightly b200 test for spec decode eagle correctness#38577
Conversation
Signed-off-by: Rishi Puri <riship@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request adds a new Buildkite test step for 'Spec Decode Eagle Nightly B200' to execute correctness tests on B200 hardware using nightly PyTorch builds. A review comment pointed out that the pytest command uses an incorrect path, missing the 'tests/' prefix required to correctly locate the test suite.
Signed-off-by: Rishi Puri <riship@nvidia.com>
| - vllm/v1/worker/gpu/spec_decode/ | ||
| - tests/v1/e2e/spec_decode/ | ||
| commands: | ||
| - pytest -v -s v1/e2e/spec_decode -k "eagle_correctness" |
There was a problem hiding this comment.
This nightly should cover more than just the eagle correctness. Ideally we'd check at least MTP, maybe also draft model.
Signed-off-by: Rishi Puri <riship@nvidia.com>
Head branch was pushed to by a user without write access
|
Why did you rebase :( It canceled the tests |
…-project#38577)" This reverts commit adaabb8. Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…)" (#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>
…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Re-applies the 3 optional nightly B200 buildkite steps originally added in vllm-project#38577 and reverted in vllm-project#39512. The revert was due to the Blackwell specdec correctness regression; the preceding commit in this PR fixes the underlying bug. Addresses Matthew Bonanni's review ask to re-enable the previously failing tests and confirm they pass CI. Co-authored-by: Rishi Puri <riship@nvidia.com> Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
…m-project#38577)" (vllm-project#39512) This reverts commit af661a1. Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
…m-project#38577)" (vllm-project#39512) This reverts commit af661a1. Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>
…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>
…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
Hey @puririshi98 — heads up: while migrating B200 jobs from the old This comes from flashinfer's TRTLLM fused MoE kernel ( This test was previously being skipped because it's decorated with For now we've added a Build with the failure: https://buildkite.com/vllm/ci/builds/65711#019e1b4e-7f4c-41c5-8f0e-82cbde49317a |
|
Also, Added a Build with the failure: https://buildkite.com/vllm/ci/builds/65711#019e19b3-b994-44e0-a1cc-ce8614caa13a |
…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>
…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>
…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com>
…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…#38577) Signed-off-by: Rishi Puri <riship@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…-project#38577)" (vllm-project#39512) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
cc @benchislett