Fix spec NaN/OOB detection: skip during CUDA graph capture, sync check otherwise#20092
Fix spec NaN/OOB detection: skip during CUDA graph capture, sync check otherwise#20092alisonshao wants to merge 3 commits intomainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical nightly test failure on B200 GPUs by disabling specific NaN and Out-Of-Bounds detection mechanisms that were causing CUDA runtime crashes. This change ensures the stability and reliability of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a test crash on B200 hardware by removing environment variable overrides for NaN and out-of-bounds detection. The change is correct and directly resolves the issue described, which is caused by torch._assert_async leading to CUDA aborts. I have one suggestion to add a comment to the code to improve long-term maintainability by explaining why these specific checks are disabled.
| ), envs.SGLANG_SPEC_OOB_DETECTION.override( | ||
| True | ||
| ): | ||
| with envs.SGLANG_ENABLE_SPEC_V2.override(True): |
There was a problem hiding this comment.
For long-term maintainability, it's good practice to add a comment explaining why SGLANG_SPEC_NAN_DETECTION and SGLANG_SPEC_OOB_DETECTION are disabled for this test. While the PR description covers this, a code comment ensures future developers understand the context without needing to find this specific PR.
# NOTE: SGLANG_SPEC_NAN_DETECTION and SGLANG_SPEC_OOB_DETECTION are disabled
# as they cause unrecoverable CUDA aborts on B200 hardware.
with envs.SGLANG_ENABLE_SPEC_V2.override(True):953997b to
e2608f5
Compare
…k otherwise The previous torch._assert_async approach caused unrecoverable CUDA aborts when NaN was detected at runtime, leading to coredumps and cascading DP process crashes (B200 nightly failing since 3/6). This fix: - Skips detection during CUDA graph capture (dummy data makes NaN detection meaningless; sync .item() is illegal during capture) - Uses sync checks with RuntimeError during actual execution, giving clear debuggable errors instead of CUDA coredumps
aec79a8 to
a952564
Compare
Hello, please check this: flashinfer-ai/flashinfer#2708 I was hunting down NaN issues in GLM5 on 8x RTX 6000 PRO and the problem was in race condition / missing sync when PDL is enabled. I believe this could be also problem for B200 - worth to check |
Use cuDNN FP4 GEMM backend instead of CUTLASS on B200 to work around PDL race condition in flashinfer CUTLASS kernels (flashinfer#2708).
Change maybe_detect_nan to warn + replace NaN with nan_to_num_() instead of raising RuntimeError. NaN in draft model logits (e.g. from flashinfer CUTLASS FP4 GEMM bugs on Blackwell) doesn't affect correctness since bad draft tokens get rejected by the verifier. Also revert the flashinfer_cudnn workaround in the B200 test since the NaN occurs in the EAGLE draft path regardless of FP4 backend.
Summary
maybe_detect_nan/maybe_detect_oobto skip during CUDA graph captureProblem: The B200 nightly EAGLE DP attention test crashes due to NaN in draft model logits, caused by upstream flashinfer CUTLASS FP4 GEMM PDL race condition (flashinfer#2708). PR #19899 added NaN detection using
torch._assert_asyncwhich causes unrecoverable CUDA abort on NaN.Fix: Change
maybe_detect_nanto warn + replace (like the sampler's--enable-nan-detection):nan_to_num_()(safe defaults).item()is illegal during capture)Why not cuDNN workaround?
--fp4-gemm-backend flashinfer_cudnnpropagates to the draft model but the NaN still occurs in the EAGLE draft path, suggesting additional NaN sources beyond the FP4 GEMM kernel.Related: #20043, flashinfer#2708, flashinfer#2716
Test plan