Skip to content

Fix spec NaN/OOB detection: skip during CUDA graph capture, sync check otherwise#20092

Closed
alisonshao wants to merge 3 commits intomainfrom
fix/eagle-dp-attn-large-crash
Closed

Fix spec NaN/OOB detection: skip during CUDA graph capture, sync check otherwise#20092
alisonshao wants to merge 3 commits intomainfrom
fix/eagle-dp-attn-large-crash

Conversation

@alisonshao
Copy link
Copy Markdown
Collaborator

@alisonshao alisonshao commented Mar 7, 2026

Summary

  • Handle NaN gracefully in speculative decoding: warn + replace instead of crash
  • Fix maybe_detect_nan / maybe_detect_oob to skip during CUDA graph capture

Problem: The B200 nightly EAGLE DP attention test crashes due to NaN in draft model logits, caused by upstream flashinfer CUTLASS FP4 GEMM PDL race condition (flashinfer#2708). PR #19899 added NaN detection using torch._assert_async which causes unrecoverable CUDA abort on NaN.

Fix: Change maybe_detect_nan to warn + replace (like the sampler's --enable-nan-detection):

  • Log a warning when NaN is detected (keeps intentional detection for debugging)
  • Replace NaN values in-place with nan_to_num_() (safe defaults)
  • Skip during CUDA graph capture (.item() is illegal during capture)
  • NaN in draft logits doesn't affect correctness — bad draft tokens get rejected by the verifier

Why not cuDNN workaround? --fp4-gemm-backend flashinfer_cudnn propagates to the draft model but the NaN still occurs in the EAGLE draft path, suggesting additional NaN sources beyond the FP4 GEMM kernel.

Related: #20043, flashinfer#2708, flashinfer#2716

Test plan

  • B200 nightly EAGLE DP attention test passes (NaN is logged but handled gracefully)
  • Detection skips during CUDA graph capture
  • NaN detection still active (warnings logged for debugging)

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical nightly test failure on B200 GPUs by disabling specific NaN and Out-Of-Bounds detection mechanisms that were causing CUDA runtime crashes. This change ensures the stability and reliability of the test_eagle_infer_beta_dp_attention_large test, allowing for proper validation of the EAGLE DP attention model without encountering these specific assertion-related failures.

Highlights

  • Test Fix: Removed SGLANG_SPEC_NAN_DETECTION and SGLANG_SPEC_OOB_DETECTION environment variable overrides from the test_eagle_infer_beta_dp_attention_large test setup.
  • Root Cause Identified: Determined that torch._assert_async used by these overrides caused unrecoverable CUDA aborts and test crashes on B200 GPUs when NaN values were detected in NVFP4 draft logits.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • test/registered/spec/eagle/test_eagle_infer_beta_dp_attention_large.py
    • Removed the explicit overriding of SGLANG_SPEC_NAN_DETECTION and SGLANG_SPEC_OOB_DETECTION environment variables to True within the setUpClass method.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a test crash on B200 hardware by removing environment variable overrides for NaN and out-of-bounds detection. The change is correct and directly resolves the issue described, which is caused by torch._assert_async leading to CUDA aborts. I have one suggestion to add a comment to the code to improve long-term maintainability by explaining why these specific checks are disabled.

), envs.SGLANG_SPEC_OOB_DETECTION.override(
True
):
with envs.SGLANG_ENABLE_SPEC_V2.override(True):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For long-term maintainability, it's good practice to add a comment explaining why SGLANG_SPEC_NAN_DETECTION and SGLANG_SPEC_OOB_DETECTION are disabled for this test. While the PR description covers this, a code comment ensures future developers understand the context without needing to find this specific PR.

        # NOTE: SGLANG_SPEC_NAN_DETECTION and SGLANG_SPEC_OOB_DETECTION are disabled
        # as they cause unrecoverable CUDA aborts on B200 hardware.
        with envs.SGLANG_ENABLE_SPEC_V2.override(True):

@alisonshao
Copy link
Copy Markdown
Collaborator Author

@alisonshao alisonshao force-pushed the fix/eagle-dp-attn-large-crash branch from 953997b to e2608f5 Compare March 7, 2026 23:14
@alisonshao alisonshao changed the title Fix EAGLE DP attention large test crash on B200 nightly Fix spec NaN/OOB detection to use sync checks instead of torch._assert_async Mar 7, 2026
@alisonshao alisonshao changed the title Fix spec NaN/OOB detection to use sync checks instead of torch._assert_async Fix spec NaN/OOB detection: skip during CUDA graph capture, sync check otherwise Mar 7, 2026
…k otherwise

The previous torch._assert_async approach caused unrecoverable CUDA
aborts when NaN was detected at runtime, leading to coredumps and
cascading DP process crashes (B200 nightly failing since 3/6).

This fix:
- Skips detection during CUDA graph capture (dummy data makes NaN
  detection meaningless; sync .item() is illegal during capture)
- Uses sync checks with RuntimeError during actual execution, giving
  clear debuggable errors instead of CUDA coredumps
@alisonshao alisonshao force-pushed the fix/eagle-dp-attn-large-crash branch from aec79a8 to a952564 Compare March 7, 2026 23:32
@alisonshao
Copy link
Copy Markdown
Collaborator Author

@voipmonitor
Copy link
Copy Markdown
Contributor

Note: This PR fixes the detection mechanism. The underlying NaN in NVFP4 draft logits on B200 still needs investigation separately.

Hello, please check this: flashinfer-ai/flashinfer#2708

I was hunting down NaN issues in GLM5 on 8x RTX 6000 PRO and the problem was in race condition / missing sync when PDL is enabled. I believe this could be also problem for B200 - worth to check

Alison Shao added 2 commits March 7, 2026 16:47
Use cuDNN FP4 GEMM backend instead of CUTLASS on B200 to work around
PDL race condition in flashinfer CUTLASS kernels (flashinfer#2708).
Change maybe_detect_nan to warn + replace NaN with nan_to_num_()
instead of raising RuntimeError. NaN in draft model logits (e.g.
from flashinfer CUTLASS FP4 GEMM bugs on Blackwell) doesn't affect
correctness since bad draft tokens get rejected by the verifier.

Also revert the flashinfer_cudnn workaround in the B200 test since
the NaN occurs in the EAGLE draft path regardless of FP4 backend.
@hnyls2002 hnyls2002 closed this Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants