Fix spec NaN/OOB detection: skip during CUDA graph capture, sync check otherwise by alisonshao · Pull Request #20092 · sgl-project/sglang

alisonshao · 2026-03-07T09:59:54Z

Summary

Handle NaN gracefully in speculative decoding: warn + replace instead of crash
Fix maybe_detect_nan / maybe_detect_oob to skip during CUDA graph capture

Problem: The B200 nightly EAGLE DP attention test crashes due to NaN in draft model logits, caused by upstream flashinfer CUTLASS FP4 GEMM PDL race condition (flashinfer#2708). PR #19899 added NaN detection using torch._assert_async which causes unrecoverable CUDA abort on NaN.

Fix: Change maybe_detect_nan to warn + replace (like the sampler's --enable-nan-detection):

Log a warning when NaN is detected (keeps intentional detection for debugging)
Replace NaN values in-place with nan_to_num_() (safe defaults)
Skip during CUDA graph capture (.item() is illegal during capture)
NaN in draft logits doesn't affect correctness — bad draft tokens get rejected by the verifier

Why not cuDNN workaround? --fp4-gemm-backend flashinfer_cudnn propagates to the draft model but the NaN still occurs in the EAGLE draft path, suggesting additional NaN sources beyond the FP4 GEMM kernel.

Related: #20043, flashinfer#2708, flashinfer#2716

Test plan

B200 nightly EAGLE DP attention test passes (NaN is logged but handled gracefully)
Detection skips during CUDA graph capture
NaN detection still active (warnings logged for debugging)

gemini-code-assist · 2026-03-07T10:00:06Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical nightly test failure on B200 GPUs by disabling specific NaN and Out-Of-Bounds detection mechanisms that were causing CUDA runtime crashes. This change ensures the stability and reliability of the test_eagle_infer_beta_dp_attention_large test, allowing for proper validation of the EAGLE DP attention model without encountering these specific assertion-related failures.

Highlights

Test Fix: Removed SGLANG_SPEC_NAN_DETECTION and SGLANG_SPEC_OOB_DETECTION environment variable overrides from the test_eagle_infer_beta_dp_attention_large test setup.
Root Cause Identified: Determined that torch._assert_async used by these overrides caused unrecoverable CUDA aborts and test crashes on B200 GPUs when NaN values were detected in NVFP4 draft logits.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

test/registered/spec/eagle/test_eagle_infer_beta_dp_attention_large.py
- Removed the explicit overriding of SGLANG_SPEC_NAN_DETECTION and SGLANG_SPEC_OOB_DETECTION environment variables to True within the setUpClass method.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a test crash on B200 hardware by removing environment variable overrides for NaN and out-of-bounds detection. The change is correct and directly resolves the issue described, which is caused by torch._assert_async leading to CUDA aborts. I have one suggestion to add a comment to the code to improve long-term maintainability by explaining why these specific checks are disabled.

gemini-code-assist · 2026-03-07T10:00:44Z

test/registered/spec/eagle/test_eagle_infer_beta_dp_attention_large.py

-        ), envs.SGLANG_SPEC_OOB_DETECTION.override(
-            True
-        ):
+        with envs.SGLANG_ENABLE_SPEC_V2.override(True):


For long-term maintainability, it's good practice to add a comment explaining why SGLANG_SPEC_NAN_DETECTION and SGLANG_SPEC_OOB_DETECTION are disabled for this test. While the PR description covers this, a code comment ensures future developers understand the context without needing to find this specific PR.

# NOTE: SGLANG_SPEC_NAN_DETECTION and SGLANG_SPEC_OOB_DETECTION are disabled # as they cause unrecoverable CUDA aborts on B200 hardware. with envs.SGLANG_ENABLE_SPEC_V2.override(True):

alisonshao · 2026-03-07T10:01:05Z

https://github.com/sgl-project/sglang/actions/runs/22796939110/job/66132897752

…k otherwise The previous torch._assert_async approach caused unrecoverable CUDA aborts when NaN was detected at runtime, leading to coredumps and cascading DP process crashes (B200 nightly failing since 3/6). This fix: - Skips detection during CUDA graph capture (dummy data makes NaN detection meaningless; sync .item() is illegal during capture) - Uses sync checks with RuntimeError during actual execution, giving clear debuggable errors instead of CUDA coredumps

alisonshao · 2026-03-07T23:34:51Z

https://github.com/sgl-project/sglang/actions/runs/22809598875/job/66164101093

voipmonitor · 2026-03-07T23:45:36Z

Note: This PR fixes the detection mechanism. The underlying NaN in NVFP4 draft logits on B200 still needs investigation separately.

Hello, please check this: flashinfer-ai/flashinfer#2708

I was hunting down NaN issues in GLM5 on 8x RTX 6000 PRO and the problem was in race condition / missing sync when PDL is enabled. I believe this could be also problem for B200 - worth to check

Use cuDNN FP4 GEMM backend instead of CUTLASS on B200 to work around PDL race condition in flashinfer CUTLASS kernels (flashinfer#2708).

Change maybe_detect_nan to warn + replace NaN with nan_to_num_() instead of raising RuntimeError. NaN in draft model logits (e.g. from flashinfer CUTLASS FP4 GEMM bugs on Blackwell) doesn't affect correctness since bad draft tokens get rejected by the verifier. Also revert the flashinfer_cudnn workaround in the B200 test since the NaN occurs in the EAGLE draft path regardless of FP4 backend.

gemini-code-assist bot reviewed Mar 7, 2026

View reviewed changes

alisonshao force-pushed the fix/eagle-dp-attn-large-crash branch from 953997b to e2608f5 Compare March 7, 2026 23:14

alisonshao requested review from Ying1123, hnyls2002 and merrymercy as code owners March 7, 2026 23:14

alisonshao changed the title ~~Fix EAGLE DP attention large test crash on B200 nightly~~ Fix spec NaN/OOB detection to use sync checks instead of torch._assert_async Mar 7, 2026

alisonshao changed the title ~~Fix spec NaN/OOB detection to use sync checks instead of torch._assert_async~~ Fix spec NaN/OOB detection: skip during CUDA graph capture, sync check otherwise Mar 7, 2026

alisonshao force-pushed the fix/eagle-dp-attn-large-crash branch from aec79a8 to a952564 Compare March 7, 2026 23:32

Alison Shao added 2 commits March 7, 2026 16:47

Add flashinfer_cudnn workaround for B200 EAGLE DP attention test

e2194e7

Use cuDNN FP4 GEMM backend instead of CUTLASS on B200 to work around PDL race condition in flashinfer CUTLASS kernels (flashinfer#2708).

hnyls2002 closed this Mar 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix spec NaN/OOB detection: skip during CUDA graph capture, sync check otherwise#20092

Fix spec NaN/OOB detection: skip during CUDA graph capture, sync check otherwise#20092
alisonshao wants to merge 3 commits intomainfrom
fix/eagle-dp-attn-large-crash

alisonshao commented Mar 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 7, 2026

Uh oh!

alisonshao commented Mar 7, 2026

Uh oh!

alisonshao commented Mar 7, 2026

Uh oh!

voipmonitor commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alisonshao commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist bot commented Mar 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

alisonshao commented Mar 7, 2026

Uh oh!

alisonshao commented Mar 7, 2026

Uh oh!

voipmonitor commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alisonshao commented Mar 7, 2026 •

edited

Loading