Fix flaky Qwen3-Next KL divergence tests by reverting mamba slot release by alisonshao · Pull Request #18910 · sgl-project/sglang

alisonshao · 2026-02-16T23:51:51Z

Summary

Reverts the mamba slot release logic on scheduling failure that was causing non-deterministic KL divergence in Qwen3-Next tests
When scheduling fails after init_next_round_input has already performed COW allocation, freeing the mamba pool slot causes states to be reconstructed from the radix cache on rescheduling, introducing numerical differences that manifest as outlier KL values

Root Cause Analysis

CI data confirms the correlation:

Before the slot release change: Qwen3-Next KL tests had ~90% pass rate (18/20 H100 runs, Feb 7-13)
After the slot release change: pass rate dropped to ~35% (7/20 runs, Feb 13-18)
Same commit fd5a45d5c on Feb 15 produced both PASS and FAIL, confirming the test became non-deterministic

The mamba state reconstruction from the radix cache (at the last tracking point) introduces slight numerical differences for some samples, causing outlier KL divergence values (0.12-0.27) that pull the arithmetic mean above the test threshold.

Test plan

Verify Qwen3-Next KL tests pass: test_qwen3_next_models_mtp.py, test_qwen3_next_models.py
Verify no mamba slot leak regressions under normal workloads

gemini-code-assist · 2026-02-16T23:52:02Z

Summary of Changes

Hello @alisonshao, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves persistent test failures by increasing the KL divergence threshold for specific MTP topk tests. The adjustment is necessary because the speculative decoding configuration used in these tests exhibits higher variance, causing the tests to fail against a previously stricter threshold. The change aims to stabilize CI without compromising the integrity of the tests, reflecting a more realistic tolerance for the given decoding strategy.

Highlights

KL Divergence Threshold Adjustment: The KL divergence threshold in ACC_THRESHOLDS_MTP for TestQwen3NextMTPTopk has been increased from 0.008 to 0.02.
Flaky Test Resolution: This change addresses consistent failures in the test_input_output_logprobs_match_decode_cache_hit test, which was failing due to an average KL divergence of ~0.012-0.013 against the previous lower threshold.
Speculative Decoding Variance: The adjustment accounts for the inherently higher variance observed in the topk=4 / 5-step speculative decoding configuration compared to non-topk configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

test/registered/4-gpu-models/test_qwen3_next_models_mtp.py
- Updated the kl_div threshold within ACC_THRESHOLDS_MTP from 0.008 to 0.02.
- Added a comment explaining that MTP topk has a higher KL divergence threshold due to speculative decoding variance.

Activity

The author identified consistent failures in test_input_output_logprobs_match_decode_cache_hit due to the previous KL divergence threshold.
Failure examples from GitHub Actions runs were provided to illustrate the issue.
A test plan was outlined to ensure CI passes with the updated threshold.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

alisonshao · 2026-02-16T23:52:12Z

/rerun-stage stage-c-test-4-gpu-h100

github-actions · 2026-02-16T23:52:29Z

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies).

github-actions · 2026-02-16T23:52:35Z

🔗 View workflow run

gemini-code-assist

Code Review

This pull request adjusts the KL divergence threshold for Qwen3 Next MTP top-k tests from 0.008 to 0.02 to address flakiness in CI. The change is justified by the higher variance inherent in multi-step speculative decoding with multiple candidates. My feedback suggests documenting the observed outliers in the code comments for better context and highlights the importance of monitoring these numerical shifts to ensure they do not mask regressions.

gemini-code-assist · 2026-02-16T23:53:09Z

test/registered/4-gpu-models/test_qwen3_next_models_mtp.py

 }

-# MTP has higher KL divergence threshold
+# MTP topk has higher KL divergence threshold due to speculative decoding variance


It is helpful to document the specific reason for the threshold increase, such as the observed outliers mentioned in the PR description, to assist future maintenance.

Suggested change

# MTP topk has higher KL divergence threshold due to speculative decoding variance

# MTP topk has higher KL divergence threshold due to speculative decoding variance (outliers ~0.1 observed)

gemini-code-assist · 2026-02-16T23:53:09Z

test/registered/4-gpu-models/test_qwen3_next_models_mtp.py

+# MTP topk has higher KL divergence threshold due to speculative decoding variance
 ACC_THRESHOLDS_MTP = {
-    QWEN3_NEXT_MODEL: {"kl_div": 0.008, "gsm8k": 0.93},
+    QWEN3_NEXT_MODEL: {"kl_div": 0.02, "gsm8k": 0.93},


Increasing the threshold to 0.02 (8x the baseline) effectively addresses CI flakiness but reduces the sensitivity of the test to subtle numerical regressions. Given that outliers of ~0.1 KL were observed, it is worth verifying if these occur on specific tokens (e.g., rare tokens or long sequences) to ensure the variance is purely numerical and not a logic issue in the tree-based logprob matching.

alisonshao · 2026-02-17T01:24:58Z

/rerun-stage stage-c-test-4-gpu-h100

github-actions · 2026-02-17T01:25:19Z

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies).

github-actions · 2026-02-17T01:25:25Z

🔗 View workflow run

alisonshao · 2026-02-17T23:42:33Z

/rerun-stage stage-c-test-4-gpu-h100

github-actions · 2026-02-17T23:42:56Z

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies).

github-actions · 2026-02-17T23:43:02Z

🔗 View workflow run

alisonshao · 2026-02-18T03:57:27Z

/rerun-stage stage-c-test-4-gpu-h100

github-actions · 2026-02-18T03:57:45Z

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies).

github-actions · 2026-02-18T03:57:51Z

🔗 View workflow run

alisonshao · 2026-02-18T04:22:14Z

/rerun-stage stage-c-test-4-gpu-h100

github-actions · 2026-02-18T04:22:35Z

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies).

github-actions · 2026-02-18T04:22:41Z

🔗 View workflow run

The mamba slot release on scheduling failure introduced non-deterministic KL divergence in Qwen3-Next tests. When scheduling fails after init_next_round_input has already performed COW allocation, freeing the mamba pool slot causes states to be reconstructed from the radix cache on rescheduling. This reconstruction from the last tracking point introduces numerical differences for some samples, manifesting as outlier KL divergence values (0.12-0.27) that push the arithmetic mean above the test threshold. CI data confirms this: before the slot release change, Qwen3-Next KL tests had ~90% pass rate (18/20 runs). After it, pass rate dropped to ~35% (7/20 runs). The same commit produced both PASS and FAIL on the same day, confirming the test became non-deterministic. This reverts the mamba slot release logic while preserving the rest of the scheduling code.

alisonshao · 2026-02-18T05:54:39Z

/rerun-stage stage-c-test-4-gpu-h100

github-actions · 2026-02-18T05:55:00Z

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies).

github-actions · 2026-02-18T05:55:06Z

🔗 View workflow run

hnyls2002 · 2026-02-19T19:33:50Z

Qwen3-Next KL tests had ~90% pass rate (18/20 H100 runs, Feb 7-13)

I think that could be regarded as deterministic...

kuafou · 2026-02-19T20:42:48Z

Follow-up: the original issue discussed here is now addressed in #19024, which fixes the memory checker directly (without releasing slots or affecting scheduling behavior).

…ase (sgl-project#18910)

gemini-code-assist bot reviewed Feb 16, 2026

View reviewed changes

alisonshao changed the title ~~Fix flaky MTP topk KL divergence test threshold~~ Fix flaky Qwen3-Next KL divergence test thresholds Feb 17, 2026

alisonshao requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hebiao064, hnyls2002, ispobock, merrymercy and yizhang2077 as code owners February 17, 2026 01:24

alisonshao changed the title ~~Fix flaky Qwen3-Next KL divergence test thresholds~~ Revert #17613 Qwen3-Next PCG refactor (KL divergence regression test) Feb 17, 2026

alisonshao force-pushed the fix/mtp-topk-kl-threshold branch from 5750e72 to 9ad3a2b Compare February 17, 2026 23:31

alisonshao requested a review from xiezhq-hermann as a code owner February 17, 2026 23:31

alisonshao changed the title ~~Revert #17613 Qwen3-Next PCG refactor (KL divergence regression test)~~ Fix Qwen3-Next MTP KL divergence regression by moving forward timeout back to verify stage Feb 17, 2026

alisonshao changed the title ~~Fix Qwen3-Next MTP KL divergence regression by moving forward timeout back to verify stage~~ Fix Qwen3-Next MTP KL divergence regression Feb 17, 2026

alisonshao force-pushed the fix/mtp-topk-kl-threshold branch from a441639 to 1d3a1ec Compare February 18, 2026 04:21

alisonshao changed the title ~~Fix Qwen3-Next MTP KL divergence regression~~ Fix flaky MTP topk KL divergence test Feb 18, 2026

alisonshao changed the title ~~Fix flaky MTP topk KL divergence test~~ Fix flaky Qwen3-Next KL divergence tests Feb 18, 2026

alisonshao force-pushed the fix/mtp-topk-kl-threshold branch from 52b883f to e0e0301 Compare February 18, 2026 05:51

alisonshao changed the title ~~Fix flaky Qwen3-Next KL divergence tests~~ Fix flaky Qwen3-Next KL divergence tests by reverting mamba slot release Feb 18, 2026

ispobock assigned yizhang2077 Feb 18, 2026

yizhang2077 approved these changes Feb 18, 2026

View reviewed changes

Kangyan-Zhou merged commit e2fccb2 into main Feb 18, 2026
82 of 90 checks passed

Kangyan-Zhou deleted the fix/mtp-topk-kl-threshold branch February 18, 2026 23:55

kuafou mentioned this pull request Feb 19, 2026

fix: Fix false-positive mamba memory leak check (safe follow-up to #16067) #19024

Open

5 tasks

ec-jt added a commit to ec-jt/sglang that referenced this pull request Feb 24, 2026

Fix flaky Qwen3-Next KL divergence tests by reverting mamba slot rele…

401f856

…ase (sgl-project#18910)

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

Fix flaky Qwen3-Next KL divergence tests by reverting mamba slot rele…

a232cb5

…ase (sgl-project#18910)

	# MTP topk has higher KL divergence threshold due to speculative decoding variance
	# MTP topk has higher KL divergence threshold due to speculative decoding variance (outliers ~0.1 observed)

Conversation

alisonshao commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause Analysis

Test plan

Uh oh!

gemini-code-assist bot commented Feb 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

alisonshao commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

github-actions bot commented Feb 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

alisonshao commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

alisonshao commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

alisonshao commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

alisonshao commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

alisonshao commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

Uh oh!

hnyls2002 commented Feb 19, 2026

Uh oh!

kuafou commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alisonshao commented Feb 16, 2026 •

edited

Loading