Fix flaky Qwen3-Next KL divergence tests by reverting mamba slot release#18910
Fix flaky Qwen3-Next KL divergence tests by reverting mamba slot release#18910Kangyan-Zhou merged 1 commit intomainfrom
Conversation
Summary of ChangesHello @alisonshao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves persistent test failures by increasing the KL divergence threshold for specific MTP topk tests. The adjustment is necessary because the speculative decoding configuration used in these tests exhibits higher variance, causing the tests to fail against a previously stricter threshold. The change aims to stabilize CI without compromising the integrity of the tests, reflecting a more realistic tolerance for the given decoding strategy. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/rerun-stage stage-c-test-4-gpu-h100 |
|
✅ Triggered |
There was a problem hiding this comment.
Code Review
This pull request adjusts the KL divergence threshold for Qwen3 Next MTP top-k tests from 0.008 to 0.02 to address flakiness in CI. The change is justified by the higher variance inherent in multi-step speculative decoding with multiple candidates. My feedback suggests documenting the observed outliers in the code comments for better context and highlights the importance of monitoring these numerical shifts to ensure they do not mask regressions.
| } | ||
|
|
||
| # MTP has higher KL divergence threshold | ||
| # MTP topk has higher KL divergence threshold due to speculative decoding variance |
There was a problem hiding this comment.
It is helpful to document the specific reason for the threshold increase, such as the observed outliers mentioned in the PR description, to assist future maintenance.
| # MTP topk has higher KL divergence threshold due to speculative decoding variance | |
| # MTP topk has higher KL divergence threshold due to speculative decoding variance (outliers ~0.1 observed) |
| # MTP topk has higher KL divergence threshold due to speculative decoding variance | ||
| ACC_THRESHOLDS_MTP = { | ||
| QWEN3_NEXT_MODEL: {"kl_div": 0.008, "gsm8k": 0.93}, | ||
| QWEN3_NEXT_MODEL: {"kl_div": 0.02, "gsm8k": 0.93}, |
There was a problem hiding this comment.
Increasing the threshold to 0.02 (8x the baseline) effectively addresses CI flakiness but reduces the sensitivity of the test to subtle numerical regressions. Given that outliers of ~0.1 KL were observed, it is worth verifying if these occur on specific tokens (e.g., rare tokens or long sequences) to ensure the variance is purely numerical and not a logic issue in the tree-based logprob matching.
|
/rerun-stage stage-c-test-4-gpu-h100 |
|
✅ Triggered |
5750e72 to
9ad3a2b
Compare
|
/rerun-stage stage-c-test-4-gpu-h100 |
|
✅ Triggered |
|
/rerun-stage stage-c-test-4-gpu-h100 |
|
✅ Triggered |
a441639 to
1d3a1ec
Compare
|
/rerun-stage stage-c-test-4-gpu-h100 |
|
✅ Triggered |
The mamba slot release on scheduling failure introduced non-deterministic KL divergence in Qwen3-Next tests. When scheduling fails after init_next_round_input has already performed COW allocation, freeing the mamba pool slot causes states to be reconstructed from the radix cache on rescheduling. This reconstruction from the last tracking point introduces numerical differences for some samples, manifesting as outlier KL divergence values (0.12-0.27) that push the arithmetic mean above the test threshold. CI data confirms this: before the slot release change, Qwen3-Next KL tests had ~90% pass rate (18/20 runs). After it, pass rate dropped to ~35% (7/20 runs). The same commit produced both PASS and FAIL on the same day, confirming the test became non-deterministic. This reverts the mamba slot release logic while preserving the rest of the scheduling code.
52b883f to
e0e0301
Compare
|
/rerun-stage stage-c-test-4-gpu-h100 |
|
✅ Triggered |
I think that could be regarded as deterministic... |
|
Follow-up: the original issue discussed here is now addressed in #19024, which fixes the memory checker directly (without releasing slots or affecting scheduling behavior). |
Summary
init_next_round_inputhas already performed COW allocation, freeing the mamba pool slot causes states to be reconstructed from the radix cache on rescheduling, introducing numerical differences that manifest as outlier KL valuesRoot Cause Analysis
CI data confirms the correlation:
fd5a45d5con Feb 15 produced both PASS and FAIL, confirming the test became non-deterministicThe mamba state reconstruction from the radix cache (at the last tracking point) introduces slight numerical differences for some samples, causing outlier KL divergence values (0.12-0.27) that pull the arithmetic mean above the test threshold.
Test plan
test_qwen3_next_models_mtp.py,test_qwen3_next_models.py