[BugFix] fix generating endlessly bug when reasoner grammar & spec decoding are enabled#12008
Conversation
Summary of ChangesHello @justadogistaken, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a significant bug that caused models to enter an infinite loop of empty string generation when using reasoner grammar alongside speculative decoding. The fix involves refining how grammar states are managed and copied, particularly during vocabulary mask generation, and introducing more precise control over when tokens are masked. These changes ensure that the grammar's state is correctly preserved and that masking logic is applied only when appropriate, leading to more stable and predictable model output. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a critical bug causing an infinite loop when reasoner grammar and speculative decoding are enabled simultaneously. The core of the fix involves introducing accept_token_length() to prevent erroneous rollbacks on the grammar, and ensuring that a copy of the grammar object is used during speculative tree traversal to preserve its state. The changes to ReasonerGrammarObject to correctly manage its state during the reasoning phase are also well-implemented. Overall, the logic is sound and the fix appears robust. I have a couple of minor suggestions to improve code clarity and consistency.
290760b to
c3acf3c
Compare
c3acf3c to
6c38950
Compare
|
Will this PR being merge?I've use this pr for quite a long time to prevent the hang issues. |
|
@hnyls2002 @merrymercy @Ying1123 @kssteven418 Hi, could you help review this pr? |
|
@justadogistaken could you resolve the conflicts, tysm! |
|
@LorrinWWW The problem seems to be solved in the latest version. I will close the pr. But I found there is another problem. Could you help review this? ref #14480 |



Motivation
Related to #9187 and #11550.
When reasoner grammar and spec decoding are enabled at the same time.
The model will generate
<think>first, and nothing would be generated till reachmax_tokens.I found that the problem comes from
allocate_token_bitmask.fill_(0)andreasoner grammar.Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist