Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a robust, grammar-based approach using Lark to fix several long-standing issues with Mistral tool calling and reasoning parsing. The changes are comprehensive, covering both streaming and non-streaming modes, various tool choice options, and different tokenizer versions. The introduction of MistralGrammarFactory and the associated Jinja templates for grammar generation is a solid design choice for managing the complexity. The test suite has been significantly expanded, which is excellent for ensuring the correctness of these critical changes. I've found one critical issue in the streaming logic that could lead to corrupted output, for which I've provided a specific comment and suggestion.
| if self.bot_token not in delta_text: | ||
| return DeltaMessage(content=delta_text) |
There was a problem hiding this comment.
This condition can lead to corrupted output during streaming. If the bot_token (e.g., "[TOOL_CALLS]") is split across multiple streaming deltas, delta_text may not contain the full token. In this scenario, the current logic incorrectly returns a partial token string as content, which is incorrect.
The correct behavior should be to wait for more tokens by returning None, especially since the calling function has already confirmed that the bot_token exists in the cumulative current_text. This ensures that partial tool call markers are not leaked as content.
return NoneThere was a problem hiding this comment.
special tokens are not split so don't think it's true
Signed-off-by: juliendenize <julien.denize@mistral.ai>
|
Hi @juliendenize, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: juliendenize <julien.denize@mistral.ai>
35dc6d4 to
5fee4b0
Compare
|
Hi @juliendenize, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @juliendenize, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| prompt_tokens = mistral_tokenizer.tokenizer.encode( | ||
| "Just a regular prompt.", bos=False, eos=False | ||
| ) | ||
| assert parser.is_reasoning_end(prompt_tokens) is False |
There was a problem hiding this comment.
| assert parser.is_reasoning_end(prompt_tokens) is False | |
| assert not parser.is_reasoning_end(prompt_tokens) |
| + mistral_tokenizer.tokenizer.encode(prompt_text_after, bos=False, eos=False) | ||
| ) | ||
|
|
||
| assert parser.is_reasoning_end(prompt_tokens) is True |
There was a problem hiding this comment.
| assert parser.is_reasoning_end(prompt_tokens) is True | |
| assert parser.is_reasoning_end(prompt_tokens) |
| self.tool_parser, MistralToolParser | ||
| ) | ||
| if _is_mistral_tool_parser and self.reasoning_parser_cls is not None: | ||
| MistralToolParser._has_reasoning_parser = True |
There was a problem hiding this comment.
| MistralToolParser._has_reasoning_parser = True |
why are we setting a bool of the MistralToolParser here? could we just do this directly in the get_tool_parser function?
| ) | ||
| harmony_tools_streamed[i] |= tools_streamed_flag | ||
| elif use_mistral_grammar and reasoning_parser: | ||
| assert tool_parser is not None |
There was a problem hiding this comment.
Can we move all this new code into a chat_completion/mistral_grammar.py file?
| and tokenizer.version >= 15 | ||
| and request.reasoning_effort in (None, "none") | ||
| ) | ||
| if not skip_reasoning: |
There was a problem hiding this comment.
Think this is a lot of extra code here.
Couldn't we just set:
self.reasoning_parser = Noneif skip_reasoning is False much earlier above so that the code should stay exactly the same here?
| request, tokenizer, self.tool_parser | ||
| ) | ||
| tool_call_items: list[ToolCall] | ||
| if use_mistral_grammar: |
There was a problem hiding this comment.
can we maybe also import this code from another function?
There was a problem hiding this comment.
Also this one is about Mistral API, not related with grammar no?
| use_mistral_grammar = ( | ||
| isinstance(request, ChatCompletionRequest) | ||
| and tokenizer is not None | ||
| and is_mistral_lark_grammar_active( | ||
| request, | ||
| tokenizer, | ||
| tool_parser_cls, # type: ignore[arg-type] | ||
| ) |
There was a problem hiding this comment.
| use_mistral_grammar = ( | |
| isinstance(request, ChatCompletionRequest) | |
| and tokenizer is not None | |
| and is_mistral_lark_grammar_active( | |
| request, | |
| tokenizer, | |
| tool_parser_cls, # type: ignore[arg-type] | |
| ) | |
| use_mistral_grammar = isinstance(tokenizer, MistralTokenizer) and tokenizer.is_grammar_active(request) |
can't we inject more logic through the MistralTokenizer here?
| {% endif -%} | ||
| think: <THINK> content </THINK> | ||
| content: (/(.|\n)+/)+ | ||
| SAFE_WS: /[ \t\r\n]+/""" |
There was a problem hiding this comment.
How likely are these grammars changing again? Should we think about a mechanism to inject grammars via file loading?
Maybe ok to leave as is for a v1 but not sure that this will be future proof?
| return ( | ||
| tool_parser_cls is not None | ||
| and issubclass(tool_parser_cls, MistralToolParser) | ||
| and is_mistral_tokenizer(tokenizer) |
There was a problem hiding this comment.
is it possible to have issubclass(tool_parser_cls, MistralToolParser) but not is_mistral_tokenizer(tokenizer)?
| if not args: | ||
| args = {"type": "object", "properties": {}, "additionalProperties": False} |
There was a problem hiding this comment.
| if not args: | |
| args = {"type": "object", "properties": {}, "additionalProperties": False} | |
| args = args or {"type": "object", "properties": {}, "additionalProperties": False} |
| args = tool.function.parameters if _is_strict_tool(tool) else {"type": "object"} | ||
| # Handle empty parameters case | ||
| if not args: | ||
| args = {"type": "object", "properties": {}, "additionalProperties": False} |
There was a problem hiding this comment.
How can this happen? Why do we have {"type": "object"} and not {"type": "object", "properties": {}, "additionalProperties": False}
Maybe we can clean up and have only two cases? I think for us at least parameters has to be a valid json schema
| self._tokenizer = tokenizer | ||
| self._tokenizer_version = tokenizer.version | ||
|
|
||
| def get_lark_from_jinja( |
There was a problem hiding this comment.
should we maybe cache this function?
Template(jinja_template).render(...)looks a bit expensive and there don't seem to be that many possible function arg options
| ) | ||
| self.grammar_factory = ( | ||
| MistralGrammarFactory(tokenizer) # type: ignore[arg-type] | ||
| if is_mistral_tokenizer(self.model_tokenizer) |
There was a problem hiding this comment.
How can this not be a mistral tokenizer?! The MistralToolParser should only be compatible with MistralTokenizer no?
| parallel_tool_calls: bool, | ||
| ) -> str: ... | ||
|
|
||
| def get_args_json(self, tool: ChatCompletionToolsParam) -> dict[str, Any]: |
There was a problem hiding this comment.
| def get_args_json(self, tool: ChatCompletionToolsParam) -> dict[str, Any]: | |
| def get_tool_parameters_schema(self, tool: ChatCompletionToolsParam) -> dict[str, Any]: |
or something like this?
| regex: str | None = None | ||
| choice: list[str] | None = None | ||
| grammar: str | None = None | ||
| lark: str | None = None |
There was a problem hiding this comment.
is lark compatible with tokenizers other than the mistral ones?
patrickvonplaten
left a comment
There was a problem hiding this comment.
Super cool feature addition!!!
Two general points:
- 1.) I think we should try to the best of our ability to not modify "general" files such as chat_completion/serving.py too much and instead inject all the logic via the
MistralTokenizerand/orMistralToolParser - 2.) Think it could be better to add a couple
assert ..., "not supported"statements for use cases that are too exotic and probably low usage (such as guidance + spm tokenizer). By doing this I think we should be able to clean some of the more complex code (like overwriting__call__of the tokenizer etc...
|
Can you add the cmd on how you're running the vllm server to the PR description? It seems that this PR only works on llguidance backend, it explicitly throws on xgrammar backend, but unhandled for outlines / lm-format-enforcer. Is this expected? I'm curious to know if you've explored structural tag as alternative to Lark, structural tag has more backend coverage and simper, if it can satisfy what mistral format needs. |
|
|
||
| if not request.tools: | ||
| # Sanitize tool_choice. | ||
| request.tool_choice = "none" |
sfeng33
left a comment
There was a problem hiding this comment.
Nice work on the correctness fixes! One note on the performance side:
The Lark grammar is applied for all tool_choice modes, including "none" and "auto". This means every Mistral tool-calling request pays guided decoding overhead (grammar compilation per request + bitmask computation per token), even when the grammar is essentially unconstrained (e.g. tool_choice="none" produces body: content which matches nearly everything).
For "none" specifically, the only real effect is preventing the <TOOL_CALLS> special token from being emitted — which could be achieved by just masking that single token ID without involving the grammar engine.
| seen_special_tokens: set[str] = set() | ||
| for i in range(self._tokenizer.n_words): | ||
| # Convert square brackets to angle brackets for special tokens, | ||
| # since llg only recognizes the latter. |
There was a problem hiding this comment.
I think now llg recognizes as well token ids but maybe keeping it this way is more modular here?
Signed-off-by: juliendenize <julien.denize@mistral.ai>
|
I pushed a commit to enforce |
Signed-off-by: juliendenize <julien.denize@mistral.ai>
|
This pull request has merge conflicts that must be resolved before it can be |
| ) | ||
|
|
||
|
|
||
| BASE_LARK_GRAMMAR = r"""start: body |
There was a problem hiding this comment.
grammar definition
When Mistral Lark grammar constrains tool call generation (PR vllm-project#37081), the streaming tool parser's streamed_args_for_tool list is never populated — the grammar path handles token generation differently from the incremental [TOOL_CALLS] parsing path. At finish_reason=tool_calls, the code accesses streamed_args_for_tool[index] unconditionally, causing: IndexError: list index out of range The grammar path already streams tool call arguments correctly via delta chunks — the remaining-args diff at finish time is unnecessary. Fix: bounds-check streamed_args_for_tool and skip the remaining-args computation when the list is empty (grammar/structured output path). Tested: short and long streaming tool calls produce valid JSON. Affects: Any model using Mistral Lark grammar tool parsing with streaming. Discovered: DGX Spark with Mistral Small 4 + Open WebUI, March 2026.
When Mistral Lark grammar constrains tool call generation (PR vllm-project#37081), the streaming tool parser's streamed_args_for_tool list is never populated — the grammar path handles token generation differently from the incremental [TOOL_CALLS] parsing path. At finish_reason=tool_calls, the code accesses streamed_args_for_tool[index] unconditionally, causing: IndexError: list index out of range The grammar path already streams tool call arguments correctly via delta chunks — the remaining-args diff at finish time is unnecessary. Fix: bounds-check streamed_args_for_tool and skip the remaining-args computation when the list is empty (grammar/structured output path). Tested: short and long streaming tool calls produce valid JSON. Affects: Any model using Mistral Lark grammar tool parsing with streaming. Discovered: DGX Spark with Mistral Small 4 + Open WebUI, March 2026.
When Mistral Lark grammar constrains tool call generation (PR vllm-project#37081), the streaming tool parser's streamed_args_for_tool list is never populated — the grammar path handles token generation differently from the incremental [TOOL_CALLS] parsing path. At finish_reason=tool_calls, the code accesses streamed_args_for_tool[index] unconditionally, causing: IndexError: list index out of range The grammar path already streams tool call arguments correctly via delta chunks — the remaining-args diff at finish time is unnecessary. Fix: bounds-check streamed_args_for_tool and skip the remaining-args computation when the list is empty (grammar/structured output path). Tested: short and long streaming tool calls produce valid JSON. Affects: Any model using Mistral Lark grammar tool parsing with streaming. Discovered: DGX Spark with Mistral Small 4 + Open WebUI, March 2026.
Purpose
Fix Mistral tool calling and reasoning parsing by introducing Lark grammar-based structured output for Mistral models. This ensures correct tool call parsing for both streaming and non-streaming modes across all
tool_choicevalues (auto,required, named,none) and all Mistral tokenizer versions.Problems on
main:[TOOL_CALLS]tokens leak into content instead of being parsed as tool calls.tool_choice="required"and named tool choice produce unparseable arguments (e.g.,[TOOL_CALLS]get_current_weather{...}embedded in the arguments string).tool_choice="none"leaks raw[TOOL_CALLS]special tokens into user-visible content.[THINK]...[/THINK]) is not separated from content for pre-v15 models and not properly controlled viareasoning_effortfor v15+ models.This PR fixes by:
MistralGrammarFactorythat generates Lark grammars from Jinja templates, selecting the appropriate grammar variant. This involves adding Lark grammar and support for Mistral Tokenizer to llguidace.Test Plan
Unit tests (in this repo)
End-to-end tests (external repo)
Scripts and results are available at: https://github.com/juliendenize/vllm-test-tool-and-reasoning-parsing
Post-v15 tokenizer: 48 tests (8 scenarios × 2 stream modes × 3
reasoning_effortvalues)Pre-v15 tokenizer: 16 tests (8 scenarios × 2 stream modes)
Test Result
End-to-end results comparison
mainmainEssential Elements of an Effective PR Description Checklist