[Mistral Grammar] Fix tool and reasoning parsing#39217
[Mistral Grammar] Fix tool and reasoning parsing#39217vllm-bot merged 7 commits intovllm-project:mainfrom
Conversation
| @model_validator(mode="before") | ||
| @classmethod | ||
| def set_include_reasoning_for_none_effort(cls, data: Any) -> Any: | ||
| if data.get("reasoning_effort") == "none": | ||
| data["include_reasoning"] = False | ||
| return data |
There was a problem hiding this comment.
This was introduced by #36238 but it was a bad idea because sometimes the model might want to try to reason so it forces it to be OOD.
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for Mistral tool parsing, including grammar-based tool call enforcement and integrated reasoning extraction for streaming responses. It updates the MistralToolParser to handle streaming reasoning and tool calls, adds necessary protocol updates for grammar-based parsing, and expands the test suite to cover various tool-use scenarios. I have identified a critical issue regarding the mutation of global state in MistralToolParser within the OpenAIServingChat class, which poses thread-safety risks.
| _is_mistral_tool_parser = self.tool_parser is not None and issubclass( | ||
| self.tool_parser, MistralToolParser | ||
| ) | ||
| if _is_mistral_tool_parser and self.reasoning_parser_cls is not None: | ||
| MistralToolParser.model_can_reason = True |
There was a problem hiding this comment.
Setting a class attribute MistralToolParser.model_can_reason = True in the __init__ method of OpenAIServingChat is a global state mutation that will affect all instances of MistralToolParser across the entire application. This is a thread-safety issue and can lead to unpredictable behavior if multiple models with different reasoning capabilities are served simultaneously. This should be handled via instance-level configuration or a more robust mechanism.
There was a problem hiding this comment.
yeah so indeed this is not clean but was discussed in previous PR. I don't know how else we should do this 😄
There was a problem hiding this comment.
I may not have seen all the previous discussion - did you consider just looking at the reasoning_effort in the request? That's what gates the actual prompt to enable reasoning outputs, right? Or do you need model_can_reason to be true any time a reasoning parser is set, even if the requests are using reasoning_effort=None?
There was a problem hiding this comment.
actually using reasoning_effort does not help because it:
- was recently introduced so previous models won't work out
- even if reasoning_effort is set to "high" or "none" sometimes the model won't follow the instruction. Even if this behavior is not desired and could be prevented by the grammar, we found that it is usually not stable as the model is forced to be doing something it didn't "want" which could end up to infinite loop
There was a problem hiding this comment.
That's reasonable. To avoid mutating global state, can we just set this on the instance of the tool parser as opposed to on the module as a global mutation? I think it's just changing this line to self.tool_parser.model_can_reason = True instead of changing it for the module itself? And moving the definition of that model_can_reason field to inside the constructor of the Mistral tool parser?
That makes it set per instance of tool parser as opposed to globally.
|
Hi @juliendenize, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@sfeng33 How do you feel about all the Mistral-specific conditionals wired into the 3 serving.py variants here? I don't love it, but also recognize that we do this for other models so it likely represents some missing abstractions on our end. I'm inclined to focus on ensuring this doesn't negatively impact non-Mistral paths, properly wires up the new Mistral grammar bits, and then consider tackling the larger changes to add abstractions and clean these types of things up after-the-fact. But, I'd like a second opinion here since merging this as-is would be signing us and others in this area to clean these bits up later as we work on the Parser abstractions. I ran the added tests (both unit and integration) and they all pass. I also pointed out an issue to @juliendenize elsewhere that we've regressed a bit on our stripping of extra tool call args that causes the mistral_common library to throw an error. Here's a stack from that when I was running BFCL multi_turn against these changes, but the regression was actually in #38150. We'll need to fix this in a fast-follow or either wrap that into this one as well, given how often in the wild tools pass in extra args that mistral_common doesn't handle here. |
|
@bbrowning i think i addressed the tool issue, would it be possible for you to rerun tests and confirm ? I abstracted a bit the tool adaptation for mistral-common to limit duplication of code. |
|
There's a problem where the newly added code to clean mistral tool calls is modifying the dict while iterating over it, resulting in stack traces like this: |
|
I worked around this locally for now with this diff, although it highlights we don't have anything testing this logic to confirm it actually works. With that diff, it seems to be properly stripping the unknown fields and not erroring when I sent in chat completion requests from something like BFCL multi_turn. I haven't done any kind of before/after comparison and am not entirely sure how to, as a user, verify the new tool/reasoning parsing is working vs the old path. But, I do see non-streaming tool requests sent to |
|
@bbrowning Thanks i applied the patch. Regarding testing requests i have testing scripts on my gh here: Basically I tested various requests on main and this branch to see what tests fail regarding when we want to enforce a tool call with a correct name, with or without reasoning, with json structured output , ... The only "failing" tests with this branch is that when the user specifically instructs the model to return a json the grammar without tool_choice = "none" allows for a non-json output that the model usually choses. On main however, a lot of tests fail. One example is when reasoning is not performed but reasoning_parser is defined vLLM does not correctly parse tool calls. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@juliendenize could be something wrong with my setup but I used https://github.com/eugr/spark-vllm-docker to build for the GB10 from main with your PR applied. First query sent to https://huggingface.co/mistralai/Mistral-Small-4-119B-2603-NVFP4 resulted in a vLLM error dump. Full error posted on https://forums.developer.nvidia.com/t/running-mistral-small-4-119b-nvfp4-on-nvidia-dgx-spark-gb10/363863/53?u=drew22 version 0.19.1rc1.dev243+g995e9a209.d20260413 |
|
@androiddrew I've seen that same error. For now, this model does not work on any hardware that requires Triton attention, which includes the DGX Spark. I brought it up to the team in vLLM Slack previously (as I always test on Spark as well), but I'm not sure if there's an upstream issue here tracking it. That's the exact error I also get when running this model on any hardware that requires triton and cannot use another attention backend. It's worth filing one, as the practical implications are that until the Triton attention backend is fixed to work with this model only other architectures that can use Flash Attention will work - some versions of Ampere, Hopper, and Blackwell GPUs I believe? But, that's also a separate issue than the tool and reasoning parsing being fixed here. |
I was under the impression that https://github.com/eugr/spark-vllm-docker patches for MLA_ATTENTION. I had this model working on my Spark back on March 19th. It generates text the only problem I am having is that opencode is seeing TOOL_CALLS leak into the context. This was using version 0.17.2rc1.dev7+g9c7cab5eb.d20260317 though and my own patch to This was my first attempt at using current main with this pr applied via |
|
@bbrowning Thanks for the heads up This and a small change to @juliendenize vllm/tokenizers/mistral.py appears to have resolved my issue. I can now successfully load mistral4 and make tool calls in Opencode on my DGX Spark |
5d8a9b0 to
85e0ee5
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
3d34a0e to
69466bf
Compare
There was a problem hiding this comment.
The tests pass, though in terms of code design we might want to eventually delegate the whole _parse_tool_calls_from_content into tool parser so we don't need to complicate the code with additional cases.
I'll merge this once the other reviewers also approve.
|
@DarkLight1337 indeed there is code design issues there. I tried to minimize additional cases but would definitely be best to have a mistral unified parser and making sure to call known methods from the unified parser class. However as things are still moving AFAIK inside vLLM regarding this parser I thought it might be best to wait for it to be stable / fully added. Would gladly help to clean this up in the future. |
sfeng33
left a comment
There was a problem hiding this comment.
Thank you for the work!
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Purpose
When Mistral models are served with
--tool-call-parser mistraland amistral-commoncompatible tokenizer (tekken/v11+), #38150 introduced grammar-based tool-call enforcement:adjust_requestinjects a Lark grammar from mistral-common's grammar factory intostructured_outputs, constraining model output to valid Mistral tool-call formatting at the decoding level.However, that PR only handled grammar injection. The serving layer still fell through to generic vLLM tool-call parsing paths that don't understand Mistral's grammar-constrained format. This broke tool-call extraction for all tool_choice modes (auto, required, named, none), both streaming and non-streaming.
This PR makes the serving layer recognize grammar-constrained Mistral output and route it through MistralToolParser for correct parsing, rather than falling through to the generic paths. It also ensures
tool_choice="none"still calls adjust_request on grammar-capable tokenizers so the grammar factory can suppress special-token leakage (e.g., [TOOL_CALLS] appearing in plain text output).This is the second PR to improve #37081 first attempt.
Test Plan
The branch adds:
Test Result
The tests pass
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.