Skip to content

[Mistral Grammar] Fix tool and reasoning parsing#39217

Merged
vllm-bot merged 7 commits intovllm-project:mainfrom
juliendenize:fix_parsing_on_top_of_grammar
Apr 16, 2026
Merged

[Mistral Grammar] Fix tool and reasoning parsing#39217
vllm-bot merged 7 commits intovllm-project:mainfrom
juliendenize:fix_parsing_on_top_of_grammar

Conversation

@juliendenize
Copy link
Copy Markdown
Contributor

@juliendenize juliendenize commented Apr 7, 2026

Purpose

When Mistral models are served with --tool-call-parser mistral and a mistral-common compatible tokenizer (tekken/v11+), #38150 introduced grammar-based tool-call enforcement: adjust_request injects a Lark grammar from mistral-common's grammar factory into structured_outputs, constraining model output to valid Mistral tool-call formatting at the decoding level.

However, that PR only handled grammar injection. The serving layer still fell through to generic vLLM tool-call parsing paths that don't understand Mistral's grammar-constrained format. This broke tool-call extraction for all tool_choice modes (auto, required, named, none), both streaming and non-streaming.

This PR makes the serving layer recognize grammar-constrained Mistral output and route it through MistralToolParser for correct parsing, rather than falling through to the generic paths. It also ensures tool_choice="none" still calls adjust_request on grammar-capable tokenizers so the grammar factory can suppress special-token leakage (e.g., [TOOL_CALLS] appearing in plain text output).

This is the second PR to improve #37081 first attempt.

Test Plan

The branch adds:

  • Unit tests in tests/tool_parsers/test_mistral_tool_parser.py
  • E2E tests in tests/tool_use/mistral/test_mistral_tool_calls.py
# Unit tests
pytest tests/tool_parsers/test_mistral_tool_parser.py -v
# E2E tests
pytest tests/tool_use/mistral/test_mistral_tool_calls.py -v

Test Result

The tests pass


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Comment on lines -784 to -789
@model_validator(mode="before")
@classmethod
def set_include_reasoning_for_none_effort(cls, data: Any) -> Any:
if data.get("reasoning_effort") == "none":
data["include_reasoning"] = False
return data
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was introduced by #36238 but it was a bad idea because sometimes the model might want to try to reason so it forces it to be OOD.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree thanks!

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for Mistral tool parsing, including grammar-based tool call enforcement and integrated reasoning extraction for streaming responses. It updates the MistralToolParser to handle streaming reasoning and tool calls, adds necessary protocol updates for grammar-based parsing, and expands the test suite to cover various tool-use scenarios. I have identified a critical issue regarding the mutation of global state in MistralToolParser within the OpenAIServingChat class, which poses thread-safety risks.

Comment on lines +140 to +144
_is_mistral_tool_parser = self.tool_parser is not None and issubclass(
self.tool_parser, MistralToolParser
)
if _is_mistral_tool_parser and self.reasoning_parser_cls is not None:
MistralToolParser.model_can_reason = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Setting a class attribute MistralToolParser.model_can_reason = True in the __init__ method of OpenAIServingChat is a global state mutation that will affect all instances of MistralToolParser across the entire application. This is a thread-safety issue and can lead to unpredictable behavior if multiple models with different reasoning capabilities are served simultaneously. This should be handled via instance-level configuration or a more robust mechanism.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah so indeed this is not clean but was discussed in previous PR. I don't know how else we should do this 😄

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not have seen all the previous discussion - did you consider just looking at the reasoning_effort in the request? That's what gates the actual prompt to enable reasoning outputs, right? Or do you need model_can_reason to be true any time a reasoning parser is set, even if the requests are using reasoning_effort=None?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually using reasoning_effort does not help because it:

  • was recently introduced so previous models won't work out
  • even if reasoning_effort is set to "high" or "none" sometimes the model won't follow the instruction. Even if this behavior is not desired and could be prevented by the grammar, we found that it is usually not stable as the model is forced to be doing something it didn't "want" which could end up to infinite loop

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable. To avoid mutating global state, can we just set this on the instance of the tool parser as opposed to on the module as a global mutation? I think it's just changing this line to self.tool_parser.model_can_reason = True instead of changing it for the module itself? And moving the definition of that model_can_reason field to inside the constructor of the Mistral tool parser?

That makes it set per instance of tool parser as opposed to globally.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 7, 2026

Hi @juliendenize, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@bbrowning
Copy link
Copy Markdown
Collaborator

@sfeng33 How do you feel about all the Mistral-specific conditionals wired into the 3 serving.py variants here? I don't love it, but also recognize that we do this for other models so it likely represents some missing abstractions on our end. I'm inclined to focus on ensuring this doesn't negatively impact non-Mistral paths, properly wires up the new Mistral grammar bits, and then consider tackling the larger changes to add abstractions and clean these types of things up after-the-fact. But, I'd like a second opinion here since merging this as-is would be signing us and others in this area to clean these bits up later as we work on the Parser abstractions.

I ran the added tests (both unit and integration) and they all pass.

I also pointed out an issue to @juliendenize elsewhere that we've regressed a bit on our stripping of extra tool call args that causes the mistral_common library to throw an error. Here's a stack from that when I was running BFCL multi_turn against these changes, but the regression was actually in #38150. We'll need to fix this in a fast-follow or either wrap that into this one as well, given how often in the wild tools pass in extra args that mistral_common doesn't handle here.

(APIServer pid=158664)   File "/home/bbrowning/src/vllm/vllm/entrypoints/openai/chat_completion/serving.py", line 239, in create_chat_completion                                                                    
(APIServer pid=158664)     result = await self.render_chat_request(request)                                                                                                                                         
(APIServer pid=158664)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                         
(APIServer pid=158664)   File "/home/bbrowning/src/vllm/vllm/entrypoints/openai/chat_completion/serving.py", line 211, in render_chat_request                   
(APIServer pid=158664)     return await self.openai_serving_render.render_chat(request)                                                                                                                             
(APIServer pid=158664)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                             
(APIServer pid=158664)   File "/home/bbrowning/src/vllm/vllm/entrypoints/serve/render/serving.py", line 246, in render_chat
(APIServer pid=158664)     conversation, engine_inputs = await self.preprocess_chat(                                                                                                                                
(APIServer pid=158664)                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                
(APIServer pid=158664)   File "/home/bbrowning/src/vllm/vllm/entrypoints/serve/render/serving.py", line 577, in preprocess_chat                                                                                     
(APIServer pid=158664)     request = tool_parser(tokenizer, request.tools).adjust_request(                                                                                                                          
(APIServer pid=158664)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^           
(APIServer pid=158664)   File "/home/bbrowning/src/vllm/vllm/tool_parsers/mistral_tool_parser.py", line 247, in adjust_request                                                                                      
(APIServer pid=158664)     MistralTool.from_openai(openai_tool=tool.model_dump())                                                                                                                                   
(APIServer pid=158664)   File "/home/bbrowning/src/vllm/.venv/lib64/python3.12/site-packages/mistral_common/protocol/instruct/tool_calls.py", line 145, in from_openai                                              
(APIServer pid=158664)     return cls.model_validate(openai_tool)                                                                                                                                                   
(APIServer pid=158664)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                   
(APIServer pid=158664)   File "/home/bbrowning/src/vllm/.venv/lib64/python3.12/site-packages/pydantic/main.py", line 716, in model_validate                                                                         
(APIServer pid=158664)     return cls.__pydantic_validator__.validate_python(                                                                                                                                       
(APIServer pid=158664)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                       
(APIServer pid=158664) pydantic_core._pydantic_core.ValidationError: 1 validation error for Tool                                                                                                                    
(APIServer pid=158664) function.response                                                                                                                                                                            
(APIServer pid=158664)   Extra inputs are not permitted [type=extra_forbidden, input_value={'type': 'dict', 'propert...': {'type': 'string'}}}}, input_type=dict]                                                   
(APIServer pid=158664)     For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden

@mergify mergify bot added the mistral Related to Mistral models label Apr 10, 2026
@juliendenize juliendenize mentioned this pull request Apr 10, 2026
3 tasks
@juliendenize
Copy link
Copy Markdown
Contributor Author

@bbrowning i think i addressed the tool issue, would it be possible for you to rerun tests and confirm ? I abstracted a bit the tool adaptation for mistral-common to limit duplication of code.

@bbrowning
Copy link
Copy Markdown
Collaborator

There's a problem where the newly added code to clean mistral tool calls is modifying the dict while iterating over it, resulting in stack traces like this:

(APIServer pid=1417371)   File "/home/bbrowning/src/vllm/vllm/entrypoints/openai/chat_completion/serving.py", line 211, in render_chat_request
(APIServer pid=1417371)     return await self.openai_serving_render.render_chat(request)
(APIServer pid=1417371)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1417371)   File "/home/bbrowning/src/vllm/vllm/entrypoints/serve/render/serving.py", line 246, in render_chat
(APIServer pid=1417371)     conversation, engine_inputs = await self.preprocess_chat(
(APIServer pid=1417371)                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1417371)   File "/home/bbrowning/src/vllm/vllm/entrypoints/serve/render/serving.py", line 538, in preprocess_chat
(APIServer pid=1417371)     (conversation,), (engine_input,) = await renderer.render_chat_async(
(APIServer pid=1417371)                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1417371)   File "/home/bbrowning/src/vllm/vllm/renderers/base.py", line 986, in render_chat_async
(APIServer pid=1417371)     for conv, prompt in await asyncio.gather(*rendered):
(APIServer pid=1417371)                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1417371)   File "/home/bbrowning/src/vllm/vllm/renderers/mistral.py", line 104, in render_messages_async
(APIServer pid=1417371)     prompt_raw = await self._apply_chat_template_async(
(APIServer pid=1417371)                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1417371)   File "/usr/lib64/python3.12/concurrent/futures/thread.py", line 59, in run
(APIServer pid=1417371)     result = self.fn(*self.args, **self.kwargs)
(APIServer pid=1417371)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1417371)   File "/home/bbrowning/src/vllm/vllm/renderers/mistral.py", line 47, in safe_apply_chat_template
(APIServer pid=1417371)     raise ValueError(str(e)) from e
(APIServer pid=1417371) ValueError: dictionary changed size during iteration

@bbrowning
Copy link
Copy Markdown
Collaborator

bbrowning commented Apr 13, 2026

I worked around this locally for now with this diff, although it highlights we don't have anything testing this logic to confirm it actually works.

diff --git a/vllm/tokenizers/mistral.py b/vllm/tokenizers/mistral.py
index ba0152e62..3a79fbb1a 100644
--- a/vllm/tokenizers/mistral.py
+++ b/vllm/tokenizers/mistral.py
@@ -57,7 +57,8 @@ logger = init_logger(__name__)
 def _pop_unallowed_keys_and_warn(
     dictionary: dict[str, Any], allowed_keys: set[str], err_dict_name: str
 ):
-    for key in dictionary:
+    dict_keys = list(dictionary.keys())
+    for key in dict_keys:
         if key not in allowed_keys:
             dictionary.pop(key)
             logger.warning_once(

With that diff, it seems to be properly stripping the unknown fields and not erroring when I sent in chat completion requests from something like BFCL multi_turn. I haven't done any kind of before/after comparison and am not entirely sure how to, as a user, verify the new tool/reasoning parsing is working vs the old path. But, I do see non-streaming tool requests sent to mistralai/Mistral-Small-4-119B-2603 with reasoning_effort set are working.

Comment thread vllm/tokenizers/mistral.py Outdated
@juliendenize
Copy link
Copy Markdown
Contributor Author

@bbrowning Thanks i applied the patch.

Regarding testing requests i have testing scripts on my gh here:
https://github.com/juliendenize/vllm-test-tool-and-reasoning-parsing
with associated results (post v15 are up to date)

Basically I tested various requests on main and this branch to see what tests fail regarding when we want to enforce a tool call with a correct name, with or without reasoning, with json structured output , ...

The only "failing" tests with this branch is that when the user specifically instructs the model to return a json the grammar without tool_choice = "none" allows for a non-json output that the model usually choses.

On main however, a lot of tests fail. One example is when reasoning is not performed but reasoning_parser is defined vLLM does not correctly parse tool calls.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 14, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juliendenize.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 14, 2026
@androiddrew
Copy link
Copy Markdown

androiddrew commented Apr 14, 2026

@juliendenize could be something wrong with my setup but I used https://github.com/eugr/spark-vllm-docker to build for the GB10 from main with your PR applied. First query sent to https://huggingface.co/mistralai/Mistral-Small-4-119B-2603-NVFP4 resulted in a vLLM error dump.

Full error posted on https://forums.developer.nvidia.com/t/running-mistral-small-4-119b-nvfp4-on-nvidia-dgx-spark-gb10/363863/53?u=drew22

version 0.19.1rc1.dev243+g995e9a209.d20260413
model mistralai/Mistral-Small-4-119B-2603-NVFP4

@bbrowning
Copy link
Copy Markdown
Collaborator

@androiddrew I've seen that same error. For now, this model does not work on any hardware that requires Triton attention, which includes the DGX Spark. I brought it up to the team in vLLM Slack previously (as I always test on Spark as well), but I'm not sure if there's an upstream issue here tracking it. That's the exact error I also get when running this model on any hardware that requires triton and cannot use another attention backend.

It's worth filing one, as the practical implications are that until the Triton attention backend is fixed to work with this model only other architectures that can use Flash Attention will work - some versions of Ampere, Hopper, and Blackwell GPUs I believe?

But, that's also a separate issue than the tool and reasoning parsing being fixed here.

Comment thread vllm/entrypoints/openai/chat_completion/serving.py Outdated
Comment thread vllm/sampling_params.py Outdated
@androiddrew
Copy link
Copy Markdown

androiddrew commented Apr 14, 2026

@androiddrew I've seen that same error. For now, this model does not work on any hardware that requires Triton attention, which includes the DGX Spark. I brought it up to the team in vLLM Slack previously (as I always test on Spark as well), but I'm not sure if there's an upstream issue here tracking it. That's the exact error I also get when running this model on any hardware that requires triton and cannot use another attention backend.

It's worth filing one, as the practical implications are that until the Triton attention backend is fixed to work with this model only other architectures that can use Flash Attention will work - some versions of Ampere, Hopper, and Blackwell GPUs I believe?

But, that's also a separate issue than the tool and reasoning parsing being fixed here.

I was under the impression that https://github.com/eugr/spark-vllm-docker patches for MLA_ATTENTION.

(EngineCore pid=234) INFO 04-14 15:07:55 [cuda.py:317] Using TRITON_MLA attention backend out of potential backends: ['TRITON_MLA'].

I had this model working on my Spark back on March 19th. It generates text the only problem I am having is that opencode is seeing TOOL_CALLS leak into the context. This was using version 0.17.2rc1.dev7+g9c7cab5eb.d20260317 though and my own patch to tokenizer/mistral.py to fix the mistral-common version issue.

This was my first attempt at using current main with this pr applied via build-and-copy.sh --apply-vllm-pr 39217 from the https://github.com/eugr/spark-vllm-docker project.

@androiddrew
Copy link
Copy Markdown

androiddrew commented Apr 15, 2026

@bbrowning Thanks for the heads up

diff --git a/vllm/v1/attention/ops/triton_decode_attention.py b/vllm/v1/attention/ops/triton_decode_attention.py
index 8118db0da..347dfcc07 100644
--- a/vllm/v1/attention/ops/triton_decode_attention.py
+++ b/vllm/v1/attention/ops/triton_decode_attention.py
@@ -467,7 +467,14 @@ def _decode_grouped_att_m_fwd(
     if is_hip_ and Lk >= 576:
         BLOCK = 16
 
-    if Lk == 576:
+    if is_mla and Lk > Lv:
+        # MLA: KV cache stores [c_kv || k_pe] concatenated.
+        # Split into nope (BLOCK_DMODEL = kv_lora_rank) and rope (BLOCK_DPE)
+        # so the kernel loads them separately and v = trans(k_nope) matches
+        # the accumulator dimension (BLOCK_DV).
+        BLOCK_DMODEL = triton.next_power_of_2(Lv)
+        BLOCK_DPE = triton.next_power_of_2(Lk - Lv)
+    elif Lk == 576:
         BLOCK_DMODEL = 512
         BLOCK_DPE = 64
     elif Lk == 288:

This and a small change to @juliendenize vllm/tokenizers/mistral.py appears to have resolved my issue. I can now successfully load mistral4 and make tool calls in Opencode on my DGX Spark

diff --git a/vllm/tokenizers/mistral.py b/vllm/tokenizers/mistral.py
index 3a79fbb1a..7667903e4 100644
--- a/vllm/tokenizers/mistral.py
+++ b/vllm/tokenizers/mistral.py
@@ -447,7 +447,9 @@ class MistralTokenizer(TokenizerLike):
         # NOTE: This is for backward compatibility.
         # Transformers should be passed arguments it knows.
         if self.version >= 15:
-            version_kwargs["reasoning_effort"] = kwargs.get("reasoning_effort")
+            reasoning_effort = kwargs.get("reasoning_effort")
+            if reasoning_effort is not None:
+                version_kwargs["reasoning_effort"] = reasoning_effort
 
         messages, tools = _prepare_apply_chat_template_tools_and_messages(
             messages, tools, continue_final_message, add_generation_prompt

@juliendenize juliendenize force-pushed the fix_parsing_on_top_of_grammar branch from 5d8a9b0 to 85e0ee5 Compare April 15, 2026 07:23
@mergify mergify bot removed the needs-rebase label Apr 15, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 15, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juliendenize.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 15, 2026
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
@juliendenize juliendenize force-pushed the fix_parsing_on_top_of_grammar branch from 3d34a0e to 69466bf Compare April 15, 2026 09:12
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 15, 2026
@mergify mergify bot removed the needs-rebase label Apr 15, 2026
Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests pass, though in terms of code design we might want to eventually delegate the whole _parse_tool_calls_from_content into tool parser so we don't need to complicate the code with additional cases.

I'll merge this once the other reviewers also approve.

@juliendenize
Copy link
Copy Markdown
Contributor Author

@DarkLight1337 indeed there is code design issues there. I tried to minimize additional cases but would definitely be best to have a mistral unified parser and making sure to call known methods from the unified parser class. However as things are still moving AFAIK inside vLLM regarding this parser I thought it might be best to wait for it to be stable / fully added. Would gladly help to clean this up in the future.

Copy link
Copy Markdown
Collaborator

@sfeng33 sfeng33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) April 16, 2026 04:04
@vllm-bot vllm-bot merged commit c0722f2 into vllm-project:main Apr 16, 2026
52 of 53 checks passed
askliar pushed a commit to askliar/vllm that referenced this pull request Apr 16, 2026
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend mistral Related to Mistral models ready ONLY add when PR is ready to merge/full CI is needed tool-calling

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants