Add Mistral Guidance by juliendenize · Pull Request #37081 · vllm-project/vllm

juliendenize · 2026-03-14T23:53:01Z

Purpose

Fix Mistral tool calling and reasoning parsing by introducing Lark grammar-based structured output for Mistral models. This ensures correct tool call parsing for both streaming and non-streaming modes across all tool_choice values (auto, required, named, none) and all Mistral tokenizer versions.

Problems on main:

Streaming tool calls are broken for post-v15 Mistral tokenizers: [TOOL_CALLS] tokens leak into content instead of being parsed as tool calls.
tool_choice="required" and named tool choice produce unparseable arguments (e.g., [TOOL_CALLS]get_current_weather{...} embedded in the arguments string).
tool_choice="none" leaks raw [TOOL_CALLS] special tokens into user-visible content.
Reasoning content ([THINK]...[/THINK]) is not separated from content for pre-v15 models and not properly controlled via reasoning_effort for v15+ models.

This PR fixes by:

Introduce MistralGrammarFactory that generates Lark grammars from Jinja templates, selecting the appropriate grammar variant. This involves adding Lark grammar and support for Mistral Tokenizer to llguidace.
Update the OpenAI chat serving layer to use the Mistral grammar path for both streaming and non-streaming responses.

Test Plan

Unit tests (in this repo)

# Mistral tool parser tests (grammar factory, lark converter, adjust_request, streaming)
pytest tests/tool_parsers/test_mistral_tool_parser.py -v -s

# Mistral reasoning parser tests (is_reasoning_end with prefilled prompts)
pytest tests/reasoning/test_mistral_reasoning_parser.py -v -s

# Guidance backend lark grammar test
pytest tests/v1/structured_output/test_backend_guidance.py -v -s -k test_backend_guidance_lark_grammar

End-to-end tests (external repo)

Scripts and results are available at: https://github.com/juliendenize/vllm-test-tool-and-reasoning-parsing

Post-v15 tokenizer: 48 tests (8 scenarios × 2 stream modes × 3 reasoning_effort values)

python test_vllm_tools_post_v15.py --base-url http://localhost:8000/v1

Pre-v15 tokenizer: 16 tests (8 scenarios × 2 stream modes)

python test_vllm_tools_pre_v15.py --base-url http://localhost:8000/v1

Test Result

End-to-end results comparison

Model / Tokenizer	Branch	Passed	Failed	Total
Post-v15	This PR	48	0	48
Post-v15	`main`	26	22	48
Pre-v15	This PR	16	0	16
Pre-v15	`main`	12	4	16

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

gemini-code-assist

Code Review

This pull request introduces a robust, grammar-based approach using Lark to fix several long-standing issues with Mistral tool calling and reasoning parsing. The changes are comprehensive, covering both streaming and non-streaming modes, various tool choice options, and different tokenizer versions. The introduction of MistralGrammarFactory and the associated Jinja templates for grammar generation is a solid design choice for managing the complexity. The test suite has been significantly expanded, which is excellent for ensuring the correctness of these critical changes. I've found one critical issue in the streaming logic that could lead to corrupted output, for which I've provided a specific comment and suggestion.

gemini-code-assist · 2026-03-14T23:55:38Z

vllm/tool_parsers/mistral_tool_parser.py

+            if self.bot_token not in delta_text:
+                return DeltaMessage(content=delta_text)


This condition can lead to corrupted output during streaming. If the bot_token (e.g., "[TOOL_CALLS]") is split across multiple streaming deltas, delta_text may not contain the full token. In this scenario, the current logic incorrectly returns a partial token string as content, which is incorrect.

The correct behavior should be to wait for more tokens by returning None, especially since the calling function has already confirmed that the bot_token exists in the cumulative current_text. This ensures that partial tool call markers are not leaked as content.

return None

special tokens are not split so don't think it's true

Signed-off-by: juliendenize <julien.denize@mistral.ai>

mergify · 2026-03-14T23:56:46Z

Hi @juliendenize, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: juliendenize <julien.denize@mistral.ai>

mergify · 2026-03-15T00:00:33Z

Hi @juliendenize, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: juliendenize <julien.denize@mistral.ai>

mergify · 2026-03-15T00:40:13Z

Hi @juliendenize, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: juliendenize <julien.denize@mistral.ai>

patrickvonplaten · 2026-03-15T17:16:39Z

tests/reasoning/test_mistral_reasoning_parser.py

+    prompt_tokens = mistral_tokenizer.tokenizer.encode(
+        "Just a regular prompt.", bos=False, eos=False
+    )
+    assert parser.is_reasoning_end(prompt_tokens) is False


Suggested change

assert parser.is_reasoning_end(prompt_tokens) is False

assert not parser.is_reasoning_end(prompt_tokens)

patrickvonplaten · 2026-03-15T17:16:52Z

tests/reasoning/test_mistral_reasoning_parser.py

+        + mistral_tokenizer.tokenizer.encode(prompt_text_after, bos=False, eos=False)
+    )
+
+    assert parser.is_reasoning_end(prompt_tokens) is True


Suggested change

assert parser.is_reasoning_end(prompt_tokens) is True

assert parser.is_reasoning_end(prompt_tokens)

patrickvonplaten · 2026-03-15T17:20:30Z

vllm/entrypoints/openai/chat_completion/serving.py

+            self.tool_parser, MistralToolParser
+        )
+        if _is_mistral_tool_parser and self.reasoning_parser_cls is not None:
+            MistralToolParser._has_reasoning_parser = True


Suggested change

MistralToolParser._has_reasoning_parser = True

why are we setting a bool of the MistralToolParser here? could we just do this directly in the get_tool_parser function?

patrickvonplaten · 2026-03-15T17:24:42Z

vllm/entrypoints/openai/chat_completion/serving.py

                        )
                        harmony_tools_streamed[i] |= tools_streamed_flag
+                    elif use_mistral_grammar and reasoning_parser:
+                        assert tool_parser is not None


Can we move all this new code into a chat_completion/mistral_grammar.py file?

patrickvonplaten · 2026-03-15T17:26:53Z

vllm/entrypoints/openai/chat_completion/serving.py

+                and tokenizer.version >= 15
+                and request.reasoning_effort in (None, "none")
            )
+            if not skip_reasoning:


Think this is a lot of extra code here.

Couldn't we just set:

self.reasoning_parser = None

if skip_reasoning is False much earlier above so that the code should stay exactly the same here?

patrickvonplaten · 2026-03-15T17:27:26Z

vllm/entrypoints/openai/chat_completion/serving.py

+                request, tokenizer, self.tool_parser
+            )
+            tool_call_items: list[ToolCall]
+            if use_mistral_grammar:


can we maybe also import this code from another function?

Also this one is about Mistral API, not related with grammar no?

patrickvonplaten · 2026-03-15T17:30:03Z

vllm/entrypoints/openai/engine/serving.py

+        use_mistral_grammar = (
+            isinstance(request, ChatCompletionRequest)
+            and tokenizer is not None
+            and is_mistral_lark_grammar_active(
+                request,
+                tokenizer,
+                tool_parser_cls,  # type: ignore[arg-type]
            )


Suggested change

use_mistral_grammar = (

isinstance(request, ChatCompletionRequest)

and tokenizer is not None

and is_mistral_lark_grammar_active(

request,

tokenizer,

tool_parser_cls, # type: ignore[arg-type]

)

use_mistral_grammar = isinstance(tokenizer, MistralTokenizer) and tokenizer.is_grammar_active(request)

can't we inject more logic through the MistralTokenizer here?

patrickvonplaten · 2026-03-15T18:17:04Z

vllm/tool_parsers/mistral_tool_parser.py

+{% endif -%}
+think: <THINK> content </THINK>
+content: (/(.|\n)+/)+
+SAFE_WS: /[ \t\r\n]+/"""


How likely are these grammars changing again? Should we think about a mechanism to inject grammars via file loading?

Maybe ok to leave as is for a v1 but not sure that this will be future proof?

joa-stdn · 2026-03-15T18:17:16Z

vllm/tool_parsers/mistral_tool_parser.py

+    return (
+        tool_parser_cls is not None
+        and issubclass(tool_parser_cls, MistralToolParser)
+        and is_mistral_tokenizer(tokenizer)


is it possible to have issubclass(tool_parser_cls, MistralToolParser) but not is_mistral_tokenizer(tokenizer)?

patrickvonplaten · 2026-03-15T18:17:39Z

vllm/tool_parsers/mistral_tool_parser.py

+        if not args:
+            args = {"type": "object", "properties": {}, "additionalProperties": False}


Suggested change

if not args:

args = {"type": "object", "properties": {}, "additionalProperties": False}

args = args or {"type": "object", "properties": {}, "additionalProperties": False}

joa-stdn · 2026-03-15T18:19:39Z

vllm/tool_parsers/mistral_tool_parser.py

+        args = tool.function.parameters if _is_strict_tool(tool) else {"type": "object"}
+        # Handle empty parameters case
+        if not args:
+            args = {"type": "object", "properties": {}, "additionalProperties": False}


How can this happen? Why do we have {"type": "object"} and not {"type": "object", "properties": {}, "additionalProperties": False}
Maybe we can clean up and have only two cases? I think for us at least parameters has to be a valid json schema

patrickvonplaten · 2026-03-15T18:19:52Z

vllm/tool_parsers/mistral_tool_parser.py

+        self._tokenizer = tokenizer
+        self._tokenizer_version = tokenizer.version
+
+    def get_lark_from_jinja(


should we maybe cache this function?

Template(jinja_template).render(...)

looks a bit expensive and there don't seem to be that many possible function arg options

patrickvonplaten · 2026-03-15T18:21:04Z

vllm/tool_parsers/mistral_tool_parser.py

            )
+        self.grammar_factory = (
+            MistralGrammarFactory(tokenizer)  # type: ignore[arg-type]
+            if is_mistral_tokenizer(self.model_tokenizer)


How can this not be a mistral tokenizer?! The MistralToolParser should only be compatible with MistralTokenizer no?

joa-stdn · 2026-03-15T18:24:47Z

vllm/tool_parsers/mistral_tool_parser.py

+        parallel_tool_calls: bool,
+    ) -> str: ...
+
+    def get_args_json(self, tool: ChatCompletionToolsParam) -> dict[str, Any]:


Suggested change

def get_args_json(self, tool: ChatCompletionToolsParam) -> dict[str, Any]:

def get_tool_parameters_schema(self, tool: ChatCompletionToolsParam) -> dict[str, Any]:

or something like this?

patrickvonplaten · 2026-03-15T18:25:52Z

vllm/sampling_params.py

    regex: str | None = None
    choice: list[str] | None = None
    grammar: str | None = None
+    lark: str | None = None


is lark compatible with tokenizers other than the mistral ones?

patrickvonplaten

Super cool feature addition!!!

Two general points:

1.) I think we should try to the best of our ability to not modify "general" files such as chat_completion/serving.py too much and instead inject all the logic via the MistralTokenizer and/or MistralToolParser
2.) Think it could be better to add a couple assert ..., "not supported" statements for use cases that are too exotic and probably low usage (such as guidance + spm tokenizer). By doing this I think we should be able to clean some of the more complex code (like overwriting __call__ of the tokenizer etc...

sfeng33 · 2026-03-15T18:29:34Z

Can you add the cmd on how you're running the vllm server to the PR description?

It seems that this PR only works on llguidance backend, it explicitly throws on xgrammar backend, but unhandled for outlines / lm-format-enforcer. Is this expected?

I'm curious to know if you've explored structural tag as alternative to Lark, structural tag has more backend coverage and simper, if it can satisfy what mistral format needs.

joa-stdn · 2026-03-15T18:31:15Z

vllm/tool_parsers/mistral_tool_parser.py

+
+        if not request.tools:
+            # Sanitize tool_choice.
+            request.tool_choice = "none"


Are we sure about this?

sfeng33

Nice work on the correctness fixes! One note on the performance side:

The Lark grammar is applied for all tool_choice modes, including "none" and "auto". This means every Mistral tool-calling request pays guided decoding overhead (grammar compilation per request + bitmask computation per token), even when the grammar is essentially unconstrained (e.g. tool_choice="none" produces body: content which matches nearly everything).

For "none" specifically, the only real effect is preventing the <TOOL_CALLS> special token from being emitted — which could be achieved by just masking that single token ID without involving the grammar engine.

joa-stdn · 2026-03-15T18:51:26Z

vllm/tokenizers/mistral.py

+        seen_special_tokens: set[str] = set()
+        for i in range(self._tokenizer.n_words):
+            # Convert square brackets to angle brackets for special tokens,
+            # since llg only recognizes the latter.


I think now llg recognizes as well token ids but maybe keeping it this way is more modular here?

Signed-off-by: juliendenize <julien.denize@mistral.ai>

juliendenize · 2026-03-16T13:04:51Z

I pushed a commit to enforce tool_choice="auto" for now and we inject grammar. This will be used for a special docker image as this PR entangles too many changes and require deeper design thoughts.
It would be better to break it in multiple smaller pr that target specific parts (grammar, parser, ...)

Signed-off-by: juliendenize <julien.denize@mistral.ai>

mergify · 2026-03-17T00:41:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juliendenize.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

patrickvonplaten · 2026-03-17T11:05:50Z

vllm/tool_parsers/mistral_tool_parser.py

+    )
+
+
+BASE_LARK_GRAMMAR = r"""start: body


grammar definition

When Mistral Lark grammar constrains tool call generation (PR vllm-project#37081), the streaming tool parser's streamed_args_for_tool list is never populated — the grammar path handles token generation differently from the incremental [TOOL_CALLS] parsing path. At finish_reason=tool_calls, the code accesses streamed_args_for_tool[index] unconditionally, causing: IndexError: list index out of range The grammar path already streams tool call arguments correctly via delta chunks — the remaining-args diff at finish time is unnecessary. Fix: bounds-check streamed_args_for_tool and skip the remaining-args computation when the list is empty (grammar/structured output path). Tested: short and long streaming tool calls produce valid JSON. Affects: Any model using Mistral Lark grammar tool parsing with streaming. Discovered: DGX Spark with Mistral Small 4 + Open WebUI, March 2026.

juliendenize requested review from DarkLight1337, NickLucche, aarnphm, benchislett, chaunceyjiang, mgoin, njhill, patrickvonplaten and russellb as code owners March 14, 2026 23:53

juliendenize changed the title ~~Fix mistral parsing~~ Add Mistral Guidance Mar 14, 2026

mergify bot added frontend structured-output v1 labels Mar 14, 2026

github-project-automation bot added this to Structured Output Mar 14, 2026

gemini-code-assist bot reviewed Mar 14, 2026

View reviewed changes

Add Mistral Guidance

76803b1

Signed-off-by: juliendenize <julien.denize@mistral.ai>

Improvement

5fee4b0

Signed-off-by: juliendenize <julien.denize@mistral.ai>

juliendenize force-pushed the fix_mistral_parsing branch from 35dc6d4 to 5fee4b0 Compare March 14, 2026 23:56

Fix mypy

114ee57

Signed-off-by: juliendenize <julien.denize@mistral.ai>

fix mypy

465119e

Signed-off-by: juliendenize <julien.denize@mistral.ai>

patrickvonplaten reviewed Mar 15, 2026

View reviewed changes

joa-stdn reviewed Mar 15, 2026

View reviewed changes

patrickvonplaten reviewed Mar 15, 2026

View reviewed changes

joa-stdn reviewed Mar 15, 2026

View reviewed changes

patrickvonplaten reviewed Mar 15, 2026

View reviewed changes

joa-stdn reviewed Mar 15, 2026

View reviewed changes

patrickvonplaten reviewed Mar 15, 2026

View reviewed changes

joa-stdn reviewed Mar 15, 2026

View reviewed changes

sfeng33 reviewed Mar 15, 2026

View reviewed changes

joa-stdn reviewed Mar 15, 2026

View reviewed changes

Enforce 'auto' tool choice

b59d6ce

Signed-off-by: juliendenize <julien.denize@mistral.ai>

Fix no tools path

8e6f431

Signed-off-by: juliendenize <julien.denize@mistral.ai>

juliendenize mentioned this pull request Mar 16, 2026

Fix some Mistral parser issues #37209

Merged

5 tasks

mergify bot added the needs-rebase label Mar 17, 2026

patrickvonplaten reviewed Mar 17, 2026

View reviewed changes

vllm/tool_parsers/mistral_tool_parser.py

)

BASE_LARK_GRAMMAR = r"""start: body

Copy link
Copy Markdown

Collaborator

patrickvonplaten Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar definition

will-deines mentioned this pull request Mar 17, 2026

[Responses API] Unified tool_choice + structured output via triggered tags will-deines/vllm#1

Closed

7 tasks

arkavo-com mentioned this pull request Mar 18, 2026

Integrate Leanstral + Lean 4 for Formally Verified Agent Operations arkavo-org/arkavo-edge#537

Open

15 tasks

mergify bot added the tool-calling label Mar 20, 2026

github-project-automation bot added this to Tool Calling Mar 20, 2026

Martossien mentioned this pull request Mar 20, 2026

[Bug]: Mistral-Small-4-119B-2603 fails on 8x RTX 3090 (SM 8.6) with vLLM v0.17.1: no valid MLA attention backend #37553

Open

ChuckChambers mentioned this pull request Mar 21, 2026

Add mod: fix-mistral-reasoning-effort eugr/spark-vllm-docker#119

Open

juliendenize mentioned this pull request Mar 25, 2026

[Mistral Grammar] Support Grammar Factory #38150

Draft

5 tasks

		if self.bot_token not in delta_text:
		return DeltaMessage(content=delta_text)

	assert parser.is_reasoning_end(prompt_tokens) is False
	assert not parser.is_reasoning_end(prompt_tokens)

		if not args:
		args = {"type": "object", "properties": {}, "additionalProperties": False}

	if not args:
	args = {"type": "object", "properties": {}, "additionalProperties": False}
	args = args or {"type": "object", "properties": {}, "additionalProperties": False}

	def get_args_json(self, tool: ChatCompletionToolsParam) -> dict[str, Any]:
	def get_tool_parameters_schema(self, tool: ChatCompletionToolsParam) -> dict[str, Any]:

Uh oh!

Conversation

juliendenize commented Mar 14, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Unit tests (in this repo)

End-to-end tests (external repo)

Test Result

End-to-end results comparison

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

mergify bot commented Mar 15, 2026

Uh oh!

mergify bot commented Mar 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joa-stdn Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

sfeng33 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfeng33 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliendenize commented Mar 16, 2026

Uh oh!

mergify bot commented Mar 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

juliendenize commented Mar 14, 2026 •

edited by github-actions bot

Loading

joa-stdn Mar 15, 2026 •

edited

Loading

sfeng33 commented Mar 15, 2026 •

edited

Loading