Skip to content

Add Mistral Guidance#37081

Open
juliendenize wants to merge 6 commits intovllm-project:mainfrom
juliendenize:fix_mistral_parsing
Open

Add Mistral Guidance#37081
juliendenize wants to merge 6 commits intovllm-project:mainfrom
juliendenize:fix_mistral_parsing

Conversation

@juliendenize
Copy link
Copy Markdown
Contributor

@juliendenize juliendenize commented Mar 14, 2026

Purpose

Fix Mistral tool calling and reasoning parsing by introducing Lark grammar-based structured output for Mistral models. This ensures correct tool call parsing for both streaming and non-streaming modes across all tool_choice values (auto, required, named, none) and all Mistral tokenizer versions.

Problems on main:

  • Streaming tool calls are broken for post-v15 Mistral tokenizers: [TOOL_CALLS] tokens leak into content instead of being parsed as tool calls.
  • tool_choice="required" and named tool choice produce unparseable arguments (e.g., [TOOL_CALLS]get_current_weather{...} embedded in the arguments string).
  • tool_choice="none" leaks raw [TOOL_CALLS] special tokens into user-visible content.
  • Reasoning content ([THINK]...[/THINK]) is not separated from content for pre-v15 models and not properly controlled via reasoning_effort for v15+ models.

This PR fixes by:

  • Introduce MistralGrammarFactory that generates Lark grammars from Jinja templates, selecting the appropriate grammar variant. This involves adding Lark grammar and support for Mistral Tokenizer to llguidace.
  • Update the OpenAI chat serving layer to use the Mistral grammar path for both streaming and non-streaming responses.

Test Plan

Unit tests (in this repo)

# Mistral tool parser tests (grammar factory, lark converter, adjust_request, streaming)
pytest tests/tool_parsers/test_mistral_tool_parser.py -v -s

# Mistral reasoning parser tests (is_reasoning_end with prefilled prompts)
pytest tests/reasoning/test_mistral_reasoning_parser.py -v -s

# Guidance backend lark grammar test
pytest tests/v1/structured_output/test_backend_guidance.py -v -s -k test_backend_guidance_lark_grammar

End-to-end tests (external repo)

Scripts and results are available at: https://github.com/juliendenize/vllm-test-tool-and-reasoning-parsing

Post-v15 tokenizer: 48 tests (8 scenarios × 2 stream modes × 3 reasoning_effort values)

python test_vllm_tools_post_v15.py --base-url http://localhost:8000/v1

Pre-v15 tokenizer: 16 tests (8 scenarios × 2 stream modes)

python test_vllm_tools_pre_v15.py --base-url http://localhost:8000/v1

Test Result

End-to-end results comparison

Model / Tokenizer Branch Passed Failed Total
Post-v15 This PR 48 0 48
Post-v15 main 26 22 48
Pre-v15 This PR 16 0 16
Pre-v15 main 12 4 16

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust, grammar-based approach using Lark to fix several long-standing issues with Mistral tool calling and reasoning parsing. The changes are comprehensive, covering both streaming and non-streaming modes, various tool choice options, and different tokenizer versions. The introduction of MistralGrammarFactory and the associated Jinja templates for grammar generation is a solid design choice for managing the complexity. The test suite has been significantly expanded, which is excellent for ensuring the correctness of these critical changes. I've found one critical issue in the streaming logic that could lead to corrupted output, for which I've provided a specific comment and suggestion.

Comment on lines +585 to +586
if self.bot_token not in delta_text:
return DeltaMessage(content=delta_text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This condition can lead to corrupted output during streaming. If the bot_token (e.g., "[TOOL_CALLS]") is split across multiple streaming deltas, delta_text may not contain the full token. In this scenario, the current logic incorrectly returns a partial token string as content, which is incorrect.

The correct behavior should be to wait for more tokens by returning None, especially since the calling function has already confirmed that the bot_token exists in the cumulative current_text. This ensures that partial tool call markers are not leaked as content.

                return None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special tokens are not split so don't think it's true

Signed-off-by: juliendenize <julien.denize@mistral.ai>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 14, 2026

Hi @juliendenize, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: juliendenize <julien.denize@mistral.ai>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 15, 2026

Hi @juliendenize, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: juliendenize <julien.denize@mistral.ai>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 15, 2026

Hi @juliendenize, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: juliendenize <julien.denize@mistral.ai>
prompt_tokens = mistral_tokenizer.tokenizer.encode(
"Just a regular prompt.", bos=False, eos=False
)
assert parser.is_reasoning_end(prompt_tokens) is False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert parser.is_reasoning_end(prompt_tokens) is False
assert not parser.is_reasoning_end(prompt_tokens)

+ mistral_tokenizer.tokenizer.encode(prompt_text_after, bos=False, eos=False)
)

assert parser.is_reasoning_end(prompt_tokens) is True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert parser.is_reasoning_end(prompt_tokens) is True
assert parser.is_reasoning_end(prompt_tokens)

self.tool_parser, MistralToolParser
)
if _is_mistral_tool_parser and self.reasoning_parser_cls is not None:
MistralToolParser._has_reasoning_parser = True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MistralToolParser._has_reasoning_parser = True

why are we setting a bool of the MistralToolParser here? could we just do this directly in the get_tool_parser function?

)
harmony_tools_streamed[i] |= tools_streamed_flag
elif use_mistral_grammar and reasoning_parser:
assert tool_parser is not None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move all this new code into a chat_completion/mistral_grammar.py file?

and tokenizer.version >= 15
and request.reasoning_effort in (None, "none")
)
if not skip_reasoning:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is a lot of extra code here.

Couldn't we just set:

self.reasoning_parser = None

if skip_reasoning is False much earlier above so that the code should stay exactly the same here?

request, tokenizer, self.tool_parser
)
tool_call_items: list[ToolCall]
if use_mistral_grammar:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we maybe also import this code from another function?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this one is about Mistral API, not related with grammar no?

Comment on lines +1122 to 1129
use_mistral_grammar = (
isinstance(request, ChatCompletionRequest)
and tokenizer is not None
and is_mistral_lark_grammar_active(
request,
tokenizer,
tool_parser_cls, # type: ignore[arg-type]
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use_mistral_grammar = (
isinstance(request, ChatCompletionRequest)
and tokenizer is not None
and is_mistral_lark_grammar_active(
request,
tokenizer,
tool_parser_cls, # type: ignore[arg-type]
)
use_mistral_grammar = isinstance(tokenizer, MistralTokenizer) and tokenizer.is_grammar_active(request)

can't we inject more logic through the MistralTokenizer here?

{% endif -%}
think: <THINK> content </THINK>
content: (/(.|\n)+/)+
SAFE_WS: /[ \t\r\n]+/"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How likely are these grammars changing again? Should we think about a mechanism to inject grammars via file loading?

Maybe ok to leave as is for a v1 but not sure that this will be future proof?

return (
tool_parser_cls is not None
and issubclass(tool_parser_cls, MistralToolParser)
and is_mistral_tokenizer(tokenizer)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to have issubclass(tool_parser_cls, MistralToolParser) but not is_mistral_tokenizer(tokenizer)?

Comment on lines +155 to +156
if not args:
args = {"type": "object", "properties": {}, "additionalProperties": False}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not args:
args = {"type": "object", "properties": {}, "additionalProperties": False}
args = args or {"type": "object", "properties": {}, "additionalProperties": False}

args = tool.function.parameters if _is_strict_tool(tool) else {"type": "object"}
# Handle empty parameters case
if not args:
args = {"type": "object", "properties": {}, "additionalProperties": False}
Copy link
Copy Markdown
Contributor

@joa-stdn joa-stdn Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen? Why do we have {"type": "object"} and not {"type": "object", "properties": {}, "additionalProperties": False}
Maybe we can clean up and have only two cases? I think for us at least parameters has to be a valid json schema

self._tokenizer = tokenizer
self._tokenizer_version = tokenizer.version

def get_lark_from_jinja(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we maybe cache this function?

Template(jinja_template).render(...)

looks a bit expensive and there don't seem to be that many possible function arg options

)
self.grammar_factory = (
MistralGrammarFactory(tokenizer) # type: ignore[arg-type]
if is_mistral_tokenizer(self.model_tokenizer)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this not be a mistral tokenizer?! The MistralToolParser should only be compatible with MistralTokenizer no?

parallel_tool_calls: bool,
) -> str: ...

def get_args_json(self, tool: ChatCompletionToolsParam) -> dict[str, Any]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_args_json(self, tool: ChatCompletionToolsParam) -> dict[str, Any]:
def get_tool_parameters_schema(self, tool: ChatCompletionToolsParam) -> dict[str, Any]:

or something like this?

regex: str | None = None
choice: list[str] | None = None
grammar: str | None = None
lark: str | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is lark compatible with tokenizers other than the mistral ones?

Copy link
Copy Markdown
Collaborator

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool feature addition!!!

Two general points:

  • 1.) I think we should try to the best of our ability to not modify "general" files such as chat_completion/serving.py too much and instead inject all the logic via the MistralTokenizer and/or MistralToolParser
  • 2.) Think it could be better to add a couple assert ..., "not supported" statements for use cases that are too exotic and probably low usage (such as guidance + spm tokenizer). By doing this I think we should be able to clean some of the more complex code (like overwriting __call__ of the tokenizer etc...

@sfeng33
Copy link
Copy Markdown
Contributor

sfeng33 commented Mar 15, 2026

Can you add the cmd on how you're running the vllm server to the PR description?

It seems that this PR only works on llguidance backend, it explicitly throws on xgrammar backend, but unhandled for outlines / lm-format-enforcer. Is this expected?

I'm curious to know if you've explored structural tag as alternative to Lark, structural tag has more backend coverage and simper, if it can satisfy what mistral format needs.


if not request.tools:
# Sanitize tool_choice.
request.tool_choice = "none"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure about this?

Copy link
Copy Markdown
Contributor

@sfeng33 sfeng33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the correctness fixes! One note on the performance side:

The Lark grammar is applied for all tool_choice modes, including "none" and "auto". This means every Mistral tool-calling request pays guided decoding overhead (grammar compilation per request + bitmask computation per token), even when the grammar is essentially unconstrained (e.g. tool_choice="none" produces body: content which matches nearly everything).

For "none" specifically, the only real effect is preventing the <TOOL_CALLS> special token from being emitted — which could be achieved by just masking that single token ID without involving the grammar engine.

seen_special_tokens: set[str] = set()
for i in range(self._tokenizer.n_words):
# Convert square brackets to angle brackets for special tokens,
# since llg only recognizes the latter.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now llg recognizes as well token ids but maybe keeping it this way is more modular here?

Signed-off-by: juliendenize <julien.denize@mistral.ai>
@juliendenize
Copy link
Copy Markdown
Contributor Author

I pushed a commit to enforce tool_choice="auto" for now and we inject grammar. This will be used for a special docker image as this PR entangles too many changes and require deeper design thoughts.
It would be better to break it in multiple smaller pr that target specific parts (grammar, parser, ...)

Signed-off-by: juliendenize <julien.denize@mistral.ai>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 17, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @juliendenize.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 17, 2026
)


BASE_LARK_GRAMMAR = r"""start: body
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar definition

@mergify mergify bot added the tool-calling label Mar 20, 2026
TrevorS added a commit to TrevorS/vllm that referenced this pull request Mar 25, 2026
When Mistral Lark grammar constrains tool call generation (PR vllm-project#37081),
the streaming tool parser's streamed_args_for_tool list is never
populated — the grammar path handles token generation differently from
the incremental [TOOL_CALLS] parsing path.

At finish_reason=tool_calls, the code accesses
streamed_args_for_tool[index] unconditionally, causing:

  IndexError: list index out of range

The grammar path already streams tool call arguments correctly via delta
chunks — the remaining-args diff at finish time is unnecessary.

Fix: bounds-check streamed_args_for_tool and skip the remaining-args
computation when the list is empty (grammar/structured output path).

Tested: short and long streaming tool calls produce valid JSON.
Affects: Any model using Mistral Lark grammar tool parsing with streaming.
Discovered: DGX Spark with Mistral Small 4 + Open WebUI, March 2026.
TrevorS added a commit to TrevorS/vllm that referenced this pull request Mar 25, 2026
When Mistral Lark grammar constrains tool call generation (PR vllm-project#37081),
the streaming tool parser's streamed_args_for_tool list is never
populated — the grammar path handles token generation differently from
the incremental [TOOL_CALLS] parsing path.

At finish_reason=tool_calls, the code accesses
streamed_args_for_tool[index] unconditionally, causing:

  IndexError: list index out of range

The grammar path already streams tool call arguments correctly via delta
chunks — the remaining-args diff at finish time is unnecessary.

Fix: bounds-check streamed_args_for_tool and skip the remaining-args
computation when the list is empty (grammar/structured output path).

Tested: short and long streaming tool calls produce valid JSON.
Affects: Any model using Mistral Lark grammar tool parsing with streaming.
Discovered: DGX Spark with Mistral Small 4 + Open WebUI, March 2026.
TrevorS added a commit to TrevorS/vllm that referenced this pull request Mar 25, 2026
When Mistral Lark grammar constrains tool call generation (PR vllm-project#37081),
the streaming tool parser's streamed_args_for_tool list is never
populated — the grammar path handles token generation differently from
the incremental [TOOL_CALLS] parsing path.

At finish_reason=tool_calls, the code accesses
streamed_args_for_tool[index] unconditionally, causing:

  IndexError: list index out of range

The grammar path already streams tool call arguments correctly via delta
chunks — the remaining-args diff at finish time is unnecessary.

Fix: bounds-check streamed_args_for_tool and skip the remaining-args
computation when the list is empty (grammar/structured output path).

Tested: short and long streaming tool calls produce valid JSON.
Affects: Any model using Mistral Lark grammar tool parsing with streaming.
Discovered: DGX Spark with Mistral Small 4 + Open WebUI, March 2026.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status
Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants