Mistral tool parser by graelo · Pull Request #30063 · vllm-project/vllm

graelo · 2025-12-04T13:48:03Z

Purpose

Rewrite Mistral tool parser to use token-based parsing for v11+ models. Special tokens like [TOOL_CALLS] and [ARGS] have dedicated token IDs that are atomic - they can't be split across streaming chunks, which eliminates edge cases plaguing text-based parsing.

Drops pre-v11 support (Mistral-7B-Instruct-v0.1/v0.2/v0.3). These older models have weak tool calling and the text-based parsing is fragile. Users should upgrade to v11+ models.

Related to #19425 which tries to fix streaming issues while keeping pre-v11 support. This PR takes the opposite approach: drop pre-v11 and sidestep the parsing rabbit hole.

Important

In the first commit of this PR, both pre-v11 and v11+ are supported and tested in the same manner as the 2nd commit (and yes I also tested them with tool calling on a GPU using curl and opencode). However, the tool calling capabilities of these older models are not super duper crazy, and thus I simplified the code in the 2nd commit.

This of course open to discussion.

Test Plan

pytest tests/tool_use/test_mistral_tool_parser.py -v

Server test with Ministral-3-14B-Instruct-2512:

vllm serve mistralai/Ministral-3-14B-Instruct-2512 \
  --enable-auto-tool-choice \
  --tool-call-parser mistral

Test Result

Local pytest: 17 passed, 1 xfailed (BPE tokenization edge case that doesn't occur in real inference)

Server integration on Ministral-3-14B-Instruct-2512:

Single tool streaming/non-streaming: pass
Multiple tools streaming/non-streaming: pass

Additional note

I already shared the tool parser code and how I tested it on GPU: #19425 (comment)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-04T13:51:05Z

vllm/entrypoints/openai/tool_parsers/mistral_tool_parser.py

+        # Regex for non-streaming parsing: name{args}
+        self.fn_name_regex = re.compile(
+            r"([a-zA-Z0-9_-]+)(\{[\s\S]*?\}+)", re.DOTALL
+        )


Handle [ARGS] token in non-streaming parser

The non-streaming extractor never matches the documented v11+ format ([TOOL_CALLS]name[ARGS]{…}) because the regex only accepts name{…} with no [ARGS] marker. When the model outputs the real v11 layout, fn_name_regex.findall returns no matches, so extract_tool_calls returns tools_called=True with an empty tool_calls list, silently dropping every tool invocation. This breaks non-streaming tool calling for all v11+ Mistral models.

Useful? React with 👍 / 👎.

Hmm, not really on point. The [ARGS] token exists in the token stream (v11+) but gets decoded as empty string in the text output - it's a control token, not text. So when matching on strings, this regex is appropriate (everything would fall apart in tests otherwise).

gemini-code-assist

Code Review

This pull request is a great simplification and improvement for Mistral tool parsing, correctly focusing on v11+ models and leveraging token-based parsing for robustness. The code is much cleaner and the tests are well-structured. However, I found a significant issue in the streaming implementation where content preceding a tool call within the same data chunk is dropped. I've provided a fix for this and also a correction for the corresponding test which was masking this bug. Once these are addressed, this will be a solid contribution.

gemini-code-assist · 2025-12-04T13:51:13Z

vllm/entrypoints/openai/tool_parsers/mistral_tool_parser.py

-                current_tool_call.function.name = self.current_tool_name
-                self.current_tool_name = None
-            if self.streaming_state == StreamingState.PARSING_NAME_COMPLETED:
-                self.streaming_state = StreamingState.WAITING_FOR_TOOL_KEY
-            if self.streaming_state in [
-                StreamingState.PARSING_ARGUMENTS,
-                StreamingState.PARSING_ARGUMENTS_COMPLETED,
-            ]:
-                if self.streaming_state == StreamingState.PARSING_ARGUMENTS_COMPLETED:
-                    self.streaming_state = StreamingState.WAITING_FOR_TOOL_KEY
-                # the delta_to_be_parsed is part of arguments.
-                current_tool_call_modified = True
-                if current_tool_call.function.arguments is None:
-                    current_tool_call.function.arguments = delta_to_be_parsed
-                else:
-                    current_tool_call.function.arguments += delta_to_be_parsed
-                if streaming_state_before_parse != StreamingState.PARSING_ARGUMENTS:
-                    # It's the first chunk of arg. let's lstrip it
-                    current_tool_call.function.arguments = (
-                        current_tool_call.function.arguments.lstrip()
-                    )

-        if current_tool_call_modified:
-            if self.current_tool_mistral_id is not None:
-                current_tool_call.id = self.current_tool_mistral_id
-                self.current_tool_mistral_id = None
-            delta_tool_calls.append(current_tool_call)
-
-        # HACK: serving_chat.py inspects the internal state of tool parsers
-        # when determining it's final streaming delta, automatically
-        # adding autocompleted JSON.
-        # These two lines avoid that nonsense while ensuring finish_reason
-        # is set to tool_calls when at least one tool is called.
-        if delta_tool_calls and not self.prev_tool_call_arr:
-            self.prev_tool_call_arr = [{"arguments": {}}]
-
-        if content or len(delta_tool_calls) > 0:
-            delta_message = DeltaMessage()
-            if content:
-                delta_message.content = content
-            if len(delta_tool_calls) > 0:
-                delta_message.tool_calls = delta_tool_calls
-            return delta_message
-        else:
-            if self.streaming_state == StreamingState.ALL_TOOLS_COMPLETE:
-                return DeltaMessage()
-            else:
-                return None
-
-    def _split_delta(
-        self,
-        delta_text: str,
-        stop_after_quotes: int = -1,
-        stop_after_opening_curly_braces: int = -1,
-        stop_after_closing_curly_braces: int = -1,
-        stop_after_closing_brackets: int = -1,
-        stop_after_colon: int = -1,
-        stop_after_comma=-1,
-    ) -> tuple[str, str]:
-        delta_to_be_parsed = ""
-        for i, c in enumerate(delta_text):
-            if c in ['"', "'"]:
-                delta_to_be_parsed += c
-                stop_after_quotes -= 1
-                if stop_after_quotes == 0:
-                    return (delta_to_be_parsed, delta_text[i + 1 :])
-            elif c == "{":
-                delta_to_be_parsed += c
-                stop_after_opening_curly_braces -= 1
-                if stop_after_opening_curly_braces == 0:
-                    return (delta_to_be_parsed, delta_text[i + 1 :])
-            elif c == "}":
-                delta_to_be_parsed += c
-                stop_after_closing_curly_braces -= 1
-                if stop_after_closing_curly_braces == 0:
-                    return (delta_to_be_parsed, delta_text[i + 1 :])
-            elif c == "]":
-                delta_to_be_parsed += c
-                stop_after_closing_brackets -= 1
-                if stop_after_closing_brackets == 0:
-                    return (delta_to_be_parsed, delta_text[i + 1 :])
-            elif c == ":":
-                delta_to_be_parsed += c
-                stop_after_colon -= 1
-                if stop_after_colon == 0:
-                    return (delta_to_be_parsed, delta_text[i + 1 :])
-            elif c == ",":
-                delta_to_be_parsed += c
-                stop_after_comma -= 1
-                if stop_after_comma == 0:
-                    return (delta_to_be_parsed, delta_text[i + 1 :])
-            else:
-                delta_to_be_parsed += c
-
-        return (delta_to_be_parsed, "")
+        # Build response
+        if delta_tool_calls:
+            return DeltaMessage(tool_calls=delta_tool_calls)
+
+        return None


Content that appears before the first [TOOL_CALLS] token is dropped during streaming if it's in the same chunk as the token. The streaming state machine currently doesn't handle content when in the CONTENT state, causing this data loss. The fix involves accumulating content in the CONTENT state and including it in the returned DeltaMessage.

def _stream_tool_calls( self, delta_token_ids: Sequence[int] ) -> DeltaMessage | None: """ Stream tool calls using token-based parsing. Detects [TOOL_CALLS] and [ARGS] tokens to identify tool call boundaries, then streams function names and arguments as they arrive. """ from mistral_common.tokens.tokenizers.base import SpecialTokenPolicy delta_tool_calls: list[DeltaToolCall] = [] content_delta = "" for token_id in delta_token_ids: if token_id == self.bot_token_id: # Starting a new tool call self._current_tool_index += 1 self._current_tool_id = MistralToolCall.generate_random_id() self._current_tool_name = "" self._current_tool_args = "" self._brace_depth = 0 self._streaming_state = StreamingState.PARSING_TOOL_NAME # Set flag for finish_reason detection if not self.prev_tool_call_arr: self.prev_tool_call_arr = [{"arguments": {}}] # Initialize streamed_args_for_tool for this tool index while len(self.streamed_args_for_tool) <= self._current_tool_index: self.streamed_args_for_tool.append("") elif token_id == self._args_token_id: # Transition from name to arguments if self._streaming_state == StreamingState.PARSING_TOOL_NAME: # Emit the complete function name delta_tool_calls.append( DeltaToolCall( index=self._current_tool_index, type="function", id=self._current_tool_id, function=DeltaFunctionCall( name=self._current_tool_name.strip() ).model_dump(exclude_none=True), ) ) self._streaming_state = StreamingState.PARSING_TOOL_ARGS elif token_id == self._call_id_token_id: # Skip call ID tokens (they come between name and [ARGS]) # We generate our own IDs pass elif self._streaming_state == StreamingState.CONTENT: # Before any tool call, accumulate as content. token_str = self._mistral_base_tokenizer.decode( [token_id], special_token_policy=SpecialTokenPolicy.IGNORE ) content_delta += token_str elif self._streaming_state == StreamingState.PARSING_TOOL_NAME: # Accumulate name tokens token_str = self._mistral_base_tokenizer.decode( [token_id], special_token_policy=SpecialTokenPolicy.IGNORE ) self._current_tool_name += token_str elif self._streaming_state == StreamingState.PARSING_TOOL_ARGS: # Stream argument tokens token_str = self._mistral_base_tokenizer.decode( [token_id], special_token_policy=SpecialTokenPolicy.IGNORE ) # Track brace depth for nested JSON for char in token_str: if char == "{": self._brace_depth += 1 elif char == "}": self._brace_depth -= 1 self._current_tool_args += token_str # Update streamed_args_for_tool for vLLM's finish handling if self._current_tool_index < len(self.streamed_args_for_tool): self.streamed_args_for_tool[self._current_tool_index] = ( self._current_tool_args ) # Emit arguments delta delta_tool_calls.append( DeltaToolCall( index=self._current_tool_index, function=DeltaFunctionCall( arguments=token_str ).model_dump(exclude_none=True), ) ) # Build response if delta_tool_calls or content_delta: return DeltaMessage( content=content_delta if content_delta else None, tool_calls=delta_tool_calls) return None

True, I'll provide a fix soon! Edge case, but a good one.

I believe that upcoming commit 420e6fd67 fixes this. I tested it with mistralai/ministral-3-14b-instruct-2512 and the following prompt and answer

Prompt:
Say "Sure, let me calculate that for you" and then use the add tool to compute 3+5

Streaming response (content + tool call):
Sure, let me calculate that for you
followed by tool call:
{"name": "add", "arguments": "{\"a\": 3, \"b\": 5}"}

gemini-code-assist · 2025-12-04T13:51:13Z

tests/tool_use/test_mistral_tool_parser.py

+        # Verify each tool call
+        for i, expected in enumerate(expected_tool_calls):
+            actual = tool_call_data[i]
+            assert actual["name"] == expected.function.name, (
+                f"Expected name '{expected.function.name}', got '{actual['name']}'"
+            )
+            assert actual["arguments"] == expected.function.arguments, (
+                f"Expected args '{expected.function.arguments}', got '{actual['arguments']}'"
+            )
+            assert actual["id"] is not None and len(actual["id"]) == 9


This test is missing an assertion for expected_content. This masks a bug in the streaming implementation where content before a tool call is dropped. The test should be updated to verify that the content field of the delta_message matches the expected_content.

Suggested change

# Verify each tool call

for i, expected in enumerate(expected_tool_calls):

actual = tool_call_data[i]

assert actual["name"] == expected.function.name, (

f"Expected name '{expected.function.name}', got '{actual['name']}'"

)

assert actual["arguments"] == expected.function.arguments, (

f"Expected args '{expected.function.arguments}', got '{actual['arguments']}'"

)

assert actual["id"] is not None and len(actual["id"]) == 9

# Verify each tool call

for i, expected in enumerate(expected_tool_calls):

actual = tool_call_data[i]

assert actual["name"] == expected.function.name, (

f"Expected name '{expected.function.name}', got '{actual['name']}'"

)

assert actual["arguments"] == expected.function.arguments, (

f"Expected args '{expected.function.arguments}', got '{actual['arguments']}'"

)

assert actual["id"] is not None and len(actual["id"]) == 9

assert (delta_message.content or "") == expected_content

Indeed, I'll fix this soon, thanks.

github-actions · 2025-12-04T14:44:36Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

graelo · 2025-12-04T14:55:20Z

Sorry for the force-pushes, I'm not used to the sign-off process.

Signed-off-by: graelo <graelo@graelo.cc>

…e chunk When content tokens and [TOOL_CALLS] arrive in the same streaming chunk, the content was being dropped. Now we extract content tokens before [TOOL_CALLS] and include them in the response. Also adds test assertion to verify content is preserved. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: graelo <graelo@graelo.cc>

Signed-off-by: graelo <graelo@graelo.cc>

LucasWilkinson · 2025-12-05T20:28:37Z

cc @aarnphm

This removes JSON parsing of the response to avoid failing on invalid json responses: when the model emits unescaped \n chars inside JSON strings. This aligns with the deepseek parser by not validating the JSON and letting the client what to do. Signed-off-by: graelo <graelo@graelo.cc>

graelo · 2025-12-06T20:50:51Z

In this commit, I removed the JSON parsing before emitting the response to the client. This sometimes breaks, as @Erwandsn found in #19425 (comment) and #19425 (comment)

Commit message:

This removes JSON parsing of the response to avoid failing on invalid
json responses: when the model emits unescaped \n chars inside JSON
strings.

This aligns with the deepseek parser by not validating the JSON and
letting the client decide what to do.

aarnphm

This is technically breaking change.

Given https://docs.vllm.ai/en/latest/contributing/deprecation_policy I would like to see some proposal to keep previous implementation and deprecate it

aarnphm · 2025-12-06T21:22:55Z

vllm/entrypoints/openai/tool_parsers/mistral_tool_parser.py

 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""
+Mistral tool call parser for v11+ models.


This means all of the deployment for older models won't be able to use this anymore.

I'm ok with this, but might want to have notice somewhere that if older models is to be used, then it should be locked to vllm==0.12.0

Thanks @aarnphm! Yes, I called for discussion on this topic in my intro text.

In the first commit of this PR the pre-v11 are still supported, but as the code is more complex and these models relatively weak at tool calling, meh, I suggest this simplification.

Thanks for the idea of the deprecation, that's smoother/smarter.

I'll soon port the 1st commit plus the improvements since then. Would you prefer a separate PR or I bring these changes in the current one?

we can bring this to this PR. But if it works better for you i'm also good with separating out a different PR.

Erwandsn · 2025-12-06T21:26:09Z

@graelo I don't understand your latest changes, the json is still not parsed by the model but now the endpoint return a HTTP 400. Maybe we should make something to be able to parse the JSON event if the model didn't escaped \n correctly ?

Edit: I've made JSON semantic validation on the tool calls produced by the model and they all pass the validation.

Signed-off-by: graelo <graelo@graelo.cc>

graelo · 2025-12-06T23:31:16Z

Hi @Erwandsn. Regarding my last commit: it simply drops the mechanism that accumulates the json.loaded args generated by the model and instead emits the ToolCall with the matched string without further interpretation. This is the same as https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/tool_parsers/deepseekv3_tool_parser.py#L101-L103

But in your case, this probably makes serialization fail later in the chain, before sending to the client. I'm guessing this based on your feedback that the endpoint returns 400 (I would have expected a 500), and crashes. Sorry.

In the next commit I add a fix for \n, with a couple of tests: for the case where the model generates correctly escaped \n and for the opposite. In both cases, the output correctly escapes \n. You'll see \t and \r are treated the same just to be on the safe side.

I currently don't know how to make the model generate invalid JSON strings like yours. Could you help with this? I'm currently happy with both ministral-3-14b and devstral-small-2507
I'd expect some discussion on the topic of "should vLLM validate/fix the json emitted by the model". It's not on me to decide.

Erwandsn · 2025-12-06T23:52:50Z

Something seems strange, in the first version of your code the json.loads fail and the API return the raw response.
Here is an example of response producing the error unparsed_raw_response.txt i checked in a vailidator and the JSON is valid, so json.loads should work here.

In my case i'm on Ministral-3-8b-instruct

As json produced by the model is correct, i think this is not necessary.

I preprare a completion JSON body to reproduce, forward it here ASAP

graelo · 2025-12-09T15:56:12Z

I think #30332 fixed the issue. Closing.

graelo requested review from aarnphm and chaunceyjiang as code owners December 4, 2025 13:48

mergify bot added frontend tool-calling labels Dec 4, 2025

github-project-automation bot added this to Tool Calling Dec 4, 2025

chatgpt-codex-connector bot reviewed Dec 4, 2025

View reviewed changes

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

graelo and others added 6 commits December 4, 2025 22:53

feat(mistral): support pre-v11 and v11+ mistral tokenizers & models

2ef782c

Signed-off-by: graelo <graelo@graelo.cc>

refactor(mistral): drop pre-v11 support, v11+ only

4c8dad1

Signed-off-by: graelo <graelo@graelo.cc>

fix: format to 88 chars

b6b72d8

Signed-off-by: graelo <graelo@graelo.cc>

fix(mistral): resolve mypy type errors

204c026

Signed-off-by: graelo <graelo@graelo.cc>

fix(mistral): use proper exception chaining and contextlib.suppress

e72265b

Signed-off-by: graelo <graelo@graelo.cc>

DarkLight1337 requested a review from patrickvonplaten December 5, 2025 05:40

LucasWilkinson assigned aarnphm Dec 5, 2025

graelo mentioned this pull request Dec 6, 2025

[Bugfix] Mistral tool parser streaming update #19425

Merged

aarnphm reviewed Dec 6, 2025

View reviewed changes

fix(mistral): escape raw control chars in tool arguments

935bdb9

Signed-off-by: graelo <graelo@graelo.cc>

jayteaftw mentioned this pull request Dec 9, 2025

[BUGFIX] Mistral tool call parser v11+ #30332

Merged

5 tasks

graelo closed this Dec 9, 2025

github-project-automation bot moved this to Done in Tool Calling Dec 9, 2025

graelo deleted the mistral-tool-parser branch December 9, 2025 15:56

Uh oh!

Conversation

graelo commented Dec 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Important

Test Plan

Test Result

Additional note

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

graelo Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

graelo Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

graelo Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

graelo Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

graelo commented Dec 4, 2025

Uh oh!

LucasWilkinson commented Dec 5, 2025

Uh oh!

graelo commented Dec 6, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

aarnphm Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

graelo Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

aarnphm Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Erwandsn commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

graelo commented Dec 6, 2025

Uh oh!

Erwandsn commented Dec 6, 2025

Uh oh!

graelo commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

graelo commented Dec 4, 2025 •

edited by github-actions bot

Loading

graelo Dec 4, 2025 •

edited

Loading

Erwandsn commented Dec 6, 2025 •

edited

Loading