Skip to content

Mistral tool parser#30063

Closed
graelo wants to merge 8 commits intovllm-project:mainfrom
graelo:mistral-tool-parser
Closed

Mistral tool parser#30063
graelo wants to merge 8 commits intovllm-project:mainfrom
graelo:mistral-tool-parser

Conversation

@graelo
Copy link

@graelo graelo commented Dec 4, 2025

Purpose

Rewrite Mistral tool parser to use token-based parsing for v11+ models. Special tokens like [TOOL_CALLS] and [ARGS] have dedicated token IDs that are atomic - they can't be split across streaming chunks, which eliminates edge cases plaguing text-based parsing.

Drops pre-v11 support (Mistral-7B-Instruct-v0.1/v0.2/v0.3). These older models have weak tool calling and the text-based parsing is fragile. Users should upgrade to v11+ models.

Related to #19425 which tries to fix streaming issues while keeping pre-v11 support. This PR takes the opposite approach: drop pre-v11 and sidestep the parsing rabbit hole.

Important

In the first commit of this PR, both pre-v11 and v11+ are supported and tested in the same manner as the 2nd commit (and yes I also tested them with tool calling on a GPU using curl and opencode). However, the tool calling capabilities of these older models are not super duper crazy, and thus I simplified the code in the 2nd commit.

This of course open to discussion.

Test Plan

pytest tests/tool_use/test_mistral_tool_parser.py -v

Server test with Ministral-3-14B-Instruct-2512:

vllm serve mistralai/Ministral-3-14B-Instruct-2512 \
  --enable-auto-tool-choice \
  --tool-call-parser mistral

Test Result

Local pytest: 17 passed, 1 xfailed (BPE tokenization edge case that doesn't occur in real inference)

Server integration on Ministral-3-14B-Instruct-2512:

  • Single tool streaming/non-streaming: pass
  • Multiple tools streaming/non-streaming: pass

Additional note

I already shared the tool parser code and how I tested it on GPU: #19425 (comment)


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +133 to +136
# Regex for non-streaming parsing: name{args}
self.fn_name_regex = re.compile(
r"([a-zA-Z0-9_-]+)(\{[\s\S]*?\}+)", re.DOTALL
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle [ARGS] token in non-streaming parser

The non-streaming extractor never matches the documented v11+ format ([TOOL_CALLS]name[ARGS]{…}) because the regex only accepts name{…} with no [ARGS] marker. When the model outputs the real v11 layout, fn_name_regex.findall returns no matches, so extract_tool_calls returns tools_called=True with an empty tool_calls list, silently dropping every tool invocation. This breaks non-streaming tool calling for all v11+ Mistral models.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not really on point. The [ARGS] token exists in the token stream (v11+) but gets decoded as empty string in the text output - it's a control token, not text. So when matching on strings, this regex is appropriate (everything would fall apart in tests otherwise).

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great simplification and improvement for Mistral tool parsing, correctly focusing on v11+ models and leveraging token-based parsing for robustness. The code is much cleaner and the tests are well-structured. However, I found a significant issue in the streaming implementation where content preceding a tool call within the same data chunk is dropped. I've provided a fix for this and also a correction for the corresponding test which was masking this bug. Once these are addressed, this will be a solid contribution.

Comment on lines +234 to +333
current_tool_call.function.name = self.current_tool_name
self.current_tool_name = None
if self.streaming_state == StreamingState.PARSING_NAME_COMPLETED:
self.streaming_state = StreamingState.WAITING_FOR_TOOL_KEY
if self.streaming_state in [
StreamingState.PARSING_ARGUMENTS,
StreamingState.PARSING_ARGUMENTS_COMPLETED,
]:
if self.streaming_state == StreamingState.PARSING_ARGUMENTS_COMPLETED:
self.streaming_state = StreamingState.WAITING_FOR_TOOL_KEY
# the delta_to_be_parsed is part of arguments.
current_tool_call_modified = True
if current_tool_call.function.arguments is None:
current_tool_call.function.arguments = delta_to_be_parsed
else:
current_tool_call.function.arguments += delta_to_be_parsed
if streaming_state_before_parse != StreamingState.PARSING_ARGUMENTS:
# It's the first chunk of arg. let's lstrip it
current_tool_call.function.arguments = (
current_tool_call.function.arguments.lstrip()
)

if current_tool_call_modified:
if self.current_tool_mistral_id is not None:
current_tool_call.id = self.current_tool_mistral_id
self.current_tool_mistral_id = None
delta_tool_calls.append(current_tool_call)

# HACK: serving_chat.py inspects the internal state of tool parsers
# when determining it's final streaming delta, automatically
# adding autocompleted JSON.
# These two lines avoid that nonsense while ensuring finish_reason
# is set to tool_calls when at least one tool is called.
if delta_tool_calls and not self.prev_tool_call_arr:
self.prev_tool_call_arr = [{"arguments": {}}]

if content or len(delta_tool_calls) > 0:
delta_message = DeltaMessage()
if content:
delta_message.content = content
if len(delta_tool_calls) > 0:
delta_message.tool_calls = delta_tool_calls
return delta_message
else:
if self.streaming_state == StreamingState.ALL_TOOLS_COMPLETE:
return DeltaMessage()
else:
return None

def _split_delta(
self,
delta_text: str,
stop_after_quotes: int = -1,
stop_after_opening_curly_braces: int = -1,
stop_after_closing_curly_braces: int = -1,
stop_after_closing_brackets: int = -1,
stop_after_colon: int = -1,
stop_after_comma=-1,
) -> tuple[str, str]:
delta_to_be_parsed = ""
for i, c in enumerate(delta_text):
if c in ['"', "'"]:
delta_to_be_parsed += c
stop_after_quotes -= 1
if stop_after_quotes == 0:
return (delta_to_be_parsed, delta_text[i + 1 :])
elif c == "{":
delta_to_be_parsed += c
stop_after_opening_curly_braces -= 1
if stop_after_opening_curly_braces == 0:
return (delta_to_be_parsed, delta_text[i + 1 :])
elif c == "}":
delta_to_be_parsed += c
stop_after_closing_curly_braces -= 1
if stop_after_closing_curly_braces == 0:
return (delta_to_be_parsed, delta_text[i + 1 :])
elif c == "]":
delta_to_be_parsed += c
stop_after_closing_brackets -= 1
if stop_after_closing_brackets == 0:
return (delta_to_be_parsed, delta_text[i + 1 :])
elif c == ":":
delta_to_be_parsed += c
stop_after_colon -= 1
if stop_after_colon == 0:
return (delta_to_be_parsed, delta_text[i + 1 :])
elif c == ",":
delta_to_be_parsed += c
stop_after_comma -= 1
if stop_after_comma == 0:
return (delta_to_be_parsed, delta_text[i + 1 :])
else:
delta_to_be_parsed += c

return (delta_to_be_parsed, "")
# Build response
if delta_tool_calls:
return DeltaMessage(tool_calls=delta_tool_calls)

return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Content that appears before the first [TOOL_CALLS] token is dropped during streaming if it's in the same chunk as the token. The streaming state machine currently doesn't handle content when in the CONTENT state, causing this data loss. The fix involves accumulating content in the CONTENT state and including it in the returned DeltaMessage.

    def _stream_tool_calls(
        self, delta_token_ids: Sequence[int]
    ) -> DeltaMessage | None:
        """
        Stream tool calls using token-based parsing.

        Detects [TOOL_CALLS] and [ARGS] tokens to identify tool call boundaries,
        then streams function names and arguments as they arrive.
        """
        from mistral_common.tokens.tokenizers.base import SpecialTokenPolicy

        delta_tool_calls: list[DeltaToolCall] = []
        content_delta = ""

        for token_id in delta_token_ids:
            if token_id == self.bot_token_id:
                # Starting a new tool call
                self._current_tool_index += 1
                self._current_tool_id = MistralToolCall.generate_random_id()
                self._current_tool_name = ""
                self._current_tool_args = ""
                self._brace_depth = 0
                self._streaming_state = StreamingState.PARSING_TOOL_NAME

                # Set flag for finish_reason detection
                if not self.prev_tool_call_arr:
                    self.prev_tool_call_arr = [{"arguments": {}}]

                # Initialize streamed_args_for_tool for this tool index
                while len(self.streamed_args_for_tool) <= self._current_tool_index:
                    self.streamed_args_for_tool.append("")

            elif token_id == self._args_token_id:
                # Transition from name to arguments
                if self._streaming_state == StreamingState.PARSING_TOOL_NAME:
                    # Emit the complete function name
                    delta_tool_calls.append(
                        DeltaToolCall(
                            index=self._current_tool_index,
                            type="function",
                            id=self._current_tool_id,
                            function=DeltaFunctionCall(
                                name=self._current_tool_name.strip()
                            ).model_dump(exclude_none=True),
                        )
                    )
                    self._streaming_state = StreamingState.PARSING_TOOL_ARGS

            elif token_id == self._call_id_token_id:
                # Skip call ID tokens (they come between name and [ARGS])
                # We generate our own IDs
                pass

            elif self._streaming_state == StreamingState.CONTENT:
                # Before any tool call, accumulate as content.
                token_str = self._mistral_base_tokenizer.decode(
                    [token_id], special_token_policy=SpecialTokenPolicy.IGNORE
                )
                content_delta += token_str

            elif self._streaming_state == StreamingState.PARSING_TOOL_NAME:
                # Accumulate name tokens
                token_str = self._mistral_base_tokenizer.decode(
                    [token_id], special_token_policy=SpecialTokenPolicy.IGNORE
                )
                self._current_tool_name += token_str

            elif self._streaming_state == StreamingState.PARSING_TOOL_ARGS:
                # Stream argument tokens
                token_str = self._mistral_base_tokenizer.decode(
                    [token_id], special_token_policy=SpecialTokenPolicy.IGNORE
                )

                # Track brace depth for nested JSON
                for char in token_str:
                    if char == "{":
                        self._brace_depth += 1
                    elif char == "}":
                        self._brace_depth -= 1

                self._current_tool_args += token_str

                # Update streamed_args_for_tool for vLLM's finish handling
                if self._current_tool_index < len(self.streamed_args_for_tool):
                    self.streamed_args_for_tool[self._current_tool_index] = (
                        self._current_tool_args
                    )

                # Emit arguments delta
                delta_tool_calls.append(
                    DeltaToolCall(
                        index=self._current_tool_index,
                        function=DeltaFunctionCall(
                            arguments=token_str
                        ).model_dump(exclude_none=True),
                    )
                )

        # Build response
        if delta_tool_calls or content_delta:
            return DeltaMessage(
                content=content_delta if content_delta else None,
                tool_calls=delta_tool_calls)

        return None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'll provide a fix soon! Edge case, but a good one.

Copy link
Author

@graelo graelo Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that upcoming commit 420e6fd67 fixes this. I tested it with mistralai/ministral-3-14b-instruct-2512 and the following prompt and answer

Prompt:
Say "Sure, let me calculate that for you" and then use the add tool to compute 3+5

Streaming response (content + tool call):
Sure, let me calculate that for you
followed by tool call:
{"name": "add", "arguments": "{\"a\": 3, \"b\": 5}"}

Comment on lines +602 to +611
# Verify each tool call
for i, expected in enumerate(expected_tool_calls):
actual = tool_call_data[i]
assert actual["name"] == expected.function.name, (
f"Expected name '{expected.function.name}', got '{actual['name']}'"
)
assert actual["arguments"] == expected.function.arguments, (
f"Expected args '{expected.function.arguments}', got '{actual['arguments']}'"
)
assert actual["id"] is not None and len(actual["id"]) == 9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test is missing an assertion for expected_content. This masks a bug in the streaming implementation where content before a tool call is dropped. The test should be updated to verify that the content field of the delta_message matches the expected_content.

Suggested change
# Verify each tool call
for i, expected in enumerate(expected_tool_calls):
actual = tool_call_data[i]
assert actual["name"] == expected.function.name, (
f"Expected name '{expected.function.name}', got '{actual['name']}'"
)
assert actual["arguments"] == expected.function.arguments, (
f"Expected args '{expected.function.arguments}', got '{actual['arguments']}'"
)
assert actual["id"] is not None and len(actual["id"]) == 9
# Verify each tool call
for i, expected in enumerate(expected_tool_calls):
actual = tool_call_data[i]
assert actual["name"] == expected.function.name, (
f"Expected name '{expected.function.name}', got '{actual['name']}'"
)
assert actual["arguments"] == expected.function.arguments, (
f"Expected args '{expected.function.arguments}', got '{actual['arguments']}'"
)
assert actual["id"] is not None and len(actual["id"]) == 9
assert (delta_message.content or "") == expected_content

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I'll fix this soon, thanks.

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@graelo
Copy link
Author

graelo commented Dec 4, 2025

Sorry for the force-pushes, I'm not used to the sign-off process.

graelo and others added 6 commits December 4, 2025 22:53
Signed-off-by: graelo <graelo@graelo.cc>
…e chunk

When content tokens and [TOOL_CALLS] arrive in the same streaming chunk,
the content was being dropped. Now we extract content tokens before
[TOOL_CALLS] and include them in the response.

Also adds test assertion to verify content is preserved.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: graelo <graelo@graelo.cc>
Signed-off-by: graelo <graelo@graelo.cc>
Signed-off-by: graelo <graelo@graelo.cc>
@LucasWilkinson
Copy link
Collaborator

cc @aarnphm

This removes JSON parsing of the response to avoid failing on invalid
json responses: when the model emits unescaped \n chars inside JSON
strings.

This aligns with the deepseek parser by not validating the JSON and
letting the client what to do.

Signed-off-by: graelo <graelo@graelo.cc>
@graelo
Copy link
Author

graelo commented Dec 6, 2025

In this commit, I removed the JSON parsing before emitting the response to the client. This sometimes breaks, as @Erwandsn found in #19425 (comment) and #19425 (comment)

Commit message:

This removes JSON parsing of the response to avoid failing on invalid
json responses: when the model emits unescaped \n chars inside JSON
strings.

This aligns with the deepseek parser by not validating the JSON and
letting the client decide what to do.

Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically breaking change.

Given https://docs.vllm.ai/en/latest/contributing/deprecation_policy I would like to see some proposal to keep previous implementation and deprecate it

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Mistral tool call parser for v11+ models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means all of the deployment for older models won't be able to use this anymore.

I'm ok with this, but might want to have notice somewhere that if older models is to be used, then it should be locked to vllm==0.12.0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aarnphm! Yes, I called for discussion on this topic in my intro text.

In the first commit of this PR the pre-v11 are still supported, but as the code is more complex and these models relatively weak at tool calling, meh, I suggest this simplification.

Thanks for the idea of the deprecation, that's smoother/smarter.

I'll soon port the 1st commit plus the improvements since then. Would you prefer a separate PR or I bring these changes in the current one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can bring this to this PR. But if it works better for you i'm also good with separating out a different PR.

@Erwandsn
Copy link

Erwandsn commented Dec 6, 2025

@graelo I don't understand your latest changes, the json is still not parsed by the model but now the endpoint return a HTTP 400. Maybe we should make something to be able to parse the JSON event if the model didn't escaped \n correctly ?

Edit: I've made JSON semantic validation on the tool calls produced by the model and they all pass the validation.

Signed-off-by: graelo <graelo@graelo.cc>
@graelo
Copy link
Author

graelo commented Dec 6, 2025

Hi @Erwandsn. Regarding my last commit: it simply drops the mechanism that accumulates the json.loaded args generated by the model and instead emits the ToolCall with the matched string without further interpretation. This is the same as https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/tool_parsers/deepseekv3_tool_parser.py#L101-L103

But in your case, this probably makes serialization fail later in the chain, before sending to the client. I'm guessing this based on your feedback that the endpoint returns 400 (I would have expected a 500), and crashes. Sorry.

In the next commit I add a fix for \n, with a couple of tests: for the case where the model generates correctly escaped \n and for the opposite. In both cases, the output correctly escapes \n. You'll see \t and \r are treated the same just to be on the safe side.

  • I currently don't know how to make the model generate invalid JSON strings like yours. Could you help with this? I'm currently happy with both ministral-3-14b and devstral-small-2507
  • I'd expect some discussion on the topic of "should vLLM validate/fix the json emitted by the model". It's not on me to decide.

@Erwandsn
Copy link

Erwandsn commented Dec 6, 2025

Something seems strange, in the first version of your code the json.loads fail and the API return the raw response.
Here is an example of response producing the error unparsed_raw_response.txt i checked in a vailidator and the JSON is valid, so json.loads should work here.

  1. In my case i'm on Ministral-3-8b-instruct
    
  2. As json produced by the model is correct, i think this is not necessary.
    

I preprare a completion JSON body to reproduce, forward it here ASAP

@graelo
Copy link
Author

graelo commented Dec 9, 2025

I think #30332 fixed the issue. Closing.

@graelo graelo closed this Dec 9, 2025
@graelo graelo deleted the mistral-tool-parser branch December 9, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants