Mistral tool parser#30063
Mistral tool parser#30063graelo wants to merge 8 commits intovllm-project:mainfrom graelo:mistral-tool-parser
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Regex for non-streaming parsing: name{args} | ||
| self.fn_name_regex = re.compile( | ||
| r"([a-zA-Z0-9_-]+)(\{[\s\S]*?\}+)", re.DOTALL | ||
| ) |
There was a problem hiding this comment.
Handle [ARGS] token in non-streaming parser
The non-streaming extractor never matches the documented v11+ format ([TOOL_CALLS]name[ARGS]{…}) because the regex only accepts name{…} with no [ARGS] marker. When the model outputs the real v11 layout, fn_name_regex.findall returns no matches, so extract_tool_calls returns tools_called=True with an empty tool_calls list, silently dropping every tool invocation. This breaks non-streaming tool calling for all v11+ Mistral models.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Hmm, not really on point. The [ARGS] token exists in the token stream (v11+) but gets decoded as empty string in the text output - it's a control token, not text. So when matching on strings, this regex is appropriate (everything would fall apart in tests otherwise).
There was a problem hiding this comment.
Code Review
This pull request is a great simplification and improvement for Mistral tool parsing, correctly focusing on v11+ models and leveraging token-based parsing for robustness. The code is much cleaner and the tests are well-structured. However, I found a significant issue in the streaming implementation where content preceding a tool call within the same data chunk is dropped. I've provided a fix for this and also a correction for the corresponding test which was masking this bug. Once these are addressed, this will be a solid contribution.
| current_tool_call.function.name = self.current_tool_name | ||
| self.current_tool_name = None | ||
| if self.streaming_state == StreamingState.PARSING_NAME_COMPLETED: | ||
| self.streaming_state = StreamingState.WAITING_FOR_TOOL_KEY | ||
| if self.streaming_state in [ | ||
| StreamingState.PARSING_ARGUMENTS, | ||
| StreamingState.PARSING_ARGUMENTS_COMPLETED, | ||
| ]: | ||
| if self.streaming_state == StreamingState.PARSING_ARGUMENTS_COMPLETED: | ||
| self.streaming_state = StreamingState.WAITING_FOR_TOOL_KEY | ||
| # the delta_to_be_parsed is part of arguments. | ||
| current_tool_call_modified = True | ||
| if current_tool_call.function.arguments is None: | ||
| current_tool_call.function.arguments = delta_to_be_parsed | ||
| else: | ||
| current_tool_call.function.arguments += delta_to_be_parsed | ||
| if streaming_state_before_parse != StreamingState.PARSING_ARGUMENTS: | ||
| # It's the first chunk of arg. let's lstrip it | ||
| current_tool_call.function.arguments = ( | ||
| current_tool_call.function.arguments.lstrip() | ||
| ) | ||
|
|
||
| if current_tool_call_modified: | ||
| if self.current_tool_mistral_id is not None: | ||
| current_tool_call.id = self.current_tool_mistral_id | ||
| self.current_tool_mistral_id = None | ||
| delta_tool_calls.append(current_tool_call) | ||
|
|
||
| # HACK: serving_chat.py inspects the internal state of tool parsers | ||
| # when determining it's final streaming delta, automatically | ||
| # adding autocompleted JSON. | ||
| # These two lines avoid that nonsense while ensuring finish_reason | ||
| # is set to tool_calls when at least one tool is called. | ||
| if delta_tool_calls and not self.prev_tool_call_arr: | ||
| self.prev_tool_call_arr = [{"arguments": {}}] | ||
|
|
||
| if content or len(delta_tool_calls) > 0: | ||
| delta_message = DeltaMessage() | ||
| if content: | ||
| delta_message.content = content | ||
| if len(delta_tool_calls) > 0: | ||
| delta_message.tool_calls = delta_tool_calls | ||
| return delta_message | ||
| else: | ||
| if self.streaming_state == StreamingState.ALL_TOOLS_COMPLETE: | ||
| return DeltaMessage() | ||
| else: | ||
| return None | ||
|
|
||
| def _split_delta( | ||
| self, | ||
| delta_text: str, | ||
| stop_after_quotes: int = -1, | ||
| stop_after_opening_curly_braces: int = -1, | ||
| stop_after_closing_curly_braces: int = -1, | ||
| stop_after_closing_brackets: int = -1, | ||
| stop_after_colon: int = -1, | ||
| stop_after_comma=-1, | ||
| ) -> tuple[str, str]: | ||
| delta_to_be_parsed = "" | ||
| for i, c in enumerate(delta_text): | ||
| if c in ['"', "'"]: | ||
| delta_to_be_parsed += c | ||
| stop_after_quotes -= 1 | ||
| if stop_after_quotes == 0: | ||
| return (delta_to_be_parsed, delta_text[i + 1 :]) | ||
| elif c == "{": | ||
| delta_to_be_parsed += c | ||
| stop_after_opening_curly_braces -= 1 | ||
| if stop_after_opening_curly_braces == 0: | ||
| return (delta_to_be_parsed, delta_text[i + 1 :]) | ||
| elif c == "}": | ||
| delta_to_be_parsed += c | ||
| stop_after_closing_curly_braces -= 1 | ||
| if stop_after_closing_curly_braces == 0: | ||
| return (delta_to_be_parsed, delta_text[i + 1 :]) | ||
| elif c == "]": | ||
| delta_to_be_parsed += c | ||
| stop_after_closing_brackets -= 1 | ||
| if stop_after_closing_brackets == 0: | ||
| return (delta_to_be_parsed, delta_text[i + 1 :]) | ||
| elif c == ":": | ||
| delta_to_be_parsed += c | ||
| stop_after_colon -= 1 | ||
| if stop_after_colon == 0: | ||
| return (delta_to_be_parsed, delta_text[i + 1 :]) | ||
| elif c == ",": | ||
| delta_to_be_parsed += c | ||
| stop_after_comma -= 1 | ||
| if stop_after_comma == 0: | ||
| return (delta_to_be_parsed, delta_text[i + 1 :]) | ||
| else: | ||
| delta_to_be_parsed += c | ||
|
|
||
| return (delta_to_be_parsed, "") | ||
| # Build response | ||
| if delta_tool_calls: | ||
| return DeltaMessage(tool_calls=delta_tool_calls) | ||
|
|
||
| return None |
There was a problem hiding this comment.
Content that appears before the first [TOOL_CALLS] token is dropped during streaming if it's in the same chunk as the token. The streaming state machine currently doesn't handle content when in the CONTENT state, causing this data loss. The fix involves accumulating content in the CONTENT state and including it in the returned DeltaMessage.
def _stream_tool_calls(
self, delta_token_ids: Sequence[int]
) -> DeltaMessage | None:
"""
Stream tool calls using token-based parsing.
Detects [TOOL_CALLS] and [ARGS] tokens to identify tool call boundaries,
then streams function names and arguments as they arrive.
"""
from mistral_common.tokens.tokenizers.base import SpecialTokenPolicy
delta_tool_calls: list[DeltaToolCall] = []
content_delta = ""
for token_id in delta_token_ids:
if token_id == self.bot_token_id:
# Starting a new tool call
self._current_tool_index += 1
self._current_tool_id = MistralToolCall.generate_random_id()
self._current_tool_name = ""
self._current_tool_args = ""
self._brace_depth = 0
self._streaming_state = StreamingState.PARSING_TOOL_NAME
# Set flag for finish_reason detection
if not self.prev_tool_call_arr:
self.prev_tool_call_arr = [{"arguments": {}}]
# Initialize streamed_args_for_tool for this tool index
while len(self.streamed_args_for_tool) <= self._current_tool_index:
self.streamed_args_for_tool.append("")
elif token_id == self._args_token_id:
# Transition from name to arguments
if self._streaming_state == StreamingState.PARSING_TOOL_NAME:
# Emit the complete function name
delta_tool_calls.append(
DeltaToolCall(
index=self._current_tool_index,
type="function",
id=self._current_tool_id,
function=DeltaFunctionCall(
name=self._current_tool_name.strip()
).model_dump(exclude_none=True),
)
)
self._streaming_state = StreamingState.PARSING_TOOL_ARGS
elif token_id == self._call_id_token_id:
# Skip call ID tokens (they come between name and [ARGS])
# We generate our own IDs
pass
elif self._streaming_state == StreamingState.CONTENT:
# Before any tool call, accumulate as content.
token_str = self._mistral_base_tokenizer.decode(
[token_id], special_token_policy=SpecialTokenPolicy.IGNORE
)
content_delta += token_str
elif self._streaming_state == StreamingState.PARSING_TOOL_NAME:
# Accumulate name tokens
token_str = self._mistral_base_tokenizer.decode(
[token_id], special_token_policy=SpecialTokenPolicy.IGNORE
)
self._current_tool_name += token_str
elif self._streaming_state == StreamingState.PARSING_TOOL_ARGS:
# Stream argument tokens
token_str = self._mistral_base_tokenizer.decode(
[token_id], special_token_policy=SpecialTokenPolicy.IGNORE
)
# Track brace depth for nested JSON
for char in token_str:
if char == "{":
self._brace_depth += 1
elif char == "}":
self._brace_depth -= 1
self._current_tool_args += token_str
# Update streamed_args_for_tool for vLLM's finish handling
if self._current_tool_index < len(self.streamed_args_for_tool):
self.streamed_args_for_tool[self._current_tool_index] = (
self._current_tool_args
)
# Emit arguments delta
delta_tool_calls.append(
DeltaToolCall(
index=self._current_tool_index,
function=DeltaFunctionCall(
arguments=token_str
).model_dump(exclude_none=True),
)
)
# Build response
if delta_tool_calls or content_delta:
return DeltaMessage(
content=content_delta if content_delta else None,
tool_calls=delta_tool_calls)
return NoneThere was a problem hiding this comment.
True, I'll provide a fix soon! Edge case, but a good one.
There was a problem hiding this comment.
I believe that upcoming commit 420e6fd67 fixes this. I tested it with mistralai/ministral-3-14b-instruct-2512 and the following prompt and answer
Prompt:
Say "Sure, let me calculate that for you" and then use the add tool to compute 3+5
Streaming response (content + tool call):
Sure, let me calculate that for you
followed by tool call:
{"name": "add", "arguments": "{\"a\": 3, \"b\": 5}"}
| # Verify each tool call | ||
| for i, expected in enumerate(expected_tool_calls): | ||
| actual = tool_call_data[i] | ||
| assert actual["name"] == expected.function.name, ( | ||
| f"Expected name '{expected.function.name}', got '{actual['name']}'" | ||
| ) | ||
| assert actual["arguments"] == expected.function.arguments, ( | ||
| f"Expected args '{expected.function.arguments}', got '{actual['arguments']}'" | ||
| ) | ||
| assert actual["id"] is not None and len(actual["id"]) == 9 |
There was a problem hiding this comment.
This test is missing an assertion for expected_content. This masks a bug in the streaming implementation where content before a tool call is dropped. The test should be updated to verify that the content field of the delta_message matches the expected_content.
| # Verify each tool call | |
| for i, expected in enumerate(expected_tool_calls): | |
| actual = tool_call_data[i] | |
| assert actual["name"] == expected.function.name, ( | |
| f"Expected name '{expected.function.name}', got '{actual['name']}'" | |
| ) | |
| assert actual["arguments"] == expected.function.arguments, ( | |
| f"Expected args '{expected.function.arguments}', got '{actual['arguments']}'" | |
| ) | |
| assert actual["id"] is not None and len(actual["id"]) == 9 | |
| # Verify each tool call | |
| for i, expected in enumerate(expected_tool_calls): | |
| actual = tool_call_data[i] | |
| assert actual["name"] == expected.function.name, ( | |
| f"Expected name '{expected.function.name}', got '{actual['name']}'" | |
| ) | |
| assert actual["arguments"] == expected.function.arguments, ( | |
| f"Expected args '{expected.function.arguments}', got '{actual['arguments']}'" | |
| ) | |
| assert actual["id"] is not None and len(actual["id"]) == 9 | |
| assert (delta_message.content or "") == expected_content |
There was a problem hiding this comment.
Indeed, I'll fix this soon, thanks.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Sorry for the force-pushes, I'm not used to the sign-off process. |
Signed-off-by: graelo <graelo@graelo.cc>
Signed-off-by: graelo <graelo@graelo.cc>
…e chunk When content tokens and [TOOL_CALLS] arrive in the same streaming chunk, the content was being dropped. Now we extract content tokens before [TOOL_CALLS] and include them in the response. Also adds test assertion to verify content is preserved. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: graelo <graelo@graelo.cc>
Signed-off-by: graelo <graelo@graelo.cc>
Signed-off-by: graelo <graelo@graelo.cc>
Signed-off-by: graelo <graelo@graelo.cc>
|
cc @aarnphm |
This removes JSON parsing of the response to avoid failing on invalid json responses: when the model emits unescaped \n chars inside JSON strings. This aligns with the deepseek parser by not validating the JSON and letting the client what to do. Signed-off-by: graelo <graelo@graelo.cc>
|
In this commit, I removed the JSON parsing before emitting the response to the client. This sometimes breaks, as @Erwandsn found in #19425 (comment) and #19425 (comment) Commit message: |
aarnphm
left a comment
There was a problem hiding this comment.
This is technically breaking change.
Given https://docs.vllm.ai/en/latest/contributing/deprecation_policy I would like to see some proposal to keep previous implementation and deprecate it
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| """ | ||
| Mistral tool call parser for v11+ models. |
There was a problem hiding this comment.
This means all of the deployment for older models won't be able to use this anymore.
I'm ok with this, but might want to have notice somewhere that if older models is to be used, then it should be locked to vllm==0.12.0
There was a problem hiding this comment.
Thanks @aarnphm! Yes, I called for discussion on this topic in my intro text.
In the first commit of this PR the pre-v11 are still supported, but as the code is more complex and these models relatively weak at tool calling, meh, I suggest this simplification.
Thanks for the idea of the deprecation, that's smoother/smarter.
I'll soon port the 1st commit plus the improvements since then. Would you prefer a separate PR or I bring these changes in the current one?
There was a problem hiding this comment.
we can bring this to this PR. But if it works better for you i'm also good with separating out a different PR.
|
@graelo I don't understand your latest changes, the json is still not parsed by the model but now the endpoint return a HTTP 400. Maybe we should make something to be able to parse the JSON event if the model didn't escaped \n correctly ? Edit: I've made JSON semantic validation on the tool calls produced by the model and they all pass the validation. |
Signed-off-by: graelo <graelo@graelo.cc>
|
Hi @Erwandsn. Regarding my last commit: it simply drops the mechanism that accumulates the But in your case, this probably makes serialization fail later in the chain, before sending to the client. I'm guessing this based on your feedback that the endpoint returns 400 (I would have expected a 500), and crashes. Sorry. In the next commit I add a fix for
|
|
Something seems strange, in the first version of your code the json.loads fail and the API return the raw response.
I preprare a completion JSON body to reproduce, forward it here ASAP |
|
I think #30332 fixed the issue. Closing. |
Purpose
Rewrite Mistral tool parser to use token-based parsing for v11+ models. Special tokens like
[TOOL_CALLS]and[ARGS]have dedicated token IDs that are atomic - they can't be split across streaming chunks, which eliminates edge cases plaguing text-based parsing.Drops pre-v11 support (Mistral-7B-Instruct-v0.1/v0.2/v0.3). These older models have weak tool calling and the text-based parsing is fragile. Users should upgrade to v11+ models.
Related to #19425 which tries to fix streaming issues while keeping pre-v11 support. This PR takes the opposite approach: drop pre-v11 and sidestep the parsing rabbit hole.
Important
In the first commit of this PR, both pre-v11 and v11+ are supported and tested in the same manner as the 2nd commit (and yes I also tested them with tool calling on a GPU using curl and opencode). However, the tool calling capabilities of these older models are not super duper crazy, and thus I simplified the code in the 2nd commit.
This of course open to discussion.
Test Plan
Server test with Ministral-3-14B-Instruct-2512:
Test Result
Local pytest: 17 passed, 1 xfailed (BPE tokenization edge case that doesn't occur in real inference)
Server integration on Ministral-3-14B-Instruct-2512:
Additional note
I already shared the tool parser code and how I tested it on GPU: #19425 (comment)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.