feat: Add streaming support for Mistral v11 tool format#20503
feat: Add streaming support for Mistral v11 tool format#20503sjuxax wants to merge 12 commits intovllm-project:mainfrom
Conversation
Co-authored-by: avigny <47987522+avigny@users.noreply.github.com> Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat> Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Jeff Cook <jeff@jeffcook.io>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Summary of Changes
Hello @sjuxax, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces comprehensive streaming support for Mistral's v11 tool calling format by completely overhauling the existing tool parsing mechanism. The changes focus on robustly handling incremental output, dynamically adapting to the new format, and improving overall parsing efficiency, ensuring seamless tool integration during model generation.
Highlights
- Streaming Tool Call Support: I have significantly improved the Mistral tool calling parser to add robust streaming support for the new Mistral v11 tool format. This includes handling incremental parsing of tool names and arguments.
- Refactored Parsing Logic: The core streaming parsing logic within
MistralToolParserhas been completely re-architected. This involves introducing a new state machine (StreamingStateenum) and dedicated state variables for more precise and efficient parsing of tool calls as they stream in. - Dynamic Format Detection: The parser now dynamically detects whether the incoming tool call format is the traditional JSON array or the newer Mistral v11 format (e.g.,
ToolName{arguments}), adapting its parsing strategy accordingly. - Performance Optimizations: I've implemented optimized regex patterns and caching mechanisms for JSON parsing to enhance the performance of tool call extraction during streaming.
- Comprehensive Testing: A new, extensive test suite has been added to validate both non-streaming and streaming tool call extraction for various scenarios, including single and multiple tool calls, different argument types, and the new v11 format.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces streaming support for the Mistral v11 tool format. While the refactoring to a state machine is a positive step, there are critical issues related to correctness and robustness, particularly in the v11 parsing logic and test coverage. Addressing these is essential for the stability of the new feature.
|
@sjuxax hey! I've tested your solution here and it seems to be working, nice job! For non-streaming, I think we can fix it by replacing lines 510:535: with this: I've commented this also on #19425, but mentioning here also! |
|
|
||
| # Core streaming state | ||
| self.raw_tool_calls: str = "" | ||
| self.streaming_state: StreamingState = StreamingState.WAITING_FOR_TOOL_START |
There was a problem hiding this comment.
Can you fix these ruff error?
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>
…sed tools Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>
…ex and JSON decoding Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>
… JSON corruption Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>
Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>
… and using offset-based parsing Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>
…mpatibility Co-authored-by: aider (anthropic/claude-sonnet-4-20250514) <aider@aider.chat>
Co-authored-by: aider (claude-opus-4-20250514) <aider@aider.chat>
|
Addressed Gemini's comments with Sonnet/Opus. I've been using these changes on my Mistral3.1-rebase branch with success for the last week or so. @avigny, will take a look at your tests and probably pull them in in place of the Opus-autobuilt ones tomorrow. @PedroMiolaSilva, thanks for persistently posting that snippet. I'll test and pull it in tomorrow too. |
|
Here is a snippet which reproduces some errors with Mistral Small 3.2 with commit b521f50 from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location", "unit"]
}
}
}]
is_stream = False # <--- try also with True
out = client.chat.completions.create(
model=client.models.list().data[0].id,
messages=[{"role": "user", "content": "Where is colder tomorrow San Francisco or New York?"}],
tools=tools,
stream=is_stream,
temperature=0,
)
if is_stream:
for chunk in out:
print(chunk.choices[0])
else:
print(out.choices[0])when I run it with |
|
hey @sjuxax ! I've followed your instructions from hugging face and tried to run with v0.9.2 but with your modified files: I've tried with and without the chat template and the same error shows up. In general, it works almost always, but I've found an example that breaks something on serving_chat.py, and I'm trying to figure it out why. This is the request example (this is a tool from Hubspot MCP): And then the server goes: Again, most tools work perfectly, so idk why this happens with this particular one. Any thoughts? Am I doing something wrong? |
re-using @sjuxax `StreamingState` from vllm-project#20503 Co-authored-by: Jeff Cook <jeff@jeffcook.io> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
here are where serving_chat.py inspects the internal state of tool parsers vllm/vllm/entrypoints/openai/serving_chat.py Lines 841 to 845 in ef4d46c vllm/vllm/entrypoints/openai/serving_chat.py Line 864 in ef4d46c The hack done for pythonic_tool_parser or llama4_pythonic_tool_parser should probably be done for mistral tool parser too. vllm/vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py Lines 167 to 173 in ef4d46c |
|
Hello, what's the status of this work ? |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
Hi @sjuxax, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hello, any update ? Mistral 3 |
|
Thank you for the effort on improving Mistral tool call streaming! This work has been superseded by PRs #19425 and #30332, which were merged and address the Mistral tool parser v11 streaming support. The referenced issue #20028 is also closed. We are closing this PR accordingly. If you believe there are additional improvements not covered by those merged PRs, please feel free to open a new PR. Thanks for the contribution! |
Follow-up to #19425
Fixes #20028
Purpose
Based on avigny's work in #19425, we substantially improve the Mistral tool calling parser to handle the tool call format in MistralTokenizer v11.
Test Plan
Used avigny's test suite attached to #19425, passes.
Test Result
Tested it in streaming on https://huggingface.co/jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym, works. Didn't test non-streaming or other checkpoints, so not sure if they work yet.