fix: tool_choice="required" falls back to tool_parser for non-JSON formats (streaming + non-streaming)#35936
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a bug where tool_choice="required" would fail for non-JSON tool call formats. The fix, which involves adding a try...except block to fall back to the configured tool_parser, is logical and mirrors the behavior of tool_choice="auto". This is a good, low-risk solution. I have one suggestion to make the exception handling more specific and robust.
| for tool_call in tool_calls | ||
| ] | ||
| ) | ||
| except Exception: |
There was a problem hiding this comment.
Using except Exception: is too broad and can mask unexpected errors (e.g., KeyboardInterrupt). It's a best practice to catch more specific exceptions. In this case, TypeAdapter(...).validate_json() raises pydantic.ValidationError on failure. Catching this specific exception makes the code safer and the intent clearer. You will need to add from pydantic import ValidationError at the top of the file.
| except Exception: | |
| except ValidationError: |
|
Hi @voipmonitor, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
2817ac6 to
ac8975e
Compare
Cherry-pick upstream fixes for GB10 Spark (SM121): - PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8 kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py) - PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4 by using ReplicatedLinear with quant_config=None - PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds on-the-fly FP8 dequantization in Triton kernels - PR vllm-project#35936: tool_choice="required" falls back to tool_parser for non-JSON (XML) tool calls from Qwen3 models Local patches: - Patch FlashInfer TRTLLM JIT to compile for SM12x (supported_major_versions=[10] → [10, 12]) - Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)
|
Hi @voipmonitor, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
4fb8b2c to
50662ad
Compare
|
Hi @voipmonitor, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Cherry-pick upstream fixes for GB10 Spark (SM121): - PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8 kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py) - PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4 by using ReplicatedLinear with quant_config=None - PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds on-the-fly FP8 dequantization in Triton kernels - PR vllm-project#35936: tool_choice="required" falls back to tool_parser for non-JSON (XML) tool calls from Qwen3 models Local patches: - Patch FlashInfer TRTLLM JIT to compile for SM12x (supported_major_versions=[10] → [10, 12]) - Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)
Cherry-pick upstream fixes for GB10 Spark (SM121): - PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8 kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py) - PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4 by using ReplicatedLinear with quant_config=None - PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds on-the-fly FP8 dequantization in Triton kernels - PR vllm-project#35936: tool_choice="required" falls back to tool_parser for non-JSON (XML) tool calls from Qwen3 models Local patches: - Patch FlashInfer TRTLLM JIT to compile for SM12x (supported_major_versions=[10] → [10, 12]) - Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)
…rmats When tool_choice="required", the non-streaming path in _parse_tool_calls_from_content() unconditionally calls TypeAdapter(list[FunctionDefinition]).validate_json(content), which fails with a 400 error when the model produces non-JSON tool calls (e.g. XML format from Qwen3). This patch wraps the JSON validation in a try/except and falls back to the configured tool_parser (e.g. qwen3_coder) when JSON parsing fails. This is consistent with the tool_choice="auto" branch which already uses the tool parser. Models that produce JSON tool calls are unaffected — the try block succeeds and behavior is identical to before. Signed-off-by: voipmonitor <festr@voipmonitor.org>
The non-streaming fix (engine/serving.py) handles the case where the full response contains non-JSON tool calls. However, the streaming code path in chat_completion/serving.py has an analogous bug: the "required" branch calls extract_tool_call_required_streaming() which uses partial_json_loads() — a JSON-only parser that cannot handle XML tool calls (e.g. from Qwen3 with qwen3_coder parser). This patch makes the streaming "required" branch use the configured tool_parser.extract_tool_calls_streaming() when available, mirroring the existing "auto" + reasoning branch logic. This includes resetting previous_text and previous_token_ids after reasoning ends, so the tool_parser sees only post-reasoning content. Falls back to the original JSON-only extract_tool_call_required_streaming() when no tool_parser is configured, preserving backward compatibility. Signed-off-by: voipmonitor <festr@voipmonitor.org>
Three fixes for the streaming `tool_choice="required"` code path: 1. Initialize tool_parsers for "required" (not just "auto"): `tool_parsers` were only created when `tool_choice_auto` was True, so the configured tool_parser was always None for "required". 2. Use separate `if` instead of `if/else` for reasoning + tool parsing: With MTP/speculative decoding, `</think>` and tool call tokens can arrive in the same chunk. The `if/else` structure prevented the tool parser from running on the same iteration where reasoning ended. 3. Dual parser: try tool_parser (XML) first, JSON fallback second: With MTP, models may non-deterministically produce JSON or XML tool calls. The tool_parser handles XML, while the JSON-only extract_tool_call_required_streaming() handles JSON. Both are tried. Also adds unit tests for the non-streaming tool_parser fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: voipmonitor <festr@voipmonitor.org>
With tool_choice="required", the tool_parser may emit partial text
as content before detecting the tool call pattern (e.g. qwen3_coder
looks for XML "<tool_call>" but sees JSON "[{\"name\":" first).
This caused content to leak alongside tool_calls in the SSE stream,
so clients received both delta.content and delta.tool_calls for the
same response.
Since tool_choice="required" mandates that the response is always
a tool call (never text content), suppress any content in
delta_message after tool parsing. The "auto" branch does not have
this issue because the tool_parser has full control over what goes
to content vs tool_calls from the start.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: voipmonitor <festr@voipmonitor.org>
Content suppression was too aggressive — when the model doesn't comply with tool_choice="required" and generates text instead of tool calls, the text answer is lost and the client gets an empty response. Clients should handle the content leak themselves by prioritizing tool_calls over content when both are present. Signed-off-by: Martin Festr <festr@voipmonitor.org> Signed-off-by: voipmonitor <festr@voipmonitor.org>
e57e4f9 to
eed0ae7
Compare
|
Hi @voipmonitor, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Cherry-pick upstream fixes for GB10 Spark (SM121): - PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8 kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py) - PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4 by using ReplicatedLinear with quant_config=None - PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds on-the-fly FP8 dequantization in Triton kernels - PR vllm-project#35936: tool_choice="required" falls back to tool_parser for non-JSON (XML) tool calls from Qwen3 models Local patches: - Patch FlashInfer TRTLLM JIT to compile for SM12x (supported_major_versions=[10] → [10, 12]) - Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)
PR vllm-project#35675 equivalent (MTP fc layer fix) Updated qwen3_5_mtp.py Switched import from ColumnParallelLinear to ReplicatedLinear Changed FC construction from self.fc = ColumnParallelLinear(...) to self.fc = ReplicatedLinear(...) Removed TP-only args (gather_output, return_bias) Set quant_config=None for this layer Updated call site to unpack tuple: hidden_states, _ = self.fc(hidden_states) PR vllm-project#35936 equivalent (tool_choice="required" fallback) Updated engine/serving.py Replaced JSON parse suppress-block at elif request.tool_choice == "required": New flow: First try TypeAdapter(...).validate_json(content) On ValidationError or JSON decode error, fallback to configured tool parser when available Convert parsed tool calls into FunctionCall(...) entries Removed now-unused contextlib import Signed-off-by: ec-jt <james.trappett@elementalcompute.com>
Summary
tool_choice="required"fails when the model produces non-JSON tool calls (e.g. Qwen3 XML format), even though a compatible--tool-call-parser(e.g.qwen3_coder) is configured. This affects both the non-streaming and streaming code paths.Root cause
Both
tool_choice="required"code paths bypass the configuredtool_parserand use JSON-only parsing, while thetool_choice="auto"branches correctly use the tool parser.Non-streaming path (
engine/serving.py)In
_parse_tool_calls_from_content(), the code uses pydanticTypeAdapterJSON validation only. When the model produces XML-formatted tool calls (e.g.<tool_call>{"name": ...}</tool_call>), parsing fails withValidationError. Note: #36841 recently changed this block to silently suppressValidationError(fixing the crash), but this means non-JSON tool calls are now silently dropped instead of being routed to the tool parser.Streaming path (
chat_completion/serving.py)Three issues in the streaming
requiredbranch:tool_parsersnot initialized —tool_parsersarray is only created whentool_choice_autois true. Whentool_choice="required", it stays empty, so even if a parser is configured, it is never used.if/elseprevents tool parsing when reasoning ends — With MTP (multi-token prediction), the model can send the</think>reasoning-end token and the beginning of tool call content in the same chunk. The originalif/elsestructure (reasoning processing vs. tool call processing) meant that the iteration that detected reasoning-end would skip tool call extraction entirely.No tool_parser fallback for non-JSON formats — The
requiredbranch only hasextract_tool_call_required_streaming()which expects JSON. Models using XML-style tool calls (like Qwen3 withqwen3_coderparser) fail silently.Fix
Non-streaming (
engine/serving.py)Replace the
contextlib.suppress(ValidationError)from #36841 with atry/exceptthat preserves the crash-safety (content = content or "") while adding a fallback: when JSON parsing fails withValidationErrororJSONDecodeError, fall back to the configuredtool_parser.extract_tool_calls()(whenenable_auto_toolsis true and a parser is available). This way both the crash fix from #36841 and the non-JSON fallback coexist.Streaming (
chat_completion/serving.py)Initialize
tool_parsersforrequiredtoo — Create parser instances whentool_choiceis"auto"OR"required".Separate
ifstatements — Change theif reasoning / else tool_parsingto two independentifblocks so both can execute in the same iteration when reasoning ends.Dual parser approach — When a
tool_parseris available, try it first (for XML/custom formats). If it has not detected tool calls yet, also try the JSONextract_tool_call_required_streaming()as fallback. This handles the non-deterministic output from MTP, where the same model may produce JSON in one request and XML in another.Note: content alongside tool_calls
When using the tool_parser fallback for
required, the parser may emit partial text asdelta.contentbefore it detects the tool call pattern (e.g.,qwen3_coderlooks for XML<tool_call>tags but the model outputs JSON[{"name":...). This means some SSE chunks may contain bothcontentandtool_calls. Clients consumingtool_choice="required"responses should prioritizetool_callsovercontentwhen both are present in the stream — the content in that case is leaked parser buffer text, not a real text response.Discussion point: Dual parser approach
Fix #3 is somewhat opinionated. We use it because with MTP (speculative decoding), the same model (Qwen3) non-deterministically produces either JSON or XML tool calls across different requests. The dual approach handles both formats reliably. However, if the vLLM team prefers a different strategy (e.g., always routing through the tool_parser for
required), we are happy to adjust.Testing
tests/tool_use/test_tool_choice_required_fallback.py--tool-call-parser qwen3_coder --enable-auto-tool-choice):tool_choice="required"succeedFiles changed
vllm/entrypoints/openai/engine/serving.py— Non-streaming fallback (replaces silent suppress from [Bugfix] Fix crash when tool_choice=required exceeds max_tokens #36841 with try/except + tool_parser fallback)vllm/entrypoints/openai/chat_completion/serving.py— Streaming fixes (3 changes)tests/tool_use/test_tool_choice_required_fallback.py— Unit tests (new)