[Bugfix][ToolParser] Fix Qwen3 XML and Coder streaming tool call parser regressions#40861

Open

ExtReMLapin wants to merge 25 commits into

vllm-project:mainfrom

ExtReMLapin:qwen3_combined_fixes

Contributor

ExtReMLapin commented Apr 25, 2026 •

edited

Loading

To be used with #40783

Purpose

Fix several streaming regressions in both the Qwen3CoderToolParser and
Qwen3XMLToolParser that caused dropped parameters, duplicated content,
or incorrect type conversion in tool call responses.

Qwen3Coder (streaming)

Fix split <tool_call> tag detection: when the tag was fragmented across
two deltas (e.g. <tool_ then call>), it was not detected and the tool
call was silently dropped.
Fix dropped parameters when the tool call header (<tool_call><function=name>)
arrived in delta 1 and the parameters + </function> arrived in delta 2.
Fix last content message not being flushed to the client after all tool calls
completed.
Fix structural delimiter disambiguation: </tool_call>, </function> and
</parameter> appearing as literal text inside a parameter value (e.g.
documentation, Python code) were incorrectly treated as closing delimiters,
truncating or corrupting parameter values.

Qwen3XML (streaming)

Fix delayed text emission between consecutive tool calls.
Fix anyOf schema type detection: nullable schemas
({"anyOf": [{"type": "string"}, {"type": "null"}]}) were classified as
"object" (triggering json.loads) instead of resolving to the first
non-null type ("string"), causing type conversion errors.
Fix double-close fallback when </parameter> appeared inside a parameter
value.

Both parsers

Fix speculative decoding: when two or more complete tool calls were delivered
in a single delta burst, only the first was emitted; subsequent ones were
silently dropped.

Refactor / tests

Extract _advance_to_next_tool() helper in Qwen3CoderToolParser to
deduplicate identical state-advance logic that was copy-pasted between the
normal delta path and the speculative-decoding recursion path.
Factor all regression tests shared between the XML and Coder parsers into
tests/tool_parsers/test_qwen3_xml_coder_shared.py, parametrized over both
parser classes.

Not a duplicate of any open PR: existing Qwen3 tool parser PRs address
non-streaming (batch) parsing only. This PR focuses exclusively on the
streaming path and speculative decoding edge cases.

Test Plan

python -m pytest \
  tests/tool_parsers/test_qwen3coder_tool_parser.py \
  tests/tool_parsers/test_qwen3xml_tool_parser.py \
  tests/tool_parsers/test_qwen3_xml_coder_shared.py \
  -v

Test Result

249 passed, 16 warnings in 108.68s
All 249 tests pass. No regressions detected in the existing test suite.

CNE Pierre FICHEPOIL and others added 15 commits

April 24, 2026 09:42


          fix split tag detection in tool parser : qwen3_coder (streaming mode)

7fc99ed

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          Update vllm/tool_parsers/qwen3coder_tool_parser.py

f1785b3

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>


          Fix delayed text emission between tool calls in Qwen3XML

68694d7

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          Update vllm/tool_parsers/qwen3coder_tool_parser.py

fcd8783

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>


          applied gemini suggestion

f4ee86c

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          gemini is right, don't get tokenizer using transformers

fe4d251

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          gemini is right, ensure last content message is flushed to client aft…

3ffd769

…er tool calls

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          fixed tests

721b10e

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          Fixed edge case streamed tool call started in delta1 (tool call start…

77d9e95

… + function name only) + delta2 (params + tool call end) was dropping params

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          fixed and re-enabled broken tests

a4ef7d0

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          Fix StreamingXMLToolCallParser: anyOf type detection and double-close…

74783cd

… fallback

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          Merge XML and Coder Qwen3 fix branches

f12be42

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          fix Qwen3 XML and Coder tool parser regressions on merged branches

c164555

Combined fixes for the XML and Coder tool parsers that surfaced once
the two PR branches were merged together.

Qwen3XML parser:
* Reorder _convert_param_value: check string type BEFORE the "null"
  shortcut so a string param with literal value "null" stays "null"
  instead of becoming JSON null. Fix logger.warning argument count.
* _convert_for_json_streaming: emit "null" (not "") when converted_value
  is None so nullable integer/object params serialize correctly.
* _get_param_type: anyOf returns the first non-null type instead of
  falling back to "string" for nullable integer/boolean schemas.
* _preprocess_xml_chunk: defer streaming for boolean params (avoids
  emitting "false" on the first 't' of "true") and for all container
  types regardless of single-quote hint.
* _end_element deferred path: try json.loads BEFORE ast.literal_eval so
  arrays/objects containing JSON true/false/null parse natively;
  double-decode strings to recover from buggy json.dumps(str(dict))
  templates.
* Add structural-aware helpers: _is_structural_tag_position,
  _get_valid_param_names, _is_structural_closing_tag (with partial-tag
  prefix safety), _chunk_has_structural_function_end,
  _chunk_has_structural_tool_call_end.
* _preprocess_xml_chunk: when SAX state is inside a parameter value,
  escape <tool_call>/<function=> always, and <parameter=NAME>/closing
  tags only when they are not structural delimiters.
* _process_complete_xml_elements: defer </parameter> when streaming
  with empty lookahead (more tokens may still arrive).
* parse_single_streaming_chunks: fallback close uses
  _chunk_has_structural_*_end instead of plain "in xml_chunk" so a
  literal </function> in a parameter value doesn't trigger a double
  close.
* extract_tool_calls_streaming: enable _streaming_mode=True on first
  delta.

Qwen3Coder parser:
* Reorder _convert_param_value the same way (string-first, then null).
* anyOf picks the first non-null type instead of treating it as
  "object".
* Container handling: try json.loads then double-decode via
  ast.literal_eval to recover from buggy json.dumps(str(dict)) outputs.
* Add structural-aware helpers: _next_structural_param_start,
  _find_true_function_end, _find_true_tool_call_end,
  _find_true_param_end (with require_lookahead for streaming).
* _parse_xml_function_call: top-level params are NOT filtered by schema
  (callers may rename fields) but nested boundaries inside a value ARE,
  so literal <parameter=...> lines in file content don't terminate the
  param early.
* _get_function_calls: structural-aware (</tool_call> must be followed
  by another <tool_call> or EOS; same for </function>).
* Streaming param_starts uses the helpers; </function> close check
  uses _find_true_function_end so a literal </function> in a value
  doesn't prematurely emit "}".
* tool_start_positions skips past each </tool_call> of completed calls
  so a literal <tool_call> inside a parameter value of a closed call
  doesn't spawn a phantom new tool call.
* Multi-tool-call delta (speculative decoding): when one tool call
  closes and another full <tool_call>...</tool_call> remains in
  current_text, advance manually and re-enter with a sentinel
  previous_text so reset_streaming_state isn't triggered (which would
  loop forever).

These fix the agentic-streaming bug where Qwen3.5 would freeze
mid-tool-call when a parameter value contained <tool_call>,
</parameter>, <parameter=NAME>, or </function> as literal text (e.g.
writing a Jinja2 template, a heredoc, or any file describing the
tool-call format), as well as several value-conversion bugs (string
"null" -> JSON null, anyOf nullable -> wrong type, double-encoded
objects -> string).

Add 16 regression tests in test_qwen3xml_tool_parser.py, 10 in
test_qwen3coder_tool_parser.py, and a new test_qwen36_bugs.py
covering bugs that span both parsers (XML array with JSON true/false,
Coder multi-tool-call in one streaming delta).

98 tests pass across the three test files.

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          test: factor shared Qwen3 XML/Coder regression tests into one file

Both the XML and Coder tool parsers were tested against nearly
identical regression scenarios in their respective files (string
"null" preservation, anyOf nullable schemas, double-encoded objects,
content with literal XML structural tags, content with param-like
lines, etc.).  Split the shared expectations into a single file with
a parametrized parser fixture so that:

* the same intent is tested against BOTH parsers automatically;
* divergent behaviour is caught immediately instead of drifting;
* parser-specific quirks (XML SAX double-close brace, char-by-char
  boolean streaming, Coder speculative-decoding chunk loss, etc.)
  stay in their parser-specific test file.

New: tests/tool_parsers/test_qwen3_xml_coder_shared.py exposes a
``parser_cls`` fixture parametrized over Qwen3XMLToolParser and
Qwen3CoderToolParser.  Each shared test runs twice and prints
``[xml]``/``[coder]`` in the test id.

Removed duplicates from:
* tests/tool_parsers/test_qwen3xml_tool_parser.py: anyOf object
  param (streaming + non-streaming), string null preservation, anyOf
  integer/null type detection, content with structural tags
  (streaming + non-streaming), content with param-like lines
  (streaming + non-streaming), double-encoded object (streaming +
  non-streaming).
* tests/tool_parsers/test_qwen3coder_tool_parser.py: anyOf parameter
  not double encoded, string null preservation, anyOf string/null
  numeric value, content with XML structural tags (streaming +
  non-streaming), content with param-like lines (streaming +
  non-streaming), double-encoded object (streaming + non-streaming),
  content param with tool_call tag (streaming + non-streaming —
  redundant with content_with_xml_structural_tags).

Removed: tests/tool_parsers/test_qwen36_bugs.py.  Its two scenarios
(XML array containing JSON ``true``, Coder two complete tool calls
in a single streaming delta) are now in the shared file as
``test_array_with_json_bool`` and
``test_two_tool_calls_in_one_streaming_chunk``, both running against
both parsers.

Net effect: 209 -> 183 tests, 0 failures, identical coverage.

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>


          test: maximize shared coverage between Qwen3 XML and Coder parsers

7036e5f

Move all generic regression tests (basic extraction, type conversion,
streaming variants, robustness) from the Coder-specific file into the
shared parametrized file so each test runs against both parsers.  Only
behaviour that genuinely differs between the two parsers stays
parser-specific:

- Coder-only: ``streaming_split_tag`` (relies on ``is_tool_call_started``)
  and ``streaming_various_chunk_sizes`` (XML SAX cannot tolerate
  single-character chunks).
- XML-only: ``streaming_missing_opening_tool_call_tag`` (Coder does not
  recover from a missing ``<tool_call>`` opener in streaming mode).

Two assertions were relaxed in the shared file to accept both legitimate
behaviours: content between parallel tool calls (``None`` vs ``"\\n"``)
and the streaming header arguments value (``""`` vs ``"{"``).

Test count rises from 99 to 138 (+39 from cross-parser parametrization)
while ``test_qwen3coder_tool_parser.py`` shrinks from 1260 to 162 lines.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

This was referenced Apr 25, 2026

[Bugfix] Qwen3 XML parser: interleaved text emission and streaming ID management #40787

Closed

[Bugfix] Robust Qwen3 Coder streaming: fragmented tags, speculative decoding fixes, and content tracking #40785

Closed

mergify Bot added qwen tool-calling bug labels

github-project-automation Bot added this to Tool Calling

ExtReMLapin marked this pull request as ready for review

April 25, 2026 05:02

ExtReMLapin requested review from aarnphm, bbrowning, chaunceyjiang and sfeng33 as code owners

April 25, 2026 05:02

claude Bot reviewed

View reviewed changes

claude Bot left a comment

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist Bot reviewed

View reviewed changes

Contributor

gemini-code-assist Bot left a comment

Code Review

This pull request significantly improves the robustness of the Qwen3 XML and Coder tool parsers, particularly for streaming scenarios involving speculative decoding and complex parameter types. Key changes include structural-aware parsing to correctly handle XML tags appearing as literal text within parameter values, improved handling of nullable/anyOf schemas, and fixes for streaming bugs where partial tokens or multi-tool bursts could lead to data loss or incorrect type conversion. I have reviewed the implementation and identified a potential issue where content from recursive tool call processing might be lost; please apply the suggested fix to ensure all model output is correctly concatenated.

vllm/tool_parsers/qwen3coder_tool_parser.py Outdated

Contributor Author

ExtReMLapin commented Apr 25, 2026

Manual testing was done with some bleeding corner cases like trying to write special tokens inside a tool call (asking a tool to write python code containing inside strings special tokens into a file, whole thing streamed)

bfroemel commented Apr 25, 2026 •

edited

Loading

apologies if this is the wrong place to ask, but as you must be deeply familiar with these parsers: is there a technical reason why we (still) have two qwen3 tool parsers? does Qwen3Coder offer anything over Qwen3XML? I know that Qwen has Qwen3Coder in their Qwen3.5 documentation/releases, but there is also this: #25028 (comment) (+ follow up comment; comments from PR with tool parser contribution from the Qwen team)

chaunceyjiang assigned sfeng33

chaunceyjiang reviewed

View reviewed changes

Collaborator

chaunceyjiang left a comment

Thanks, I have a question. What issues does the current tool parser have?

I noticed you’ve made quite a lot of changes to these two tool parsers. This might take quite some time to review.

Contributor Author

ExtReMLapin commented Apr 29, 2026 •

edited

Loading

TL;DR

Every single added test is its own bug report. There is no added test that was passing on main. You can reproduce that locally:

git fetch origin pull/40861/head:pr-40861 && git checkout pr-40861

# 1) Run all qwen3 tests with this PR's code → everything passes
pytest tests/tool_parsers/test_qwen3coder_tool_parser.py \
       tests/tool_parsers/test_qwen3xml_tool_parser.py \
       tests/tool_parsers/test_qwen3_xml_coder_shared.py
# → 168 passed

# 2) Now restore ONLY the parsers from main (keep the new tests)
git checkout main -- vllm/tool_parsers/qwen3coder_tool_parser.py \
                     vllm/tool_parsers/qwen3xml_tool_parser.py
pytest tests/tool_parsers/test_qwen3coder_tool_parser.py \
       tests/tool_parsers/test_qwen3xml_tool_parser.py \
       tests/tool_parsers/test_qwen3_xml_coder_shared.py
# → 66 failed, 102 passed

# 3) Restore the parsers
git checkout pr-40861 -- vllm/tool_parsers/qwen3coder_tool_parser.py \
                         vllm/tool_parsers/qwen3xml_tool_parser.py

The 102 still-passing tests are tests that already existed on main. The 66 failing ones are exactly the new tests added in this PR — each one is a real symptom I hit in production with Qwen3.5 / Qwen3.6.

What kind of bugs (so you can review by category, not by line count)

The 66 failures fall into ~6 independent categories. Each category is self-contained, and you can read / merge them independently if you prefer to split the PR :

MTP / fragmented tokens bugs — at temp 1.5 the <tool_call> special token is sometimes split into multiple tokens (~1/75 calls), and entire before <tool_1> between <tool_2> after </tool_call> sequences arrive in a single delta (or just ... omited) . Multiple bugs : early-advance returning None and dropping the delta, recursion not advancing _sent_content_idx past the last </tool_call>, content fragments between two tools silently dropped because the outer merger guarded with not result.content.
Tests :

test_extract_tool_calls_streaming_speculative_decode_loss
test_two_tool_calls_in_one_streaming_chunk
test_streaming_two_tool_calls_plus_trailing_text_one_delta
test_streaming_trailing_text_with_final_close_in_same_delta,
test_streaming_content_before_and_between_two_tool_calls_one_delta.

Literal XML tags inside parameter values — a write_file tool whose content documents the tool-call format itself was making current_text.find("</tool_call>") land inside an earlier tool's content and silently drop every subsequent emission. New _structural_tool_call_end_positions helper only accepts </tool_call> if it's preceded by </function> after optional whitespace, or followed structurally by another opener / EOS. Same fix extended to </function>, </parameter>, <tool_call> opener detection.
Tests :

test_content_with_xml_structural_tags_*
test_content_with_param_like_lines_*,
test_content_with_real_param_name_literal_*,
test_content_with_full_nested_tool_call_*,
test_two_tools_second_with_out_of_schema_nested_literal_*,
test_streaming_*_literal_close_tag_in_value.

Qwen3.5 | string chat-template rendering — the official chat template renders nullable args via | string, so a previous turn's null value becomes the literal "None" in the prompt. Models trained on this template generate "None" verbatim. _convert_param_value now accepts "None" alongside "null" for nullable params, and the XML streaming path defers numeric (int/float) conversion the same way booleans were already deferred — otherwise the diff-based char emission produces "Non" then "l" against the new "null" output, yielding the cumulative invalid JSON "Nonl".
Tests :

test_python_none_value_for_nullable_int,
test_qwen3xml_streaming_python_none_int_char_by_char,
test_anyof_string_null_*,
test_anyof_integer_null_parses_as_int,
test_string_null_value_preserved.

anyOf / nullable schema handling — both parsers were double-encoding object params resolved via anyOf (previous PR added partial support but missed the streaming path and nested objects).
Tests :

test_anyof_object_param_not_double_encoded_*
test_double_encoded_object_param_*,
test_array_with_json_bool.

Free-text streaming around tool calls — content between two tool calls was being delayed and emitted after the last tool call, content after the last tool call was buffered indefinitely and lost on EOS.
Tests :

test_qwen3xml_streaming_text_after_tool_call,
test_qwen3xml_async_streaming_free_text,
TestQwen3xmlToolParser::test_surrounding_text[True],
test_streaming_trailing_text_*,
test_inline_empty_tool_call_preserves_content_before_real_call.

Streaming chunking robustness — split tags across delta boundaries (a < arriving in delta N and tool_call> in delta N+1), various chunk sizes (1 char, 2 char, … full token), bool true getting flipped to false because of partial-string→JSON-literal flip, last char of a string-typed null value being dropped.
Tests :

test_extract_tool_calls_streaming_split_tag,
test_streaming_char_by_char_literal_balises_in_value,
test_extract_tool_calls_streaming_various_chunk_sizes,
test_xml_streaming_boolean_true_not_false,
test_xml_streaming_string_null_last_char_not_dropped,
test_qwen36_xml_streaming_double_close_brace,
test_xml_streaming_parallel_tool_calls_preformed_chunks,
test_xml_streaming_missing_opening_tool_call_tag.

How I found them

I run a lot of agentic stuff with Qwen3.5 27B / Qwen3.6 in MTP + streaming, (temp 1.5 & advised temps) . The MTP-related ones were caught at runtime over days of intensive usage.
For the | string / template-rendering bugs, I literally asked Qwen3.5 to read its own chat template and predict where the parsers would break. It nailed several of them in one shot.
The "literal balise in parameter value" bugs surfaced when I asked the model to write a Python tool whose content was itself a tool-call snippet (write_file with code that documents the tool-call format).
Asking Qwen 3.5 to review it's own chat template ... was a messy ride to be frank.

AI assistance : To be explicit : Claude Opus wrote most of the test scaffolding and several of the fixes under my supervision ; I read every changed line, ran them against my real Qwen3 traffic, and I'm the one defending the change end-to-end.

Contributor Author

ExtReMLapin commented Apr 29, 2026

What really worries me as of today is the whole parsing ecosystem (not an issue restricted to vLLM)

IMO each message should be tokenized on it's own, isolated for the others, then reasoning then tool calls, instead of losing messages isolations by applying the chat template.

Not sure if there is a dedicated place to discuss this.

Collaborator

bbrowning commented Apr 29, 2026

What really worries me as of today is the whole parsing ecosystem (not an issue restricted to vLLM)

IMO each message should be tokenized on it's own, isolated for the others, then reasoning then tool calls, instead of losing messages isolations by applying the chat template.

Not sure if there is a dedicated place to discuss this.

I'm not following what you mean here. We aren't parsing for reasoning and tools on incoming messages. We only apply the chat template on incoming messages. We do not apply a chat template to the model's generated outputs. We do parse for reasoning and tool content in the model's generated outputs.

Contributor Author

ExtReMLapin commented Apr 29, 2026 •

edited

Loading

What really worries me as of today is the whole parsing ecosystem (not an issue restricted to vLLM)
IMO each message should be tokenized on it's own, isolated for the others, then reasoning then tool calls, instead of losing messages isolations by applying the chat template.
Not sure if there is a dedicated place to discuss this.

I'm not following what you mean here. We aren't parsing for reasoning and tools on incoming messages. We only apply the chat template on incoming messages. We do not apply a chat template to the model's generated outputs. We do parse for reasoning and tool content in the model's generated outputs.

You're right that vLLM doesn't apply the chat template to the model's output directly, my point is it can fail on the next conversation turn :

Model generates raw tokens
vLLM parses them (reasoning + tool calls) into structured fields and ships that to the client
Client appends the structured message to history and sends the whole conversation back
vLLM re-applies the chat template, re-tokenizing everything together

Step 4 is where a bad parse in step 2 breaks the whole thing.
If during generation the model emits a role token inside content/reasoning (ex : while introspecting its own chat template), that token either gets misparsed on the way out, or survives into the structured message and then breaks boundaries when the template is reapplied next turn.

My point is that parsing/templating at the conversation level allows for bad parsing propagation, we destroy the initially correctly "checkpointed" parsed messages. While parsing at message level isolates the parsing issues.

Again I'm not sure it's the right place to discuss this, I'm fine with it but I fear maintainers might not appreciate that.

Edit : please pardon my previous phrasing, I'm out of bandwidth with everything at the office, I'm working overtime to get things together and it's a mess

bfroemel commented Apr 29, 2026 •

edited

Loading

If during generation the model emits a role token inside content/reasoning (ex : while introspecting its own chat template), that token either gets misparsed on the way out, or survives into the structured message and then breaks boundaries when the template is reapplied next turn.

I also noticed that behavior and my "work-a-round" was to ask the model to not use its actual special tokens, but placeholders, e.g., [think] instead of <think>, to not confuse parsers. Another more stable solution could be to filter all model input (while the prompt is rendered) and properly escape any special tokens. before sending back model output to the client the potentially generated escaped special tokens could be automatically unescaped.

Anyway, I think you only really hit this issue, if you work on chat templates and model output parsers.

Contributor Author

ExtReMLapin commented Apr 29, 2026

I do agree, I also do agree it can be an edge case with this particuliar scenario.

If tomorrow one model family takes 90% of the market share because they're simply better than the rest of the model, you can't rely on the others less intelligent models to fix an issue with this model family.

Self compiling compiler, self fixing LLM.

ToastyTheBot commented Apr 29, 2026

I've applied both #40783 and #40861, backported to v0.20.0 with help from Claude, but my qwen3.6 AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP model still frequently fails at tool calling. Am I missing any other patches?

Contributor Author

ExtReMLapin commented Apr 29, 2026

I've applied both #40783 and #40861, backported to v0.20.0 with help from Claude, but my qwen3.6 AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP model still frequently fails at tool calling. Am I missing any other patches?

Do you feel like those PRs introduced less stability or just it didn't fix the issues you had ?

ToastyTheBot commented Apr 29, 2026

I've applied both #40783 and #40861, backported to v0.20.0 with help from Claude, but my qwen3.6 AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP model still frequently fails at tool calling. Am I missing any other patches?

Do you feel like those PRs introduced less stability or just it didn't fix the issues you had ?

I don't believe it has introduced more instabilities, but I don't think it has fixed the model stopping issue. Can you confirm the two PRs fix model stopping issues for you, and if so, what arguments and model are you using?

Contributor Author

ExtReMLapin commented Apr 29, 2026 •

edited

Loading

I feel like it reduced issues on 3.6 27B, qwen3_coder (not much diff with xml).

But it CLEARLY fixed issues with model chat template introspection, which is a very specific use case, I give it to you.

Tensor parallel 2, FP8 official model, with preserve_thinking to true. 200k context

This was referenced May 1, 2026

[Bugfix] Fix Qwen3Coder prev_tool_call_arr double-emission on parse failure #41466

Draft

[Bugfix] Detect MTP truncation at reasoning-to-tool-call boundary #41467

Draft

fix(spec decode): suppress EOS at draft positions in rejection sampler #41493

Draft


          Merge branch 'main' into qwen3_combined_fixes

Conflict resolutions:

- vllm/parser/abstract_parser.py: kept the reasoning_from_transition
  restoration block adjacent to the history_tool_call_cnt counter
  added by main; the two blocks are independent (no shared state).

- vllm/tool_parsers/qwen3coder_tool_parser.py: merged the new
  structural_tag_registry imports with the existing partial_tag_overlap
  import; preserved the speculative-decoding recursion and trailing
  free-text emission logic from HEAD and appended the get_structural_tag
  method introduced by main right after extract_tool_calls_streaming.

- tests/tool_parsers/test_qwen3coder_tool_parser.py: dropped the test
  bodies that were re-introduced by the merge but had already been
  factored into tests/tool_parsers/test_qwen3_xml_coder_shared.py
  during the qwen3_combined_fixes refactor.  Cleaned the matching
  unused imports.

Cross-parser coverage:

- Added get_structural_tag to Qwen3XMLToolParser using the same
  qwen_3_5 model registration as the Coder parser, so the XML parser
  also exposes a valid StructuralTag.
- Moved the three structural_tag tests added in 844df54 (and the
  _as_chat_completion_tools helper) into test_qwen3_xml_coder_shared.py
  so they run against both Qwen3XMLToolParser and Qwen3CoderToolParser
  via the parser_cls fixture.

Note: Qwen3CoderToolParser.supports_required_and_named is set to False
by main; the same flag was intentionally left at its True default on
Qwen3XMLToolParser pending a separate decision.

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

Contributor Author

ExtReMLapin commented May 11, 2026

merged with claude code

@Seven-Streams in #40894 you added some tests.

I moved those tests in a shared qwen3 xml-coder file to ensure they cover both parsers.


          Merge branch 'main' into qwen3_combined_fixes

ada7c2c

bfroemel commented May 11, 2026

As a harness user I noticed with this PR and #40783 there are much less parsing issues, while using chat completion API, streaming with Qwen3.6-27B-FP8. I had runs with 100s of tool calls and they all seemed fine. Without the PRs there are:

sometimes tool call blocks in the model reasoning
model reasoning is cut-off/truncated
failing tool calls/json parsing issues
premature agent turn stops (probably because cut-off/truncated model output in the message history slowly stacks up until the model gets confused and generates output without content or tool calls)

Even with these PRs I sometimes see reasoning output that appears to be cut-off (e.g., the last sentence ends without '.\n'). I think this is a parser/streaming issue, because with non-streaming I haven't observed the model generating sentences without a proper end. (Also, while this isn't relevant in most cases, special tokens still can confuse the parsers, e.g., if I let the model review these PRs, e.g., reasoning can bleed into content.)

@chaunceyjiang @sfeng33
-> Imo for agentic harness users (that depend on long running turns with correct reasoning and tool calls parsing) the PRs have value and it would be great if maintainers/code owners/reviewers could offer some guidance how to move forward. Thanks!

/cc: @qmx @hickeyma @hmellor @stakeswky @ywang96 (apologies for spamming all recent code contributors of the qwen parsers)

ExtReMLapin requested a review from chaunceyjiang

May 11, 2026 18:44

dabhimanyu commented May 15, 2026

Because of this tool calling issue it has been a mess to use Qwen 3.6 35B A3B for Agentic tasks. It works fine as long as I'm asking it do to targetted fixes, provide full content from the files since it's not able to use file system and father than calling it as a tool call I call it directly using curl command.

I'm using it on a single RTX 5090 in NVFP4 quant. Serving wrapper: Not sure if this is the right place to ask this question here but what's the current status of this fix?

I discovered this solution on hugging face recently froggeric/Qwen-Fixed-Chat-Templates but even after applying this I was still facing tool call issues. I still need to check If I had done something incorrectly or if my agent messed it up (Was using Gemma 4 31B to fix this 💀) but it was late and I was feeling sleepy so still have to check the git logs more throughtlly and figure out if my agent messed up something or is this a vllm and/or Qwen 3.6 family of models issue. Because this doesn't happen with Gemma models.

FYI I'm getting this error in Opencode claude code as well as Droid all three are giving same error. It works if you simply invoke the local model via direct Curl command using the VLLm API directly (not sure what that curl thingii is called, just a random multiphase turbulence researcher here 💀). With concurrcy 2 it rips through at over 200 TPS output tokens at a context of about 90k.

Just want to know how to get this damm tool call to work in Qwen 3.6 35B-A3B and Qwen 27B, and if possible Qwen 3V models too. Any help to put me in the right direction would be deeply appreciated. Forgive me in case I barged in wrong place.

My VLM serving wrapper.

This still needs more work because I can squeeze out more context window from Gemma and Qwen 3.6 family since attention mechanism in these models is very different from Qwen 3V. And the below script treats attention mechanism of all models like Qwen 3V to be on the safe side since I didn't undstood that previously. So I've got a beta wrapper in progress which incorporates that but that's still in testing phase. Currently the below wrapper is the workhorse for the everyday text and vision related grunt work to save tokens from my Codex plus plan💀.

# ~/.config/zsh/local_llm_yolo.zsh
# Local LLM YOLO — Shell profile for vLLM model management
# Source: source ~/vllm_serving_scripts/v2/lib/local_llm_yolo.zsh
# RTX 5090 32GB | 8 NVFP4 models | Concurrency 1/2/3 only | 1 model at a time

# ═══════════════════════════════════════════════════════════════════════
# MODEL REGISTRY
# ═══════════════════════════════════════════════════════════════════════

declare -A VLLM_MODELS
VLLM_MODELS=(
  M1 "Firworks-Qwen3-VL-32B-Thinking"
  M2 "LilaRest-gemma-4-31B-it-turbo"
  M3 "nvidia-Gemma-4-26B-A4B"
  M4 "OptimizeLLM-Qwen3-VL-30B-A3B"
  M5 "RedHatAI-gemma-4-31B-it"
  M6 "RedHatAI-Qwen3.6-35B-A3B"
  M7 "sakamakismile-Qwen3.6-27B-Text-MTP"
  M8 "unsloth-Qwen3.6-27B"
)

declare -A VLLM_HF_IDS
VLLM_HF_IDS=(
  M1 "Firworks/Qwen3-VL-32B-Thinking-NVFP4"
  M2 "LilaRest/gemma-4-31B-it-NVFP4-turbo"
  M3 "nvidia/Gemma-4-26B-A4B-NVFP4"
  M4 "OptimizeLLM/Qwen3-VL-30B-A3B-Thinking-NVFP4"
  M5 "RedHatAI/gemma-4-31B-it-NVFP4"
  M6 "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
  M7 "sakamakismile/Qwen3.6-27B-Text-MTP"
  M8 "unsloth/Qwen3.6-27B-NVFP4"
)

declare -A VLLM_QUANT
VLLM_QUANT=(
  M1 "compressed-tensors"
  M2 "modelopt"
  M3 "modelopt"
  M4 "compressed-tensors"
  M5 "compressed-tensors"
  M6 "compressed-tensors"
  M7 "modelopt"
  M8 "compressed-tensors"
)

declare -A VLLM_PARSER_REASON
VLLM_PARSER_REASON=(
  M1 "qwen3"
  M2 "gemma4"
  M3 "gemma4"
  M4 "qwen3"
  M5 "gemma4"
  M6 "qwen3"
  M7 "qwen3"
  M8 "qwen3"
)

declare -A VLLM_PARSER_TOOL
VLLM_PARSER_TOOL=(
  M1 "hermes"
  M2 "gemma4"
  M3 "gemma4"
  M4 "hermes"
  M5 "gemma4"
  M6 "hermes"
  M7 "hermes"
  M8 "hermes"
)

declare -A VLLM_VISION
VLLM_VISION=(
  M1 "YES"
  M2 "NO"
  M3 "YES"
  M4 "YES"
  M5 "YES"
  M6 "YES"
  M7 "NO"
  M8 "YES"
)

declare -A VLLM_MOE
VLLM_MOE=(
  M1 "NO"
  M2 "NO"
  M3 "YES"
  M4 "YES"
  M5 "NO"
  M6 "YES"
  M7 "NO"
  M8 "NO"
)

declare -A VLLM_VIDEO
VLLM_VIDEO=(
  M1 "NO" M2 "NO" M3 "NO" M4 "NO"
  M5 "NO" M6 "NO" M7 "NO" M8 "NO"
)

declare -A VLLM_AUDIO
VLLM_AUDIO=(
  M1 "NO" M2 "NO" M3 "NO" M4 "NO"
  M5 "NO" M6 "NO" M7 "NO" M8 "NO"
)

declare -A VLLM_DEFAULT_CONCURRENCY
VLLM_DEFAULT_CONCURRENCY=(
  M1 "1" M2 "2" M3 "2" M4 "1"
  M5 "1" M6 "1" M7 "2" M8 "1"
)

declare -A VLLM_CONCURRENCY_RANGE
VLLM_CONCURRENCY_RANGE=(
  M1 "1-2"  M2 "1-3"  M3 "2-3" M4 "1"
  M5 "1-2"  M6 "1-2"  M7 "2-3" M8 "1-2"
)

declare -A VLLM_LOCAL_PATHS
VLLM_LOCAL_PATHS=(
  M1 "/home/abhimanyu/local_llm_models/Firworks-Qwen3-VL-32B-Thinking-nvfp4"
  M2 "/home/abhimanyu/local_llm_models/LilaRest-gemma-4-31B-it-NVFP4-turbo"
  M3 "/home/abhimanyu/local_llm_models/nvidia-Gemma-4-26B-A4B-NVFP4"
  M4 "/home/abhimanyu/local_llm_models/OptimizeLLM-Qwen3-VL-30B-A3B-Thinking-NVFP4"
  M5 "/home/abhimanyu/local_llm_models/RedHatAI-gemma-4-31B-it-NVFP4"
  M6 "/home/abhimanyu/local_llm_models/RedHatAI-Qwen3.6-35B-A3B-NVFP4"
  M7 "/home/abhimanyu/local_llm_models/sakamakismile-Qwen3.6-27B-Text-NVFP4-MTP"
  M8 "/home/abhimanyu/local_llm_models/unsloth-Qwen3.6-27B-NVFP4"
)

# ═══════════════════════════════════════════════════════════════════════
# ENDPOINT
# ═══════════════════════════════════════════════════════════════════════

export LOCAL_LLM_BASE_URL="http://127.0.0.1:8000/v1"
export LOCAL_LLM_API_KEY="dummy"

# ═══════════════════════════════════════════════════════════════════════
# CONCURRENCY GUARD
# ═══════════════════════════════════════════════════════════════════════

_vllm_check_concurrency() {
  local c=$1
  if [[ "$c" != "1" && "$c" != "2" && "$c" != "3" ]]; then
    echo "ERROR: Concurrency must be 1, 2, or 3. Got: $c" >&2
    echo "ATTEMPTED_OP: vllm serve with --max-num-seqs $c" >&2
    echo "REASON: Only concurrency 1/2/3 allowed on RTX 5090 32GB" >&2
    echo "ACTION_REQUIRED: Use concurrency 1 (Deep Thinker), 2 (Balanced), or 3 (Max throughput)" >&2
    return 1
  fi
  return 0
}

# ═══════════════════════════════════════════════════════════════════════
# CORE FUNCTIONS
# ═══════════════════════════════════════════════════════════════════════

vllm_kill() {
  echo "Killing existing vLLM processes..."
  pkill -f "vllm serve" 2>/dev/null
  sleep 2
  if pgrep -f "vllm serve" >/dev/null 2>&1; then
    echo "WARNING: vLLM still running, force killing..."
    pkill -9 -f "vllm serve" 2>/dev/null
    sleep 1
  fi
  # Kill orphaned vLLM GPU processes (parent APIServer dead)
  # EngineCore children use multiprocessing.spawn — their /proc/cmdline may not
  # contain "vllm", so we also check the nvidia-smi process name (VLLM::EngineCore)
  # and the vllm venv python path as fallback signals.
  local orphan_pids=""
  local gpu_pids
  local gpu_info
  gpu_info=$(nvidia-smi --query-compute-apps=pid,process_name --format=csv,noheader 2>/dev/null)
  gpu_pids=$(echo "$gpu_info" | awk -F', ' '{print $1}' | tr -d ' ')
  for pid in ${(f)gpu_pids}; do
    [[ -z "$pid" ]] && continue
    local is_vllm=false
    # Check 1: nvidia-smi process name contains VLLM (case-insensitive)
    if echo "$gpu_info" | grep -qi "VLLM" 2>/dev/null; then
      local pname=$(echo "$gpu_info" | grep "^ *${pid}," | awk -F', ' '{print $2}')
      if [[ "$pname" == *"VLLM"* ]]; then
        is_vllm=true
      fi
    fi
    # Check 2: /proc/cmdline contains "vllm"
    if ! $is_vllm && [[ -r /proc/$pid/cmdline ]] && grep -ql "vllm" /proc/$pid/cmdline 2>/dev/null; then
      is_vllm=true
    fi
    # Check 3: process running from .vllm_venv python (vLLM's venv)
    if ! $is_vllm && [[ -r /proc/$pid/cmdline ]] && grep -ql ".vllm_venv" /proc/$pid/cmdline 2>/dev/null; then
      is_vllm=true
    fi
    if $is_vllm; then
      orphan_pids="$orphan_pids $pid"
    fi
  done
  if [[ -n "$orphan_pids" ]]; then
    echo "Cleaning up orphaned vLLM GPU processes:$orphan_pids"
    echo "$orphan_pids" | tr ' ' '\n' | grep -v '^$' | xargs -r kill -9 2>/dev/null
    sleep 1
  fi
  echo "vLLM stopped."
}

vllm_status() {
  if pgrep -f "vllm serve" >/dev/null 2>&1; then
    echo "vLLM is RUNNING"
    echo "PID: $(pgrep -f 'vllm serve' | head -1)"
    echo "Model: ${VLLM_MODELS[$VLLM_CURRENT_MODEL]:-unknown} ($VLLM_CURRENT_MODEL)"
    echo "Modalities: ${VLLM_CURRENT_MODALITIES:-?}"
    echo "Concurrency: ${VLLM_CURRENT_CONCURRENCY:-?}"
    echo "Max Context: ${VLLM_CURRENT_MAX_LEN:-?} tokens"
    local health=$(curl -s http://127.0.0.1:8000/health 2>/dev/null)
    if [[ "$health" == *"ok"* ]] || [[ "$health" == *"200"* ]]; then
      echo "Health: OK"
    else
      echo "Health: WAITING (model loading...)"
    fi
    nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv,noheader 2>/dev/null
  else
    echo "vLLM is STOPPED"
    nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader 2>/dev/null
  fi
}

vllm_wait_health() {
  local timeout=${1:-300}
  local elapsed=0
  echo "Waiting for vLLM health (timeout: ${timeout}s)..."
  while [[ $elapsed -lt $timeout ]]; do
    local health=$(curl -s http://127.0.0.1:8000/health 2>/dev/null)
    if [[ "$health" == *"ok"* ]] || [[ "$health" == *"200"* ]]; then
      echo "vLLM is healthy after ${elapsed}s"
      return 0
    fi
    sleep 2
    elapsed=$((elapsed + 2))
    echo -ne "  Waiting... ${elapsed}s\r"
  done
  echo ""
  echo "ERROR: vLLM did not become healthy within ${timeout}s" >&2
  return 1
}

# ═══════════════════════════════════════════════════════════════════════
# vllm_swap — Main entry point
# Usage: vllm_swap <M1..M8> [1|2|3] [--text|--vision|--video|--audio]
# ═══════════════════════════════════════════════════════════════════════

vllm_swap() {
  local model_id=""
  local concurrency=""
  local modality=""

  for arg in "$@"; do
    case "$arg" in
      M[1-8])
        model_id="$arg"
        ;;
      1|2|3)
        concurrency="$arg"
        ;;
      --text|--vision|--video|--audio)
        modality="${arg#--}"
        ;;
      *)
        echo "ERROR: Unknown argument: $arg" >&2
        echo "USAGE: vllm_swap <M1..M8> [1|2|3] [--text|--vision|--video|--audio]" >&2
        return 1
        ;;
    esac
  done

  if [[ -z "$model_id" ]] || [[ -z "${VLLM_HF_IDS[$model_id]}" ]]; then
    echo "ERROR: Invalid or missing model ID. Use M1-M8." >&2
    echo "Available: M1 M2 M3 M4 M5 M6 M7 M8" >&2
    echo "USAGE: vllm_swap <M1..M8> [1|2|3] [--text|--vision|--video|--audio]" >&2
    return 1
  fi

  if [[ -z "$concurrency" ]]; then
    concurrency="${VLLM_DEFAULT_CONCURRENCY[$model_id]}"
  fi

  _vllm_check_concurrency "$concurrency" || return 1

  local range="${VLLM_CONCURRENCY_RANGE[$model_id]}"
  local range_lo="${range%-*}"
  local range_hi="${range#*-}"
  if [[ "$concurrency" -lt "$range_lo" || "$concurrency" -gt "$range_hi" ]]; then
    echo "WARNING: Concurrency $concurrency outside recommended range [$range] for $model_id"
  fi

  if [[ -z "$modality" ]]; then
    modality="text"
  fi

  local use_vision="no"
  local use_video="no"
  local use_audio="no"
  local modalities_str="text"

  case "$modality" in
    vision)
      if [[ "${VLLM_VISION[$model_id]}" != "YES" ]]; then
        echo "WARNING: $model_id (${VLLM_MODELS[$model_id]}) does not support vision."
        echo "  Falling back to text-only."
        local vl_models=()
        for m in M1 M2 M3 M4 M5 M6 M7 M8; do
          [[ "${VLLM_VISION[$m]}" == "YES" ]] && vl_models+=("$m (${VLLM_MODELS[$m]})")
        done
        echo "  Vision-capable models: ${vl_models[*]}"
        modality="text"
      else
        use_vision="yes"
        modalities_str="text+vision"
      fi
      ;;
    video)
      if [[ "${VLLM_VIDEO[$model_id]}" != "YES" ]]; then
        echo "WARNING: $model_id (${VLLM_MODELS[$model_id]}) does not support video."
        echo "  Falling back to text-only."
        local vid_models=()
        for m in M1 M2 M3 M4 M5 M6 M7 M8; do
          [[ "${VLLM_VIDEO[$m]}" == "YES" ]] && vid_models+=("$m (${VLLM_MODELS[$m]})")
        done
        echo "  Video-capable models: ${vid_models[*]}"
        modality="text"
      else
        use_video="yes"
        modalities_str="text+video"
      fi
      ;;
    audio)
      if [[ "${VLLM_AUDIO[$model_id]}" != "YES" ]]; then
        echo "WARNING: $model_id (${VLLM_MODELS[$model_id]}) does not support audio."
        echo "  Falling back to text-only."
        local aud_models=()
        for m in M1 M2 M3 M4 M5 M6 M7 M8; do
          [[ "${VLLM_AUDIO[$m]}" == "YES" ]] && aud_models+=("$m (${VLLM_MODELS[$m]})")
        done
        echo "  Audio-capable models: ${aud_models[*]}"
        modality="text"
      else
        use_audio="yes"
        modalities_str="text+audio"
      fi
      ;;
    text)
      ;;
  esac

  echo "Swapping to ${VLLM_MODELS[$model_id]} | C=$concurrency | $modalities_str"

  # Vision concurrency guard: C=3 not supported for vision
  if [[ "$use_vision" == "yes" || "$use_video" == "yes" ]]; then
    if [[ "$concurrency" -eq 3 ]]; then
      echo "WARNING: Vision mode does not support C=3 on RTX 5090. Overriding to C=2."
      concurrency=2
    fi
  fi

  vllm_kill

  # Wait for GPU memory to actually free after kill
  local gpu_wait=0
  while [[ $gpu_wait -lt 30 ]]; do
    local used
    used=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -d ' ')
    if [[ -n "$used" && "$used" -lt 500 ]]; then
      break
    fi
    sleep 2
    gpu_wait=$((gpu_wait + 2))
  done
  if [[ $gpu_wait -ge 30 ]]; then
    echo "WARNING: GPU memory may not be fully freed (${used}MB still used)" >&2
  fi

  if ! source ~/.vllm_venv/bin/activate 2>/dev/null; then
    echo "ERROR: Failed to activate venv at ~/.vllm_venv/" >&2
    echo "ATTEMPTED_OP: source venv for vllm serve" >&2
    echo "ACTION_REQUIRED: Run setup_python_venv.sh or verify ~/.vllm_venv/bin/activate exists" >&2
    return 1
  fi

  # Validate local model path
  local model_path_check="${VLLM_LOCAL_PATHS[$model_id]}"
  if [[ -n "$model_path_check" && -d "$model_path_check" ]]; then
    local sf_files=("${model_path_check}"/*.safetensors(N))
    if [[ ${#sf_files} -eq 0 ]]; then
      echo "WARNING: No .safetensors files found in $model_path_check" >&2
      echo "  Model files may be incomplete. Proceeding with local path anyway." >&2
    fi
  elif [[ -n "$model_path_check" ]]; then
    echo "WARNING: Local path does not exist: $model_path_check" >&2
    echo "  Will fall back to HuggingFace download." >&2
  fi

  local hf_id="${VLLM_HF_IDS[$model_id]}"
  local quant="${VLLM_QUANT[$model_id]}"
  local r_parser="${VLLM_PARSER_REASON[$model_id]}"
  local t_parser="${VLLM_PARSER_TOOL[$model_id]}"
  local is_moe="${VLLM_MOE[$model_id]}"

  local model_path="${VLLM_LOCAL_PATHS[$model_id]}"
  local cmd
  if [[ -d "$model_path" ]]; then
    cmd="vllm serve $model_path"
  else
    echo "WARNING: Local path not found: $model_path" >&2
    echo "  Falling back to HuggingFace ID: $hf_id" >&2
    cmd="vllm serve $hf_id"
  fi
  cmd+=" --host 127.0.0.1"
  cmd+=" --port 8000"
  cmd+=" --tensor-parallel-size 1"
  cmd+=" --max-num-seqs $concurrency"
  cmd+=" --gpu-memory-utilization 0.94"
  cmd+=" --quantization $quant"
  cmd+=" --kv-cache-dtype fp8"
  cmd+=" --trust-remote-code"
  cmd+=" --no-calculate-kv-scales"
  cmd+=" --reasoning-parser $r_parser"
  cmd+=" --served-model-name local-llm"
  cmd+=" --enable-auto-tool-choice"
  cmd+=" --tool-call-parser $t_parser"

  local max_len
  if [[ "$use_vision" == "yes" || "$use_video" == "yes" ]]; then
    case "$concurrency" in
      1) max_len="90000" ;;
      2) max_len="65536" ;;
    esac
    case "$concurrency" in
      1) cmd+=" --limit-mm-per-prompt '{\"image\": 3, \"video\": 0}'" ;;
      2) cmd+=" --limit-mm-per-prompt '{\"image\": 1, \"video\": 0}'" ;;
    esac
  else
    case "$concurrency" in
      1) max_len="131072" ;;
      2) max_len="81920" ;;
      3) max_len="40960" ;;
    esac
    cmd+=" --limit-mm-per-prompt '{\"image\": 0}'"
  fi
  cmd+=" --max-model-len $max_len"

  if [[ $max_len -ge 131072 ]]; then
    cmd+=" --async-scheduling"
    cmd+=" --no-enable-prefix-caching"
    cmd+=" --max-num-batched-tokens 640"
  else
    cmd+=" --enable-prefix-caching"
    cmd+=" --max-num-batched-tokens 8192"
  fi

  if [[ "$is_moe" == "YES" ]]; then
    cmd+=" --enable-expert-parallel"
  fi

  if [[ "$model_id" == "M7" ]]; then
    cmd+=" --speculative-config '{\"method\":\"qwen3_5_mtp\",\"num_speculative_tokens\":3}'"
    cmd+=" --no-scheduler-reserve-full-isl"
  fi

  echo ""
  echo "+-- vLLM Serve Command -------------------------------------------------+"
  echo "$cmd" | sed 's/ --/ \\\n  --/g'
  echo "+-----------------------------------------------------------------------+"
  echo ""
  eval "nohup $cmd > /tmp/vllm_startup.log 2>&1 &"
  local pid=$!
  disown
  echo "vLLM PID: $pid"
  echo "Startup log: /tmp/vllm_startup.log"

  local health_timeout="${VLLM_HEALTH_TIMEOUT:-300}"
  if vllm_wait_health "$health_timeout"; then
    echo ""
    echo "========================================"
    echo " vLLM READY"
    echo "========================================"
  else
    echo ""
    echo "WARNING: vLLM not healthy after ${health_timeout}s" >&2
    echo "  The model may still be loading (FlashInfer kernel compilation)." >&2
    echo "  PID: $pid — check progress: tail -f /tmp/vllm_startup.log" >&2
    echo "  Run 'vllm_wait_health' or 'vllm_status' to check again." >&2
  fi

  export VLLM_CURRENT_MODEL="$model_id"
  export VLLM_CURRENT_MODALITIES="$modalities_str"
  export VLLM_CURRENT_CONCURRENCY="$concurrency"
  export VLLM_CURRENT_HF_ID="$hf_id"
  export VLLM_CURRENT_MAX_LEN="$max_len"
  export LOCAL_LLM_CURRENT_MODEL="local-llm"

  echo ""
  echo "========================================"
  echo " ACTIVE MODALITIES: ✓ $modalities_str"
  echo " CONTEXT: $max_len tokens | CONCURRENCY: $concurrency"
  echo "========================================"
  echo "Model:    ${VLLM_MODELS[$model_id]} ($model_id)"
  echo "API:      $LOCAL_LLM_BASE_URL"
  echo "Aliases:  local-llm, $hf_id"
  echo "========================================"
}

# ═══════════════════════════════════════════════════════════════════════
# CONVENIENCE FUNCTIONS — Per-model shortcuts
# Usage: vllm_M6 [concurrency] [--text|--vision|--video|--audio]
# ═══════════════════════════════════════════════════════════════════════

vllm_M1() { vllm_swap M1 "$@" }
vllm_M2() { vllm_swap M2 "$@" }
vllm_M3() { vllm_swap M3 "$@" }
vllm_M4() { vllm_swap M4 "$@" }
vllm_M5() { vllm_swap M5 "$@" }
vllm_M6() { vllm_swap M6 "$@" }
vllm_M7() { vllm_swap M7 "$@" }
vllm_M8() { vllm_swap M8 "$@" }

# ═══════════════════════════════════════════════════════════════════════
# MODE SHORTCUTS — Preset serving configs
# ═══════════════════════════════════════════════════════════════════════

alias vllm_agentic='vllm_swap M6 2'
alias vllm_agentic_fast='vllm_swap M3 3'

alias vllm_writer='vllm_swap M5 1'
alias vllm_writer_moe='vllm_swap M6 1'

alias vllm_deep='vllm_swap M6 1'
alias vllm_deep_gemma='vllm_swap M5 1'

alias vllm_vision='vllm_swap M4 1 --vision'
alias vllm_vision_gemma='vllm_swap M5 1 --vision'

alias vllm_mtp='vllm_swap M7'

# ═══════════════════════════════════════════════════════════════════════
# QUICK API TEST
# ═══════════════════════════════════════════════════════════════════════

vllm_test() {
  local prompt="${1:-Hello, respond with 'OK' and nothing else.}"
  echo "Testing: $LOCAL_LLM_BASE_URL"
  curl -s "$LOCAL_LLM_BASE_URL/chat/completions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $LOCAL_LLM_API_KEY" \
    -d "$(jq -n --arg p "$prompt" '{
      model: "local-llm",
      messages: [{role: "user", content: $p}],
      max_tokens: 100,
      temperature: 0.1
    }')" | jq -r '.choices[0].message.content // .error // "FAILED"'
}

vllm_models() {
  curl -s "$LOCAL_LLM_BASE_URL/models" | jq -r '.data[].id // "No models found"'
}

# ═══════════════════════════════════════════════════════════════════════
# HELP — only shown when explicitly called, NOT on source
# ═══════════════════════════════════════════════════════════════════════

vllm_help() {
  cat <<'HELPEOF'
+======================================================================+
|            local_llm_yolo — Command Reference                        |
+======================================================================+
|                                                                      |
|  QUICK-START COMMAND MATRIX (Text-Only)                              |
|  Bold = default. Dashes = outside recommended range.                 |
|  Context sizes: C=1 → 131K, C=2 → 82K, C=3 → 41K tokens            |
|                                                                      |
|  Model                             C=1          C=2          C=3     |
|  ─────────────────────────────────────────────────────────────────── |
|  M1 Firworks-Qwen3-VL-32B         vllm_M1      vllm_M1 2      —     |
|  M2 LilaRest-gemma-4-31B-turbo    vllm_M2 1    vllm_M2      vllm_M2 3|
|  M3 nvidia-Gemma-4-26B-A4B           —         vllm_M3      vllm_M3 3|
|  M4 OptimizeLLM-Qwen3-VL-30B-A3B  vllm_M4         —            —     |
|  M5 RedHatAI-gemma-4-31B-it       vllm_M5      vllm_M5 2      —     |
|  M6 RedHatAI-Qwen3.6-35B-A3B      vllm_M6      vllm_M6 2      —     |
|  M7 sakamakismile-Qwen3.6-27B-MTP    —         vllm_M7      vllm_M7 3|
|  M8 unsloth-Qwen3.6-27B           vllm_M8      vllm_M8 2      —     |
|                                                                      |
|  VISION MODE: Append --vision to any command.                        |
|  Context sizes: C=1 → 90K, C=2 → 66K (C=3 → override to C=2)         |
|  Image limits: C=1 → 3 images, C=2 → 1 image (C=3 unsupported)       |
|  --limit-mm-per-prompt format: '{"image": N, "video": 0}'            |
|  Vision-capable: M1, M3, M4, M5, M6, M8                              |
|  Text-only (no vision): M2, M7                                       |
|                                                                      |
|  EXAMPLES (both forms are equivalent):                               |
|    vllm_swap M6 2   ==   vllm_M6 2        (C=2 text)                |
|    vllm_swap M1 --vision == vllm_M1 --vision (C=1 vision)           |
|    vllm_swap M7 3   ==   vllm_M7 3        (C=3 text, MTP)           |
|                                                                      |
+======================================================================+
| Model Registry:                                                      |
|   M1  Firworks-Qwen3-VL-32B-Thinking      21.9GB  VL MoE-          |
|   M2  LilaRest-gemma-4-31B-it-turbo       15.3GB     MoE-          |
|   M3  nvidia-Gemma-4-26B-A4B              18.8GB  VL MoE+          |
|   M4  OptimizeLLM-Qwen3-VL-30B-A3B        19.2GB  VL MoE+          |
|   M5  RedHatAI-gemma-4-31B-it             23.3GB  VL MoE-          |
|   M6  RedHatAI-Qwen3.6-35B-A3B            25.1GB  VL MoE+          |
|   M7  sakamakismile-Qwen3.6-27B-Text-MTP  19.7GB     MoE-  (MTP)   |
|   M8  unsloth-Qwen3.6-27B                 ~19GB   VL MoE-          |
|                                                                      |
| Core:                                                                |
|   vllm_swap <M1..M8> [1|2|3] [--text|--vision]                     |
|       Swap model. Args in any order.                                 |
|       Default concurrency per model. Tools always enabled.           |
|       --text    Text-only (larger context, no vision encoder VRAM)   |
|       --vision  Enable vision (smaller context for VRAM overhead)    |
|   vllm_status                              Check status              |
|   vllm_kill                                Kill vLLM                 |
|   vllm_test [prompt]                       Test API                  |
|   vllm_wait_health [timeout]               Wait healthy              |
|   vllm_models                              List models               |
|   vllm_help                                This help                 |
|                                                                      |
| Quick Swap:  vllm_M{1..8} [conc] [--text|--vision]                 |
|                                                                      |
| Modes:                                                               |
|   vllm_agentic       -> M6 C=2          (agentic work)              |
|   vllm_agentic_fast  -> M3 C=3          (max throughput)            |
|   vllm_writer        -> M5 C=1          (JFM writing)               |
|   vllm_writer_moe    -> M6 C=1          (MoE reasoning)             |
|   vllm_deep          -> M6 C=1          (max context)               |
|   vllm_deep_gemma    -> M5 C=1          (Gemma deep)                |
|   vllm_vision        -> M4 C=1 --vision (primary vision)            |
|   vllm_vision_gemma  -> M5 C=1 --vision (Gemma vision)             |
|   vllm_mtp           -> M7              (MTP speculative)           |
|                                                                      |
| Claude Code:                                                         |
|   local-yolo  -> Local vLLM, no permission prompts                  |
|   cc-local    -> Local vLLM, with permission prompts                |
+======================================================================+
HELPEOF
}

alexbi29 mentioned this pull request

[Bugfix] Qwen3Coder streaming: emit args when whole tool body lands in one delta #43074

Open

Contributor

mergify Bot commented May 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ExtReMLapin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify Bot added the needs-rebase label


          Merge branch 'main' into qwen3_combined_fixes

Conflict resolutions (qwen3coder_tool_parser.py + its test file):

- _convert_param_value: kept this branch's detailed type-coercion logic
  (nullable string/None handling, container double-decode for buggy
  templates) instead of main's refactor to utils.coerce_to_schema_type /
  extract_types_from_schema (vllm-project#38973). Restored `import ast` that vllm-project#38973
  had removed. Kept main's vllm-project#42292 change
  (supports_required_and_named = not VLLM_ENFORCE_STRICT_TOOL_CALLING).

- Tests: kept this branch's rewritten Coder test file. Re-integrated the
  two anyOf tests vllm-project#38973 added: the comprehensive non-streaming + streaming
  cases stay Coder-specific (they assert {"type": ["integer","null"]} -> int
  and whitespace-stripped values, which only hold for the Coder parser);
  the genuinely cross-parser anyOf[array,null] -> list case was added to the
  shared test file, parametrized over both XML and Coder parsers.

All 190 qwen3 tool-parser tests pass; ruff check/format unchanged vs the
pre-merge branch head.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mergify Bot removed the needs-rebase label


          test: move anyOf type-conversion tests into shared qwen3 suite

e831309

Relocate the anyOf / nullable type-resolution tests (originally added by
vllm-project#38973 to the Coder-only file) into the shared XML/Coder suite,
parametrized over both parsers, so the coverage applies to both.

To make the JSON-Schema list-form type {"type": ["integer", "null"]}
resolve consistently across parsers, teach the XML parser's
_get_param_type to pick the first non-null entry of a list-form type
(it already did this for anyOf). Both parsers now coerce it to int.

Ruff: replace try/except/pass with contextlib.suppress in both parsers
and run ruff format on the touched qwen3 files.

Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Contributor

mergify Bot commented May 28, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ExtReMLapin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify Bot added the needs-rebase label

Contributor Author

ExtReMLapin commented Jun 4, 2026

What do you think would be better if I close this PR to make multiple one ?

One pr for the tests (because it's also refactoring the tests to move them into a new file which runs tests in both qwen3_coder and qwen3_xml) ?

And one PR for each fix ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

claude[bot] claude[bot] left review comments

aarnphm Awaiting requested review from aarnphm aarnphm is a code owner

sfeng33 Awaiting requested review from sfeng33 sfeng33 is a code owner

bbrowning Awaiting requested review from bbrowning bbrowning is a code owner

noooop Awaiting requested review from noooop

tjtanaa Awaiting requested review from tjtanaa

sighingnow Awaiting requested review from sighingnow

vadiklyutiy Awaiting requested review from vadiklyutiy

mgoin Awaiting requested review from mgoin

robertgshaw2-redhat Awaiting requested review from robertgshaw2-redhat

yewentao256 Awaiting requested review from yewentao256

pavanimajety Awaiting requested review from pavanimajety

DarkLight1337 Awaiting requested review from DarkLight1337

ywang96 Awaiting requested review from ywang96

NickLucche Awaiting requested review from NickLucche

tlrmchlsmth Awaiting requested review from tlrmchlsmth

WoosukKwon Awaiting requested review from WoosukKwon

njhill Awaiting requested review from njhill

benchislett Awaiting requested review from benchislett

luccafong Awaiting requested review from luccafong

MatthewBonanni Awaiting requested review from MatthewBonanni

alexm-redhat Awaiting requested review from alexm-redhat

heheda12345 Awaiting requested review from heheda12345

ApostaC Awaiting requested review from ApostaC

orozery Awaiting requested review from orozery

LucasWilkinson Awaiting requested review from LucasWilkinson

russellb Awaiting requested review from russellb

youkaichao Awaiting requested review from youkaichao

houseroad Awaiting requested review from houseroad

hmellor Awaiting requested review from hmellor

ProExpertProg Awaiting requested review from ProExpertProg

22quinn Awaiting requested review from 22quinn

tdoublep Awaiting requested review from tdoublep

tomeras91 Awaiting requested review from tomeras91

chaunceyjiang Awaiting requested review from chaunceyjiang chaunceyjiang is a code owner

+1 more reviewer

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

At least 1 approving review is required to merge this pull request.

Labels

bug ci/build cpu deepseek documentation frontend mistral multi-modality needs-rebase new-model nvidia performance qwen rocm speculative-decoding structured-output tool-calling v1