Skip to content

[Bugfix][ToolParser] Fix Qwen3 XML and Coder streaming tool call parser regressions#40861

Open
ExtReMLapin wants to merge 25 commits into
vllm-project:mainfrom
ExtReMLapin:qwen3_combined_fixes
Open

[Bugfix][ToolParser] Fix Qwen3 XML and Coder streaming tool call parser regressions#40861
ExtReMLapin wants to merge 25 commits into
vllm-project:mainfrom
ExtReMLapin:qwen3_combined_fixes

Conversation

@ExtReMLapin
Copy link
Copy Markdown
Contributor

@ExtReMLapin ExtReMLapin commented Apr 25, 2026

To be used with #40783

Purpose

Fix several streaming regressions in both the Qwen3CoderToolParser and
Qwen3XMLToolParser that caused dropped parameters, duplicated content,
or incorrect type conversion in tool call responses.

Qwen3Coder (streaming)

  • Fix split <tool_call> tag detection: when the tag was fragmented across
    two deltas (e.g. <tool_ then call>), it was not detected and the tool
    call was silently dropped.
  • Fix dropped parameters when the tool call header (<tool_call><function=name>)
    arrived in delta 1 and the parameters + </function> arrived in delta 2.
  • Fix last content message not being flushed to the client after all tool calls
    completed.
  • Fix structural delimiter disambiguation: </tool_call>, </function> and
    </parameter> appearing as literal text inside a parameter value (e.g.
    documentation, Python code) were incorrectly treated as closing delimiters,
    truncating or corrupting parameter values.

Qwen3XML (streaming)

  • Fix delayed text emission between consecutive tool calls.
  • Fix anyOf schema type detection: nullable schemas
    ({"anyOf": [{"type": "string"}, {"type": "null"}]}) were classified as
    "object" (triggering json.loads) instead of resolving to the first
    non-null type ("string"), causing type conversion errors.
  • Fix double-close fallback when </parameter> appeared inside a parameter
    value.

Both parsers

  • Fix speculative decoding: when two or more complete tool calls were delivered
    in a single delta burst, only the first was emitted; subsequent ones were
    silently dropped.

Refactor / tests

  • Extract _advance_to_next_tool() helper in Qwen3CoderToolParser to
    deduplicate identical state-advance logic that was copy-pasted between the
    normal delta path and the speculative-decoding recursion path.
  • Factor all regression tests shared between the XML and Coder parsers into
    tests/tool_parsers/test_qwen3_xml_coder_shared.py, parametrized over both
    parser classes.

Not a duplicate of any open PR: existing Qwen3 tool parser PRs address
non-streaming (batch) parsing only. This PR focuses exclusively on the
streaming path and speculative decoding edge cases.

Test Plan

python -m pytest \
  tests/tool_parsers/test_qwen3coder_tool_parser.py \
  tests/tool_parsers/test_qwen3xml_tool_parser.py \
  tests/tool_parsers/test_qwen3_xml_coder_shared.py \
  -v

Test Result

249 passed, 16 warnings in 108.68s
All 249 tests pass. No regressions detected in the existing test suite.

CNE Pierre FICHEPOIL and others added 15 commits April 24, 2026 09:42
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
…er tool calls

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
… + function name only) + delta2 (params + tool call end) was dropping params

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
… fallback

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Combined fixes for the XML and Coder tool parsers that surfaced once
the two PR branches were merged together.

Qwen3XML parser:
* Reorder _convert_param_value: check string type BEFORE the "null"
  shortcut so a string param with literal value "null" stays "null"
  instead of becoming JSON null. Fix logger.warning argument count.
* _convert_for_json_streaming: emit "null" (not "") when converted_value
  is None so nullable integer/object params serialize correctly.
* _get_param_type: anyOf returns the first non-null type instead of
  falling back to "string" for nullable integer/boolean schemas.
* _preprocess_xml_chunk: defer streaming for boolean params (avoids
  emitting "false" on the first 't' of "true") and for all container
  types regardless of single-quote hint.
* _end_element deferred path: try json.loads BEFORE ast.literal_eval so
  arrays/objects containing JSON true/false/null parse natively;
  double-decode strings to recover from buggy json.dumps(str(dict))
  templates.
* Add structural-aware helpers: _is_structural_tag_position,
  _get_valid_param_names, _is_structural_closing_tag (with partial-tag
  prefix safety), _chunk_has_structural_function_end,
  _chunk_has_structural_tool_call_end.
* _preprocess_xml_chunk: when SAX state is inside a parameter value,
  escape <tool_call>/<function=> always, and <parameter=NAME>/closing
  tags only when they are not structural delimiters.
* _process_complete_xml_elements: defer </parameter> when streaming
  with empty lookahead (more tokens may still arrive).
* parse_single_streaming_chunks: fallback close uses
  _chunk_has_structural_*_end instead of plain "in xml_chunk" so a
  literal </function> in a parameter value doesn't trigger a double
  close.
* extract_tool_calls_streaming: enable _streaming_mode=True on first
  delta.

Qwen3Coder parser:
* Reorder _convert_param_value the same way (string-first, then null).
* anyOf picks the first non-null type instead of treating it as
  "object".
* Container handling: try json.loads then double-decode via
  ast.literal_eval to recover from buggy json.dumps(str(dict)) outputs.
* Add structural-aware helpers: _next_structural_param_start,
  _find_true_function_end, _find_true_tool_call_end,
  _find_true_param_end (with require_lookahead for streaming).
* _parse_xml_function_call: top-level params are NOT filtered by schema
  (callers may rename fields) but nested boundaries inside a value ARE,
  so literal <parameter=...> lines in file content don't terminate the
  param early.
* _get_function_calls: structural-aware (</tool_call> must be followed
  by another <tool_call> or EOS; same for </function>).
* Streaming param_starts uses the helpers; </function> close check
  uses _find_true_function_end so a literal </function> in a value
  doesn't prematurely emit "}".
* tool_start_positions skips past each </tool_call> of completed calls
  so a literal <tool_call> inside a parameter value of a closed call
  doesn't spawn a phantom new tool call.
* Multi-tool-call delta (speculative decoding): when one tool call
  closes and another full <tool_call>...</tool_call> remains in
  current_text, advance manually and re-enter with a sentinel
  previous_text so reset_streaming_state isn't triggered (which would
  loop forever).

These fix the agentic-streaming bug where Qwen3.5 would freeze
mid-tool-call when a parameter value contained <tool_call>,
</parameter>, <parameter=NAME>, or </function> as literal text (e.g.
writing a Jinja2 template, a heredoc, or any file describing the
tool-call format), as well as several value-conversion bugs (string
"null" -> JSON null, anyOf nullable -> wrong type, double-encoded
objects -> string).

Add 16 regression tests in test_qwen3xml_tool_parser.py, 10 in
test_qwen3coder_tool_parser.py, and a new test_qwen36_bugs.py
covering bugs that span both parsers (XML array with JSON true/false,
Coder multi-tool-call in one streaming delta).

98 tests pass across the three test files.

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Both the XML and Coder tool parsers were tested against nearly
identical regression scenarios in their respective files (string
"null" preservation, anyOf nullable schemas, double-encoded objects,
content with literal XML structural tags, content with param-like
lines, etc.).  Split the shared expectations into a single file with
a parametrized parser fixture so that:

* the same intent is tested against BOTH parsers automatically;
* divergent behaviour is caught immediately instead of drifting;
* parser-specific quirks (XML SAX double-close brace, char-by-char
  boolean streaming, Coder speculative-decoding chunk loss, etc.)
  stay in their parser-specific test file.

New: tests/tool_parsers/test_qwen3_xml_coder_shared.py exposes a
``parser_cls`` fixture parametrized over Qwen3XMLToolParser and
Qwen3CoderToolParser.  Each shared test runs twice and prints
``[xml]``/``[coder]`` in the test id.

Removed duplicates from:
* tests/tool_parsers/test_qwen3xml_tool_parser.py: anyOf object
  param (streaming + non-streaming), string null preservation, anyOf
  integer/null type detection, content with structural tags
  (streaming + non-streaming), content with param-like lines
  (streaming + non-streaming), double-encoded object (streaming +
  non-streaming).
* tests/tool_parsers/test_qwen3coder_tool_parser.py: anyOf parameter
  not double encoded, string null preservation, anyOf string/null
  numeric value, content with XML structural tags (streaming +
  non-streaming), content with param-like lines (streaming +
  non-streaming), double-encoded object (streaming + non-streaming),
  content param with tool_call tag (streaming + non-streaming —
  redundant with content_with_xml_structural_tags).

Removed: tests/tool_parsers/test_qwen36_bugs.py.  Its two scenarios
(XML array containing JSON ``true``, Coder two complete tool calls
in a single streaming delta) are now in the shared file as
``test_array_with_json_bool`` and
``test_two_tool_calls_in_one_streaming_chunk``, both running against
both parsers.

Net effect: 209 -> 183 tests, 0 failures, identical coverage.

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Move all generic regression tests (basic extraction, type conversion,
streaming variants, robustness) from the Coder-specific file into the
shared parametrized file so each test runs against both parsers.  Only
behaviour that genuinely differs between the two parsers stays
parser-specific:

- Coder-only: ``streaming_split_tag`` (relies on ``is_tool_call_started``)
  and ``streaming_various_chunk_sizes`` (XML SAX cannot tolerate
  single-character chunks).
- XML-only: ``streaming_missing_opening_tool_call_tag`` (Coder does not
  recover from a missing ``<tool_call>`` opener in streaming mode).

Two assertions were relaxed in the shared file to accept both legitimate
behaviours: content between parallel tool calls (``None`` vs ``"\\n"``)
and the streaming header arguments value (``""`` vs ``"{"``).

Test count rises from 99 to 138 (+39 from cross-parser parametrization)
while ``test_qwen3coder_tool_parser.py`` shrinks from 1260 to 162 lines.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the robustness of the Qwen3 XML and Coder tool parsers, particularly for streaming scenarios involving speculative decoding and complex parameter types. Key changes include structural-aware parsing to correctly handle XML tags appearing as literal text within parameter values, improved handling of nullable/anyOf schemas, and fixes for streaming bugs where partial tokens or multi-tool bursts could lead to data loss or incorrect type conversion. I have reviewed the implementation and identified a potential issue where content from recursive tool call processing might be lost; please apply the suggested fix to ensure all model output is correctly concatenated.

Comment thread vllm/tool_parsers/qwen3coder_tool_parser.py Outdated
@ExtReMLapin
Copy link
Copy Markdown
Contributor Author

Manual testing was done with some bleeding corner cases like trying to write special tokens inside a tool call (asking a tool to write python code containing inside strings special tokens into a file, whole thing streamed)

@bfroemel
Copy link
Copy Markdown

bfroemel commented Apr 25, 2026

apologies if this is the wrong place to ask, but as you must be deeply familiar with these parsers: is there a technical reason why we (still) have two qwen3 tool parsers? does Qwen3Coder offer anything over Qwen3XML? I know that Qwen has Qwen3Coder in their Qwen3.5 documentation/releases, but there is also this: #25028 (comment) (+ follow up comment; comments from PR with tool parser contribution from the Qwen team)

Copy link
Copy Markdown
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have a question. What issues does the current tool parser have?

I noticed you’ve made quite a lot of changes to these two tool parsers. This might take quite some time to review.

/cc @sfeng33

@ExtReMLapin
Copy link
Copy Markdown
Contributor Author

ExtReMLapin commented Apr 29, 2026

@chaunceyjiang

TL;DR

Every single added test is its own bug report. There is no added test that was passing on main. You can reproduce that locally:

git fetch origin pull/40861/head:pr-40861 && git checkout pr-40861

# 1) Run all qwen3 tests with this PR's code → everything passes
pytest tests/tool_parsers/test_qwen3coder_tool_parser.py \
       tests/tool_parsers/test_qwen3xml_tool_parser.py \
       tests/tool_parsers/test_qwen3_xml_coder_shared.py
# → 168 passed

# 2) Now restore ONLY the parsers from main (keep the new tests)
git checkout main -- vllm/tool_parsers/qwen3coder_tool_parser.py \
                     vllm/tool_parsers/qwen3xml_tool_parser.py
pytest tests/tool_parsers/test_qwen3coder_tool_parser.py \
       tests/tool_parsers/test_qwen3xml_tool_parser.py \
       tests/tool_parsers/test_qwen3_xml_coder_shared.py
# → 66 failed, 102 passed

# 3) Restore the parsers
git checkout pr-40861 -- vllm/tool_parsers/qwen3coder_tool_parser.py \
                         vllm/tool_parsers/qwen3xml_tool_parser.py

The 102 still-passing tests are tests that already existed on main. The 66 failing ones are exactly the new tests added in this PR — each one is a real symptom I hit in production with Qwen3.5 / Qwen3.6.

What kind of bugs (so you can review by category, not by line count)

The 66 failures fall into ~6 independent categories. Each category is self-contained, and you can read / merge them independently if you prefer to split the PR :

  1. MTP / fragmented tokens bugs — at temp 1.5 the <tool_call> special token is sometimes split into multiple tokens (~1/75 calls), and entire before <tool_1> between <tool_2> after </tool_call> sequences arrive in a single delta (or just ... omited) . Multiple bugs : early-advance returning None and dropping the delta, recursion not advancing _sent_content_idx past the last </tool_call>, content fragments between two tools silently dropped because the outer merger guarded with not result.content.
    Tests :
  • test_extract_tool_calls_streaming_speculative_decode_loss
  • test_two_tool_calls_in_one_streaming_chunk
  • test_streaming_two_tool_calls_plus_trailing_text_one_delta
  • test_streaming_trailing_text_with_final_close_in_same_delta,
  • test_streaming_content_before_and_between_two_tool_calls_one_delta.
  1. Literal XML tags inside parameter values — a write_file tool whose content documents the tool-call format itself was making current_text.find("</tool_call>") land inside an earlier tool's content and silently drop every subsequent emission. New _structural_tool_call_end_positions helper only accepts </tool_call> if it's preceded by </function> after optional whitespace, or followed structurally by another opener / EOS. Same fix extended to </function>, </parameter>, <tool_call> opener detection.
    Tests :
  • test_content_with_xml_structural_tags_*
  • test_content_with_param_like_lines_*,
  • test_content_with_real_param_name_literal_*,
  • test_content_with_full_nested_tool_call_*,
  • test_two_tools_second_with_out_of_schema_nested_literal_*,
  • test_streaming_*_literal_close_tag_in_value.
  1. Qwen3.5 | string chat-template rendering — the official chat template renders nullable args via | string, so a previous turn's null value becomes the literal "None" in the prompt. Models trained on this template generate "None" verbatim. _convert_param_value now accepts "None" alongside "null" for nullable params, and the XML streaming path defers numeric (int/float) conversion the same way booleans were already deferred — otherwise the diff-based char emission produces "Non" then "l" against the new "null" output, yielding the cumulative invalid JSON "Nonl".
    Tests :
  • test_python_none_value_for_nullable_int,
  • test_qwen3xml_streaming_python_none_int_char_by_char,
  • test_anyof_string_null_*,
  • test_anyof_integer_null_parses_as_int,
  • test_string_null_value_preserved.
  1. anyOf / nullable schema handling — both parsers were double-encoding object params resolved via anyOf (previous PR added partial support but missed the streaming path and nested objects).
    Tests :
  • test_anyof_object_param_not_double_encoded_*
  • test_double_encoded_object_param_*,
  • test_array_with_json_bool.
  1. Free-text streaming around tool calls — content between two tool calls was being delayed and emitted after the last tool call, content after the last tool call was buffered indefinitely and lost on EOS.
    Tests :
  • test_qwen3xml_streaming_text_after_tool_call,
  • test_qwen3xml_async_streaming_free_text,
  • TestQwen3xmlToolParser::test_surrounding_text[True],
  • test_streaming_trailing_text_*,
  • test_inline_empty_tool_call_preserves_content_before_real_call.
  1. Streaming chunking robustness — split tags across delta boundaries (a < arriving in delta N and tool_call> in delta N+1), various chunk sizes (1 char, 2 char, … full token), bool true getting flipped to false because of partial-string→JSON-literal flip, last char of a string-typed null value being dropped.
    Tests :
  • test_extract_tool_calls_streaming_split_tag,
  • test_streaming_char_by_char_literal_balises_in_value,
  • test_extract_tool_calls_streaming_various_chunk_sizes,
  • test_xml_streaming_boolean_true_not_false,
  • test_xml_streaming_string_null_last_char_not_dropped,
  • test_qwen36_xml_streaming_double_close_brace,
  • test_xml_streaming_parallel_tool_calls_preformed_chunks,
  • test_xml_streaming_missing_opening_tool_call_tag.

How I found them

  • I run a lot of agentic stuff with Qwen3.5 27B / Qwen3.6 in MTP + streaming, (temp 1.5 & advised temps) . The MTP-related ones were caught at runtime over days of intensive usage.
  • For the | string / template-rendering bugs, I literally asked Qwen3.5 to read its own chat template and predict where the parsers would break. It nailed several of them in one shot.
  • The "literal balise in parameter value" bugs surfaced when I asked the model to write a Python tool whose content was itself a tool-call snippet (write_file with code that documents the tool-call format).
  • Asking Qwen 3.5 to review it's own chat template ... was a messy ride to be frank.

AI assistance : To be explicit : Claude Opus wrote most of the test scaffolding and several of the fixes under my supervision ; I read every changed line, ran them against my real Qwen3 traffic, and I'm the one defending the change end-to-end.

@ExtReMLapin
Copy link
Copy Markdown
Contributor Author

What really worries me as of today is the whole parsing ecosystem (not an issue restricted to vLLM)

IMO each message should be tokenized on it's own, isolated for the others, then reasoning then tool calls, instead of losing messages isolations by applying the chat template.

Not sure if there is a dedicated place to discuss this.

@bbrowning
Copy link
Copy Markdown
Collaborator

What really worries me as of today is the whole parsing ecosystem (not an issue restricted to vLLM)

IMO each message should be tokenized on it's own, isolated for the others, then reasoning then tool calls, instead of losing messages isolations by applying the chat template.

Not sure if there is a dedicated place to discuss this.

I'm not following what you mean here. We aren't parsing for reasoning and tools on incoming messages. We only apply the chat template on incoming messages. We do not apply a chat template to the model's generated outputs. We do parse for reasoning and tool content in the model's generated outputs.

@ExtReMLapin
Copy link
Copy Markdown
Contributor Author

ExtReMLapin commented Apr 29, 2026

What really worries me as of today is the whole parsing ecosystem (not an issue restricted to vLLM)
IMO each message should be tokenized on it's own, isolated for the others, then reasoning then tool calls, instead of losing messages isolations by applying the chat template.
Not sure if there is a dedicated place to discuss this.

I'm not following what you mean here. We aren't parsing for reasoning and tools on incoming messages. We only apply the chat template on incoming messages. We do not apply a chat template to the model's generated outputs. We do parse for reasoning and tool content in the model's generated outputs.

You're right that vLLM doesn't apply the chat template to the model's output directly, my point is it can fail on the next conversation turn :

  1. Model generates raw tokens
  2. vLLM parses them (reasoning + tool calls) into structured fields and ships that to the client
  3. Client appends the structured message to history and sends the whole conversation back
  4. vLLM re-applies the chat template, re-tokenizing everything together

Step 4 is where a bad parse in step 2 breaks the whole thing.
If during generation the model emits a role token inside content/reasoning (ex : while introspecting its own chat template), that token either gets misparsed on the way out, or survives into the structured message and then breaks boundaries when the template is reapplied next turn.

My point is that parsing/templating at the conversation level allows for bad parsing propagation, we destroy the initially correctly "checkpointed" parsed messages. While parsing at message level isolates the parsing issues.

Again I'm not sure it's the right place to discuss this, I'm fine with it but I fear maintainers might not appreciate that.

Edit : please pardon my previous phrasing, I'm out of bandwidth with everything at the office, I'm working overtime to get things together and it's a mess

@bfroemel
Copy link
Copy Markdown

bfroemel commented Apr 29, 2026

If during generation the model emits a role token inside content/reasoning (ex : while introspecting its own chat template), that token either gets misparsed on the way out, or survives into the structured message and then breaks boundaries when the template is reapplied next turn.

I also noticed that behavior and my "work-a-round" was to ask the model to not use its actual special tokens, but placeholders, e.g., [think] instead of <think>, to not confuse parsers. Another more stable solution could be to filter all model input (while the prompt is rendered) and properly escape any special tokens. before sending back model output to the client the potentially generated escaped special tokens could be automatically unescaped.

Anyway, I think you only really hit this issue, if you work on chat templates and model output parsers.

@ExtReMLapin
Copy link
Copy Markdown
Contributor Author

I do agree, I also do agree it can be an edge case with this particuliar scenario.

If tomorrow one model family takes 90% of the market share because they're simply better than the rest of the model, you can't rely on the others less intelligent models to fix an issue with this model family.

Self compiling compiler, self fixing LLM.

@ToastyTheBot
Copy link
Copy Markdown

I've applied both #40783 and #40861, backported to v0.20.0 with help from Claude, but my qwen3.6 AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP model still frequently fails at tool calling. Am I missing any other patches?

@ExtReMLapin
Copy link
Copy Markdown
Contributor Author

I've applied both #40783 and #40861, backported to v0.20.0 with help from Claude, but my qwen3.6 AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP model still frequently fails at tool calling. Am I missing any other patches?

Do you feel like those PRs introduced less stability or just it didn't fix the issues you had ?

@ToastyTheBot
Copy link
Copy Markdown

I've applied both #40783 and #40861, backported to v0.20.0 with help from Claude, but my qwen3.6 AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP model still frequently fails at tool calling. Am I missing any other patches?

Do you feel like those PRs introduced less stability or just it didn't fix the issues you had ?

I don't believe it has introduced more instabilities, but I don't think it has fixed the model stopping issue. Can you confirm the two PRs fix model stopping issues for you, and if so, what arguments and model are you using?

@ExtReMLapin
Copy link
Copy Markdown
Contributor Author

ExtReMLapin commented Apr 29, 2026

I feel like it reduced issues on 3.6 27B, qwen3_coder (not much diff with xml).

But it CLEARLY fixed issues with model chat template introspection, which is a very specific use case, I give it to you.

Tensor parallel 2, FP8 official model, with preserve_thinking to true. 200k context

Conflict resolutions:

- vllm/parser/abstract_parser.py: kept the reasoning_from_transition
  restoration block adjacent to the history_tool_call_cnt counter
  added by main; the two blocks are independent (no shared state).

- vllm/tool_parsers/qwen3coder_tool_parser.py: merged the new
  structural_tag_registry imports with the existing partial_tag_overlap
  import; preserved the speculative-decoding recursion and trailing
  free-text emission logic from HEAD and appended the get_structural_tag
  method introduced by main right after extract_tool_calls_streaming.

- tests/tool_parsers/test_qwen3coder_tool_parser.py: dropped the test
  bodies that were re-introduced by the merge but had already been
  factored into tests/tool_parsers/test_qwen3_xml_coder_shared.py
  during the qwen3_combined_fixes refactor.  Cleaned the matching
  unused imports.

Cross-parser coverage:

- Added get_structural_tag to Qwen3XMLToolParser using the same
  qwen_3_5 model registration as the Coder parser, so the XML parser
  also exposes a valid StructuralTag.
- Moved the three structural_tag tests added in 844df54 (and the
  _as_chat_completion_tools helper) into test_qwen3_xml_coder_shared.py
  so they run against both Qwen3XMLToolParser and Qwen3CoderToolParser
  via the parser_cls fixture.

Note: Qwen3CoderToolParser.supports_required_and_named is set to False
by main; the same flag was intentionally left at its True default on
Qwen3XMLToolParser pending a separate decision.

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
@ExtReMLapin
Copy link
Copy Markdown
Contributor Author

merged with claude code

@Seven-Streams in #40894 you added some tests.

I moved those tests in a shared qwen3 xml-coder file to ensure they cover both parsers.

@bfroemel
Copy link
Copy Markdown

As a harness user I noticed with this PR and #40783 there are much less parsing issues, while using chat completion API, streaming with Qwen3.6-27B-FP8. I had runs with 100s of tool calls and they all seemed fine. Without the PRs there are:

  • sometimes tool call blocks in the model reasoning
  • model reasoning is cut-off/truncated
  • failing tool calls/json parsing issues
  • premature agent turn stops (probably because cut-off/truncated model output in the message history slowly stacks up until the model gets confused and generates output without content or tool calls)

Even with these PRs I sometimes see reasoning output that appears to be cut-off (e.g., the last sentence ends without '.\n'). I think this is a parser/streaming issue, because with non-streaming I haven't observed the model generating sentences without a proper end. (Also, while this isn't relevant in most cases, special tokens still can confuse the parsers, e.g., if I let the model review these PRs, e.g., reasoning can bleed into content.)

@chaunceyjiang @sfeng33
-> Imo for agentic harness users (that depend on long running turns with correct reasoning and tool calls parsing) the PRs have value and it would be great if maintainers/code owners/reviewers could offer some guidance how to move forward. Thanks!

/cc: @qmx @hickeyma @hmellor @stakeswky @ywang96 (apologies for spamming all recent code contributors of the qwen parsers)

@ExtReMLapin ExtReMLapin requested a review from chaunceyjiang May 11, 2026 18:44
@dabhimanyu
Copy link
Copy Markdown

Because of this tool calling issue it has been a mess to use Qwen 3.6 35B A3B for Agentic tasks. It works fine as long as I'm asking it do to targetted fixes, provide full content from the files since it's not able to use file system and father than calling it as a tool call I call it directly using curl command.

I'm using it on a single RTX 5090 in NVFP4 quant. Serving wrapper: Not sure if this is the right place to ask this question here but what's the current status of this fix?

I discovered this solution on hugging face recently froggeric/Qwen-Fixed-Chat-Templates but even after applying this I was still facing tool call issues. I still need to check If I had done something incorrectly or if my agent messed it up (Was using Gemma 4 31B to fix this 💀) but it was late and I was feeling sleepy so still have to check the git logs more throughtlly and figure out if my agent messed up something or is this a vllm and/or Qwen 3.6 family of models issue. Because this doesn't happen with Gemma models.

FYI I'm getting this error in Opencode claude code as well as Droid all three are giving same error. It works if you simply invoke the local model via direct Curl command using the VLLm API directly (not sure what that curl thingii is called, just a random multiphase turbulence researcher here 💀). With concurrcy 2 it rips through at over 200 TPS output tokens at a context of about 90k.

Just want to know how to get this damm tool call to work in Qwen 3.6 35B-A3B and Qwen 27B, and if possible Qwen 3V models too. Any help to put me in the right direction would be deeply appreciated. Forgive me in case I barged in wrong place.

My VLM serving wrapper.

This still needs more work because I can squeeze out more context window from Gemma and Qwen 3.6 family since attention mechanism in these models is very different from Qwen 3V. And the below script treats attention mechanism of all models like Qwen 3V to be on the safe side since I didn't undstood that previously. So I've got a beta wrapper in progress which incorporates that but that's still in testing phase. Currently the below wrapper is the workhorse for the everyday text and vision related grunt work to save tokens from my Codex plus plan💀.

# ~/.config/zsh/local_llm_yolo.zsh
# Local LLM YOLO — Shell profile for vLLM model management
# Source: source ~/vllm_serving_scripts/v2/lib/local_llm_yolo.zsh
# RTX 5090 32GB | 8 NVFP4 models | Concurrency 1/2/3 only | 1 model at a time

# ═══════════════════════════════════════════════════════════════════════
# MODEL REGISTRY
# ═══════════════════════════════════════════════════════════════════════

declare -A VLLM_MODELS
VLLM_MODELS=(
  M1 "Firworks-Qwen3-VL-32B-Thinking"
  M2 "LilaRest-gemma-4-31B-it-turbo"
  M3 "nvidia-Gemma-4-26B-A4B"
  M4 "OptimizeLLM-Qwen3-VL-30B-A3B"
  M5 "RedHatAI-gemma-4-31B-it"
  M6 "RedHatAI-Qwen3.6-35B-A3B"
  M7 "sakamakismile-Qwen3.6-27B-Text-MTP"
  M8 "unsloth-Qwen3.6-27B"
)

declare -A VLLM_HF_IDS
VLLM_HF_IDS=(
  M1 "Firworks/Qwen3-VL-32B-Thinking-NVFP4"
  M2 "LilaRest/gemma-4-31B-it-NVFP4-turbo"
  M3 "nvidia/Gemma-4-26B-A4B-NVFP4"
  M4 "OptimizeLLM/Qwen3-VL-30B-A3B-Thinking-NVFP4"
  M5 "RedHatAI/gemma-4-31B-it-NVFP4"
  M6 "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
  M7 "sakamakismile/Qwen3.6-27B-Text-MTP"
  M8 "unsloth/Qwen3.6-27B-NVFP4"
)

declare -A VLLM_QUANT
VLLM_QUANT=(
  M1 "compressed-tensors"
  M2 "modelopt"
  M3 "modelopt"
  M4 "compressed-tensors"
  M5 "compressed-tensors"
  M6 "compressed-tensors"
  M7 "modelopt"
  M8 "compressed-tensors"
)

declare -A VLLM_PARSER_REASON
VLLM_PARSER_REASON=(
  M1 "qwen3"
  M2 "gemma4"
  M3 "gemma4"
  M4 "qwen3"
  M5 "gemma4"
  M6 "qwen3"
  M7 "qwen3"
  M8 "qwen3"
)

declare -A VLLM_PARSER_TOOL
VLLM_PARSER_TOOL=(
  M1 "hermes"
  M2 "gemma4"
  M3 "gemma4"
  M4 "hermes"
  M5 "gemma4"
  M6 "hermes"
  M7 "hermes"
  M8 "hermes"
)

declare -A VLLM_VISION
VLLM_VISION=(
  M1 "YES"
  M2 "NO"
  M3 "YES"
  M4 "YES"
  M5 "YES"
  M6 "YES"
  M7 "NO"
  M8 "YES"
)

declare -A VLLM_MOE
VLLM_MOE=(
  M1 "NO"
  M2 "NO"
  M3 "YES"
  M4 "YES"
  M5 "NO"
  M6 "YES"
  M7 "NO"
  M8 "NO"
)

declare -A VLLM_VIDEO
VLLM_VIDEO=(
  M1 "NO" M2 "NO" M3 "NO" M4 "NO"
  M5 "NO" M6 "NO" M7 "NO" M8 "NO"
)

declare -A VLLM_AUDIO
VLLM_AUDIO=(
  M1 "NO" M2 "NO" M3 "NO" M4 "NO"
  M5 "NO" M6 "NO" M7 "NO" M8 "NO"
)

declare -A VLLM_DEFAULT_CONCURRENCY
VLLM_DEFAULT_CONCURRENCY=(
  M1 "1" M2 "2" M3 "2" M4 "1"
  M5 "1" M6 "1" M7 "2" M8 "1"
)

declare -A VLLM_CONCURRENCY_RANGE
VLLM_CONCURRENCY_RANGE=(
  M1 "1-2"  M2 "1-3"  M3 "2-3" M4 "1"
  M5 "1-2"  M6 "1-2"  M7 "2-3" M8 "1-2"
)

declare -A VLLM_LOCAL_PATHS
VLLM_LOCAL_PATHS=(
  M1 "/home/abhimanyu/local_llm_models/Firworks-Qwen3-VL-32B-Thinking-nvfp4"
  M2 "/home/abhimanyu/local_llm_models/LilaRest-gemma-4-31B-it-NVFP4-turbo"
  M3 "/home/abhimanyu/local_llm_models/nvidia-Gemma-4-26B-A4B-NVFP4"
  M4 "/home/abhimanyu/local_llm_models/OptimizeLLM-Qwen3-VL-30B-A3B-Thinking-NVFP4"
  M5 "/home/abhimanyu/local_llm_models/RedHatAI-gemma-4-31B-it-NVFP4"
  M6 "/home/abhimanyu/local_llm_models/RedHatAI-Qwen3.6-35B-A3B-NVFP4"
  M7 "/home/abhimanyu/local_llm_models/sakamakismile-Qwen3.6-27B-Text-NVFP4-MTP"
  M8 "/home/abhimanyu/local_llm_models/unsloth-Qwen3.6-27B-NVFP4"
)

# ═══════════════════════════════════════════════════════════════════════
# ENDPOINT
# ═══════════════════════════════════════════════════════════════════════

export LOCAL_LLM_BASE_URL="http://127.0.0.1:8000/v1"
export LOCAL_LLM_API_KEY="dummy"

# ═══════════════════════════════════════════════════════════════════════
# CONCURRENCY GUARD
# ═══════════════════════════════════════════════════════════════════════

_vllm_check_concurrency() {
  local c=$1
  if [[ "$c" != "1" && "$c" != "2" && "$c" != "3" ]]; then
    echo "ERROR: Concurrency must be 1, 2, or 3. Got: $c" >&2
    echo "ATTEMPTED_OP: vllm serve with --max-num-seqs $c" >&2
    echo "REASON: Only concurrency 1/2/3 allowed on RTX 5090 32GB" >&2
    echo "ACTION_REQUIRED: Use concurrency 1 (Deep Thinker), 2 (Balanced), or 3 (Max throughput)" >&2
    return 1
  fi
  return 0
}

# ═══════════════════════════════════════════════════════════════════════
# CORE FUNCTIONS
# ═══════════════════════════════════════════════════════════════════════

vllm_kill() {
  echo "Killing existing vLLM processes..."
  pkill -f "vllm serve" 2>/dev/null
  sleep 2
  if pgrep -f "vllm serve" >/dev/null 2>&1; then
    echo "WARNING: vLLM still running, force killing..."
    pkill -9 -f "vllm serve" 2>/dev/null
    sleep 1
  fi
  # Kill orphaned vLLM GPU processes (parent APIServer dead)
  # EngineCore children use multiprocessing.spawn — their /proc/cmdline may not
  # contain "vllm", so we also check the nvidia-smi process name (VLLM::EngineCore)
  # and the vllm venv python path as fallback signals.
  local orphan_pids=""
  local gpu_pids
  local gpu_info
  gpu_info=$(nvidia-smi --query-compute-apps=pid,process_name --format=csv,noheader 2>/dev/null)
  gpu_pids=$(echo "$gpu_info" | awk -F', ' '{print $1}' | tr -d ' ')
  for pid in ${(f)gpu_pids}; do
    [[ -z "$pid" ]] && continue
    local is_vllm=false
    # Check 1: nvidia-smi process name contains VLLM (case-insensitive)
    if echo "$gpu_info" | grep -qi "VLLM" 2>/dev/null; then
      local pname=$(echo "$gpu_info" | grep "^ *${pid}," | awk -F', ' '{print $2}')
      if [[ "$pname" == *"VLLM"* ]]; then
        is_vllm=true
      fi
    fi
    # Check 2: /proc/cmdline contains "vllm"
    if ! $is_vllm && [[ -r /proc/$pid/cmdline ]] && grep -ql "vllm" /proc/$pid/cmdline 2>/dev/null; then
      is_vllm=true
    fi
    # Check 3: process running from .vllm_venv python (vLLM's venv)
    if ! $is_vllm && [[ -r /proc/$pid/cmdline ]] && grep -ql ".vllm_venv" /proc/$pid/cmdline 2>/dev/null; then
      is_vllm=true
    fi
    if $is_vllm; then
      orphan_pids="$orphan_pids $pid"
    fi
  done
  if [[ -n "$orphan_pids" ]]; then
    echo "Cleaning up orphaned vLLM GPU processes:$orphan_pids"
    echo "$orphan_pids" | tr ' ' '\n' | grep -v '^$' | xargs -r kill -9 2>/dev/null
    sleep 1
  fi
  echo "vLLM stopped."
}

vllm_status() {
  if pgrep -f "vllm serve" >/dev/null 2>&1; then
    echo "vLLM is RUNNING"
    echo "PID: $(pgrep -f 'vllm serve' | head -1)"
    echo "Model: ${VLLM_MODELS[$VLLM_CURRENT_MODEL]:-unknown} ($VLLM_CURRENT_MODEL)"
    echo "Modalities: ${VLLM_CURRENT_MODALITIES:-?}"
    echo "Concurrency: ${VLLM_CURRENT_CONCURRENCY:-?}"
    echo "Max Context: ${VLLM_CURRENT_MAX_LEN:-?} tokens"
    local health=$(curl -s http://127.0.0.1:8000/health 2>/dev/null)
    if [[ "$health" == *"ok"* ]] || [[ "$health" == *"200"* ]]; then
      echo "Health: OK"
    else
      echo "Health: WAITING (model loading...)"
    fi
    nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv,noheader 2>/dev/null
  else
    echo "vLLM is STOPPED"
    nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader 2>/dev/null
  fi
}

vllm_wait_health() {
  local timeout=${1:-300}
  local elapsed=0
  echo "Waiting for vLLM health (timeout: ${timeout}s)..."
  while [[ $elapsed -lt $timeout ]]; do
    local health=$(curl -s http://127.0.0.1:8000/health 2>/dev/null)
    if [[ "$health" == *"ok"* ]] || [[ "$health" == *"200"* ]]; then
      echo "vLLM is healthy after ${elapsed}s"
      return 0
    fi
    sleep 2
    elapsed=$((elapsed + 2))
    echo -ne "  Waiting... ${elapsed}s\r"
  done
  echo ""
  echo "ERROR: vLLM did not become healthy within ${timeout}s" >&2
  return 1
}

# ═══════════════════════════════════════════════════════════════════════
# vllm_swap — Main entry point
# Usage: vllm_swap <M1..M8> [1|2|3] [--text|--vision|--video|--audio]
# ═══════════════════════════════════════════════════════════════════════

vllm_swap() {
  local model_id=""
  local concurrency=""
  local modality=""

  for arg in "$@"; do
    case "$arg" in
      M[1-8])
        model_id="$arg"
        ;;
      1|2|3)
        concurrency="$arg"
        ;;
      --text|--vision|--video|--audio)
        modality="${arg#--}"
        ;;
      *)
        echo "ERROR: Unknown argument: $arg" >&2
        echo "USAGE: vllm_swap <M1..M8> [1|2|3] [--text|--vision|--video|--audio]" >&2
        return 1
        ;;
    esac
  done

  if [[ -z "$model_id" ]] || [[ -z "${VLLM_HF_IDS[$model_id]}" ]]; then
    echo "ERROR: Invalid or missing model ID. Use M1-M8." >&2
    echo "Available: M1 M2 M3 M4 M5 M6 M7 M8" >&2
    echo "USAGE: vllm_swap <M1..M8> [1|2|3] [--text|--vision|--video|--audio]" >&2
    return 1
  fi

  if [[ -z "$concurrency" ]]; then
    concurrency="${VLLM_DEFAULT_CONCURRENCY[$model_id]}"
  fi

  _vllm_check_concurrency "$concurrency" || return 1

  local range="${VLLM_CONCURRENCY_RANGE[$model_id]}"
  local range_lo="${range%-*}"
  local range_hi="${range#*-}"
  if [[ "$concurrency" -lt "$range_lo" || "$concurrency" -gt "$range_hi" ]]; then
    echo "WARNING: Concurrency $concurrency outside recommended range [$range] for $model_id"
  fi

  if [[ -z "$modality" ]]; then
    modality="text"
  fi

  local use_vision="no"
  local use_video="no"
  local use_audio="no"
  local modalities_str="text"

  case "$modality" in
    vision)
      if [[ "${VLLM_VISION[$model_id]}" != "YES" ]]; then
        echo "WARNING: $model_id (${VLLM_MODELS[$model_id]}) does not support vision."
        echo "  Falling back to text-only."
        local vl_models=()
        for m in M1 M2 M3 M4 M5 M6 M7 M8; do
          [[ "${VLLM_VISION[$m]}" == "YES" ]] && vl_models+=("$m (${VLLM_MODELS[$m]})")
        done
        echo "  Vision-capable models: ${vl_models[*]}"
        modality="text"
      else
        use_vision="yes"
        modalities_str="text+vision"
      fi
      ;;
    video)
      if [[ "${VLLM_VIDEO[$model_id]}" != "YES" ]]; then
        echo "WARNING: $model_id (${VLLM_MODELS[$model_id]}) does not support video."
        echo "  Falling back to text-only."
        local vid_models=()
        for m in M1 M2 M3 M4 M5 M6 M7 M8; do
          [[ "${VLLM_VIDEO[$m]}" == "YES" ]] && vid_models+=("$m (${VLLM_MODELS[$m]})")
        done
        echo "  Video-capable models: ${vid_models[*]}"
        modality="text"
      else
        use_video="yes"
        modalities_str="text+video"
      fi
      ;;
    audio)
      if [[ "${VLLM_AUDIO[$model_id]}" != "YES" ]]; then
        echo "WARNING: $model_id (${VLLM_MODELS[$model_id]}) does not support audio."
        echo "  Falling back to text-only."
        local aud_models=()
        for m in M1 M2 M3 M4 M5 M6 M7 M8; do
          [[ "${VLLM_AUDIO[$m]}" == "YES" ]] && aud_models+=("$m (${VLLM_MODELS[$m]})")
        done
        echo "  Audio-capable models: ${aud_models[*]}"
        modality="text"
      else
        use_audio="yes"
        modalities_str="text+audio"
      fi
      ;;
    text)
      ;;
  esac

  echo "Swapping to ${VLLM_MODELS[$model_id]} | C=$concurrency | $modalities_str"

  # Vision concurrency guard: C=3 not supported for vision
  if [[ "$use_vision" == "yes" || "$use_video" == "yes" ]]; then
    if [[ "$concurrency" -eq 3 ]]; then
      echo "WARNING: Vision mode does not support C=3 on RTX 5090. Overriding to C=2."
      concurrency=2
    fi
  fi

  vllm_kill

  # Wait for GPU memory to actually free after kill
  local gpu_wait=0
  while [[ $gpu_wait -lt 30 ]]; do
    local used
    used=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -d ' ')
    if [[ -n "$used" && "$used" -lt 500 ]]; then
      break
    fi
    sleep 2
    gpu_wait=$((gpu_wait + 2))
  done
  if [[ $gpu_wait -ge 30 ]]; then
    echo "WARNING: GPU memory may not be fully freed (${used}MB still used)" >&2
  fi

  if ! source ~/.vllm_venv/bin/activate 2>/dev/null; then
    echo "ERROR: Failed to activate venv at ~/.vllm_venv/" >&2
    echo "ATTEMPTED_OP: source venv for vllm serve" >&2
    echo "ACTION_REQUIRED: Run setup_python_venv.sh or verify ~/.vllm_venv/bin/activate exists" >&2
    return 1
  fi

  # Validate local model path
  local model_path_check="${VLLM_LOCAL_PATHS[$model_id]}"
  if [[ -n "$model_path_check" && -d "$model_path_check" ]]; then
    local sf_files=("${model_path_check}"/*.safetensors(N))
    if [[ ${#sf_files} -eq 0 ]]; then
      echo "WARNING: No .safetensors files found in $model_path_check" >&2
      echo "  Model files may be incomplete. Proceeding with local path anyway." >&2
    fi
  elif [[ -n "$model_path_check" ]]; then
    echo "WARNING: Local path does not exist: $model_path_check" >&2
    echo "  Will fall back to HuggingFace download." >&2
  fi

  local hf_id="${VLLM_HF_IDS[$model_id]}"
  local quant="${VLLM_QUANT[$model_id]}"
  local r_parser="${VLLM_PARSER_REASON[$model_id]}"
  local t_parser="${VLLM_PARSER_TOOL[$model_id]}"
  local is_moe="${VLLM_MOE[$model_id]}"

  local model_path="${VLLM_LOCAL_PATHS[$model_id]}"
  local cmd
  if [[ -d "$model_path" ]]; then
    cmd="vllm serve $model_path"
  else
    echo "WARNING: Local path not found: $model_path" >&2
    echo "  Falling back to HuggingFace ID: $hf_id" >&2
    cmd="vllm serve $hf_id"
  fi
  cmd+=" --host 127.0.0.1"
  cmd+=" --port 8000"
  cmd+=" --tensor-parallel-size 1"
  cmd+=" --max-num-seqs $concurrency"
  cmd+=" --gpu-memory-utilization 0.94"
  cmd+=" --quantization $quant"
  cmd+=" --kv-cache-dtype fp8"
  cmd+=" --trust-remote-code"
  cmd+=" --no-calculate-kv-scales"
  cmd+=" --reasoning-parser $r_parser"
  cmd+=" --served-model-name local-llm"
  cmd+=" --enable-auto-tool-choice"
  cmd+=" --tool-call-parser $t_parser"

  local max_len
  if [[ "$use_vision" == "yes" || "$use_video" == "yes" ]]; then
    case "$concurrency" in
      1) max_len="90000" ;;
      2) max_len="65536" ;;
    esac
    case "$concurrency" in
      1) cmd+=" --limit-mm-per-prompt '{\"image\": 3, \"video\": 0}'" ;;
      2) cmd+=" --limit-mm-per-prompt '{\"image\": 1, \"video\": 0}'" ;;
    esac
  else
    case "$concurrency" in
      1) max_len="131072" ;;
      2) max_len="81920" ;;
      3) max_len="40960" ;;
    esac
    cmd+=" --limit-mm-per-prompt '{\"image\": 0}'"
  fi
  cmd+=" --max-model-len $max_len"

  if [[ $max_len -ge 131072 ]]; then
    cmd+=" --async-scheduling"
    cmd+=" --no-enable-prefix-caching"
    cmd+=" --max-num-batched-tokens 640"
  else
    cmd+=" --enable-prefix-caching"
    cmd+=" --max-num-batched-tokens 8192"
  fi

  if [[ "$is_moe" == "YES" ]]; then
    cmd+=" --enable-expert-parallel"
  fi

  if [[ "$model_id" == "M7" ]]; then
    cmd+=" --speculative-config '{\"method\":\"qwen3_5_mtp\",\"num_speculative_tokens\":3}'"
    cmd+=" --no-scheduler-reserve-full-isl"
  fi

  echo ""
  echo "+-- vLLM Serve Command -------------------------------------------------+"
  echo "$cmd" | sed 's/ --/ \\\n  --/g'
  echo "+-----------------------------------------------------------------------+"
  echo ""
  eval "nohup $cmd > /tmp/vllm_startup.log 2>&1 &"
  local pid=$!
  disown
  echo "vLLM PID: $pid"
  echo "Startup log: /tmp/vllm_startup.log"

  local health_timeout="${VLLM_HEALTH_TIMEOUT:-300}"
  if vllm_wait_health "$health_timeout"; then
    echo ""
    echo "========================================"
    echo " vLLM READY"
    echo "========================================"
  else
    echo ""
    echo "WARNING: vLLM not healthy after ${health_timeout}s" >&2
    echo "  The model may still be loading (FlashInfer kernel compilation)." >&2
    echo "  PID: $pid — check progress: tail -f /tmp/vllm_startup.log" >&2
    echo "  Run 'vllm_wait_health' or 'vllm_status' to check again." >&2
  fi

  export VLLM_CURRENT_MODEL="$model_id"
  export VLLM_CURRENT_MODALITIES="$modalities_str"
  export VLLM_CURRENT_CONCURRENCY="$concurrency"
  export VLLM_CURRENT_HF_ID="$hf_id"
  export VLLM_CURRENT_MAX_LEN="$max_len"
  export LOCAL_LLM_CURRENT_MODEL="local-llm"

  echo ""
  echo "========================================"
  echo " ACTIVE MODALITIES: ✓ $modalities_str"
  echo " CONTEXT: $max_len tokens | CONCURRENCY: $concurrency"
  echo "========================================"
  echo "Model:    ${VLLM_MODELS[$model_id]} ($model_id)"
  echo "API:      $LOCAL_LLM_BASE_URL"
  echo "Aliases:  local-llm, $hf_id"
  echo "========================================"
}

# ═══════════════════════════════════════════════════════════════════════
# CONVENIENCE FUNCTIONS — Per-model shortcuts
# Usage: vllm_M6 [concurrency] [--text|--vision|--video|--audio]
# ═══════════════════════════════════════════════════════════════════════

vllm_M1() { vllm_swap M1 "$@" }
vllm_M2() { vllm_swap M2 "$@" }
vllm_M3() { vllm_swap M3 "$@" }
vllm_M4() { vllm_swap M4 "$@" }
vllm_M5() { vllm_swap M5 "$@" }
vllm_M6() { vllm_swap M6 "$@" }
vllm_M7() { vllm_swap M7 "$@" }
vllm_M8() { vllm_swap M8 "$@" }

# ═══════════════════════════════════════════════════════════════════════
# MODE SHORTCUTS — Preset serving configs
# ═══════════════════════════════════════════════════════════════════════

alias vllm_agentic='vllm_swap M6 2'
alias vllm_agentic_fast='vllm_swap M3 3'

alias vllm_writer='vllm_swap M5 1'
alias vllm_writer_moe='vllm_swap M6 1'

alias vllm_deep='vllm_swap M6 1'
alias vllm_deep_gemma='vllm_swap M5 1'

alias vllm_vision='vllm_swap M4 1 --vision'
alias vllm_vision_gemma='vllm_swap M5 1 --vision'

alias vllm_mtp='vllm_swap M7'

# ═══════════════════════════════════════════════════════════════════════
# QUICK API TEST
# ═══════════════════════════════════════════════════════════════════════

vllm_test() {
  local prompt="${1:-Hello, respond with 'OK' and nothing else.}"
  echo "Testing: $LOCAL_LLM_BASE_URL"
  curl -s "$LOCAL_LLM_BASE_URL/chat/completions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $LOCAL_LLM_API_KEY" \
    -d "$(jq -n --arg p "$prompt" '{
      model: "local-llm",
      messages: [{role: "user", content: $p}],
      max_tokens: 100,
      temperature: 0.1
    }')" | jq -r '.choices[0].message.content // .error // "FAILED"'
}

vllm_models() {
  curl -s "$LOCAL_LLM_BASE_URL/models" | jq -r '.data[].id // "No models found"'
}

# ═══════════════════════════════════════════════════════════════════════
# HELP — only shown when explicitly called, NOT on source
# ═══════════════════════════════════════════════════════════════════════

vllm_help() {
  cat <<'HELPEOF'
+======================================================================+
|            local_llm_yolo — Command Reference                        |
+======================================================================+
|                                                                      |
|  QUICK-START COMMAND MATRIX (Text-Only)                              |
|  Bold = default. Dashes = outside recommended range.                 |
|  Context sizes: C=1 → 131K, C=2 → 82K, C=3 → 41K tokens            |
|                                                                      |
|  Model                             C=1          C=2          C=3     |
|  ─────────────────────────────────────────────────────────────────── |
|  M1 Firworks-Qwen3-VL-32B         vllm_M1      vllm_M1 2      —     |
|  M2 LilaRest-gemma-4-31B-turbo    vllm_M2 1    vllm_M2      vllm_M2 3|
|  M3 nvidia-Gemma-4-26B-A4B           —         vllm_M3      vllm_M3 3|
|  M4 OptimizeLLM-Qwen3-VL-30B-A3B  vllm_M4         —            —     |
|  M5 RedHatAI-gemma-4-31B-it       vllm_M5      vllm_M5 2      —     |
|  M6 RedHatAI-Qwen3.6-35B-A3B      vllm_M6      vllm_M6 2      —     |
|  M7 sakamakismile-Qwen3.6-27B-MTP    —         vllm_M7      vllm_M7 3|
|  M8 unsloth-Qwen3.6-27B           vllm_M8      vllm_M8 2      —     |
|                                                                      |
|  VISION MODE: Append --vision to any command.                        |
|  Context sizes: C=1 → 90K, C=2 → 66K (C=3 → override to C=2)         |
|  Image limits: C=1 → 3 images, C=2 → 1 image (C=3 unsupported)       |
|  --limit-mm-per-prompt format: '{"image": N, "video": 0}'            |
|  Vision-capable: M1, M3, M4, M5, M6, M8                              |
|  Text-only (no vision): M2, M7                                       |
|                                                                      |
|  EXAMPLES (both forms are equivalent):                               |
|    vllm_swap M6 2   ==   vllm_M6 2        (C=2 text)                |
|    vllm_swap M1 --vision == vllm_M1 --vision (C=1 vision)           |
|    vllm_swap M7 3   ==   vllm_M7 3        (C=3 text, MTP)           |
|                                                                      |
+======================================================================+
| Model Registry:                                                      |
|   M1  Firworks-Qwen3-VL-32B-Thinking      21.9GB  VL MoE-          |
|   M2  LilaRest-gemma-4-31B-it-turbo       15.3GB     MoE-          |
|   M3  nvidia-Gemma-4-26B-A4B              18.8GB  VL MoE+          |
|   M4  OptimizeLLM-Qwen3-VL-30B-A3B        19.2GB  VL MoE+          |
|   M5  RedHatAI-gemma-4-31B-it             23.3GB  VL MoE-          |
|   M6  RedHatAI-Qwen3.6-35B-A3B            25.1GB  VL MoE+          |
|   M7  sakamakismile-Qwen3.6-27B-Text-MTP  19.7GB     MoE-  (MTP)   |
|   M8  unsloth-Qwen3.6-27B                 ~19GB   VL MoE-          |
|                                                                      |
| Core:                                                                |
|   vllm_swap <M1..M8> [1|2|3] [--text|--vision]                     |
|       Swap model. Args in any order.                                 |
|       Default concurrency per model. Tools always enabled.           |
|       --text    Text-only (larger context, no vision encoder VRAM)   |
|       --vision  Enable vision (smaller context for VRAM overhead)    |
|   vllm_status                              Check status              |
|   vllm_kill                                Kill vLLM                 |
|   vllm_test [prompt]                       Test API                  |
|   vllm_wait_health [timeout]               Wait healthy              |
|   vllm_models                              List models               |
|   vllm_help                                This help                 |
|                                                                      |
| Quick Swap:  vllm_M{1..8} [conc] [--text|--vision]                 |
|                                                                      |
| Modes:                                                               |
|   vllm_agentic       -> M6 C=2          (agentic work)              |
|   vllm_agentic_fast  -> M3 C=3          (max throughput)            |
|   vllm_writer        -> M5 C=1          (JFM writing)               |
|   vllm_writer_moe    -> M6 C=1          (MoE reasoning)             |
|   vllm_deep          -> M6 C=1          (max context)               |
|   vllm_deep_gemma    -> M5 C=1          (Gemma deep)                |
|   vllm_vision        -> M4 C=1 --vision (primary vision)            |
|   vllm_vision_gemma  -> M5 C=1 --vision (Gemma vision)             |
|   vllm_mtp           -> M7              (MTP speculative)           |
|                                                                      |
| Claude Code:                                                         |
|   local-yolo  -> Local vLLM, no permission prompts                  |
|   cc-local    -> Local vLLM, with permission prompts                |
+======================================================================+
HELPEOF
}

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ExtReMLapin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 23, 2026
Conflict resolutions (qwen3coder_tool_parser.py + its test file):

- _convert_param_value: kept this branch's detailed type-coercion logic
  (nullable string/None handling, container double-decode for buggy
  templates) instead of main's refactor to utils.coerce_to_schema_type /
  extract_types_from_schema (vllm-project#38973). Restored `import ast` that vllm-project#38973
  had removed. Kept main's vllm-project#42292 change
  (supports_required_and_named = not VLLM_ENFORCE_STRICT_TOOL_CALLING).

- Tests: kept this branch's rewritten Coder test file. Re-integrated the
  two anyOf tests vllm-project#38973 added: the comprehensive non-streaming + streaming
  cases stay Coder-specific (they assert {"type": ["integer","null"]} -> int
  and whitespace-stripped values, which only hold for the Coder parser);
  the genuinely cross-parser anyOf[array,null] -> list case was added to the
  shared test file, parametrized over both XML and Coder parsers.

All 190 qwen3 tool-parser tests pass; ruff check/format unchanged vs the
pre-merge branch head.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mergify mergify Bot removed the needs-rebase label May 27, 2026
Relocate the anyOf / nullable type-resolution tests (originally added by
vllm-project#38973 to the Coder-only file) into the shared XML/Coder suite,
parametrized over both parsers, so the coverage applies to both.

To make the JSON-Schema list-form type {"type": ["integer", "null"]}
resolve consistently across parsers, teach the XML parser's
_get_param_type to pick the first non-null entry of a list-form type
(it already did this for anyOf). Both parsers now coerce it to int.

Ruff: replace try/except/pass with contextlib.suppress in both parsers
and run ruff format on the touched qwen3 files.

Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 28, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ExtReMLapin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 28, 2026
@ExtReMLapin
Copy link
Copy Markdown
Contributor Author

What do you think would be better if I close this PR to make multiple one ?

@chaunceyjiang

One pr for the tests (because it's also refactoring the tests to move them into a new file which runs tests in both qwen3_coder and qwen3_xml) ?

And one PR for each fix ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build cpu Related to CPU backends deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend mistral Related to Mistral models multi-modality Related to multi-modality (#4194) needs-rebase new-model Requests to new models nvidia performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output tool-calling v1

Projects

Status: Todo
Status: No status
Status: No status
Status: No status

Development

Successfully merging this pull request may close these issues.

7 participants