[Bugfix][ToolParser] Fix Qwen3 XML and Coder streaming tool call parser regressions#40861
[Bugfix][ToolParser] Fix Qwen3 XML and Coder streaming tool call parser regressions#40861ExtReMLapin wants to merge 25 commits into
Conversation
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
…er tool calls Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
… + function name only) + delta2 (params + tool call end) was dropping params Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
… fallback Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Combined fixes for the XML and Coder tool parsers that surfaced once the two PR branches were merged together. Qwen3XML parser: * Reorder _convert_param_value: check string type BEFORE the "null" shortcut so a string param with literal value "null" stays "null" instead of becoming JSON null. Fix logger.warning argument count. * _convert_for_json_streaming: emit "null" (not "") when converted_value is None so nullable integer/object params serialize correctly. * _get_param_type: anyOf returns the first non-null type instead of falling back to "string" for nullable integer/boolean schemas. * _preprocess_xml_chunk: defer streaming for boolean params (avoids emitting "false" on the first 't' of "true") and for all container types regardless of single-quote hint. * _end_element deferred path: try json.loads BEFORE ast.literal_eval so arrays/objects containing JSON true/false/null parse natively; double-decode strings to recover from buggy json.dumps(str(dict)) templates. * Add structural-aware helpers: _is_structural_tag_position, _get_valid_param_names, _is_structural_closing_tag (with partial-tag prefix safety), _chunk_has_structural_function_end, _chunk_has_structural_tool_call_end. * _preprocess_xml_chunk: when SAX state is inside a parameter value, escape <tool_call>/<function=> always, and <parameter=NAME>/closing tags only when they are not structural delimiters. * _process_complete_xml_elements: defer </parameter> when streaming with empty lookahead (more tokens may still arrive). * parse_single_streaming_chunks: fallback close uses _chunk_has_structural_*_end instead of plain "in xml_chunk" so a literal </function> in a parameter value doesn't trigger a double close. * extract_tool_calls_streaming: enable _streaming_mode=True on first delta. Qwen3Coder parser: * Reorder _convert_param_value the same way (string-first, then null). * anyOf picks the first non-null type instead of treating it as "object". * Container handling: try json.loads then double-decode via ast.literal_eval to recover from buggy json.dumps(str(dict)) outputs. * Add structural-aware helpers: _next_structural_param_start, _find_true_function_end, _find_true_tool_call_end, _find_true_param_end (with require_lookahead for streaming). * _parse_xml_function_call: top-level params are NOT filtered by schema (callers may rename fields) but nested boundaries inside a value ARE, so literal <parameter=...> lines in file content don't terminate the param early. * _get_function_calls: structural-aware (</tool_call> must be followed by another <tool_call> or EOS; same for </function>). * Streaming param_starts uses the helpers; </function> close check uses _find_true_function_end so a literal </function> in a value doesn't prematurely emit "}". * tool_start_positions skips past each </tool_call> of completed calls so a literal <tool_call> inside a parameter value of a closed call doesn't spawn a phantom new tool call. * Multi-tool-call delta (speculative decoding): when one tool call closes and another full <tool_call>...</tool_call> remains in current_text, advance manually and re-enter with a sentinel previous_text so reset_streaming_state isn't triggered (which would loop forever). These fix the agentic-streaming bug where Qwen3.5 would freeze mid-tool-call when a parameter value contained <tool_call>, </parameter>, <parameter=NAME>, or </function> as literal text (e.g. writing a Jinja2 template, a heredoc, or any file describing the tool-call format), as well as several value-conversion bugs (string "null" -> JSON null, anyOf nullable -> wrong type, double-encoded objects -> string). Add 16 regression tests in test_qwen3xml_tool_parser.py, 10 in test_qwen3coder_tool_parser.py, and a new test_qwen36_bugs.py covering bugs that span both parsers (XML array with JSON true/false, Coder multi-tool-call in one streaming delta). 98 tests pass across the three test files. Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Both the XML and Coder tool parsers were tested against nearly identical regression scenarios in their respective files (string "null" preservation, anyOf nullable schemas, double-encoded objects, content with literal XML structural tags, content with param-like lines, etc.). Split the shared expectations into a single file with a parametrized parser fixture so that: * the same intent is tested against BOTH parsers automatically; * divergent behaviour is caught immediately instead of drifting; * parser-specific quirks (XML SAX double-close brace, char-by-char boolean streaming, Coder speculative-decoding chunk loss, etc.) stay in their parser-specific test file. New: tests/tool_parsers/test_qwen3_xml_coder_shared.py exposes a ``parser_cls`` fixture parametrized over Qwen3XMLToolParser and Qwen3CoderToolParser. Each shared test runs twice and prints ``[xml]``/``[coder]`` in the test id. Removed duplicates from: * tests/tool_parsers/test_qwen3xml_tool_parser.py: anyOf object param (streaming + non-streaming), string null preservation, anyOf integer/null type detection, content with structural tags (streaming + non-streaming), content with param-like lines (streaming + non-streaming), double-encoded object (streaming + non-streaming). * tests/tool_parsers/test_qwen3coder_tool_parser.py: anyOf parameter not double encoded, string null preservation, anyOf string/null numeric value, content with XML structural tags (streaming + non-streaming), content with param-like lines (streaming + non-streaming), double-encoded object (streaming + non-streaming), content param with tool_call tag (streaming + non-streaming — redundant with content_with_xml_structural_tags). Removed: tests/tool_parsers/test_qwen36_bugs.py. Its two scenarios (XML array containing JSON ``true``, Coder two complete tool calls in a single streaming delta) are now in the shared file as ``test_array_with_json_bool`` and ``test_two_tool_calls_in_one_streaming_chunk``, both running against both parsers. Net effect: 209 -> 183 tests, 0 failures, identical coverage. Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Move all generic regression tests (basic extraction, type conversion,
streaming variants, robustness) from the Coder-specific file into the
shared parametrized file so each test runs against both parsers. Only
behaviour that genuinely differs between the two parsers stays
parser-specific:
- Coder-only: ``streaming_split_tag`` (relies on ``is_tool_call_started``)
and ``streaming_various_chunk_sizes`` (XML SAX cannot tolerate
single-character chunks).
- XML-only: ``streaming_missing_opening_tool_call_tag`` (Coder does not
recover from a missing ``<tool_call>`` opener in streaming mode).
Two assertions were relaxed in the shared file to accept both legitimate
behaviours: content between parallel tool calls (``None`` vs ``"\\n"``)
and the streaming header arguments value (``""`` vs ``"{"``).
Test count rises from 99 to 138 (+39 from cross-parser parametrization)
while ``test_qwen3coder_tool_parser.py`` shrinks from 1260 to 162 lines.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
There was a problem hiding this comment.
Code Review
This pull request significantly improves the robustness of the Qwen3 XML and Coder tool parsers, particularly for streaming scenarios involving speculative decoding and complex parameter types. Key changes include structural-aware parsing to correctly handle XML tags appearing as literal text within parameter values, improved handling of nullable/anyOf schemas, and fixes for streaming bugs where partial tokens or multi-tool bursts could lead to data loss or incorrect type conversion. I have reviewed the implementation and identified a potential issue where content from recursive tool call processing might be lost; please apply the suggested fix to ensure all model output is correctly concatenated.
|
Manual testing was done with some bleeding corner cases like trying to write special tokens inside a tool call (asking a tool to write python code containing inside strings special tokens into a file, whole thing streamed) |
|
apologies if this is the wrong place to ask, but as you must be deeply familiar with these parsers: is there a technical reason why we (still) have two qwen3 tool parsers? does Qwen3Coder offer anything over Qwen3XML? I know that Qwen has Qwen3Coder in their Qwen3.5 documentation/releases, but there is also this: #25028 (comment) (+ follow up comment; comments from PR with tool parser contribution from the Qwen team) |
chaunceyjiang
left a comment
There was a problem hiding this comment.
Thanks, I have a question. What issues does the current tool parser have?
I noticed you’ve made quite a lot of changes to these two tool parsers. This might take quite some time to review.
/cc @sfeng33
TL;DREvery single added test is its own bug report. There is no added test that was passing on git fetch origin pull/40861/head:pr-40861 && git checkout pr-40861
# 1) Run all qwen3 tests with this PR's code → everything passes
pytest tests/tool_parsers/test_qwen3coder_tool_parser.py \
tests/tool_parsers/test_qwen3xml_tool_parser.py \
tests/tool_parsers/test_qwen3_xml_coder_shared.py
# → 168 passed
# 2) Now restore ONLY the parsers from main (keep the new tests)
git checkout main -- vllm/tool_parsers/qwen3coder_tool_parser.py \
vllm/tool_parsers/qwen3xml_tool_parser.py
pytest tests/tool_parsers/test_qwen3coder_tool_parser.py \
tests/tool_parsers/test_qwen3xml_tool_parser.py \
tests/tool_parsers/test_qwen3_xml_coder_shared.py
# → 66 failed, 102 passed
# 3) Restore the parsers
git checkout pr-40861 -- vllm/tool_parsers/qwen3coder_tool_parser.py \
vllm/tool_parsers/qwen3xml_tool_parser.pyThe 102 still-passing tests are tests that already existed on main. The 66 failing ones are exactly the new tests added in this PR — each one is a real symptom I hit in production with Qwen3.5 / Qwen3.6. What kind of bugs (so you can review by category, not by line count)The 66 failures fall into ~6 independent categories. Each category is self-contained, and you can read / merge them independently if you prefer to split the PR :
How I found them
AI assistance : To be explicit : Claude Opus wrote most of the test scaffolding and several of the fixes under my supervision ; I read every changed line, ran them against my real Qwen3 traffic, and I'm the one defending the change end-to-end. |
|
What really worries me as of today is the whole parsing ecosystem (not an issue restricted to vLLM) IMO each message should be tokenized on it's own, isolated for the others, then reasoning then tool calls, instead of losing messages isolations by applying the chat template. Not sure if there is a dedicated place to discuss this. |
I'm not following what you mean here. We aren't parsing for reasoning and tools on incoming messages. We only apply the chat template on incoming messages. We do not apply a chat template to the model's generated outputs. We do parse for reasoning and tool content in the model's generated outputs. |
You're right that vLLM doesn't apply the chat template to the model's output directly, my point is it can fail on the next conversation turn :
Step 4 is where a bad parse in step 2 breaks the whole thing. My point is that parsing/templating at the conversation level allows for bad parsing propagation, we destroy the initially correctly "checkpointed" parsed messages. While parsing at message level isolates the parsing issues. Again I'm not sure it's the right place to discuss this, I'm fine with it but I fear maintainers might not appreciate that. Edit : please pardon my previous phrasing, I'm out of bandwidth with everything at the office, I'm working overtime to get things together and it's a mess |
I also noticed that behavior and my "work-a-round" was to ask the model to not use its actual special tokens, but placeholders, e.g., Anyway, I think you only really hit this issue, if you work on chat templates and model output parsers. |
|
I do agree, I also do agree it can be an edge case with this particuliar scenario. If tomorrow one model family takes 90% of the market share because they're simply better than the rest of the model, you can't rely on the others less intelligent models to fix an issue with this model family. Self compiling compiler, self fixing LLM. |
Do you feel like those PRs introduced less stability or just it didn't fix the issues you had ? |
I don't believe it has introduced more instabilities, but I don't think it has fixed the model stopping issue. Can you confirm the two PRs fix model stopping issues for you, and if so, what arguments and model are you using? |
|
I feel like it reduced issues on 3.6 27B, qwen3_coder (not much diff with xml). But it CLEARLY fixed issues with model chat template introspection, which is a very specific use case, I give it to you. Tensor parallel 2, FP8 official model, with preserve_thinking to true. 200k context |
Conflict resolutions: - vllm/parser/abstract_parser.py: kept the reasoning_from_transition restoration block adjacent to the history_tool_call_cnt counter added by main; the two blocks are independent (no shared state). - vllm/tool_parsers/qwen3coder_tool_parser.py: merged the new structural_tag_registry imports with the existing partial_tag_overlap import; preserved the speculative-decoding recursion and trailing free-text emission logic from HEAD and appended the get_structural_tag method introduced by main right after extract_tool_calls_streaming. - tests/tool_parsers/test_qwen3coder_tool_parser.py: dropped the test bodies that were re-introduced by the merge but had already been factored into tests/tool_parsers/test_qwen3_xml_coder_shared.py during the qwen3_combined_fixes refactor. Cleaned the matching unused imports. Cross-parser coverage: - Added get_structural_tag to Qwen3XMLToolParser using the same qwen_3_5 model registration as the Coder parser, so the XML parser also exposes a valid StructuralTag. - Moved the three structural_tag tests added in 844df54 (and the _as_chat_completion_tools helper) into test_qwen3_xml_coder_shared.py so they run against both Qwen3XMLToolParser and Qwen3CoderToolParser via the parser_cls fixture. Note: Qwen3CoderToolParser.supports_required_and_named is set to False by main; the same flag was intentionally left at its True default on Qwen3XMLToolParser pending a separate decision. Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
|
merged with claude code @Seven-Streams in #40894 you added some tests. I moved those tests in a shared qwen3 xml-coder file to ensure they cover both parsers. |
|
As a harness user I noticed with this PR and #40783 there are much less parsing issues, while using chat completion API, streaming with Qwen3.6-27B-FP8. I had runs with 100s of tool calls and they all seemed fine. Without the PRs there are:
Even with these PRs I sometimes see reasoning output that appears to be cut-off (e.g., the last sentence ends without '.\n'). I think this is a parser/streaming issue, because with non-streaming I haven't observed the model generating sentences without a proper end. (Also, while this isn't relevant in most cases, special tokens still can confuse the parsers, e.g., if I let the model review these PRs, e.g., reasoning can bleed into content.) @chaunceyjiang @sfeng33 /cc: @qmx @hickeyma @hmellor @stakeswky @ywang96 (apologies for spamming all recent code contributors of the qwen parsers) |
|
Because of this tool calling issue it has been a mess to use Qwen 3.6 35B A3B for Agentic tasks. It works fine as long as I'm asking it do to targetted fixes, provide full content from the files since it's not able to use file system and father than calling it as a tool call I call it directly using curl command. I'm using it on a single RTX 5090 in NVFP4 quant. Serving wrapper: Not sure if this is the right place to ask this question here but what's the current status of this fix? I discovered this solution on hugging face recently froggeric/Qwen-Fixed-Chat-Templates but even after applying this I was still facing tool call issues. I still need to check If I had done something incorrectly or if my agent messed it up (Was using Gemma 4 31B to fix this 💀) but it was late and I was feeling sleepy so still have to check the git logs more throughtlly and figure out if my agent messed up something or is this a vllm and/or Qwen 3.6 family of models issue. Because this doesn't happen with Gemma models. FYI I'm getting this error in Opencode claude code as well as Droid all three are giving same error. It works if you simply invoke the local model via direct Curl command using the VLLm API directly (not sure what that curl thingii is called, just a random multiphase turbulence researcher here 💀). With concurrcy 2 it rips through at over 200 TPS output tokens at a context of about 90k. Just want to know how to get this damm tool call to work in Qwen 3.6 35B-A3B and Qwen 27B, and if possible Qwen 3V models too. Any help to put me in the right direction would be deeply appreciated. Forgive me in case I barged in wrong place. My VLM serving wrapper.This still needs more work because I can squeeze out more context window from Gemma and Qwen 3.6 family since attention mechanism in these models is very different from Qwen 3V. And the below script treats attention mechanism of all models like Qwen 3V to be on the safe side since I didn't undstood that previously. So I've got a beta wrapper in progress which incorporates that but that's still in testing phase. Currently the below wrapper is the workhorse for the everyday text and vision related grunt work to save tokens from my Codex plus plan💀. # ~/.config/zsh/local_llm_yolo.zsh
# Local LLM YOLO — Shell profile for vLLM model management
# Source: source ~/vllm_serving_scripts/v2/lib/local_llm_yolo.zsh
# RTX 5090 32GB | 8 NVFP4 models | Concurrency 1/2/3 only | 1 model at a time
# ═══════════════════════════════════════════════════════════════════════
# MODEL REGISTRY
# ═══════════════════════════════════════════════════════════════════════
declare -A VLLM_MODELS
VLLM_MODELS=(
M1 "Firworks-Qwen3-VL-32B-Thinking"
M2 "LilaRest-gemma-4-31B-it-turbo"
M3 "nvidia-Gemma-4-26B-A4B"
M4 "OptimizeLLM-Qwen3-VL-30B-A3B"
M5 "RedHatAI-gemma-4-31B-it"
M6 "RedHatAI-Qwen3.6-35B-A3B"
M7 "sakamakismile-Qwen3.6-27B-Text-MTP"
M8 "unsloth-Qwen3.6-27B"
)
declare -A VLLM_HF_IDS
VLLM_HF_IDS=(
M1 "Firworks/Qwen3-VL-32B-Thinking-NVFP4"
M2 "LilaRest/gemma-4-31B-it-NVFP4-turbo"
M3 "nvidia/Gemma-4-26B-A4B-NVFP4"
M4 "OptimizeLLM/Qwen3-VL-30B-A3B-Thinking-NVFP4"
M5 "RedHatAI/gemma-4-31B-it-NVFP4"
M6 "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
M7 "sakamakismile/Qwen3.6-27B-Text-MTP"
M8 "unsloth/Qwen3.6-27B-NVFP4"
)
declare -A VLLM_QUANT
VLLM_QUANT=(
M1 "compressed-tensors"
M2 "modelopt"
M3 "modelopt"
M4 "compressed-tensors"
M5 "compressed-tensors"
M6 "compressed-tensors"
M7 "modelopt"
M8 "compressed-tensors"
)
declare -A VLLM_PARSER_REASON
VLLM_PARSER_REASON=(
M1 "qwen3"
M2 "gemma4"
M3 "gemma4"
M4 "qwen3"
M5 "gemma4"
M6 "qwen3"
M7 "qwen3"
M8 "qwen3"
)
declare -A VLLM_PARSER_TOOL
VLLM_PARSER_TOOL=(
M1 "hermes"
M2 "gemma4"
M3 "gemma4"
M4 "hermes"
M5 "gemma4"
M6 "hermes"
M7 "hermes"
M8 "hermes"
)
declare -A VLLM_VISION
VLLM_VISION=(
M1 "YES"
M2 "NO"
M3 "YES"
M4 "YES"
M5 "YES"
M6 "YES"
M7 "NO"
M8 "YES"
)
declare -A VLLM_MOE
VLLM_MOE=(
M1 "NO"
M2 "NO"
M3 "YES"
M4 "YES"
M5 "NO"
M6 "YES"
M7 "NO"
M8 "NO"
)
declare -A VLLM_VIDEO
VLLM_VIDEO=(
M1 "NO" M2 "NO" M3 "NO" M4 "NO"
M5 "NO" M6 "NO" M7 "NO" M8 "NO"
)
declare -A VLLM_AUDIO
VLLM_AUDIO=(
M1 "NO" M2 "NO" M3 "NO" M4 "NO"
M5 "NO" M6 "NO" M7 "NO" M8 "NO"
)
declare -A VLLM_DEFAULT_CONCURRENCY
VLLM_DEFAULT_CONCURRENCY=(
M1 "1" M2 "2" M3 "2" M4 "1"
M5 "1" M6 "1" M7 "2" M8 "1"
)
declare -A VLLM_CONCURRENCY_RANGE
VLLM_CONCURRENCY_RANGE=(
M1 "1-2" M2 "1-3" M3 "2-3" M4 "1"
M5 "1-2" M6 "1-2" M7 "2-3" M8 "1-2"
)
declare -A VLLM_LOCAL_PATHS
VLLM_LOCAL_PATHS=(
M1 "/home/abhimanyu/local_llm_models/Firworks-Qwen3-VL-32B-Thinking-nvfp4"
M2 "/home/abhimanyu/local_llm_models/LilaRest-gemma-4-31B-it-NVFP4-turbo"
M3 "/home/abhimanyu/local_llm_models/nvidia-Gemma-4-26B-A4B-NVFP4"
M4 "/home/abhimanyu/local_llm_models/OptimizeLLM-Qwen3-VL-30B-A3B-Thinking-NVFP4"
M5 "/home/abhimanyu/local_llm_models/RedHatAI-gemma-4-31B-it-NVFP4"
M6 "/home/abhimanyu/local_llm_models/RedHatAI-Qwen3.6-35B-A3B-NVFP4"
M7 "/home/abhimanyu/local_llm_models/sakamakismile-Qwen3.6-27B-Text-NVFP4-MTP"
M8 "/home/abhimanyu/local_llm_models/unsloth-Qwen3.6-27B-NVFP4"
)
# ═══════════════════════════════════════════════════════════════════════
# ENDPOINT
# ═══════════════════════════════════════════════════════════════════════
export LOCAL_LLM_BASE_URL="http://127.0.0.1:8000/v1"
export LOCAL_LLM_API_KEY="dummy"
# ═══════════════════════════════════════════════════════════════════════
# CONCURRENCY GUARD
# ═══════════════════════════════════════════════════════════════════════
_vllm_check_concurrency() {
local c=$1
if [[ "$c" != "1" && "$c" != "2" && "$c" != "3" ]]; then
echo "ERROR: Concurrency must be 1, 2, or 3. Got: $c" >&2
echo "ATTEMPTED_OP: vllm serve with --max-num-seqs $c" >&2
echo "REASON: Only concurrency 1/2/3 allowed on RTX 5090 32GB" >&2
echo "ACTION_REQUIRED: Use concurrency 1 (Deep Thinker), 2 (Balanced), or 3 (Max throughput)" >&2
return 1
fi
return 0
}
# ═══════════════════════════════════════════════════════════════════════
# CORE FUNCTIONS
# ═══════════════════════════════════════════════════════════════════════
vllm_kill() {
echo "Killing existing vLLM processes..."
pkill -f "vllm serve" 2>/dev/null
sleep 2
if pgrep -f "vllm serve" >/dev/null 2>&1; then
echo "WARNING: vLLM still running, force killing..."
pkill -9 -f "vllm serve" 2>/dev/null
sleep 1
fi
# Kill orphaned vLLM GPU processes (parent APIServer dead)
# EngineCore children use multiprocessing.spawn — their /proc/cmdline may not
# contain "vllm", so we also check the nvidia-smi process name (VLLM::EngineCore)
# and the vllm venv python path as fallback signals.
local orphan_pids=""
local gpu_pids
local gpu_info
gpu_info=$(nvidia-smi --query-compute-apps=pid,process_name --format=csv,noheader 2>/dev/null)
gpu_pids=$(echo "$gpu_info" | awk -F', ' '{print $1}' | tr -d ' ')
for pid in ${(f)gpu_pids}; do
[[ -z "$pid" ]] && continue
local is_vllm=false
# Check 1: nvidia-smi process name contains VLLM (case-insensitive)
if echo "$gpu_info" | grep -qi "VLLM" 2>/dev/null; then
local pname=$(echo "$gpu_info" | grep "^ *${pid}," | awk -F', ' '{print $2}')
if [[ "$pname" == *"VLLM"* ]]; then
is_vllm=true
fi
fi
# Check 2: /proc/cmdline contains "vllm"
if ! $is_vllm && [[ -r /proc/$pid/cmdline ]] && grep -ql "vllm" /proc/$pid/cmdline 2>/dev/null; then
is_vllm=true
fi
# Check 3: process running from .vllm_venv python (vLLM's venv)
if ! $is_vllm && [[ -r /proc/$pid/cmdline ]] && grep -ql ".vllm_venv" /proc/$pid/cmdline 2>/dev/null; then
is_vllm=true
fi
if $is_vllm; then
orphan_pids="$orphan_pids $pid"
fi
done
if [[ -n "$orphan_pids" ]]; then
echo "Cleaning up orphaned vLLM GPU processes:$orphan_pids"
echo "$orphan_pids" | tr ' ' '\n' | grep -v '^$' | xargs -r kill -9 2>/dev/null
sleep 1
fi
echo "vLLM stopped."
}
vllm_status() {
if pgrep -f "vllm serve" >/dev/null 2>&1; then
echo "vLLM is RUNNING"
echo "PID: $(pgrep -f 'vllm serve' | head -1)"
echo "Model: ${VLLM_MODELS[$VLLM_CURRENT_MODEL]:-unknown} ($VLLM_CURRENT_MODEL)"
echo "Modalities: ${VLLM_CURRENT_MODALITIES:-?}"
echo "Concurrency: ${VLLM_CURRENT_CONCURRENCY:-?}"
echo "Max Context: ${VLLM_CURRENT_MAX_LEN:-?} tokens"
local health=$(curl -s http://127.0.0.1:8000/health 2>/dev/null)
if [[ "$health" == *"ok"* ]] || [[ "$health" == *"200"* ]]; then
echo "Health: OK"
else
echo "Health: WAITING (model loading...)"
fi
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv,noheader 2>/dev/null
else
echo "vLLM is STOPPED"
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader 2>/dev/null
fi
}
vllm_wait_health() {
local timeout=${1:-300}
local elapsed=0
echo "Waiting for vLLM health (timeout: ${timeout}s)..."
while [[ $elapsed -lt $timeout ]]; do
local health=$(curl -s http://127.0.0.1:8000/health 2>/dev/null)
if [[ "$health" == *"ok"* ]] || [[ "$health" == *"200"* ]]; then
echo "vLLM is healthy after ${elapsed}s"
return 0
fi
sleep 2
elapsed=$((elapsed + 2))
echo -ne " Waiting... ${elapsed}s\r"
done
echo ""
echo "ERROR: vLLM did not become healthy within ${timeout}s" >&2
return 1
}
# ═══════════════════════════════════════════════════════════════════════
# vllm_swap — Main entry point
# Usage: vllm_swap <M1..M8> [1|2|3] [--text|--vision|--video|--audio]
# ═══════════════════════════════════════════════════════════════════════
vllm_swap() {
local model_id=""
local concurrency=""
local modality=""
for arg in "$@"; do
case "$arg" in
M[1-8])
model_id="$arg"
;;
1|2|3)
concurrency="$arg"
;;
--text|--vision|--video|--audio)
modality="${arg#--}"
;;
*)
echo "ERROR: Unknown argument: $arg" >&2
echo "USAGE: vllm_swap <M1..M8> [1|2|3] [--text|--vision|--video|--audio]" >&2
return 1
;;
esac
done
if [[ -z "$model_id" ]] || [[ -z "${VLLM_HF_IDS[$model_id]}" ]]; then
echo "ERROR: Invalid or missing model ID. Use M1-M8." >&2
echo "Available: M1 M2 M3 M4 M5 M6 M7 M8" >&2
echo "USAGE: vllm_swap <M1..M8> [1|2|3] [--text|--vision|--video|--audio]" >&2
return 1
fi
if [[ -z "$concurrency" ]]; then
concurrency="${VLLM_DEFAULT_CONCURRENCY[$model_id]}"
fi
_vllm_check_concurrency "$concurrency" || return 1
local range="${VLLM_CONCURRENCY_RANGE[$model_id]}"
local range_lo="${range%-*}"
local range_hi="${range#*-}"
if [[ "$concurrency" -lt "$range_lo" || "$concurrency" -gt "$range_hi" ]]; then
echo "WARNING: Concurrency $concurrency outside recommended range [$range] for $model_id"
fi
if [[ -z "$modality" ]]; then
modality="text"
fi
local use_vision="no"
local use_video="no"
local use_audio="no"
local modalities_str="text"
case "$modality" in
vision)
if [[ "${VLLM_VISION[$model_id]}" != "YES" ]]; then
echo "WARNING: $model_id (${VLLM_MODELS[$model_id]}) does not support vision."
echo " Falling back to text-only."
local vl_models=()
for m in M1 M2 M3 M4 M5 M6 M7 M8; do
[[ "${VLLM_VISION[$m]}" == "YES" ]] && vl_models+=("$m (${VLLM_MODELS[$m]})")
done
echo " Vision-capable models: ${vl_models[*]}"
modality="text"
else
use_vision="yes"
modalities_str="text+vision"
fi
;;
video)
if [[ "${VLLM_VIDEO[$model_id]}" != "YES" ]]; then
echo "WARNING: $model_id (${VLLM_MODELS[$model_id]}) does not support video."
echo " Falling back to text-only."
local vid_models=()
for m in M1 M2 M3 M4 M5 M6 M7 M8; do
[[ "${VLLM_VIDEO[$m]}" == "YES" ]] && vid_models+=("$m (${VLLM_MODELS[$m]})")
done
echo " Video-capable models: ${vid_models[*]}"
modality="text"
else
use_video="yes"
modalities_str="text+video"
fi
;;
audio)
if [[ "${VLLM_AUDIO[$model_id]}" != "YES" ]]; then
echo "WARNING: $model_id (${VLLM_MODELS[$model_id]}) does not support audio."
echo " Falling back to text-only."
local aud_models=()
for m in M1 M2 M3 M4 M5 M6 M7 M8; do
[[ "${VLLM_AUDIO[$m]}" == "YES" ]] && aud_models+=("$m (${VLLM_MODELS[$m]})")
done
echo " Audio-capable models: ${aud_models[*]}"
modality="text"
else
use_audio="yes"
modalities_str="text+audio"
fi
;;
text)
;;
esac
echo "Swapping to ${VLLM_MODELS[$model_id]} | C=$concurrency | $modalities_str"
# Vision concurrency guard: C=3 not supported for vision
if [[ "$use_vision" == "yes" || "$use_video" == "yes" ]]; then
if [[ "$concurrency" -eq 3 ]]; then
echo "WARNING: Vision mode does not support C=3 on RTX 5090. Overriding to C=2."
concurrency=2
fi
fi
vllm_kill
# Wait for GPU memory to actually free after kill
local gpu_wait=0
while [[ $gpu_wait -lt 30 ]]; do
local used
used=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -d ' ')
if [[ -n "$used" && "$used" -lt 500 ]]; then
break
fi
sleep 2
gpu_wait=$((gpu_wait + 2))
done
if [[ $gpu_wait -ge 30 ]]; then
echo "WARNING: GPU memory may not be fully freed (${used}MB still used)" >&2
fi
if ! source ~/.vllm_venv/bin/activate 2>/dev/null; then
echo "ERROR: Failed to activate venv at ~/.vllm_venv/" >&2
echo "ATTEMPTED_OP: source venv for vllm serve" >&2
echo "ACTION_REQUIRED: Run setup_python_venv.sh or verify ~/.vllm_venv/bin/activate exists" >&2
return 1
fi
# Validate local model path
local model_path_check="${VLLM_LOCAL_PATHS[$model_id]}"
if [[ -n "$model_path_check" && -d "$model_path_check" ]]; then
local sf_files=("${model_path_check}"/*.safetensors(N))
if [[ ${#sf_files} -eq 0 ]]; then
echo "WARNING: No .safetensors files found in $model_path_check" >&2
echo " Model files may be incomplete. Proceeding with local path anyway." >&2
fi
elif [[ -n "$model_path_check" ]]; then
echo "WARNING: Local path does not exist: $model_path_check" >&2
echo " Will fall back to HuggingFace download." >&2
fi
local hf_id="${VLLM_HF_IDS[$model_id]}"
local quant="${VLLM_QUANT[$model_id]}"
local r_parser="${VLLM_PARSER_REASON[$model_id]}"
local t_parser="${VLLM_PARSER_TOOL[$model_id]}"
local is_moe="${VLLM_MOE[$model_id]}"
local model_path="${VLLM_LOCAL_PATHS[$model_id]}"
local cmd
if [[ -d "$model_path" ]]; then
cmd="vllm serve $model_path"
else
echo "WARNING: Local path not found: $model_path" >&2
echo " Falling back to HuggingFace ID: $hf_id" >&2
cmd="vllm serve $hf_id"
fi
cmd+=" --host 127.0.0.1"
cmd+=" --port 8000"
cmd+=" --tensor-parallel-size 1"
cmd+=" --max-num-seqs $concurrency"
cmd+=" --gpu-memory-utilization 0.94"
cmd+=" --quantization $quant"
cmd+=" --kv-cache-dtype fp8"
cmd+=" --trust-remote-code"
cmd+=" --no-calculate-kv-scales"
cmd+=" --reasoning-parser $r_parser"
cmd+=" --served-model-name local-llm"
cmd+=" --enable-auto-tool-choice"
cmd+=" --tool-call-parser $t_parser"
local max_len
if [[ "$use_vision" == "yes" || "$use_video" == "yes" ]]; then
case "$concurrency" in
1) max_len="90000" ;;
2) max_len="65536" ;;
esac
case "$concurrency" in
1) cmd+=" --limit-mm-per-prompt '{\"image\": 3, \"video\": 0}'" ;;
2) cmd+=" --limit-mm-per-prompt '{\"image\": 1, \"video\": 0}'" ;;
esac
else
case "$concurrency" in
1) max_len="131072" ;;
2) max_len="81920" ;;
3) max_len="40960" ;;
esac
cmd+=" --limit-mm-per-prompt '{\"image\": 0}'"
fi
cmd+=" --max-model-len $max_len"
if [[ $max_len -ge 131072 ]]; then
cmd+=" --async-scheduling"
cmd+=" --no-enable-prefix-caching"
cmd+=" --max-num-batched-tokens 640"
else
cmd+=" --enable-prefix-caching"
cmd+=" --max-num-batched-tokens 8192"
fi
if [[ "$is_moe" == "YES" ]]; then
cmd+=" --enable-expert-parallel"
fi
if [[ "$model_id" == "M7" ]]; then
cmd+=" --speculative-config '{\"method\":\"qwen3_5_mtp\",\"num_speculative_tokens\":3}'"
cmd+=" --no-scheduler-reserve-full-isl"
fi
echo ""
echo "+-- vLLM Serve Command -------------------------------------------------+"
echo "$cmd" | sed 's/ --/ \\\n --/g'
echo "+-----------------------------------------------------------------------+"
echo ""
eval "nohup $cmd > /tmp/vllm_startup.log 2>&1 &"
local pid=$!
disown
echo "vLLM PID: $pid"
echo "Startup log: /tmp/vllm_startup.log"
local health_timeout="${VLLM_HEALTH_TIMEOUT:-300}"
if vllm_wait_health "$health_timeout"; then
echo ""
echo "========================================"
echo " vLLM READY"
echo "========================================"
else
echo ""
echo "WARNING: vLLM not healthy after ${health_timeout}s" >&2
echo " The model may still be loading (FlashInfer kernel compilation)." >&2
echo " PID: $pid — check progress: tail -f /tmp/vllm_startup.log" >&2
echo " Run 'vllm_wait_health' or 'vllm_status' to check again." >&2
fi
export VLLM_CURRENT_MODEL="$model_id"
export VLLM_CURRENT_MODALITIES="$modalities_str"
export VLLM_CURRENT_CONCURRENCY="$concurrency"
export VLLM_CURRENT_HF_ID="$hf_id"
export VLLM_CURRENT_MAX_LEN="$max_len"
export LOCAL_LLM_CURRENT_MODEL="local-llm"
echo ""
echo "========================================"
echo " ACTIVE MODALITIES: ✓ $modalities_str"
echo " CONTEXT: $max_len tokens | CONCURRENCY: $concurrency"
echo "========================================"
echo "Model: ${VLLM_MODELS[$model_id]} ($model_id)"
echo "API: $LOCAL_LLM_BASE_URL"
echo "Aliases: local-llm, $hf_id"
echo "========================================"
}
# ═══════════════════════════════════════════════════════════════════════
# CONVENIENCE FUNCTIONS — Per-model shortcuts
# Usage: vllm_M6 [concurrency] [--text|--vision|--video|--audio]
# ═══════════════════════════════════════════════════════════════════════
vllm_M1() { vllm_swap M1 "$@" }
vllm_M2() { vllm_swap M2 "$@" }
vllm_M3() { vllm_swap M3 "$@" }
vllm_M4() { vllm_swap M4 "$@" }
vllm_M5() { vllm_swap M5 "$@" }
vllm_M6() { vllm_swap M6 "$@" }
vllm_M7() { vllm_swap M7 "$@" }
vllm_M8() { vllm_swap M8 "$@" }
# ═══════════════════════════════════════════════════════════════════════
# MODE SHORTCUTS — Preset serving configs
# ═══════════════════════════════════════════════════════════════════════
alias vllm_agentic='vllm_swap M6 2'
alias vllm_agentic_fast='vllm_swap M3 3'
alias vllm_writer='vllm_swap M5 1'
alias vllm_writer_moe='vllm_swap M6 1'
alias vllm_deep='vllm_swap M6 1'
alias vllm_deep_gemma='vllm_swap M5 1'
alias vllm_vision='vllm_swap M4 1 --vision'
alias vllm_vision_gemma='vllm_swap M5 1 --vision'
alias vllm_mtp='vllm_swap M7'
# ═══════════════════════════════════════════════════════════════════════
# QUICK API TEST
# ═══════════════════════════════════════════════════════════════════════
vllm_test() {
local prompt="${1:-Hello, respond with 'OK' and nothing else.}"
echo "Testing: $LOCAL_LLM_BASE_URL"
curl -s "$LOCAL_LLM_BASE_URL/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $LOCAL_LLM_API_KEY" \
-d "$(jq -n --arg p "$prompt" '{
model: "local-llm",
messages: [{role: "user", content: $p}],
max_tokens: 100,
temperature: 0.1
}')" | jq -r '.choices[0].message.content // .error // "FAILED"'
}
vllm_models() {
curl -s "$LOCAL_LLM_BASE_URL/models" | jq -r '.data[].id // "No models found"'
}
# ═══════════════════════════════════════════════════════════════════════
# HELP — only shown when explicitly called, NOT on source
# ═══════════════════════════════════════════════════════════════════════
vllm_help() {
cat <<'HELPEOF'
+======================================================================+
| local_llm_yolo — Command Reference |
+======================================================================+
| |
| QUICK-START COMMAND MATRIX (Text-Only) |
| Bold = default. Dashes = outside recommended range. |
| Context sizes: C=1 → 131K, C=2 → 82K, C=3 → 41K tokens |
| |
| Model C=1 C=2 C=3 |
| ─────────────────────────────────────────────────────────────────── |
| M1 Firworks-Qwen3-VL-32B vllm_M1 vllm_M1 2 — |
| M2 LilaRest-gemma-4-31B-turbo vllm_M2 1 vllm_M2 vllm_M2 3|
| M3 nvidia-Gemma-4-26B-A4B — vllm_M3 vllm_M3 3|
| M4 OptimizeLLM-Qwen3-VL-30B-A3B vllm_M4 — — |
| M5 RedHatAI-gemma-4-31B-it vllm_M5 vllm_M5 2 — |
| M6 RedHatAI-Qwen3.6-35B-A3B vllm_M6 vllm_M6 2 — |
| M7 sakamakismile-Qwen3.6-27B-MTP — vllm_M7 vllm_M7 3|
| M8 unsloth-Qwen3.6-27B vllm_M8 vllm_M8 2 — |
| |
| VISION MODE: Append --vision to any command. |
| Context sizes: C=1 → 90K, C=2 → 66K (C=3 → override to C=2) |
| Image limits: C=1 → 3 images, C=2 → 1 image (C=3 unsupported) |
| --limit-mm-per-prompt format: '{"image": N, "video": 0}' |
| Vision-capable: M1, M3, M4, M5, M6, M8 |
| Text-only (no vision): M2, M7 |
| |
| EXAMPLES (both forms are equivalent): |
| vllm_swap M6 2 == vllm_M6 2 (C=2 text) |
| vllm_swap M1 --vision == vllm_M1 --vision (C=1 vision) |
| vllm_swap M7 3 == vllm_M7 3 (C=3 text, MTP) |
| |
+======================================================================+
| Model Registry: |
| M1 Firworks-Qwen3-VL-32B-Thinking 21.9GB VL MoE- |
| M2 LilaRest-gemma-4-31B-it-turbo 15.3GB MoE- |
| M3 nvidia-Gemma-4-26B-A4B 18.8GB VL MoE+ |
| M4 OptimizeLLM-Qwen3-VL-30B-A3B 19.2GB VL MoE+ |
| M5 RedHatAI-gemma-4-31B-it 23.3GB VL MoE- |
| M6 RedHatAI-Qwen3.6-35B-A3B 25.1GB VL MoE+ |
| M7 sakamakismile-Qwen3.6-27B-Text-MTP 19.7GB MoE- (MTP) |
| M8 unsloth-Qwen3.6-27B ~19GB VL MoE- |
| |
| Core: |
| vllm_swap <M1..M8> [1|2|3] [--text|--vision] |
| Swap model. Args in any order. |
| Default concurrency per model. Tools always enabled. |
| --text Text-only (larger context, no vision encoder VRAM) |
| --vision Enable vision (smaller context for VRAM overhead) |
| vllm_status Check status |
| vllm_kill Kill vLLM |
| vllm_test [prompt] Test API |
| vllm_wait_health [timeout] Wait healthy |
| vllm_models List models |
| vllm_help This help |
| |
| Quick Swap: vllm_M{1..8} [conc] [--text|--vision] |
| |
| Modes: |
| vllm_agentic -> M6 C=2 (agentic work) |
| vllm_agentic_fast -> M3 C=3 (max throughput) |
| vllm_writer -> M5 C=1 (JFM writing) |
| vllm_writer_moe -> M6 C=1 (MoE reasoning) |
| vllm_deep -> M6 C=1 (max context) |
| vllm_deep_gemma -> M5 C=1 (Gemma deep) |
| vllm_vision -> M4 C=1 --vision (primary vision) |
| vllm_vision_gemma -> M5 C=1 --vision (Gemma vision) |
| vllm_mtp -> M7 (MTP speculative) |
| |
| Claude Code: |
| local-yolo -> Local vLLM, no permission prompts |
| cc-local -> Local vLLM, with permission prompts |
+======================================================================+
HELPEOF
} |
|
This pull request has merge conflicts that must be resolved before it can be |
Conflict resolutions (qwen3coder_tool_parser.py + its test file): - _convert_param_value: kept this branch's detailed type-coercion logic (nullable string/None handling, container double-decode for buggy templates) instead of main's refactor to utils.coerce_to_schema_type / extract_types_from_schema (vllm-project#38973). Restored `import ast` that vllm-project#38973 had removed. Kept main's vllm-project#42292 change (supports_required_and_named = not VLLM_ENFORCE_STRICT_TOOL_CALLING). - Tests: kept this branch's rewritten Coder test file. Re-integrated the two anyOf tests vllm-project#38973 added: the comprehensive non-streaming + streaming cases stay Coder-specific (they assert {"type": ["integer","null"]} -> int and whitespace-stripped values, which only hold for the Coder parser); the genuinely cross-parser anyOf[array,null] -> list case was added to the shared test file, parametrized over both XML and Coder parsers. All 190 qwen3 tool-parser tests pass; ruff check/format unchanged vs the pre-merge branch head. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Relocate the anyOf / nullable type-resolution tests (originally added by vllm-project#38973 to the Coder-only file) into the shared XML/Coder suite, parametrized over both parsers, so the coverage applies to both. To make the JSON-Schema list-form type {"type": ["integer", "null"]} resolve consistently across parsers, teach the XML parser's _get_param_type to pick the first non-null entry of a list-form type (it already did this for anyOf). Both parsers now coerce it to int. Ruff: replace try/except/pass with contextlib.suppress in both parsers and run ruff format on the touched qwen3 files. Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
What do you think would be better if I close this PR to make multiple one ? One pr for the tests (because it's also refactoring the tests to move them into a new file which runs tests in both qwen3_coder and qwen3_xml) ? And one PR for each fix ? |
To be used with #40783
Purpose
Fix several streaming regressions in both the
Qwen3CoderToolParserandQwen3XMLToolParserthat caused dropped parameters, duplicated content,or incorrect type conversion in tool call responses.
Qwen3Coder (streaming)
<tool_call>tag detection: when the tag was fragmented acrosstwo deltas (e.g.
<tool_thencall>), it was not detected and the toolcall was silently dropped.
<tool_call><function=name>)arrived in delta 1 and the parameters +
</function>arrived in delta 2.completed.
</tool_call>,</function>and</parameter>appearing as literal text inside a parameter value (e.g.documentation, Python code) were incorrectly treated as closing delimiters,
truncating or corrupting parameter values.
Qwen3XML (streaming)
anyOfschema type detection: nullable schemas(
{"anyOf": [{"type": "string"}, {"type": "null"}]}) were classified as"object"(triggeringjson.loads) instead of resolving to the firstnon-null type (
"string"), causing type conversion errors.</parameter>appeared inside a parametervalue.
Both parsers
in a single delta burst, only the first was emitted; subsequent ones were
silently dropped.
Refactor / tests
_advance_to_next_tool()helper inQwen3CoderToolParsertodeduplicate identical state-advance logic that was copy-pasted between the
normal delta path and the speculative-decoding recursion path.
tests/tool_parsers/test_qwen3_xml_coder_shared.py, parametrized over bothparser classes.
Not a duplicate of any open PR: existing Qwen3 tool parser PRs address
non-streaming (batch) parsing only. This PR focuses exclusively on the
streaming path and speculative decoding edge cases.
Test Plan
Test Result
249 passed, 16 warnings in 108.68s
All 249 tests pass. No regressions detected in the existing test suite.