Skip to content

Fix GLM-4.6v flash tool calling in transformers 5.x#31622

Merged
vllm-bot merged 1 commit intovllm-project:mainfrom
baonudesifeizhai:fixglm4.6toolcall
Jan 5, 2026
Merged

Fix GLM-4.6v flash tool calling in transformers 5.x#31622
vllm-bot merged 1 commit intovllm-project:mainfrom
baonudesifeizhai:fixglm4.6toolcall

Conversation

@baonudesifeizhai
Copy link
Copy Markdown
Contributor

@baonudesifeizhai baonudesifeizhai commented Jan 2, 2026

Purpose

fix #31485

Test Plan

export CUDA_VISIBLE_DEVICES=0,1,2,3

vllm serve "zai-org/GLM-4.6V-Flash" \
  --tensor-parallel-size 4 \
  --enable-auto-tool-choice \
  --tool-call-parser glm45 \
  --limit-mm-per-prompt '{"image": 1, "video": 0}' \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000
cat > repro_glm46v_toolcall.py <<'EOF'
import base64, json, requests

def b64_image(path: str) -> str:
    data = open(path, "rb").read()
    return base64.b64encode(data).decode("utf-8")

img_b64 = b64_image("test.jpg")

payload = {
  "model": "zai-org/GLM-4.6V-Flash",
  "temperature": 0.2,
  "max_tokens": 256,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a city",
        "parameters": {
          "type": "object",
          "properties": {"location": {"type": "string"}},
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Look at the image, then call get_weather for Tokyo. Only call the tool; do not answer in text."},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}}
      ]
    }
  ]
}

r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload, timeout=300)
print("status:", r.status_code)
print(json.dumps(r.json(), indent=2, ensure_ascii=False))
EOF

python repro_glm46v_toolcall.py

main branch

python repro_glm46v_toolcall.py
status: 200
{
  "id": "chatcmpl-86e1c31493b07575",
  "object": "chat.completion",
  "created": 1767344497,
  "model": "zai-org/GLM-4.6V-Flash",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Got it, let's see. The user wants me to call the get_weather function for Tokyo. The image is of a cat, but the instruction is to look at the image then call the tool for Tokyo. So I need to proceed with that. The function call should be get_weather with location \"Tokyo\".\nI will now call the get_weather function for Tokyo.\nget_weather\nlocation\nTokyo\n",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 151338,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 3436,
    "total_tokens": 3529,
    "completion_tokens": 93,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

this branch

python repro_glm46v_toolcall.py
status: 200
{
 "id": "chatcmpl-b7ee6c5807040746",
 "object": "chat.completion",
 "created": 1767345552,
 "model": "zai-org/GLM-4.6V-Flash",
 "choices": [
   {
     "index": 0,
     "message": {
       "role": "assistant",
       "content": "<think>Got it, let's see. The user wants me to call the get_weather function for Tokyo. The image is of a cat, but the instruction is to look at the image then call the tool for Tokyo. So I need to proceed with that. The function call should be get_weather with location \"Tokyo\".</think>\nI will now call the get_weather function for Tokyo.\n",
       "refusal": null,
       "annotations": null,
       "audio": null,
       "function_call": null,
       "tool_calls": [
         {
           "id": "chatcmpl-tool-a7572587686f09fd",
           "type": "function",
           "function": {
             "name": "get_weather",
             "arguments": "{\"location\": \"Tokyo\"}"
           }
         }
       ],
       "reasoning": null,
       "reasoning_content": null
     },
     "logprobs": null,
     "finish_reason": "tool_calls",
     "stop_reason": 151338,
     "token_ids": null
   }
 ],
 "service_tier": null,
 "system_fingerprint": null,
 "usage": {
   "prompt_tokens": 3436,
   "total_tokens": 3529,
   "completion_tokens": 93,
   "prompt_tokens_details": null
 },
 "prompt_logprobs": null,
 "prompt_token_ids": null,
 "kv_transfer_params": null
}

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 2, 2026

Documentation preview: https://vllm--31622.org.readthedocs.build/en/31622/

@mergify mergify bot added documentation Improvements or additions to documentation tool-calling labels Jan 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an issue with tool calling for GLM-4.6v models by ensuring tool call tokens are not skipped during decoding. The core logic of setting skip_special_tokens=False is correct. However, I've identified a potential high-severity issue where the call to super().adjust_request() could enable structured output generation, which conflicts with this model's text-based tool call format and could break tool calling for certain tool_choice options. I have provided a suggestion to rectify this.

Copy link
Copy Markdown
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks~

@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 4, 2026
@vllm-bot vllm-bot merged commit 02dbb93 into vllm-project:main Jan 5, 2026
47 of 51 checks passed
LucasWilkinson pushed a commit to neuralmagic/vllm that referenced this pull request Jan 6, 2026
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
bbrowning added a commit to bbrowning/vllm that referenced this pull request Mar 26, 2026
Audited recent tool parser bug-fix PRs and found that several
landed without corresponding test coverage. Added unit tests
for each fix to prevent regressions.

- Mistral: fast detokenization text detection (PR vllm-project#37209)
- Qwen3Coder: malformed XML crash, anyOf double-encoding,
  speculative decode streaming (PRs vllm-project#36774, vllm-project#36032, vllm-project#35615)
- DeepSeekV32: delimiter preservation with fast detokenization,
  skip_special_tokens adjustment (PR vllm-project#33964)
- GLM-4 MoE: zero-argument tool calls, transformers 5.x delimiter
  handling, Unicode character preservation (PRs vllm-project#32321, vllm-project#31622, vllm-project#30920)
- MiniMax M2: anyOf nullable parameter handling for non-null and
  null values (PR vllm-project#32342)
- Step3p5: MTP-style variable-chunk and multi-token streaming
  (PR vllm-project#33690)
- Kimi K2: native tool call ID extraction and multi-turn ID
  continuity (PR vllm-project#32768)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed tool-calling

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: Support Tool Calling with transformers 5.x for GLM-4.6V Models

3 participants