Skip to content

WIP: Fix glm-4.6 tool call streaming parse#11951

Draft
tonylt wants to merge 1 commit intosgl-project:mainfrom
tonylt:fix-glm-4.6-tool-call-streaming-output
Draft

WIP: Fix glm-4.6 tool call streaming parse#11951
tonylt wants to merge 1 commit intosgl-project:mainfrom
tonylt:fix-glm-4.6-tool-call-streaming-output

Conversation

@tonylt
Copy link

@tonylt tonylt commented Oct 22, 2025

Motivation

Summary

I have implemented a fix for GitHub issue #11888 regarding GLM-4.6 tool calls not supporting streaming output for arguments in SGLang.

Problem Analysis

The issue was that GLM-4.6 tool calls were being returned all at once rather than being streamed progressively. The original implementation waited for complete tool calls (until it found </tool_call>) before parsing and streaming them, which caused arguments to appear in a single chunk after a long wait.

Modifications

Solution Implemented

I implemented incremental streaming support for GLM-4.6 tool call arguments by modifying both the Rust and Python implementations:

Key Changes Made:

  1. Added streaming state tracking:
  • Added current_tool_name_sent field to track whether the tool name has been streamed
  • Enhanced the streaming logic to handle partial tool calls
  1. Implemented incremental parsing:
  • Created parse_partial_tool_call() method that can parse incomplete tool calls
  • Added logic to detect and stream tool names first, then arguments incrementally
  1. Enhanced argument streaming:
  • Modified the streaming logic to calculate differences between current and previously streamed arguments
  • Implemented proper diff calculation to stream only new argument content

Files Modified:

  1. Rust Implementation (sgl-router/src/tool_parser/parsers/glm4_moe_parser.rs):
  • Added current_tool_name_sent field to the parser struct
  • Implemented parse_partial_tool_call() method for incremental parsing
  • Enhanced parse_incremental() method to support streaming tool names and arguments
  • Updated the reset method to include the new field
  1. Python Implementation (python/sglang/srt/function_call/glm4_moe_detector.py):
  • Added current_tool_name_sent field to track streaming state
  • Implemented _parse_partial_tool_call() method for incremental parsing
  • Enhanced parse_streaming_increment() method to support streaming
  • Added _find_common_prefix() helper method for diff calculation
  1. Tests (sgl-router/tests/tool_parser_glm4_moe.rs):
  • Added comprehensive tests for streaming functionality
  • Tests verify that tool names are streamed first, followed by incremental argument streaming

Expected Behavior After Fix

With this implementation, GLM-4.6 tool calls now support proper streaming:
Tool name streaming: The function name is streamed first as soon as it's detected
Incremental argument streaming: Arguments are streamed progressively as they are parsed from the XML format
Better user experience: Users will see tool calls building up incrementally rather than waiting for complete tool calls

Testing

I created and ran comprehensive tests that verify:

  1. Tool names are streamed immediately when detected
  2. Arguments are streamed incrementally as they are parsed
  3. The streaming behavior matches the expected format for GLM-4.6 models

The fix ensures that GLM-4.6 tool calls now provide the same streaming experience as other model formats in SGLang, addressing the user's concern about better responsiveness and user experience.

Accuracy Tests

Benchmarking and Profiling

Checklist

@tonylt tonylt force-pushed the fix-glm-4.6-tool-call-streaming-output branch from 11bcdbd to 8e1a949 Compare October 22, 2025 03:06
@JustinTong0323 JustinTong0323 self-assigned this Oct 22, 2025
@tonylt tonylt force-pushed the fix-glm-4.6-tool-call-streaming-output branch from 8e1a949 to e3faf3d Compare October 22, 2025 03:18
@tonylt tonylt marked this pull request as ready for review October 22, 2025 08:26
@tonylt tonylt changed the title Fix glm-4.6 tool call streaming parse WIP: Fix glm-4.6 tool call streaming parse Oct 22, 2025
@tonylt tonylt marked this pull request as draft October 22, 2025 08:54
@gaoganlsz
Copy link

tonylt 加油~ 在线等修复

@gaoganlsz
Copy link

tonylt 加油~ 在线等修复

目前的实现有问题, 例如下面的new_text调用顺序会把工具名称解析成:"read", 其实应该是:"read-file"

1:<tool_call>read
2:-file\n
3:<arg_key>target_file
4:</arg_key>\n<arg_value>

@gaoganlsz
Copy link

json.loads(prev_args_str) 这种解析key, value节点的规则, 无法满足例如创建一个大文件, value的内容超大就会导致等待一个超大的流节点, 无法满足需求

@Leoyzen
Copy link
Contributor

Leoyzen commented Oct 23, 2025

Maybe consider reuse the streaming xml parser from this pr: #10035.

Or a more general streaming xml parser for all kind of LLM which uses xml as tool use template?

@tonylt
Copy link
Author

tonylt commented Oct 29, 2025

Maybe consider reuse the streaming xml parser from this pr: #10035.

Or a more general streaming xml parser for all kind of LLM which uses xml as tool use template?

@Leoyzen Yes, a unified parser is much better. I'll take a look, thanks.

@cynial
Copy link
Contributor

cynial commented Dec 5, 2025

@tonylt @gaoganlsz I'm trying to fix this issue. Could you take a look?
#13989

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants