Skip to content

[Bugfix] Add missing auto_create_handle_loop to communicator methods#19610

Merged
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
Kangyan-Zhou:fix/server-info-handle-loop
Mar 1, 2026
Merged

[Bugfix] Add missing auto_create_handle_loop to communicator methods#19610
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
Kangyan-Zhou:fix/server-info-handle-loop

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

@Kangyan-Zhou Kangyan-Zhou commented Mar 1, 2026

Summary

  • Several communicator methods in TokenizerCommunicatorMixin (e.g., get_internal_state, flush_cache, get_load) were missing the self.auto_create_handle_loop() call that ensures the ZMQ response-receiving loop is running.
  • Without this call, if users use --skip-server-warmup in the server launch args, which skips one forward run for server warmup, these methods hang indefinitely when invoked before any inference request has been processed, because the handle_loop asyncio task is never started, causing confusion.

Motivation

In PD disaggregation mode, the router calls /server_info (which invokes get_internal_state()) immediately after a worker pod starts — before any inference request arrives. Since auto_create_handle_loop() is only called from generate_request(), the handle_loop that receives scheduler responses via ZMQ is never created, causing get_internal_state() to wait forever.

This also affects flush_cache, get_load, get_loads, set_internal_state, dumper_control, and the HiCache management endpoints — any communicator method called on an idle server will hang.

Fix

Add self.auto_create_handle_loop() to the 10 communicator methods that were missing it, matching the pattern already used by generate_request(), slow_down(), and other working methods.

The call is idempotent — it returns immediately if the handle_loop is already running.

Test plan

  • Verified on a live PD disaggregation cluster (2 engines + 4 decoders) that /server_info hangs before the fix
  • Confirmed that triggering a generate request (which calls auto_create_handle_loop) unblocks the pending /server_info
  • After applying the fix, /server_info responds immediately on freshly started pods without any prior inference traffic

🤖 Generated with Claude Code

The handle_loop asyncio task in TokenizerManager is responsible for
receiving responses from schedulers via ZMQ and dispatching them to
the appropriate _Communicator. However, handle_loop is lazily started
by auto_create_handle_loop() and several communicator methods were
missing this call.

This caused /server_info (and other endpoints like /flush_cache,
/get_loads) to hang indefinitely when called on freshly-started
servers that had not yet processed any inference request -- because
no inference request had triggered auto_create_handle_loop() yet,
the scheduler responses were never received.

This is particularly critical for PD disaggregation setups where
the sglang router's service discovery calls /server_info as the
very first interaction with worker pods during the discover_metadata
step, before any generate request is sent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Kangyan-Zhou Kangyan-Zhou merged commit 98224de into sgl-project:main Mar 1, 2026
54 of 62 checks passed
Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request Mar 4, 2026
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant