[Performance] Batch Send from Tokenizer Manager. #9436

sundar24295s · 2025-08-21T07:36:43Z

Motivation

The previous PR #5141 introduced batch tokenization for parallel processing.
However, when sending a batch of prompts (e.g., 50), the tokenizer manager currently transmits them individually via ZMQ socket. This can cause the batch to be split (e.g., 10 + 40) at the Scheduler, increasing per-request latency even if max-prefill-tokens is set appropriately.

Modifications

Send the entire batch in a single ZMQ message so the Scheduler can admit the full batch into prefill together.
Added two small structs to make the batch handling and dispatch logic explicit and clean.
Note: this change is part of the enable-tokenizer-batch-encode server arg which is disabled by default.

Benchmarking and Profiling

Model Qwen3-0.6B
Input Token Length = 300
GPU Type = H100
Traffic distribution = Poisson
Benchmark file: sglang/benchmark/score/bench_score.py

(sglang-repo3) jobuser [ /shared/user/repos3/sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-0.6B --port 30000 --host 0.0.0.0 --chunked-prefill-size -1 --enable-torch-compile --dtype float16 --max-prefill-tokens 30000 --mem-fraction-static 0.3 --enable-tokenizer-batch-encode  --disable-cuda-graph

Single Request Latency Comparison

Batch Size	Method	avg_response_time_ms	p50_response_time_ms	p90_response_time_ms	p99_response_time_ms
50	Baseline	70.39	70.39	70.39	70.39
50	Batch Send	41.12	41.12	41.12	41.12
100	Baseline	94.28	94.28	94.28	94.28
100	Batch Send	61.30	61.30	61.30	61.30

Latency Reduction:

For batch size 50, average latency decreased by 41.5% (from 70.39 ms to 41.12 ms).
For batch size 100, average latency decreased by 34.9% (from 94.28 ms to 61.30 ms).

Baseline logs: From logs we see a single batch request getting split

When sending a batch of 100 prompts

2025-08-21 06:32:59] Prefill batch. #new-seq: 23, #new-token: 23, #cached-token: 6877, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:32:59] Prefill batch. #new-seq: 77, #new-token: 77, #cached-token: 23023, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:32:59] INFO:     127.0.0.1:56618 - "POST /v1/score HTTP/1.1" 200 OK

When sending a batch of 50 prompts

[2025-08-21 06:32:15] Prefill batch. #new-seq: 12, #new-token: 12, #cached-token: 3588, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:32:15] Prefill batch. #new-seq: 38, #new-token: 38, #cached-token: 11362, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:32:15] INFO:     127.0.0.1:54044 - "POST /v1/score HTTP/1.1" 200 OK

Profile showing two forward batches:

Batch Send (this pr) logs

When sending a batch of 100 prompts

[2025-08-21 06:30:48] Prefill batch. #new-seq: 100, #new-token: 100, #cached-token: 29900, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:30:48] INFO:     127.0.0.1:53488 - "POST /v1/score HTTP/1.1" 200 OK

When sending a batch of 50 prompts

[2025-08-21 06:31:06] Prefill batch. #new-seq: 50, #new-token: 50, #cached-token: 14950, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:31:06] INFO:     127.0.0.1:50392 - "POST /v1/score HTTP/1.1" 200 OK

Profile showing single forward batches:

Longer Benchmarks

Baseline

target_rps	item_count	avg_response_time_ms	p50_response_time_ms	p90_response_time_ms	p99_response_time_ms	Method
14	50	82.17	67.92	109.43	337.63	Baseline
14	50	81.71	67.74	110.58	413.82	Baseline
14	50	85.54	67.29	108.52	431.74	Baseline
14	50	63.61	38.59	93.66	373.32	Batch Send
14	50	59.92	38.37	80.54	343.12	Batch Send
14	50	66.64	40.42	88.54	444.47	Batch Send

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

mickqian

LGTM, but this might slightly change the scheduler performance by increasing the latency of the leading requests in the batch

hnyls2002 · 2025-08-22T15:54:41Z

@sundar24295s Just as @mickqian said, this causes some behavior changes. It looks like, regardless of the batch size, it would be sent to the scheduler as a whole batch. This does not seem reasonable. How about adding an argument to set the token limit of batch sending?

Cc @fzyzcjy, How do you like this, since you have used the batch API so much.

sundar24295s · 2025-08-22T17:02:16Z

@hnyls2002

The scheduler already has robust batch management through max_prefill_tokens and the PrefillAdder class.
My PR simply ensures the scheduler receives requests as intended batches, allowing its existing max_prefill_tokens to kick in.

hnyls2002 · 2025-08-23T05:50:01Z

@sundar24295s I agree with what you said. But if a whole batch contains a very large batch size, then the scheduler will receive the first request quite late (as the whole batch is received as a batch request)?

hebiao064 · 2025-08-23T21:22:17Z

@hnyls2002
It seems this PR only affects cases where enable_tokenizer_batch_encode=True, which is False by default. If that’s correct, I believe it should be safe.

Also, if users send very large batch requests, they should understand this may slow things down — that’s ultimately their choice.

@sundar24295s I agree with what you said. But if a whole batch contains a very large batch size, then the scheduler will receive the first request quite late (as the whole batch is received as a batch request)?

sundar24295s · 2025-08-25T06:17:45Z

@zhyncs / @hnyls2002 Are we good to merge the PR?

Batch Send from Tokenizer Manager.

2effa85

sundar24295s marked this pull request as ready for review August 21, 2025 08:00

sundar24295s requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners August 21, 2025 08:00

mickqian reviewed Aug 21, 2025

View reviewed changes

zhyncs assigned hebiao064 and hnyls2002 Aug 21, 2025

zhyncs added the high priority label Aug 21, 2025

zhyncs assigned fzyzcjy Aug 22, 2025

Merge branch 'main' into suramach/batchsend

8d17eed

hnyls2002 approved these changes Aug 22, 2025

View reviewed changes

hebiao064 approved these changes Aug 23, 2025

View reviewed changes

Merge branch 'main' into suramach/batchsend

fe10fc2

hnyls2002 merged commit ea0696b into sgl-project:main Aug 25, 2025
68 of 72 checks passed

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[Performance] Batch Send from Tokenizer Manager. (sgl-project#9436)

b77e2a2

narutolhy mentioned this pull request Sep 30, 2025

support tokenized batch request #11091

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Batch Send from Tokenizer Manager. #9436

[Performance] Batch Send from Tokenizer Manager. #9436

Uh oh!

sundar24295s commented Aug 21, 2025 •

edited

Loading

Uh oh!

mickqian left a comment

Uh oh!

hnyls2002 commented Aug 22, 2025

Uh oh!

sundar24295s commented Aug 22, 2025

Uh oh!

hnyls2002 commented Aug 23, 2025 •

edited

Loading

Uh oh!

hebiao064 commented Aug 23, 2025

Uh oh!

sundar24295s commented Aug 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[Performance] Batch Send from Tokenizer Manager. #9436

[Performance] Batch Send from Tokenizer Manager. #9436

Uh oh!

Conversation

sundar24295s commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmarking and Profiling

Single Request Latency Comparison

Baseline logs: From logs we see a single batch request getting split

Batch Send (this pr) logs

Longer Benchmarks

Baseline

Checklist

Uh oh!

mickqian left a comment

Choose a reason for hiding this comment

Uh oh!

hnyls2002 commented Aug 22, 2025

Uh oh!

sundar24295s commented Aug 22, 2025

Uh oh!

hnyls2002 commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hebiao064 commented Aug 23, 2025

Uh oh!

sundar24295s commented Aug 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sundar24295s commented Aug 21, 2025 •

edited

Loading

hnyls2002 commented Aug 23, 2025 •

edited

Loading