Skip to content

Conversation

@sundar24295s
Copy link
Collaborator

@sundar24295s sundar24295s commented Aug 21, 2025

Motivation

  • The previous PR #5141 introduced batch tokenization for parallel processing.
  • However, when sending a batch of prompts (e.g., 50), the tokenizer manager currently transmits them individually via ZMQ socket. This can cause the batch to be split (e.g., 10 + 40) at the Scheduler, increasing per-request latency even if max-prefill-tokens is set appropriately.

Modifications

  • Send the entire batch in a single ZMQ message so the Scheduler can admit the full batch into prefill together.
  • Added two small structs to make the batch handling and dispatch logic explicit and clean.
  • Note: this change is part of the enable-tokenizer-batch-encode server arg which is disabled by default.

Benchmarking and Profiling

  • Model Qwen3-0.6B
  • Input Token Length = 300
  • GPU Type = H100
  • Traffic distribution = Poisson
  • Benchmark file: sglang/benchmark/score/bench_score.py
(sglang-repo3) jobuser [ /shared/user/repos3/sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-0.6B --port 30000 --host 0.0.0.0 --chunked-prefill-size -1 --enable-torch-compile --dtype float16 --max-prefill-tokens 30000 --mem-fraction-static 0.3 --enable-tokenizer-batch-encode  --disable-cuda-graph

Single Request Latency Comparison

Batch Size Method avg_response_time_ms p50_response_time_ms p90_response_time_ms p99_response_time_ms
50 Baseline 70.39 70.39 70.39 70.39
50 Batch Send 41.12 41.12 41.12 41.12
100 Baseline 94.28 94.28 94.28 94.28
100 Batch Send 61.30 61.30 61.30 61.30

Latency Reduction:

  • For batch size 50, average latency decreased by 41.5% (from 70.39 ms to 41.12 ms).
  • For batch size 100, average latency decreased by 34.9% (from 94.28 ms to 61.30 ms).

Baseline logs: From logs we see a single batch request getting split

When sending a batch of 100 prompts

2025-08-21 06:32:59] Prefill batch. #new-seq: 23, #new-token: 23, #cached-token: 6877, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:32:59] Prefill batch. #new-seq: 77, #new-token: 77, #cached-token: 23023, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:32:59] INFO:     127.0.0.1:56618 - "POST /v1/score HTTP/1.1" 200 OK																									

When sending a batch of 50 prompts

[2025-08-21 06:32:15] Prefill batch. #new-seq: 12, #new-token: 12, #cached-token: 3588, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:32:15] Prefill batch. #new-seq: 38, #new-token: 38, #cached-token: 11362, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:32:15] INFO:     127.0.0.1:54044 - "POST /v1/score HTTP/1.1" 200 OK	

Profile showing two forward batches:

image

Batch Send (this pr) logs

When sending a batch of 100 prompts

[2025-08-21 06:30:48] Prefill batch. #new-seq: 100, #new-token: 100, #cached-token: 29900, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:30:48] INFO:     127.0.0.1:53488 - "POST /v1/score HTTP/1.1" 200 OK	

When sending a batch of 50 prompts

[2025-08-21 06:31:06] Prefill batch. #new-seq: 50, #new-token: 50, #cached-token: 14950, token usage: 0.00, #running-req: 0, #queue-req: 0,																									
[2025-08-21 06:31:06] INFO:     127.0.0.1:50392 - "POST /v1/score HTTP/1.1" 200 OK																									

Profile showing single forward batches:
image

Longer Benchmarks

Baseline

target_rps item_count avg_response_time_ms p50_response_time_ms p90_response_time_ms p99_response_time_ms Method
14 50 82.17 67.92 109.43 337.63 Baseline
14 50 81.71 67.74 110.58 413.82 Baseline
14 50 85.54 67.29 108.52 431.74 Baseline
14 50 63.61 38.59 93.66 373.32 Batch Send
14 50 59.92 38.37 80.54 343.12 Batch Send
14 50 66.64 40.42 88.54 444.47 Batch Send

Checklist

@sundar24295s sundar24295s marked this pull request as ready for review August 21, 2025 08:00
Copy link
Collaborator

@mickqian mickqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but this might slightly change the scheduler performance by increasing the latency of the leading requests in the batch

@hnyls2002
Copy link
Collaborator

@sundar24295s Just as @mickqian said, this causes some behavior changes. It looks like, regardless of the batch size, it would be sent to the scheduler as a whole batch. This does not seem reasonable. How about adding an argument to set the token limit of batch sending?

Cc @fzyzcjy, How do you like this, since you have used the batch API so much.

@sundar24295s
Copy link
Collaborator Author

@hnyls2002

  • The scheduler already has robust batch management through max_prefill_tokens and the PrefillAdder class.
  • My PR simply ensures the scheduler receives requests as intended batches, allowing its existing max_prefill_tokens to kick in.

@hnyls2002
Copy link
Collaborator

hnyls2002 commented Aug 23, 2025

@sundar24295s I agree with what you said. But if a whole batch contains a very large batch size, then the scheduler will receive the first request quite late (as the whole batch is received as a batch request)?

@hebiao064
Copy link
Collaborator

@hnyls2002
It seems this PR only affects cases where enable_tokenizer_batch_encode=True, which is False by default. If that’s correct, I believe it should be safe.

Also, if users send very large batch requests, they should understand this may slow things down — that’s ultimately their choice.

@sundar24295s I agree with what you said. But if a whole batch contains a very large batch size, then the scheduler will receive the first request quite late (as the whole batch is received as a batch request)?

@sundar24295s
Copy link
Collaborator Author

@zhyncs / @hnyls2002 Are we good to merge the PR?

@hnyls2002 hnyls2002 merged commit ea0696b into sgl-project:main Aug 25, 2025
68 of 72 checks passed
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
@narutolhy narutolhy mentioned this pull request Sep 30, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants