-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[Performance] Batch Send from Tokenizer Manager. #9436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mickqian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but this might slightly change the scheduler performance by increasing the latency of the leading requests in the batch
|
@sundar24295s Just as @mickqian said, this causes some behavior changes. It looks like, regardless of the batch size, it would be sent to the scheduler as a whole batch. This does not seem reasonable. How about adding an argument to set the token limit of batch sending? Cc @fzyzcjy, How do you like this, since you have used the batch API so much. |
|
|
@sundar24295s I agree with what you said. But if a whole batch contains a very large batch size, then the scheduler will receive the first request quite late (as the whole batch is received as a batch request)? |
|
@hnyls2002 Also, if users send very large batch requests, they should understand this may slow things down — that’s ultimately their choice.
|
|
@zhyncs / @hnyls2002 Are we good to merge the PR? |
Motivation
max-prefill-tokensis set appropriately.Modifications
enable-tokenizer-batch-encodeserver arg which is disabled by default.Benchmarking and Profiling
sglang/benchmark/score/bench_score.pySingle Request Latency Comparison
Latency Reduction:
Baseline logs: From logs we see a single batch request getting split
When sending a batch of 100 prompts
When sending a batch of 50 prompts
Profile showing two forward batches:
Batch Send (this pr) logs
When sending a batch of 100 prompts
When sending a batch of 50 prompts
Profile showing single forward batches:

Longer Benchmarks
Baseline
Checklist