[SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models#18154
[SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models#18154BBuf merged 6 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
466c89f to
b81f932
Compare
|
also, could you clean the code a bit?x |
8021c9e to
86ee88e
Compare
|
cc @zhaochenyang20 Refactored as requested |
|
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
|
We can first have this PR merged. And I think the profiling of diffusion router could be interesting: |
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
8ffe7c1 to
17e64df
Compare
zhaochenyang20
left a comment
There was a problem hiding this comment.
-
Refactor the print lines in LLM and Diffusion. I think you can put a helper function in https://github.com/sgl-project/sglang/blob/main/python/sglang/test/test_utils.py
-
debugging with the bench_offline launching commands for the engine over multi GPUs.
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
|
Also, could you also modify this document: |
zhaochenyang20
left a comment
There was a problem hiding this comment.
I have a strong suggestion regarding the architecture of our benchmark tools. Instead of maintaining two separate scripts (bench_offline_throughput.py and bench_serving.py), we should merge them into a single, unified entry point (e.g., bench_throughput.py).
Both scenarios share identical logic for Argument Parsing, Dataset Loading, and Result Reporting/Metrics Calculation. The only distinct logic is the inference backend execution.
-
Unified Argument Parsing: Add a --backend argument (e.g., choices=["engine", "server"]) to switch modes.
-
Shared Data Loading: Reuse the datasets.py logic for both modes.
-
Backend Abstraction:
If backend == "engine": Initialize and launch the GPUWorker.
If backend == "server": Check the health of the endpoint.
-
Execution Loop: Send requests via the selected backend interface.
-
Unified Reporting: Calculate and print metrics using a shared logic to ensure fair comparison between offline and online performance.
This refactoring would significantly maximize code reuse and improve maintainability. What do you think?
a797cb0 to
e752743
Compare
|
He has already switched to DiffGenerator. Could you please take a look? Thanks. @mickqian |
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
e752743 to
8e87e67
Compare
8e87e67 to
50c4d0f
Compare
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
python/sglang/multimodal_gen/benchmarks/bench_offline_throughput.py
Outdated
Show resolved
Hide resolved
Add default value `eps=1e-5` to `register_fake` implementations of `fused_norm_scale_shift` and `fused_scale_residual_norm_scale_shift` custom ops, matching the default in the actual custom_op signatures. Made-with: Cursor
|
My testing for this is: uv pip install -e ".[diffusion]"
# This is for GLM image
pip install --upgrade transformerspython3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput \
--model-path zai-org/GLM-Image \
--height 512 --width 512 \
--num-inference-steps 3 \
--backend sglang \
--num-prompts 3==================================== Offline Throughput Benchmark Result =====================================
Model: zai-org/GLM-Image
Dataset: random
Resolution: 512x512x1
Num Inference Steps: 3
---------------------------------------------------------------------------
Total Requests: 3
Successful Requests: 3
Failed Requests: 0
Total Duration (seconds): 31.16
---------------------------------------------------------------------------
Frames Generated: 3
Megapixels Generated: 0.79
---------------------------------------------------------------------------
Frame Throughput (frames/sec): 0.10
MP Throughput (MP/sec): 0.03
Requests Per Second: 0.10
Latency Per Request (sec): 10.39
Peak Memory (MB): 35610.00
==============================================================================================================python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput \
--model-path zai-org/GLM-Image \
--height 512 --width 512 \
--num-inference-steps 3 \
--backend sglang \
--enable-torch-compile \
--num-prompts 3==================================== Offline Throughput Benchmark Result =====================================
Model: zai-org/GLM-Image
Dataset: random
Resolution: 512x512x1
Num Inference Steps: 3
---------------------------------------------------------------------------
Total Requests: 3
Successful Requests: 3
Failed Requests: 0
Total Duration (seconds): 31.47
---------------------------------------------------------------------------
Frames Generated: 3
Megapixels Generated: 0.79
---------------------------------------------------------------------------
Frame Throughput (frames/sec): 0.10
MP Throughput (MP/sec): 0.02
Requests Per Second: 0.10
Latency Per Request (sec): 10.49
Peak Memory (MB): 35634.00
==============================================================================================================python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput \
--model-path zai-org/GLM-Image \
--height 512 --width 512 \
--num-inference-steps 3 \
--backend sglang \
--num-prompts 3 \
--skip-warmup \
--output-file /tmp/bench_result.json
cat /tmp/bench_result.json==================================== Offline Throughput Benchmark Result =====================================
Model: zai-org/GLM-Image
Dataset: random
Resolution: 512x512x1
Num Inference Steps: 3
---------------------------------------------------------------------------
Total Requests: 3
Successful Requests: 3
Failed Requests: 0
Total Duration (seconds): 39.99
---------------------------------------------------------------------------
Frames Generated: 3
Megapixels Generated: 0.79
---------------------------------------------------------------------------
Frame Throughput (frames/sec): 0.08
MP Throughput (MP/sec): 0.02
Requests Per Second: 0.08
Latency Per Request (sec): 13.33
Peak Memory (MB): 35610.00
============================================================================================================== |
Why peak memory is 0MIB?
|
updated in #18154 (comment) |
|
/tag-and-rerun-ci |
…modal models (sgl-project#18154) Co-authored-by: Hao Jin <Hao Jin> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…modal models (sgl-project#18154) Co-authored-by: Hao Jin <Hao Jin> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…modal models (sgl-project#18154) Co-authored-by: Hao Jin <Hao Jin> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
…modal models (sgl-project#18154) Co-authored-by: Hao Jin <Hao Jin> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>


Motivation
Address part of step 1 for #18077
Modifications
Accuracy Tests
N/A
Benchmarking and Profiling
Need all diffusion dependencies:
pip install imageio cache_dit remote-pdb accelerate addictNeed to install source version of
transformersanddiffuserspip install git+https://github.com/huggingface/transformerspip install git+https://github.com/huggingface/diffusersSample single-GPU (RTX 6000 pro) run with
GLM-Image+sglangbackend +torch.compile:python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput --model-path zai-org/GLM-Image --height 512 --width 512 --num-inference-steps 20 --backend sglang --enable-torch-compile --num-prompts 20 --batch-size 1with resulting report:GLM-Image+diffusersbackend:python3 -m sglang.multimodal_gen.benchmarks.bench_offline_throughput --model-path zai-org/GLM-Image --height 512 --width 512 --num-inference-steps 20 --backend diffusers --num-prompts 20 --batch-size 1with resulting report:\server:
sglang serve --model-path zai-org/GLM-Image --backend sglangbench_serving:
python3 -m sglang.multimodal_gen.benchmarks.bench_serving --dataset random --num-prompts 10 --width 512 --height 512 --model zai-org/GLM-ImageChecklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci