Skip to content

[WIP] Fix piecewise cuda graph recompile#13622

Closed
zminglei wants to merge 3 commits intosgl-project:mainfrom
zminglei:grok-piecewise
Closed

[WIP] Fix piecewise cuda graph recompile#13622
zminglei wants to merge 3 commits intosgl-project:mainfrom
zminglei:grok-piecewise

Conversation

@zminglei
Copy link
Collaborator

@zminglei zminglei commented Nov 20, 2025

Motivation

Closes #13469

Modifications

Without piecewise cuda graph:

python3 -m sglang.launch_server --model /shared/public/elr-models/xai-org/grok-2/ --tokenizer-path /shared/public/elr-models/xai-org/grok-2/tokenizer.tok.json --tp 8 --quantization fp8 --attention-backend triton --disable-radix-cache

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl --num-questions 1319
Accuracy: 0.929
Invalid: 0.000
Latency: 133.680 s
Output throughput: 1111.674 token/s

python3 -m sglang.bench_serving --backend sglang --dataset-name random-ids --num-prompts 1 --random-input-len 10
24 --random-output-len 1 --random-range-ratio 1 --tokenizer /shared/public/elr-models/xai-org/grok-2/tokenizer.tok.json
---------------Time to First Token----------------
Mean TTFT (ms):                          104.61    
Median TTFT (ms):                        104.61    
P99 TTFT (ms):                           104.61    

With piecewise cuda graph:

python3 -m sglang.launch_server --model /shared/public/elr-models/xai-org/grok-2/ --tokenizer-path /shared/public/elr-models/xai-org/grok-2/tokenizer.tok.json --tp 8 --quantization fp8 --attention-backend triton --enable-piecewise-cuda-graph --disable-radix-cache

python3 -m sglang.bench_serving --backend sglang --dataset-name random-ids --num-prompts 1 --random-input-len 1024 --random-output-len 1 --random-range-ratio 1 --tokenizer /shared/public/elr-models/xai-org/grok-2/tokenizer.tok.json

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl --num-questions 1319
Accuracy: 0.934
Invalid: 0.001
Latency: 121.737 s
Output throughput: 1219.835 token/s

python3 -m sglang.bench_serving --backend sglang --dataset-name random-ids --num-prompts 1 --random-input-len 10
24 --random-output-len 1 --random-range-ratio 1 --tokenizer /shared/public/elr-models/xai-org/grok-2/tokenizer.tok.json
---------------Time to First Token----------------
Mean TTFT (ms):                          79.50     
Median TTFT (ms):                        79.50     
P99 TTFT (ms):                           79.50    

Accuracy Tests

Benchmarking and Profiling

Checklist


assert not self._called, "SGLangBackend can only be called once"
# assert not self._called, "SGLangBackend can only be called once"
if (self._called):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@hebiao064 hebiao064 changed the title [WIP] Fix grok2 piecewise cuda graph [WIP] Fix piecewise cuda graph recompile Nov 20, 2025
@zminglei
Copy link
Collaborator Author

Moved to this PR #13667

@zminglei zminglei closed this Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Mixtral & Grok2 Piecewise CUDA Graph Accuracy Drop

2 participants