Skip to content

fix: piecewise_cuda_graph get correct qo_indptr#21452

Merged
Fridge003 merged 8 commits intosgl-project:mainfrom
yyihuang:fix_21218
Mar 28, 2026
Merged

fix: piecewise_cuda_graph get correct qo_indptr#21452
Fridge003 merged 8 commits intosgl-project:mainfrom
yyihuang:fix_21218

Conversation

@yyihuang
Copy link
Copy Markdown
Collaborator

@yyihuang yyihuang commented Mar 26, 2026

Motivation

#21218

Modifications

for padding tokens, append a fake bs+1-th request with pad_tokens extend tokens whose KV indices all point to scratch slot 0. This makes qo_indptr[-1] = static_num_tokens, without affecting causal masks for real requests.

Accuracy Tests

python -m sglang.launch_server --model-path Qwen/Qwen3-14B --attention-backend flashinfer --disable-cuda-graph

python3 benchmark/gsm8k/bench_sglang.py --num-questions 100
100%|███████████████████████████████| 100/100 [00:05<00:00, 18.24it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 5.524 s
Output throughput: 2255.160 token/s

enable cuda graph:
python3 benchmark/gsm8k/bench_sglang.py --num-questions 100
100%|█████████████████████████████| 100/100 [00:04<00:00, 24.05it/s]
Accuracy: 0.940
Invalid: 0.000
Latency: 4.198 s
Output throughput: 2968.234 token/s

  • update after review comment:
    (flashinfer_bench) averyh@umb-b200-238:~/flashinfer-bench/tmp/sglang$ python3 benchmark/gsm8k/bench_sglang.py --num-questions 100
    100%|███████████████████████████████████████████████████████| 100/100 [00:04<00:00, 24.82it/s]
    Accuracy: 0.930
    Invalid: 0.000
    Latency: 4.035 s
    Output throughput: 3180.599 token/s

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yyihuang yyihuang marked this pull request as draft March 26, 2026 04:55
@yyihuang yyihuang marked this pull request as ready for review March 26, 2026 05:16
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ispobock
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@Fridge003
Copy link
Copy Markdown
Collaborator

@Oasis-Git
Copy link
Copy Markdown
Collaborator

Oasis-Git commented Mar 27, 2026

Run the test locally with h100 on tp=1 and tp=8 and gsm test passes

Copy link
Copy Markdown
Collaborator

@Oasis-Git Oasis-Git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the change is reasonable. Here is some suggestions for revision.

num_tokens = len(forward_batch.input_ids)
index = bisect.bisect_left(self.capture_num_tokens, num_tokens)
static_num_tokens = self.capture_num_tokens[index]
with enable_piecewise_cuda_graph(num_tokens=static_num_tokens):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move num_tokens into the ForwardContext. Also to skip the computation and sync with item(), it is suggested that the var such as num_dummy_pages should be pre-calculated

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I take your suggestions to update the code:

  • Added self.num_tokens: Optional[int] = None field to ForwardContext
  • Eliminated both .item() GPU-CPU syncs in the dummy-request block

@Oasis-Git
Copy link
Copy Markdown
Collaborator

Oasis-Git commented Mar 28, 2026

with --max-running-requests 1 disable the pcg to pass the ci

@Fridge003 Fridge003 merged commit 3ab9afd into sgl-project:main Mar 28, 2026
237 of 270 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants