[FlashInfer v0.6.10] [RL] [DSv32] [GLM-5] Add `--dsa-topk-backend` and integrate FlashInfer and pytorch topk by zianglih · Pull Request #22851 · sgl-project/sglang

zianglih · 2026-04-15T04:38:38Z

Motivation

Add --dsa-topk-backend for configurable topk backend implementation selection.

torch.topk is used by GLM-5 for RL.
FlashInfer topk has determinism and configurable tie break (flashinfer-ai/flashinfer#3095), and better long context performance.

Modifications

Add --dsa-topk-backend, default to existing sgl-kernel
Integrate flashinfer and torch topk for unfused code path
Integrate flashinfer topk for fused code path
Add SGLANG_DSA_TOPK_FLASHINFER_TIE_BREAK and SGLANG_DSA_TOPK_FLASHINFER_DETERMINISTIC
Add new unit test

Accuracy Tests

New unit test python3 -m pytest -q test/registered/kernels/test_dsa_indexer.py -k test_topk_unfused_backends_valid_selection passed.

SGLANG_DSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 python3 -m sglang.launch_server --dsa-topk-backend sgl-kernel --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend dsa --dsa-decode-backend flashmla_sparse --dsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.977
Invalid: 0.000
Latency: 13.146 s
Output throughput: 8566.746 token/s
Accuracy: 0.978
Invalid: 0.000
Latency: 12.749 s
Output throughput: 8878.813 token/s
Accuracy: 0.981
Invalid: 0.000
Latency: 17.272 s
Output throughput: 6584.294 token/s
# torch unfused
SGLANG_DSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 SGLANG_DSA_FUSE_TOPK=0 python3 -m sglang.launch_server --dsa-topk-backend torch --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend dsa --dsa-decode-backend flashmla_sparse --dsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.982
Invalid: 0.000
Latency: 18.256 s
Output throughput: 6183.790 token/s
Accuracy: 0.983
Invalid: 0.000
Latency: 17.637 s
Output throughput: 6388.987 token/s
Accuracy: 0.980
Invalid: 0.000
Latency: 17.609 s
Output throughput: 6403.039 token/s
# flashinfer unfused
SGLANG_DSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 SGLANG_DSA_FUSE_TOPK=0 python3 -m sglang.launch_server --dsa-topk-backend flashinfer --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend dsa --dsa-decode-backend flashmla_sparse --dsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.978
Invalid: 0.000
Latency: 20.846 s
Output throughput: 5413.876 token/s
Accuracy: 0.978
Invalid: 0.000
Latency: 24.896 s
Output throughput: 4557.003 token/s
Accuracy: 0.979
Invalid: 0.000
Latency: 21.313 s
Output throughput: 5292.839 token/s
# flashinfer fused
SGLANG_DSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 SGLANG_DSA_FUSE_TOPK=1 python3 -m sglang.launch_server --dsa-topk-backend flashinfer --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend dsa --dsa-decode-backend flashmla_sparse --dsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.980
Invalid: 0.000
Latency: 13.531 s
Output throughput: 8320.213 token/s
Accuracy: 0.981
Invalid: 0.000
Latency: 12.771 s
Output throughput: 8832.274 token/s
Accuracy: 0.978
Invalid: 0.000
Latency: 12.121 s
Output throughput: 9267.255 token/s
# flashinfer fused with tie_break=1
SGLANG_DSA_TOPK_FLASHINFER_TIE_BREAK=1 SGLANG_DSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 SGLANG_DSA_FUSE_TOPK=1 python3 -m sglang.launch_server --dsa-topk-backend flashinfer --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend dsa --dsa-decode-backend flashmla_sparse --dsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.978
Invalid: 0.000
Latency: 13.716 s
Output throughput: 8219.616 token/s
Accuracy: 0.980
Invalid: 0.000
Latency: 13.008 s
Output throughput: 8652.700 token/s
Accuracy: 0.978
Invalid: 0.000
Latency: 17.669 s
Output throughput: 6457.714 token/s
# flashinfer fused with tie_break=2
SGLANG_DSA_TOPK_FLASHINFER_TIE_BREAK=2 SGLANG_DSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 SGLANG_DSA_FUSE_TOPK=1 python3 -m sglang.launch_server --dsa-topk-backend flashinfer --kv-cache-dtype bf16 --model /data/models/ziangli_v32/DeepSeek-V3.2 --tp 8 --dp 8 --enable-dp-attention --moe-runner-backend flashinfer_trtllm_routed --attention-backend dsa --dsa-decode-backend flashmla_sparse --dsa-prefill-backend flashmla_sparse --page-size 64 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.979
Invalid: 0.000
Latency: 13.370 s
Output throughput: 8438.633 token/s
Accuracy: 0.982
Invalid: 0.000
Latency: 13.129 s
Output throughput: 8628.890 token/s
Accuracy: 0.980
Invalid: 0.000
Latency: 12.498 s
Output throughput: 9047.713 token/s

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ✅ Run #26347798409
Latest PR Test (Extra): ✅ Run #26347798377

gemini-code-assist · 2026-04-15T04:38:41Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

nvpohanh · 2026-04-21T04:58:32Z

cc @nvjullin

DarkSharpness · 2026-04-21T06:04:53Z

qq: Does flashinfer kernel support cuda-graph? I know flashinfer may dispatch to different algorithms based on static sequence length, but is that safe under CUDA graph?

zianglih · 2026-04-21T06:49:10Z

Hi @DarkSharpness , thank you for calling this out. This is indeed a valid concern. Current FlashInfer's dispatch heuritics use max_len, which is not CUDA graph safe in current implementation. We are also working with CCCL team for a graph safe topk (flashinfer-ai/flashinfer#3091 etc) which will be integrated into flashinfer soon. As of now for this PR we can disallow cuda graph if flashinfer topk backend is used.

zianglih · 2026-04-21T07:21:27Z

Hold until flashinfer-ai/flashinfer#3133 , which introduces a graph safe mode.

## 📌 Description @HumansAnd Parent PR: #3095 SGLang PR: sgl-project/sglang#22851 Add `row_starts` and `dsa_graph_safe` for SGLang DSA integration.  ## 🔍 Related Issues sgl-project/sglang#22851 (comment)  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Added dsa_graph_safe flag to top-k APIs to opt into DSA-graph safe execution. * Added optional row_starts parameter to page-table and ragged top-k transforms to support per-row score offsets. * **Behavior** * When dsa_graph_safe=True the optimized clusters fast-path is disabled to ensure safe execution. * **Tests** * Added tests covering row_starts behavior for page-table and ragged transforms.

zianglih · 2026-04-24T17:36:24Z

Hold until flashinfer v0.6.10 release.

zianglih · 2026-05-10T01:07:55Z

Hold until #24452 . Also need to add v4 support.

gemini-code-assist · 2026-05-14T18:02:14Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

zianglih · 2026-05-15T06:32:55Z

According to discussion with @DarkSharpness we will add v4 support in later PRs. Also V4 has pending metadata refactoing work.

nvpohanh · 2026-05-22T08:17:30Z

@Fridge003 could you check if your comments have been addressed? thanks!

zianglih · 2026-05-22T09:26:24Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a modular backend system for DeepSeek Sparse Attention (DSA) top-k operations, allowing users to choose between sgl-kernel, torch, and flashinfer implementations via the new --dsa-topk-backend argument. The changes include the addition of a DSATopKBackend class to encapsulate the logic for both fused and unfused top-k transformations, along with new environment variables to configure FlashInfer's deterministic behavior and tie-breaking modes. The existing DSA attention backend was refactored to integrate this new system, and extensive tests were added to verify the correctness and equivalence of the backends. Reviewer feedback focused on performance optimizations, specifically suggesting the replacement of torch.diff with manual slicing and subtraction in performance-critical sections.

zianglih · 2026-05-22T18:48:24Z

Posted an additional GSM8K Platinum accuracy run for the FlashInfer fused DSA topk path. Server was launched once, then the benchmark was repeated three times against the same server process.

Server command:

SGLANG_DSA_TOPK_FLASHINFER_TIE_BREAK=large \
SGLANG_DSA_PREFILL_DENSE_ATTN_KV_LEN_THRESHOLD=0 \
SGLANG_DSA_FUSE_TOPK=1 \
PYTHONPATH=/sgl-workspace/sglang/python:${PYTHONPATH:-} \
python3 -m sglang.launch_server \
  --host 127.0.0.1 \
  --port 30000 \
  --dsa-topk-backend flashinfer \
  --kv-cache-dtype bf16 \
  --model /data/models/ziangli_v32/DeepSeek-V3.2 \
  --tp 8 \
  --dp 8 \
  --enable-dp-attention \
  --moe-runner-backend flashinfer_trtllm_routed \
  --attention-backend dsa \
  --dsa-decode-backend flashmla_sparse \
  --dsa-prefill-backend flashmla_sparse \
  --page-size 64 \
  --trust-remote-code

Benchmark command, repeated 3 times:

python3 benchmark/gsm8k/bench_sglang.py \
  --num-shots 8 \
  --num-questions 1209 \
  --parallel 1209 \
  --platinum

gsm8k_1.log final output:

Accuracy: 0.976
Invalid: 0.000
Latency: 13.918 s
Output throughput: 8094.320 token/s

gsm8k_2.log final output:

Accuracy: 0.979
Invalid: 0.000
Latency: 13.249 s
Output throughput: 8498.528 token/s

gsm8k_3.log final output:

Accuracy: 0.981
Invalid: 0.000
Latency: 11.975 s
Output throughput: 9395.941 token/s

The large tie-break setting is the current env-var API equivalent of the previous numeric tie-break mode 2. Runtime logs were captured under run id dsa_topk_flashinfer_gsm8k_3x_20260522_184257.

Fridge003 · 2026-05-23T05:05:24Z

/tag-and-rerun-ci

outdated

zianglih · 2026-05-23T10:21:26Z

/rerun-failed-ci

zianglih · 2026-05-23T10:23:40Z

/rerun-failed-ci

zianglih · 2026-05-23T18:13:11Z

/rerun-failed-ci

zianglih · 2026-05-24T00:45:32Z

/rerun-stage base-c-test-4-gpu-b200 base-c-test-4-gpu-h100 base-c-test-8-gpu-h200

github-actions · 2026-05-24T00:45:56Z

⚠️ /rerun-stage has been deprecated.

Stage granularity is too coarse — a stage usually doesn't map to one feature, so rerunning a stage re-pays the cost of unrelated tests. If you don't know which exact test files to rerun, you shouldn't be using /rerun-stage or /rerun-test in the first place.

Use one of these instead:

Selective tests (you know exactly which files to rerun):
```
/rerun-test test_foo.py test_bar.py
```
Rerun only failed jobs:
```
/rerun-failed-ci
```
Full CI rerun (with extra coverage): add the run-ci or run-ci-extra label and push a new commit (or use /tag-and-rerun-ci).

AMD CI: stage-level dispatch is still available via Actions UI → PR Test (AMD) / PR Test ROCm 7.2 (AMD) → Run workflow → pick a stage from the dropdown.

zianglih · 2026-05-24T05:12:00Z

/rerun-failed-ci

zianglih · 2026-05-24T08:03:24Z

/rerun-failed-ci

zianglih · 2026-05-24T22:08:08Z

nv ci passed

nvpohanh · 2026-05-25T01:00:54Z

@Fridge003 this PR has passed the CI. Could you help to merge? Thanks!

zianglih requested review from Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners April 15, 2026 04:38

github-actions Bot added the documentation Improvements or additions to documentation label Apr 15, 2026

ziang-and force-pushed the torch-topk branch from 3e9baab to a62152a Compare April 21, 2026 02:54

ziang-and requested a review from wisclmy0611 as a code owner April 21, 2026 02:54

zianglih changed the title ~~[RL] [V3.2] [GLM-5] Add SGLANG_NSA_TORCH_TOPK~~ [RL] [DSv32] [GLM-5] Add --nsa-topk-backend and integrate FlashInfer and pytorch topk Apr 21, 2026

This was referenced Apr 21, 2026

[Feature] DSv32: Optimize topk for long context decode #16858

Open

[Roadmap] DeepSeek v3.2 (GLM 5) Optimization #15025

Open

zianglih mentioned this pull request Apr 22, 2026

feat: Add row_starts and dsa_graph_safe to topk flashinfer-ai/flashinfer#3133

Merged

5 tasks

This was referenced May 1, 2026

Add --miles-nsa-topk-backend radixark/miles#1058

Open

[Perf & Feat] Add deepseek32 topk opt : Introduction to the ultra low latency attention #23761

Open

b8zhong added the run-ci label May 14, 2026

zianglih marked this pull request as draft May 14, 2026 17:42

ziang-and force-pushed the torch-topk branch from 56815b1 to c1531a8 Compare May 14, 2026 17:42

zianglih marked this pull request as ready for review May 14, 2026 18:02

zianglih requested a review from zijiexia as a code owner May 16, 2026 07:27

zianglih changed the title ~~[RL] [DSv32] [GLM-5] Add --dsa-topk-backend and integrate FlashInfer and pytorch topk~~ [FlashInfer v0.6.10] [RL] [DSv32] [GLM-5] Add --dsa-topk-backend and integrate FlashInfer and pytorch topk May 21, 2026

Refactor

dcce26d

ziang-and requested review from 1am9trash, YAMY1234, hubertlu-tw and kkHuang-amd as code owners May 21, 2026 01:18

zianglih added 3 commits May 20, 2026 18:24

Clean up

f20dcdd

Refactor to [None, "small", "large"]

fb7fa68

Clean up

aeeedc8

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/dsa/dsa_topk_backend.py Outdated

Comment thread python/sglang/srt/layers/attention/dsa/dsa_topk_backend.py Outdated

Address DSA topk review comments

c430da9

Fridge003 approved these changes May 23, 2026

View reviewed changes

Fridge003 added the run-ci-extra label May 23, 2026

Merge branch 'main' into torch-topk

2685a15

Merge branch 'main' into torch-topk

23d7552

Fridge003 merged commit 2b9dd9c into sgl-project:main May 25, 2026
224 of 279 checks passed

Conversation

zianglih commented Apr 15, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

CI States

Uh oh!

gemini-code-assist Bot commented Apr 15, 2026

Uh oh!

nvpohanh commented Apr 21, 2026

Uh oh!

DarkSharpness commented Apr 21, 2026

Uh oh!

zianglih commented Apr 21, 2026

Uh oh!

zianglih commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zianglih commented Apr 24, 2026

Uh oh!

zianglih commented May 10, 2026

Uh oh!

gemini-code-assist Bot commented May 14, 2026

Uh oh!

zianglih commented May 15, 2026

Uh oh!

nvpohanh commented May 22, 2026

Uh oh!

zianglih commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

zianglih commented May 22, 2026

Uh oh!

Fridge003 commented May 23, 2026

Uh oh!

zianglih commented May 23, 2026

Uh oh!

zianglih commented May 23, 2026

Uh oh!

zianglih commented May 23, 2026

Uh oh!

zianglih commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

zianglih commented May 24, 2026

Uh oh!

zianglih commented May 24, 2026

Uh oh!

zianglih commented May 24, 2026

Uh oh!

nvpohanh commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zianglih commented Apr 15, 2026 •

edited by github-actions Bot

Loading

zianglih commented Apr 21, 2026 •

edited

Loading