add indexer-topk capture (V3.2 NSA + infra) by hnyls2002 · Pull Request #24392 · sgl-project/sglang

hnyls2002 · 2026-05-04T23:14:58Z

Stacked on #24403. Adds the IndexerTopkCapturer (built on BaseTopkCapturer from #24403) and wires V3.2 NSA models as the first producer.

API

Server flag: --enable-return-indexer-topk (default off)
Per-request flag: return_indexer_topk: bool on GenerateReqInput
Response: meta_info["indexer_topk"] is a base64-encoded int32 tensor of shape (seqlen, num_indexer_layers, index_topk)

Activation gating — model_config.get_num_indexer_layers(hf_text_config):

NSA models (V3.2 family) → num_hidden_layers (one indexer per transformer layer)
Other architectures → reads num_indexer_layers directly off hf_text_config, default 0
0 → capturer stays None, flag is a logged no-op

Producer wiring (CUDA)

Indexer.forward_cuda (NSA): _maybe_capture_topk at every return point.
forward_mla.py skip_topk reuse paths: explicit capture call so the reused layer's slot reflects the indices actually in use.
NPU path is left for follow-up.

Stack

Base: lsyin/routed-experts-cleanup (consolidate routed-experts capturer onto reusable base #24403)
This PR adds only the indexer-specific feature on top — no overlap with the cleanup.

gemini-code-assist

Code Review

This pull request implements a framework for capturing and returning top-k indices from indexer layers, including new base classes for device and host caching and integration throughout the request lifecycle. Key feedback includes correcting a redundant method override in IndexerTopkCapturer that bypasses necessary data parallelism logic, fixing an off-by-one error and redundant tensor cloning during sequence indexing, and resolving a type hint inconsistency where the output was incorrectly typed as a list of integers instead of a base64-encoded string. Additionally, recommendations were made to improve code robustness and idiomaticity by replacing assertions with explicit exceptions, substituting magic numbers with named constants, and using native PyTorch methods for tensor metadata.

fzyzcjy

LGTM since this looks like ~~naive copy-pasting my code~~ and introduces no risk. EDIT: hear there are some cleanup which are reasonable. I originally wanted to do abstractions but heard @ocss884 was doing refactor on main, thus implemented a naive version to avoid I abstract once and he abstract once and get conflcits

zianglih · 2026-05-05T19:51:13Z

Hi, have you tried end-to-end runs? I previously implemented similar things #16881 but gave up since this information is larger than kv cache so capturing and returning via endpoint is not practical.

For V3.2, 61 layers x 2048 x 4 bytes = 488kb per token.

For V4, 30 layers x 1024 x 4 bytes = 120kb per compressed token.

# Conflicts: # python/sglang/srt/hardware_backend/npu/moe/topk.py # python/sglang/srt/layers/moe/routed_experts_capturer.py # python/sglang/srt/layers/moe/topk.py # python/sglang/srt/layers/topk_capturer_base.py # python/sglang/srt/managers/scheduler_output_processor_mixin.py # python/sglang/srt/managers/utils.py # python/sglang/srt/model_executor/model_runner.py

hnyls2002 · 2026-05-05T20:59:08Z

/rerun-test test_return_indexer_topk.py test_return_routed_experts.py test_deepseek_v32_indexcache.py

github-actions · 2026-05-05T20:59:52Z

✅ 8-gpu-h200 (2 tests): View workflow run

cd test/ && python3 registered/8-gpu-models/test_return_indexer_topk.py
cd test/ && python3 registered/8-gpu-models/test_deepseek_v32_indexcache.py

✅ 2-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/rl/test_return_routed_experts.py

hnyls2002 · 2026-05-05T21:21:17Z

@zianglih Thanks for this feedback. Yeah, the current implementation is limited by the large amount of information in the host cache. The current PR unblocks the dsv4 rebase, and you can use --max-total-tokens to keep the buffer manageable for now. In the future, we can still improve this.

cc @yueming-yuan @fzyzcjy

hnyls2002 · 2026-05-05T21:24:13Z

/rerun-test test_return_indexer_topk.py

github-actions · 2026-05-05T21:24:53Z

✅ 8-gpu-h200 (1 test): View workflow run

cd test/ && python3 registered/8-gpu-models/test_return_indexer_topk.py

hnyls2002 · 2026-05-05T21:58:09Z

/tag-and-rerun-ci

hnyls2002 requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Qiaolin-Yu, Ying1123, ch-wan, hebiao064, ispobock, merrymercy and xiezhq-hermann as code owners May 4, 2026 23:15

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

fzyzcjy approved these changes May 4, 2026

View reviewed changes

hnyls2002 requested review from ByronHsu, ShangmingCai, iforgetmyname and ping1jing2 as code owners May 5, 2026 00:00

hnyls2002 requested review from 1am9trash, hlu1, hubertlu-tw and kkHuang-amd as code owners May 5, 2026 02:52

hnyls2002 added 2 commits May 4, 2026 20:02

consolidate routed-experts capturer onto reusable base

7e4a778

add indexer-topk capture (V3.2 NSA + infra)

655411f

hnyls2002 force-pushed the lsyin/indexer-topk-infra branch from ba80996 to 655411f Compare May 5, 2026 03:07

hnyls2002 changed the title ~~add indexer-topk capture infra~~ add indexer-topk capture (V3.2 NSA + infra) May 5, 2026

hnyls2002 changed the base branch from main to lsyin/routed-experts-cleanup May 5, 2026 03:07

capture indexer topk on all forward_cuda exits; refuse non-CUDA

331d45b

Base automatically changed from lsyin/routed-experts-cleanup to main May 5, 2026 19:41

hnyls2002 added 3 commits May 5, 2026 12:57

trim comment redundancy

92f9933

unify indexer-topk capture via module-level helper

6a47062

add return_indexer_topk e2e test on dsv3.2

f360048

test_return_indexer_topk: add --dp 8 and cap max-total-tokens

89ed01c

github-actions Bot added the run-ci label May 5, 2026

hnyls2002 merged commit 47a416f into main May 5, 2026
90 of 113 checks passed

hnyls2002 deleted the lsyin/indexer-topk-infra branch May 5, 2026 22:05

Conversation

hnyls2002 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzyzcjy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zianglih commented May 5, 2026

Uh oh!

hnyls2002 commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

hnyls2002 commented May 5, 2026

Uh oh!

hnyls2002 commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

hnyls2002 commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hnyls2002 commented May 4, 2026 •

edited

Loading

fzyzcjy left a comment •

edited

Loading