Skip to content

add indexer-topk capture (V3.2 NSA + infra)#24392

Merged
hnyls2002 merged 8 commits intomainfrom
lsyin/indexer-topk-infra
May 5, 2026
Merged

add indexer-topk capture (V3.2 NSA + infra)#24392
hnyls2002 merged 8 commits intomainfrom
lsyin/indexer-topk-infra

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented May 4, 2026

Stacked on #24403. Adds the IndexerTopkCapturer (built on BaseTopkCapturer from #24403) and wires V3.2 NSA models as the first producer.

API

  • Server flag: --enable-return-indexer-topk (default off)
  • Per-request flag: return_indexer_topk: bool on GenerateReqInput
  • Response: meta_info["indexer_topk"] is a base64-encoded int32 tensor of shape (seqlen, num_indexer_layers, index_topk)

Activation gatingmodel_config.get_num_indexer_layers(hf_text_config):

  • NSA models (V3.2 family) → num_hidden_layers (one indexer per transformer layer)
  • Other architectures → reads num_indexer_layers directly off hf_text_config, default 0
  • 0 → capturer stays None, flag is a logged no-op

Producer wiring (CUDA)

  • Indexer.forward_cuda (NSA): _maybe_capture_topk at every return point.
  • forward_mla.py skip_topk reuse paths: explicit capture call so the reused layer's slot reflects the indices actually in use.
  • NPU path is left for follow-up.

Stack

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a framework for capturing and returning top-k indices from indexer layers, including new base classes for device and host caching and integration throughout the request lifecycle. Key feedback includes correcting a redundant method override in IndexerTopkCapturer that bypasses necessary data parallelism logic, fixing an off-by-one error and redundant tensor cloning during sequence indexing, and resolving a type hint inconsistency where the output was incorrectly typed as a list of integers instead of a base64-encoded string. Additionally, recommendations were made to improve code robustness and idiomaticity by replacing assertions with explicit exceptions, substituting magic numbers with named constants, and using native PyTorch methods for tensor metadata.

Comment thread python/sglang/srt/layers/attention/indexer_topk_capturer.py Outdated
Comment thread python/sglang/srt/layers/topk_capturer_base.py Outdated
Comment thread python/sglang/srt/managers/io_struct.py Outdated
Comment thread python/sglang/srt/layers/attention/indexer_topk_capturer.py Outdated
Comment thread python/sglang/srt/layers/attention/indexer_topk_capturer.py
Comment thread python/sglang/srt/layers/attention/indexer_topk_capturer.py Outdated
Comment thread python/sglang/srt/layers/topk_capturer_base.py Outdated
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM since this looks like naive copy-pasting my code and introduces no risk. EDIT: hear there are some cleanup which are reasonable. I originally wanted to do abstractions but heard @ocss884 was doing refactor on main, thus implemented a naive version to avoid I abstract once and he abstract once and get conflcits

@hnyls2002 hnyls2002 force-pushed the lsyin/indexer-topk-infra branch from ba80996 to 655411f Compare May 5, 2026 03:07
@hnyls2002 hnyls2002 changed the title add indexer-topk capture infra add indexer-topk capture (V3.2 NSA + infra) May 5, 2026
@hnyls2002 hnyls2002 changed the base branch from main to lsyin/routed-experts-cleanup May 5, 2026 03:07
Base automatically changed from lsyin/routed-experts-cleanup to main May 5, 2026 19:41
@zianglih
Copy link
Copy Markdown
Contributor

zianglih commented May 5, 2026

Hi, have you tried end-to-end runs? I previously implemented similar things #16881 but gave up since this information is larger than kv cache so capturing and returning via endpoint is not practical.

For V3.2, 61 layers x 2048 x 4 bytes = 488kb per token.

For V4, 30 layers x 1024 x 4 bytes = 120kb per compressed token.

# Conflicts:
#	python/sglang/srt/hardware_backend/npu/moe/topk.py
#	python/sglang/srt/layers/moe/routed_experts_capturer.py
#	python/sglang/srt/layers/moe/topk.py
#	python/sglang/srt/layers/topk_capturer_base.py
#	python/sglang/srt/managers/scheduler_output_processor_mixin.py
#	python/sglang/srt/managers/utils.py
#	python/sglang/srt/model_executor/model_runner.py
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test_return_indexer_topk.py test_return_routed_experts.py test_deepseek_v32_indexcache.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

8-gpu-h200 (2 tests): View workflow run

cd test/ && python3 registered/8-gpu-models/test_return_indexer_topk.py
cd test/ && python3 registered/8-gpu-models/test_deepseek_v32_indexcache.py

2-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/rl/test_return_routed_experts.py

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

@zianglih Thanks for this feedback. Yeah, the current implementation is limited by the large amount of information in the host cache. The current PR unblocks the dsv4 rebase, and you can use --max-total-tokens to keep the buffer manageable for now. In the future, we can still improve this.

cc @yueming-yuan @fzyzcjy

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test test_return_indexer_topk.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

8-gpu-h200 (1 test): View workflow run

cd test/ && python3 registered/8-gpu-models/test_return_indexer_topk.py

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label May 5, 2026
@hnyls2002 hnyls2002 merged commit 47a416f into main May 5, 2026
90 of 113 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/indexer-topk-infra branch May 5, 2026 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants