add indexer-topk capture (V3.2 NSA + infra)#24392
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a framework for capturing and returning top-k indices from indexer layers, including new base classes for device and host caching and integration throughout the request lifecycle. Key feedback includes correcting a redundant method override in IndexerTopkCapturer that bypasses necessary data parallelism logic, fixing an off-by-one error and redundant tensor cloning during sequence indexing, and resolving a type hint inconsistency where the output was incorrectly typed as a list of integers instead of a base64-encoded string. Additionally, recommendations were made to improve code robustness and idiomaticity by replacing assertions with explicit exceptions, substituting magic numbers with named constants, and using native PyTorch methods for tensor metadata.
There was a problem hiding this comment.
LGTM since this looks like naive copy-pasting my code and introduces no risk. EDIT: hear there are some cleanup which are reasonable. I originally wanted to do abstractions but heard @ocss884 was doing refactor on main, thus implemented a naive version to avoid I abstract once and he abstract once and get conflcits
ba80996 to
655411f
Compare
|
Hi, have you tried end-to-end runs? I previously implemented similar things #16881 but gave up since this information is larger than kv cache so capturing and returning via endpoint is not practical. For V3.2, 61 layers x 2048 x 4 bytes = 488kb per token. For V4, 30 layers x 1024 x 4 bytes = 120kb per compressed token. |
# Conflicts: # python/sglang/srt/hardware_backend/npu/moe/topk.py # python/sglang/srt/layers/moe/routed_experts_capturer.py # python/sglang/srt/layers/moe/topk.py # python/sglang/srt/layers/topk_capturer_base.py # python/sglang/srt/managers/scheduler_output_processor_mixin.py # python/sglang/srt/managers/utils.py # python/sglang/srt/model_executor/model_runner.py
|
/rerun-test test_return_indexer_topk.py test_return_routed_experts.py test_deepseek_v32_indexcache.py |
|
✅ ✅ |
|
@zianglih Thanks for this feedback. Yeah, the current implementation is limited by the large amount of information in the host cache. The current PR unblocks the dsv4 rebase, and you can use |
|
/rerun-test test_return_indexer_topk.py |
|
✅ |
|
/tag-and-rerun-ci |
Stacked on #24403. Adds the
IndexerTopkCapturer(built onBaseTopkCapturerfrom #24403) and wires V3.2 NSA models as the first producer.API
--enable-return-indexer-topk(default off)return_indexer_topk: boolonGenerateReqInputmeta_info["indexer_topk"]is a base64-encoded int32 tensor of shape(seqlen, num_indexer_layers, index_topk)Activation gating —
model_config.get_num_indexer_layers(hf_text_config):num_hidden_layers(one indexer per transformer layer)num_indexer_layersdirectly offhf_text_config, default 0None, flag is a logged no-opProducer wiring (CUDA)
Indexer.forward_cuda(NSA):_maybe_capture_topkat every return point.forward_mla.pyskip_topkreuse paths: explicit capture call so the reused layer's slot reflects the indices actually in use.Stack
lsyin/routed-experts-cleanup(consolidate routed-experts capturer onto reusable base #24403)