-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][feat] Update the indexer topK #9255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Christina Zhang <[email protected]>
📝 WalkthroughWalkthroughThe changes rename top-K indexing kernels and PyTorch operators (removing Changes
Sequence Diagram(s)sequenceDiagram
actor Host
participant HeaderAPI as IndexerTopK.h<br/>(Public API)
participant Kernel as indexerTopK.cu<br/>(Device Code)
participant PyTorch as PyTorch Op<br/>(IndexerTopKOp.cpp)
participant Python as Python Caller<br/>(dsa.py)
rect rgb(220, 240, 255)
Note over Host,Python: Runtime topK Parameter Flow (New)
Python->>PyTorch: indexer_topk_decode(logits,<br/>seq_lens, indices, topK)
PyTorch->>HeaderAPI: invokeIndexerTopKDecode(...,<br/>topK=runtime_value, stream)
HeaderAPI->>Kernel: Launch with topK param +<br/>shared_memory=topK*sizeof(int32_t)
Kernel->>Kernel: Use topK at runtime<br/>for SMEM indexing & logic
Kernel-->>Kernel: Flat SMEM layout:<br/>smemOutput[topK]
end
rect rgb(240, 220, 255)
Note over Host,Python: Parameter Name Changes
Note over HeaderAPI: indices, outLogitsAux,<br/>outIndicesAux (renamed from outIndices)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
186-195: Fake op signatures must declare defaults to match the C++ operator schemaThe fake registrations for
indexer_topk_prefillandindexer_topk_decodecurrently requireindex_topkas a positional parameter, but the C++ schema declares it with a default value (int index_topk=2048). The FakeTensor implementation must have the same signature as the operator, including defaults.Without the default in the Python signature, calls that omit
index_topk(relying on the C++ default) will pass fewer arguments than the fake function expects, causing aTypeErrorwhen these ops run under FakeTensor/meta ortorch.compile.Update the fake signatures to declare the default:
@torch.library.register_fake("trtllm::indexer_topk_prefill") def _(logits, row_starts, row_ends, indices, index_topk: int = 2048): # In-place operation, no return value (void function) pass @torch.library.register_fake("trtllm::indexer_topk_decode") def _(logits, seq_lens, indices, next_n, index_topk: int = 2048): # In-place operation, no return value (void function) pass
🧹 Nitpick comments (2)
cpp/tensorrt_llm/kernels/indexerTopK.cu (1)
643-683: Host wrappers correctly propagate runtimetopK(minor redundancy only)
invokeIndexerTopKDecode/invokeIndexerTopKPrefillnow accept a runtimetopKand use it consistently in shared‑memory sizing and kernel arguments. The extrasplitWorkThresholdparameter is now redundant with the internalkSplitWorkThresholdconstant, but that’s a minor cleanup opportunity, not a blocker.You could either drop the
splitWorkThresholdparameter from the signature or use it instead of the local constexpr to avoid confusion.tensorrt_llm/_torch/attention_backend/sparse/dsa.py (1)
1062-1068: Nice guard for non‑2048index_topk, but consider tightening the warning conditionThe new logic:
use_custom_topk = use_custom_topk and self.index_topk == 2048 if self.index_topk != 2048: logger.warning_once( f"Using custom topk indexer with index_topk={self.index_topk} is not supported. " f"Please use index_topk=2048 instead.", key="indexer_topk_not_2048_warning")correctly prevents the CUDA top‑k kernel from being used unless
index_topk == 2048, avoiding the hard C++ check.However, this warning will also trigger when callers explicitly pass
use_custom_topk=False, even though they are not asking for the custom kernel. To reduce noise, you could gate the warning on the original intent, e.g.:- use_custom_topk = use_custom_topk and self.index_topk == 2048 - if self.index_topk != 2048: + if use_custom_topk and self.index_topk != 2048: + use_custom_topk = False logger.warning_once( f"Using custom topk indexer with index_topk={self.index_topk} is not supported. " f"Please use index_topk=2048 instead.", key="indexer_topk_not_2048_warning") + else: + use_custom_topk = use_custom_topk and self.index_topk == 2048Functionally you’re already safe; this is just a usability tweak for callers that deliberately opt out of the custom kernel.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cpp/tensorrt_llm/kernels/IndexerTopK.h(1 hunks)cpp/tensorrt_llm/kernels/indexerTopK.cu(16 hunks)cpp/tensorrt_llm/thop/IndexerTopKOp.cpp(3 hunks)tensorrt_llm/_torch/attention_backend/sparse/dsa.py(4 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py(1 hunks)tests/unittest/_torch/attention/sparse/test_dsa_indexer.py(2 hunks)tests/unittest/_torch/thop/parallel/test_indexer_topk.py(3 hunks)
🧰 Additional context used
🧠 Learnings (25)
📓 Common learnings
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
Applied to files:
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.pycpp/tensorrt_llm/thop/IndexerTopKOp.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/thop/IndexerTopKOp.cppcpp/tensorrt_llm/kernels/IndexerTopK.htensorrt_llm/_torch/attention_backend/sparse/dsa.pycpp/tensorrt_llm/kernels/indexerTopK.cu
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.
Applied to files:
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-27T14:23:55.566Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
Repo: NVIDIA/TensorRT-LLM PR: 6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-10-17T13:21:31.724Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 8398
File: tensorrt_llm/_torch/pyexecutor/sampling_utils.py:237-272
Timestamp: 2025-10-17T13:21:31.724Z
Learning: The setup.py file in TensorRT-LLM explicitly requires Python 3.10+ via `python_requires=">=3.10, <4"`, making match/case statements and other Python 3.10+ features appropriate throughout the codebase.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-28T10:21:46.652Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1068-1085
Timestamp: 2025-08-28T10:21:46.652Z
Learning: torch.index_select works with int32 indices in practice despite documentation stating LongTensor requirement. In TensorRT-LLM codebase, int32 indices are used intentionally and work correctly.
Applied to files:
tests/unittest/_torch/thop/parallel/test_indexer_topk.pycpp/tensorrt_llm/kernels/IndexerTopK.h
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Applied to files:
cpp/tensorrt_llm/kernels/IndexerTopK.hcpp/tensorrt_llm/kernels/indexerTopK.cu
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/attention_backend/sparse/dsa.py
📚 Learning: 2025-11-07T09:18:04.997Z
Learnt from: Funatiq
Repo: NVIDIA/TensorRT-LLM PR: 8587
File: tensorrt_llm/_torch/pyexecutor/llm_request.py:129-139
Timestamp: 2025-11-07T09:18:04.997Z
Learning: In `LogitsStorage.get()` method in `tensorrt_llm/_torch/pyexecutor/llm_request.py`, when `exclude_last=True`, there is an invariant that at least 2 chunks must have been appended to `_logits_indices`. The parameter is designed to drop the entire last chunk (not just the last token), which is expected behavior for the overlap scheduler that generates one extra token in a separate chunk.
Applied to files:
tensorrt_llm/_torch/attention_backend/sparse/dsa.py
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/kernels/indexerTopK.cu
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
cpp/tensorrt_llm/kernels/indexerTopK.cu
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/kernels/indexerTopK.cu
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
cpp/tensorrt_llm/kernels/indexerTopK.cu
🧬 Code graph analysis (5)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (3)
_(476-512)_(775-798)_(1917-1964)
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp (1)
cpp/tensorrt_llm/kernels/indexerTopK.cu (2)
invokeIndexerTopKDecode(643-679)invokeIndexerTopKDecode(643-645)
tests/unittest/_torch/thop/parallel/test_indexer_topk.py (1)
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp (4)
indexer_topk_decode(37-85)indexer_topk_decode(37-38)indexer_topk_prefill(87-122)indexer_topk_prefill(87-88)
cpp/tensorrt_llm/kernels/IndexerTopK.h (1)
cpp/tensorrt_llm/kernels/indexerTopK.cu (4)
invokeIndexerTopKDecode(643-679)invokeIndexerTopKDecode(643-645)invokeIndexerTopKPrefill(681-702)invokeIndexerTopKPrefill(681-683)
tensorrt_llm/_torch/attention_backend/sparse/dsa.py (2)
tensorrt_llm/logger.py (1)
warning_once(135-136)cpp/tensorrt_llm/thop/IndexerTopKOp.cpp (4)
indexer_topk_prefill(87-122)indexer_topk_prefill(87-88)indexer_topk_decode(37-85)indexer_topk_decode(37-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (14)
cpp/tensorrt_llm/kernels/indexerTopK.cu (3)
32-60: Bit extraction and pattern matching logic looks consistent with the new 4‑step schemeThe reworked
extractBinIdxandisPartialMatchimplementation (half at step 0, float bits for steps 1–3, with step‑dependentpatternShift) is internally consistent with the “half‑11‑11‑10” iteration comment and howlogitPatternis built and consumed in later steps. No functional issues spotted here.
153-351: RuntimetopKthreading in histogram step is soundUsing the runtime
topKinprocessHistogramStepfor threshold detection and final‑bin handling (Lines 231–253, 285–313) preserves the earlier behavior while decoupling it from a hardcoded constant. Combined with the earlyrowLen <= topKshortcut, this keeps prefix sums well‑defined and avoids overfilling. Looks correct.
353-576: Shared‑memory layout for indices/logits matches dynamic allocationThe switch to a flat
extern __shared__ int32_t smemOutput[]plusreinterpret_cast<float*>(smemOutput + topK)whenmultipleBlocksPerRowistrueis consistent with the launches:
- non‑chunked decode & prefill:
topK * sizeof(int32_t)and no float view- chunked decode:
2 * topK * sizeof(int32_t)and both indices/logits storedThe final writeback logic for both single‑block and multi‑block cases is aligned with this layout. No correctness issues found.
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py (2)
1333-1348: Docstrings now correctly reference the renamed decode kernelThe updated description in
test_indexer_decode_custom_vs_fallbackmatches the new op name (indexer_topk_decode). This keeps the test documentation aligned with the public API rename.
1553-1562: Prefill docstring aligned withindexer_topk_prefillrenameThe chunked prefill test docstring now mentions
indexer_topk_prefill, consistent with the C++/Torch bindings and other tests.tests/unittest/_torch/thop/parallel/test_indexer_topk.py (3)
4-6: Import comment accurately reflects the renamed CUDA opsThe updated comment on Line 4 now mentions
indexer_topk_decodeandindexer_topk_prefill, matching the new operator names and helping future readers understand whytensorrt_llmis imported here.
181-183: Decode test correctly targets the newindexer_topk_decodeop
test_indexer_topk_decodenow calls:torch.ops.trtllm.indexer_topk_decode(logits, seq_lens, indices, next_n)which matches the new schema (
next_nexplicit,index_topkdefaulted to 2048). The rest of the test logic (masking againstrow_endsand comparing totorch.topk) remains valid.
217-218: Prefill test updated to useindexer_topk_prefillwith default top‑k
test_indexer_topk_prefillnow calls:torch.ops.trtllm.indexer_topk_prefill(logits, row_starts, row_ends, indices)which is consistent with the new op name and relies on the default
index_topk=2048as intended. The reference comparison totorch.topkis unchanged.tensorrt_llm/_torch/attention_backend/sparse/dsa.py (3)
1099-1103: Chunked prefill path now callsindexer_topk_prefillwith correct argumentsIn the chunked prefill branch, the call:
torch.ops.trtllm.indexer_topk_prefill( logits, chunk.cu_seqlen_ks, chunk.cu_seqlen_ke, topk_indices_buffer[chunk.token_start:chunk.token_end, :])matches the C++ schema (
logits,row_starts,row_ends,indices, withindex_topkdefaulted). This keeps the CUDA fast path aligned with the new op name and signature.
1135-1138: Single‑pass prefill fallback uses the renamedindexer_topk_prefillopThe non‑chunked prefill path now consistently uses
indexer_topk_prefillas well, preserving the previous behavior while matching the new operator naming.
1200-1207: Decode path correctly switches toindexer_topk_decodeThe decode branch’s call:
torch.ops.trtllm.indexer_topk_decode( logits_decode, gen_kv_lens_cuda, topk_indices_buffer[num_ctx_tokens:num_ctx_tokens + num_gen_tokens, :], next_n)uses
kv_lens(total cache length) andnext_nexactly as required by the C++ binding, withindex_topkdefaulted to 2048. Together with theuse_custom_topkguard, this keeps the DSA decode path robust.cpp/tensorrt_llm/kernels/IndexerTopK.h (1)
27-33: Header signatures aligned with kernel implementations and runtimetopKThe updated declarations for
invokeIndexerTopKDecodeandinvokeIndexerTopKPrefill(including aux buffers and the defaultedtopK = 2048) match the definitions inindexerTopK.cu. This keeps the public kernel interface coherent after the internal refactor.cpp/tensorrt_llm/thop/IndexerTopKOp.cpp (2)
37-85: Decode wrapper and aux buffer configuration match the updated kernel contract
torch_ext::indexer_topk_decodenow:
- Enforces the usual shape/contiguity checks and
index_topk == 2048.- Defines
multipleBlocksPerRowConfig = 10and, whennum_columns >= splitWorkThreshold, allocates
aux_indices/aux_logitsas{num_rows, multipleBlocksPerRowConfig, index_topk}.The call into
tk::invokeIndexerTopKDecodepassesindices,aux_logits,aux_indices, and the runtimeindex_topkexactly as expected by the CUDA implementation. This wiring looks correct.
87-122: Prefill wrapper and Torch op schema are in sync with the rename
torch_ext::indexer_topk_prefillcallstk::invokeIndexerTopKPrefillwith the new signature and passesindex_topkastopK. The TORCH_LIBRARY fragments:m.def("indexer_topk_prefill(Tensor logits, Tensor row_starts, Tensor row_ends, Tensor indices, int index_topk=2048) -> ()"); ... m.impl("indexer_topk_prefill", &torch_ext::indexer_topk_prefill);align with the Python/test usage and with the fake registrations once those are adjusted. No issues here.
Signed-off-by: Christina Zhang <[email protected]>
|
/bot run |
|
PR_Github #24908 [ run ] triggered by Bot. Commit: |
|
PR_Github #24908 [ run ] completed with state |
Summary by CodeRabbit
indexer_topk_decodeandindexer_topk_prefill.Description
Update the indexer topK for Deepseek v3.2
Test Coverage
pytest -v -s tests/unittest/_torch/thop/parallel/test_indexer_topk.py
pytest -v -s tests/unittest/_torch/attention/sparse/test_dsa_indexer.py
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.