Skip to content

Conversation

@ChristinaZ
Copy link
Collaborator

@ChristinaZ ChristinaZ commented Nov 18, 2025

Summary by CodeRabbit

  • Refactor
    • Simplified CUDA operator names by removing "_op" suffix for cleaner interface: indexer_topk_decode and indexer_topk_prefill.
    • Standardized parameter naming across indexing functions for improved consistency and clarity.
    • Top-K threshold parameter now supports runtime configuration with default value of 2048.

Description

Update the indexer topK for Deepseek v3.2

Test Coverage

pytest -v -s tests/unittest/_torch/thop/parallel/test_indexer_topk.py
pytest -v -s tests/unittest/_torch/attention/sparse/test_dsa_indexer.py

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Christina Zhang <[email protected]>
@ChristinaZ ChristinaZ requested review from a team as code owners November 18, 2025 07:10
@ChristinaZ ChristinaZ requested review from brb-nv and liji-nv November 18, 2025 07:10
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 18, 2025

📝 Walkthrough

Walkthrough

The changes rename top-K indexing kernels and PyTorch operators (removing _op suffix), convert compile-time topK template parameters to runtime parameters, refactor CUDA shared memory usage from structured outputs to a flat int32_t array, update bit extraction logic, and propagate parameter changes throughout all calling code.

Changes

Cohort / File(s) Summary
Header API renaming and parameter updates
cpp/tensorrt_llm/kernels/IndexerTopK.h
Renamed output parameters in invokeIndexerTopKDecode (outIndices→indices, added outLogitsAux, outIndicesAux) and invokeIndexerTopKPrefill (outIndices→indices); changed topK parameter name from index_topk to topK with default value 2048
Kernel implementation refactoring
cpp/tensorrt_llm/kernels/indexerTopK.cu
Replaced bit-extraction logic in extractBinIdx; converted topK from compile-time template parameter (kTopK) to runtime parameter; restructured shared memory from per-block struct output (indices/logits) to single flat int32_t array with reinterpret_cast for float access; updated all conditional branches, histogram processing, and step dispatch logic to use runtime topK; adjusted kernel launch configurations and parameter passing
PyTorch operator renaming and config
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp
Renamed exported ops from indexer_topk_decode_op→indexer_topk_decode and indexer_topk_prefill_op→indexer_topk_prefill; updated Torch library definitions and impl bindings; added multipleBlocksPerRowConfig=10 and expanded auxiliary tensor shape logic
Python operator call updates
tensorrt_llm/_torch/attention_backend/sparse/dsa.py, tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
Updated CUDA operator references from trtllm.indexer_topk_prefill_op→trtllm.indexer_topk_prefill and trtllm.indexer_topk_decode_op→trtllm.indexer_topk_decode; updated registered fake op names
Test documentation and calls
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py, tests/unittest/_torch/thop/parallel/test_indexer_topk.py
Updated docstring references and operator call sites from *_op names to new names; updated torch.ops.trtllm invocations to use indexer_topk_prefill and indexer_topk_decode

Sequence Diagram(s)

sequenceDiagram
    actor Host
    participant HeaderAPI as IndexerTopK.h<br/>(Public API)
    participant Kernel as indexerTopK.cu<br/>(Device Code)
    participant PyTorch as PyTorch Op<br/>(IndexerTopKOp.cpp)
    participant Python as Python Caller<br/>(dsa.py)

    rect rgb(220, 240, 255)
    Note over Host,Python: Runtime topK Parameter Flow (New)
    Python->>PyTorch: indexer_topk_decode(logits,<br/>seq_lens, indices, topK)
    PyTorch->>HeaderAPI: invokeIndexerTopKDecode(...,<br/>topK=runtime_value, stream)
    HeaderAPI->>Kernel: Launch with topK param +<br/>shared_memory=topK*sizeof(int32_t)
    Kernel->>Kernel: Use topK at runtime<br/>for SMEM indexing & logic
    Kernel-->>Kernel: Flat SMEM layout:<br/>smemOutput[topK]
    end

    rect rgb(240, 220, 255)
    Note over Host,Python: Parameter Name Changes
    Note over HeaderAPI: indices, outLogitsAux,<br/>outIndicesAux (renamed from outIndices)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • SMEM refactoring in indexerTopK.cu: Verify correctness of conversion from structured SMEM outputs to flat int32_t array with reinterpret_cast usage for float access; ensure alignment and offset calculations are correct across all code paths
  • Runtime topK parameter propagation: Trace topK parameter flow through template instantiation, kernel dispatch, and SMEM allocation to confirm no mismatches between launch-time allocation and kernel usage
  • Bit extraction logic changes: Review extractBinIdx modifications and patternShift computation logic with step-dependent conditionals to ensure pattern extraction remains correct
  • Kernel launch configuration updates: Verify all kernels (topKPerRowJob, topKPerRowPrefill, topKPerRowDecode) correctly receive and use runtime topK parameter instead of compile-time kTopK
  • Broad but consistent changes: While the pattern is systematic across files (parameter renaming + operator renaming), the volume and variety of affected layers (header, kernel, PyTorch bindings, Python callers) increase review scope

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning PR description lacks critical details about the changes and rationale required by the template. Expand description section to explain what indexer topK updates are needed for Deepseek v3.2, why these changes are necessary, and clarify what specific modifications were made to the API and implementation.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: updating the indexer topK functionality. It uses the correct format with [None][feat] prefix and is specific to the primary refactoring effort.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)

186-195: Fake op signatures must declare defaults to match the C++ operator schema

The fake registrations for indexer_topk_prefill and indexer_topk_decode currently require index_topk as a positional parameter, but the C++ schema declares it with a default value (int index_topk=2048). The FakeTensor implementation must have the same signature as the operator, including defaults.

Without the default in the Python signature, calls that omit index_topk (relying on the C++ default) will pass fewer arguments than the fake function expects, causing a TypeError when these ops run under FakeTensor/meta or torch.compile.

Update the fake signatures to declare the default:

@torch.library.register_fake("trtllm::indexer_topk_prefill")
def _(logits, row_starts, row_ends, indices, index_topk: int = 2048):
    # In-place operation, no return value (void function)
    pass

@torch.library.register_fake("trtllm::indexer_topk_decode")
def _(logits, seq_lens, indices, next_n, index_topk: int = 2048):
    # In-place operation, no return value (void function)
    pass
🧹 Nitpick comments (2)
cpp/tensorrt_llm/kernels/indexerTopK.cu (1)

643-683: Host wrappers correctly propagate runtime topK (minor redundancy only)

invokeIndexerTopKDecode/invokeIndexerTopKPrefill now accept a runtime topK and use it consistently in shared‑memory sizing and kernel arguments. The extra splitWorkThreshold parameter is now redundant with the internal kSplitWorkThreshold constant, but that’s a minor cleanup opportunity, not a blocker.

You could either drop the splitWorkThreshold parameter from the signature or use it instead of the local constexpr to avoid confusion.

tensorrt_llm/_torch/attention_backend/sparse/dsa.py (1)

1062-1068: Nice guard for non‑2048 index_topk, but consider tightening the warning condition

The new logic:

use_custom_topk = use_custom_topk and self.index_topk == 2048
if self.index_topk != 2048:
    logger.warning_once(
        f"Using custom topk indexer with index_topk={self.index_topk} is not supported. "
        f"Please use index_topk=2048 instead.",
        key="indexer_topk_not_2048_warning")

correctly prevents the CUDA top‑k kernel from being used unless index_topk == 2048, avoiding the hard C++ check.

However, this warning will also trigger when callers explicitly pass use_custom_topk=False, even though they are not asking for the custom kernel. To reduce noise, you could gate the warning on the original intent, e.g.:

-    use_custom_topk = use_custom_topk and self.index_topk == 2048
-    if self.index_topk != 2048:
+    if use_custom_topk and self.index_topk != 2048:
+        use_custom_topk = False
         logger.warning_once(
             f"Using custom topk indexer with index_topk={self.index_topk} is not supported. "
             f"Please use index_topk=2048 instead.",
             key="indexer_topk_not_2048_warning")
+    else:
+        use_custom_topk = use_custom_topk and self.index_topk == 2048

Functionally you’re already safe; this is just a usability tweak for callers that deliberately opt out of the custom kernel.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96cfdd8 and b301bbe.

📒 Files selected for processing (7)
  • cpp/tensorrt_llm/kernels/IndexerTopK.h (1 hunks)
  • cpp/tensorrt_llm/kernels/indexerTopK.cu (16 hunks)
  • cpp/tensorrt_llm/thop/IndexerTopKOp.cpp (3 hunks)
  • tensorrt_llm/_torch/attention_backend/sparse/dsa.py (4 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1 hunks)
  • tests/unittest/_torch/attention/sparse/test_dsa_indexer.py (2 hunks)
  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py (3 hunks)
🧰 Additional context used
🧠 Learnings (25)
📓 Common learnings
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.

Applied to files:

  • tests/unittest/_torch/attention/sparse/test_dsa_indexer.py
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • cpp/tensorrt_llm/thop/IndexerTopKOp.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/thop/IndexerTopKOp.cpp
  • cpp/tensorrt_llm/kernels/IndexerTopK.h
  • tensorrt_llm/_torch/attention_backend/sparse/dsa.py
  • cpp/tensorrt_llm/kernels/indexerTopK.cu
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/thop/IndexerTopKOp.cpp
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

  • cpp/tensorrt_llm/thop/IndexerTopKOp.cpp
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-27T14:23:55.566Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
Repo: NVIDIA/TensorRT-LLM PR: 6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-10-17T13:21:31.724Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 8398
File: tensorrt_llm/_torch/pyexecutor/sampling_utils.py:237-272
Timestamp: 2025-10-17T13:21:31.724Z
Learning: The setup.py file in TensorRT-LLM explicitly requires Python 3.10+ via `python_requires=">=3.10, <4"`, making match/case statements and other Python 3.10+ features appropriate throughout the codebase.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
📚 Learning: 2025-08-28T10:21:46.652Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1068-1085
Timestamp: 2025-08-28T10:21:46.652Z
Learning: torch.index_select works with int32 indices in practice despite documentation stating LongTensor requirement. In TensorRT-LLM codebase, int32 indices are used intentionally and work correctly.

Applied to files:

  • tests/unittest/_torch/thop/parallel/test_indexer_topk.py
  • cpp/tensorrt_llm/kernels/IndexerTopK.h
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

  • cpp/tensorrt_llm/kernels/IndexerTopK.h
  • cpp/tensorrt_llm/kernels/indexerTopK.cu
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/attention_backend/sparse/dsa.py
📚 Learning: 2025-11-07T09:18:04.997Z
Learnt from: Funatiq
Repo: NVIDIA/TensorRT-LLM PR: 8587
File: tensorrt_llm/_torch/pyexecutor/llm_request.py:129-139
Timestamp: 2025-11-07T09:18:04.997Z
Learning: In `LogitsStorage.get()` method in `tensorrt_llm/_torch/pyexecutor/llm_request.py`, when `exclude_last=True`, there is an invariant that at least 2 chunks must have been appended to `_logits_indices`. The parameter is designed to drop the entire last chunk (not just the last token), which is expected behavior for the overlap scheduler that generates one extra token in a separate chunk.

Applied to files:

  • tensorrt_llm/_torch/attention_backend/sparse/dsa.py
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/kernels/indexerTopK.cu
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

  • cpp/tensorrt_llm/kernels/indexerTopK.cu
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/kernels/indexerTopK.cu
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

  • cpp/tensorrt_llm/kernels/indexerTopK.cu
🧬 Code graph analysis (5)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (3)
  • _ (476-512)
  • _ (775-798)
  • _ (1917-1964)
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp (1)
cpp/tensorrt_llm/kernels/indexerTopK.cu (2)
  • invokeIndexerTopKDecode (643-679)
  • invokeIndexerTopKDecode (643-645)
tests/unittest/_torch/thop/parallel/test_indexer_topk.py (1)
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp (4)
  • indexer_topk_decode (37-85)
  • indexer_topk_decode (37-38)
  • indexer_topk_prefill (87-122)
  • indexer_topk_prefill (87-88)
cpp/tensorrt_llm/kernels/IndexerTopK.h (1)
cpp/tensorrt_llm/kernels/indexerTopK.cu (4)
  • invokeIndexerTopKDecode (643-679)
  • invokeIndexerTopKDecode (643-645)
  • invokeIndexerTopKPrefill (681-702)
  • invokeIndexerTopKPrefill (681-683)
tensorrt_llm/_torch/attention_backend/sparse/dsa.py (2)
tensorrt_llm/logger.py (1)
  • warning_once (135-136)
cpp/tensorrt_llm/thop/IndexerTopKOp.cpp (4)
  • indexer_topk_prefill (87-122)
  • indexer_topk_prefill (87-88)
  • indexer_topk_decode (37-85)
  • indexer_topk_decode (37-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (14)
cpp/tensorrt_llm/kernels/indexerTopK.cu (3)

32-60: Bit extraction and pattern matching logic looks consistent with the new 4‑step scheme

The reworked extractBinIdx and isPartialMatch implementation (half at step 0, float bits for steps 1–3, with step‑dependent patternShift) is internally consistent with the “half‑11‑11‑10” iteration comment and how logitPattern is built and consumed in later steps. No functional issues spotted here.


153-351: Runtime topK threading in histogram step is sound

Using the runtime topK in processHistogramStep for threshold detection and final‑bin handling (Lines 231–253, 285–313) preserves the earlier behavior while decoupling it from a hardcoded constant. Combined with the early rowLen <= topK shortcut, this keeps prefix sums well‑defined and avoids overfilling. Looks correct.


353-576: Shared‑memory layout for indices/logits matches dynamic allocation

The switch to a flat extern __shared__ int32_t smemOutput[] plus reinterpret_cast<float*>(smemOutput + topK) when multipleBlocksPerRow is true is consistent with the launches:

  • non‑chunked decode & prefill: topK * sizeof(int32_t) and no float view
  • chunked decode: 2 * topK * sizeof(int32_t) and both indices/logits stored

The final writeback logic for both single‑block and multi‑block cases is aligned with this layout. No correctness issues found.

tests/unittest/_torch/attention/sparse/test_dsa_indexer.py (2)

1333-1348: Docstrings now correctly reference the renamed decode kernel

The updated description in test_indexer_decode_custom_vs_fallback matches the new op name (indexer_topk_decode). This keeps the test documentation aligned with the public API rename.


1553-1562: Prefill docstring aligned with indexer_topk_prefill rename

The chunked prefill test docstring now mentions indexer_topk_prefill, consistent with the C++/Torch bindings and other tests.

tests/unittest/_torch/thop/parallel/test_indexer_topk.py (3)

4-6: Import comment accurately reflects the renamed CUDA ops

The updated comment on Line 4 now mentions indexer_topk_decode and indexer_topk_prefill, matching the new operator names and helping future readers understand why tensorrt_llm is imported here.


181-183: Decode test correctly targets the new indexer_topk_decode op

test_indexer_topk_decode now calls:

torch.ops.trtllm.indexer_topk_decode(logits, seq_lens, indices, next_n)

which matches the new schema (next_n explicit, index_topk defaulted to 2048). The rest of the test logic (masking against row_ends and comparing to torch.topk) remains valid.


217-218: Prefill test updated to use indexer_topk_prefill with default top‑k

test_indexer_topk_prefill now calls:

torch.ops.trtllm.indexer_topk_prefill(logits, row_starts, row_ends, indices)

which is consistent with the new op name and relies on the default index_topk=2048 as intended. The reference comparison to torch.topk is unchanged.

tensorrt_llm/_torch/attention_backend/sparse/dsa.py (3)

1099-1103: Chunked prefill path now calls indexer_topk_prefill with correct arguments

In the chunked prefill branch, the call:

torch.ops.trtllm.indexer_topk_prefill(
    logits, chunk.cu_seqlen_ks, chunk.cu_seqlen_ke,
    topk_indices_buffer[chunk.token_start:chunk.token_end, :])

matches the C++ schema (logits, row_starts, row_ends, indices, with index_topk defaulted). This keeps the CUDA fast path aligned with the new op name and signature.


1135-1138: Single‑pass prefill fallback uses the renamed indexer_topk_prefill op

The non‑chunked prefill path now consistently uses indexer_topk_prefill as well, preserving the previous behavior while matching the new operator naming.


1200-1207: Decode path correctly switches to indexer_topk_decode

The decode branch’s call:

torch.ops.trtllm.indexer_topk_decode(
    logits_decode, gen_kv_lens_cuda,
    topk_indices_buffer[num_ctx_tokens:num_ctx_tokens + num_gen_tokens, :], next_n)

uses kv_lens (total cache length) and next_n exactly as required by the C++ binding, with index_topk defaulted to 2048. Together with the use_custom_topk guard, this keeps the DSA decode path robust.

cpp/tensorrt_llm/kernels/IndexerTopK.h (1)

27-33: Header signatures aligned with kernel implementations and runtime topK

The updated declarations for invokeIndexerTopKDecode and invokeIndexerTopKPrefill (including aux buffers and the defaulted topK = 2048) match the definitions in indexerTopK.cu. This keeps the public kernel interface coherent after the internal refactor.

cpp/tensorrt_llm/thop/IndexerTopKOp.cpp (2)

37-85: Decode wrapper and aux buffer configuration match the updated kernel contract

torch_ext::indexer_topk_decode now:

  • Enforces the usual shape/contiguity checks and index_topk == 2048.
  • Defines multipleBlocksPerRowConfig = 10 and, when num_columns >= splitWorkThreshold, allocates
    aux_indices / aux_logits as {num_rows, multipleBlocksPerRowConfig, index_topk}.

The call into tk::invokeIndexerTopKDecode passes indices, aux_logits, aux_indices, and the runtime index_topk exactly as expected by the CUDA implementation. This wiring looks correct.


87-122: Prefill wrapper and Torch op schema are in sync with the rename

torch_ext::indexer_topk_prefill calls tk::invokeIndexerTopKPrefill with the new signature and passes index_topk as topK. The TORCH_LIBRARY fragments:

m.def("indexer_topk_prefill(Tensor logits, Tensor row_starts, Tensor row_ends, Tensor indices, int index_topk=2048) -> ()");
...
m.impl("indexer_topk_prefill", &torch_ext::indexer_topk_prefill);

align with the Python/test usage and with the fake registrations once those are adjusted. No issues here.

Signed-off-by: Christina Zhang <[email protected]>
@lfr-0531
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24908 [ run ] triggered by Bot. Commit: 0e38f75

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24908 [ run ] completed with state SUCCESS. Commit: 0e38f75
/LLM/main/L0_MergeRequest_PR pipeline #18809 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants