Skip to content

Conversation

@liji-nv
Copy link
Collaborator

@liji-nv liji-nv commented Nov 25, 2025

Also applies a different fix with same effect as #8780 for issues introduced in #7999

Summary by CodeRabbit

  • New Features

    • Added FP8 block scaling GEMM matrix multiplication operation.
  • Improvements

    • KV-cache estimation now constrains token budget by available memory and CUDA graph warmup requirements.
    • Improved warmup flow for CUDA graph captures and torch.compile handling.

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@liji-nv liji-nv requested review from a team as code owners November 25, 2025 11:18
@liji-nv liji-nv force-pushed the dev-liji-cherry-pick-conflict-changes branch from e025364 to fca3543 Compare November 25, 2025 11:20
@liji-nv liji-nv changed the title [None][cherry-pick] Cherry-Pick Conflict Changes #7999 #8515 [None][cherrypick] Cherry-Pick Conflict Changes #7999 #8515 Nov 25, 2025
@liji-nv liji-nv changed the title [None][cherrypick] Cherry-Pick Conflict Changes #7999 #8515 [None][fix] Cherry-Pick Conflict Changes #7999 #8515 Nov 25, 2025
@liji-nv
Copy link
Collaborator Author

liji-nv commented Nov 25, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25732 [ run ] triggered by Bot. Commit: fca3543

@liji-nv liji-nv changed the title [None][fix] Cherry-Pick Conflict Changes #7999 #8515 [None][fix] Cherry-Pick Conflict Changes for PR 7999 PR 8515 Nov 25, 2025
@liji-nv liji-nv changed the title [None][fix] Cherry-Pick Conflict Changes for PR 7999 PR 8515 [None][fix] Cherry-pick conflict changes for PR 7999 PR 8515 Nov 25, 2025
@liji-nv liji-nv changed the title [None][fix] Cherry-pick conflict changes for PR 7999 PR 8515 [None][fix] Cherry-pick conflict changes for PR 7999 PR 8515 Nov 25, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 25, 2025

📝 Walkthrough

Walkthrough

This PR introduces a renamed FP8 block scaling GEMM operation with new autotuning support, refactors the warmup mechanism for torch.compile and CUDA graph captures, and improves KV-cache memory estimation to account for CUDA graph warmup overhead.

Changes

Cohort / File(s) Summary
Torch binding rename
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp, tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
Renamed exported binding from fp8_block_scaling_gemm to fp8_block_scaling_gemm_impl in TORCH_LIBRARY_FRAGMENT and TORCH_LIBRARY_IMPL sections; underlying C++ function remains unchanged. Fake op registration updated to match.
FP8 block scaling GEMM implementation
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Renamed fp8_swap_ab_gen_tuning_buckets to deep_gemm_gen_tuning_buckets; introduced Fp8BlockScalingGemmRunner (TunableRunner subclass) with tuning configuration; added get_fp8_block_scaling_gemm_constraint_spec() helper; registered new custom op trtllm::fp8_block_scaling_gemm with autotuner integration and fake implementation.
Warmup flow refactoring
tensorrt_llm/_torch/pyexecutor/model_engine.py
Added _general_warmup() method to consolidate warmup iterations; updated _create_warmup_request() signature to accept num_gen_requests and least_requests parameters; introduced resource-availability checks and safety constraints; enhanced CUDA graph warmup sequencing; modified batch-allocation failure handling to continue instead of early-return.
KV-cache memory estimation
tensorrt_llm/_torch/pyexecutor/_util.py
Updated KV-cache estimation to include CUDA graph warmup block calculation; introduced memory-bound cap (max_num_tokens_in_memory); enforced minimum num_cache_blocks based on batch size; clamped final token count to memory limits.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py: Requires careful review of the new Fp8BlockScalingGemmRunner class, autotuner integration, constraint specification logic, and the interplay between tuning config and custom op implementation.
  • tensorrt_llm/_torch/pyexecutor/model_engine.py: Complex warmup refactoring with multiple control-flow paths, resource constraint checks, and state-management changes across CUDA graph and torch.compile warmup scenarios; verify correctness of parameter mappings in _create_warmup_request() calls.
  • tensorrt_llm/_torch/pyexecutor/_util.py: KV-cache estimation logic changes require verification that clamping logic and memory bounds are mathematically sound and handle edge cases (e.g., zero batch size, insufficient memory).
  • C++ and fake op bindings: Ensure renaming is consistently applied across all entry points and that the underlying implementation dispatch is correct.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is largely incomplete, consisting mainly of the template boilerplate with only a brief mention of the fix without substantive details in the Description, Test Coverage, or PR Checklist sections. Complete the Description section explaining the issues from #7999 and how the fix differs from #8780, list specific test cases in Test Coverage, and mark completed PR Checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title mentions cherry-picking conflicts between PRs #7999 and #8515 but lacks specificity about the actual changes; it is more of a process descriptor than a technical summary. Replace with a more specific technical summary of the main changes, e.g., '[None][fix] Rename FP8 block scaling GEMM torch binding and improve KV-cache estimation' or similar.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/pyexecutor/_util.py (1)

241-291: KV-cache estimation warmup logic looks good; clean up Ruff issues and dead access.

The new logic that:

  • folds CUDA graph warmup requirements into num_cache_blocks,
  • enforces a minimum of batch_size blocks, and
  • caps by free_gpu_memory_fraction * free_mem

is a reasonable, conservative estimate and matches the subsequent use in try_prepare_estimation.

Two minor issues from static analysis are worth fixing:

  • Line 281: total_mem from torch.cuda.mem_get_info() is never used.
  • Line 283: self._dummy_reqs[0].sampling_config.beam_width is a no-op; the value is already read in the return expression below.

You can address both with a small refactor:

-        free_mem, total_mem = torch.cuda.mem_get_info()
-        max_memory = self._kv_cache_config.free_gpu_memory_fraction * free_mem
-        self._dummy_reqs[0].sampling_config.beam_width
+        free_mem, _total_mem = torch.cuda.mem_get_info()
+        max_memory = self._kv_cache_config.free_gpu_memory_fraction * free_mem
         max_num_tokens_in_memory = max_memory // self._get_kv_size_per_token(
         ) // self._tokens_per_block * self._tokens_per_block
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

747-792: Extra piecewise CUDA-graph warmup pass is reasonable once context warmup requests are fixed.

The second loop that reruns warmup over piecewise_cuda_graph_num_tokens with least_requests=False is a sensible heuristic to allocate larger, more fragmented blocks after capture and help PyTorch reuse them, matching the intent in the comment.

However, this relies on _create_warmup_request(..., num_gen_requests=0, least_requests=False) actually producing a non-empty set of context requests for the given num_tokens. The current implementation of _create_warmup_request never updates num_ctx_requests, so context-only warmups end up with no context requests at all (see separate comment on that function). Once that bug is fixed, this extra pass should behave as intended.


823-915: Bug in _create_warmup_request: num_ctx_requests never set, so context-only warmups are empty.

Within _create_warmup_request:

  • num_ctx_requests is initialized to 0 and never updated.
  • When num_ctx_tokens > 0 and num_gen_requests == 0 (pure context warmups used by autotuner and piecewise CUDA-graph warmup), you compute num_full_seqs / num_left_over_tokens and ctx_token_nums, but still call:
ctx_requests = kv_cache_manager.add_dummy_requests(
    list(range(num_ctx_requests)),  # always []
    token_nums=ctx_token_nums,
    ...
)

so:

  • no context dummy requests are actually created,
  • the batch passed into forward is empty,
  • autotuner/piecewise warmup never exercise real context lengths, and
  • ctx_token_nums is effectively dead.

This breaks the core purpose of these warmups (they still run, but with zero real workload).

You should derive num_ctx_requests from the token partitioning and use it both in the batch-size check and the request_ids passed to add_dummy_requests. For example:

-        num_ctx_tokens = num_tokens - num_gen_tokens
-        num_ctx_requests = 0
+        num_ctx_tokens = num_tokens - num_gen_tokens
+        num_ctx_requests = 0
@@
-        if num_ctx_tokens > 0:
-            if least_requests:
-                num_full_seqs = num_ctx_tokens // max_seq_len
-                num_left_over_tokens = num_ctx_tokens - num_full_seqs * max_seq_len
-
-            else:
-                max_bs = min(num_ctx_tokens, max_context_requests)
-                if num_ctx_tokens % max_bs == 0:
-                    num_full_seqs = max_bs
-                else:
-                    num_full_seqs = max_bs - 1
-                max_seq_len = num_ctx_tokens // num_full_seqs
-                num_left_over_tokens = num_ctx_tokens - max_seq_len * num_full_seqs
+        if num_ctx_tokens > 0:
+            if least_requests:
+                # Use as few context requests as possible, up to max_seq_len-1 tokens each.
+                num_full_seqs = num_ctx_tokens // max_seq_len
+                num_left_over_tokens = num_ctx_tokens - num_full_seqs * max_seq_len
+            else:
+                # Use as many context requests as allowed (more balanced lengths).
+                max_bs = min(num_ctx_tokens, max_context_requests)
+                if num_ctx_tokens % max_bs == 0:
+                    num_full_seqs = max_bs
+                else:
+                    num_full_seqs = max_bs - 1
+                max_seq_len = num_ctx_tokens // num_full_seqs
+                num_left_over_tokens = num_ctx_tokens - max_seq_len * num_full_seqs
+
+            num_ctx_requests = num_full_seqs + (1 if num_left_over_tokens > 0 else 0)
@@
-        if num_ctx_requests + num_gen_requests > self.batch_size:
+        if num_ctx_requests + num_gen_requests > self.batch_size:
             return None  # Not enough batch size to fill the request
@@
-        if num_ctx_tokens > 0:
-            ctx_token_nums = [max_seq_len] * num_full_seqs
-            if num_left_over_tokens > 0:
-                ctx_token_nums.append(num_left_over_tokens)
-
-            ctx_requests = kv_cache_manager.add_dummy_requests(
-                list(range(num_ctx_requests)),
-                token_nums=ctx_token_nums,
-                is_gen=False,
-                max_num_draft_tokens=self.runtime_draft_len,
-                use_mrope=self.use_mrope)
+        if num_ctx_tokens > 0:
+            ctx_token_nums = [max_seq_len] * num_full_seqs
+            if num_left_over_tokens > 0:
+                ctx_token_nums.append(num_left_over_tokens)
+
+            ctx_requests = kv_cache_manager.add_dummy_requests(
+                list(range(num_ctx_requests)),
+                token_nums=ctx_token_nums,
+                is_gen=False,
+                max_num_draft_tokens=self.runtime_draft_len,
+                use_mrope=self.use_mrope)

This makes the number of created dummy context requests consistent with the token partitioning, and ensures autotuner and CUDA-graph warmups actually stress realistic context workloads.

🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

579-627: Centralized _general_warmup flow looks solid; consider trimming unused reverse for now.

Unifying torch.compile warmup through _general_warmup and reusing _create_warmup_request is a good cleanup, and the batch/num-token bounds (curr_max_num_tokens, max_batch_size) are consistent with how KV capacity and draft tokens are used elsewhere.

The only minor nit is that the reverse parameter is currently only used to flip the sort order of warmup_requests_configs and is never passed as True from within this file. If there is no immediate plan to use it (e.g., for CUDA-graph-first warmups), you might either:

  • drop the argument for now, or
  • add a brief comment on intended future use to avoid confusion.
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)

201-205: Fake FP8 GEMM op rename to fp8_block_scaling_gemm_impl is consistent and shape-correct.

The fake registration now targets trtllm::fp8_block_scaling_gemm_impl, returning an [m, n] BF16 tensor with m = a.shape[0], n = b.shape[0], which matches the C++ implementation (mat1[M,K], mat2[N,K] → out[M,N]).

Optionally, you could use a.shape[-2] / b.shape[-2] for a bit more robustness to accidental higher-rank inputs, but given the op is defined as 2D in C++, the current form is acceptable.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

1111-1121: Typo in local function name.

Minor typo: fp8_quantize_1x128_sm90_constrant should be fp8_quantize_1x128_sm90_constraint.

-    def fp8_quantize_1x128_sm90_constrant(inputs: List[List[int]]):
+    def fp8_quantize_1x128_sm90_constraint(inputs: List[List[int]]):
         pad_m = fp4_utils.pad_up(inputs[0][0], 4)
         blocked_n = (inputs[0][1] + 127) // 128
         return fp4_utils.pad_up(pad_m * blocked_n * 4, 128) // 4

     if get_sm_version() >= 100:
         return (ConstraintSpec(2, 1, lambda inputs: inputs[0][0]), )
     else:
-        return (ConstraintSpec(2, 0, fp8_quantize_1x128_sm90_constrant), )
+        return (ConstraintSpec(2, 0, fp8_quantize_1x128_sm90_constraint), )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a38d91a and e025364.

📒 Files selected for processing (5)
  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp (2 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements)
If and else should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g., thisIsASubDir and thisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues), though the 'm' pre...

Files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
🧠 Learnings (19)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: MrGeva
Repo: NVIDIA/TensorRT-LLM PR: 7219
File: tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py:162-168
Timestamp: 2025-09-04T07:33:10.618Z
Learning: When users explicitly provide cuda_graph_batch_sizes in TorchCudagraphCompiler, respect their choices and only sanitize the values (clamp, dedupe, sort) without forcing additional batch sizes like 1 or max_batch_size. Only add commonly-used batch sizes when falling back to the heuristic.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-11-24T17:09:17.870Z
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:09:17.870Z
Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the guard name format `TRTLLM_` followed by the filename in all caps (e.g., `TRTLLM_FOO_BAR_HELLO_H` for file `FooBarHello.h`); do not include directory names in the symbol

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
🧬 Code graph analysis (3)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
tensorrt_llm/_torch/autotuner.py (6)
  • TunableRunner (159-215)
  • TuningConfig (54-107)
  • get_valid_tactics (162-180)
  • forward (186-212)
  • AutoTuner (520-1319)
  • get (550-553)
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • fp8_block_scaling_gemm (1125-1148)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
  • add_dummy_requests (78-79)
  • add_dummy_requests (453-531)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • batch_size (37-38)
  • ScheduledRequests (20-41)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/pyexecutor/_util.py

281-281: Unpacked variable total_mem is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


283-283: Found useless attribute access. Either assign it to a variable or remove it.

(B018)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

1096-1096: Unused method argument: inputs

(ARG002)


1097-1097: Unused method argument: profile

(ARG002)


1104-1104: Unused method argument: tactic

(ARG002)


1152-1152: Unused function argument: a_scale

(ARG001)


1152-1152: Unused function argument: b_scale

(ARG001)


1152-1152: Unused function argument: tune_max_num_tokens

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

711-736: Using continue on KV-cache allocation failure in CUDA-graph capture is the right trade-off.

Switching to continue when batch is None during _capture_generation_cuda_graphs means we still warm up and capture graphs for smaller batch sizes or other draft lengths even if the largest configuration can't be allocated. This improves robustness without affecting correctness.

cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp (1)

385-405: Torch op binding rename to fp8_block_scaling_gemm_impl is consistent with Python side.

The TORCH_LIBRARY_FRAGMENT definition and TORCH_LIBRARY_IMPL mapping now expose the kernel under trtllm::fp8_block_scaling_gemm_impl while still routing to torch_ext::fp8_block_scaling_gemm. This lines up with the updated fake op in cpp_custom_ops.py and keeps the implementation unchanged.

No functional issues spotted here.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)

997-1001: LGTM!

The function rename to deep_gemm_gen_tuning_buckets appropriately reflects its broader usage across multiple deep GEMM runners. The bucket generation logic is correct.


1004-1009: LGTM!

Reference correctly updated to use the renamed deep_gemm_gen_tuning_buckets function.


1086-1109: LGTM! Design is appropriate for JIT warmup.

The runner's purpose (triggering DeepGEMM JIT during autotune) is well-documented. The unused inputs, profile, and tactic parameters are required by the TunableRunner interface - static analysis false positives.

One minor observation: since there's only a single tactic [0], the tactic parameter in forward will always be 0, which is correct behavior.


1124-1148: Class-level tuning_config mutation is consistent but worth noting.

The pattern of mutating Fp8BlockScalingGemmRunner.tuning_config at lines 1134 and 1136-1137 is consistent with existing patterns in this file (e.g., MoERunner.tuning_config at line 199, fp8SwapABGemmRunner.tuning_config at line 1061).

This approach works because the autotuner typically runs single-threaded during warmup. However, if concurrent calls with different tune_max_num_tokens values occur, there could be a race condition. This is a pre-existing pattern, so no immediate action required.


1151-1156: Hardcoded output dtype is correct—no issues found.

The C++ implementation explicitly creates output tensors with ScalarType::BFloat16, confirming the hardcoded dtype in the fake registration matches the actual behavior. Unlike fp8_block_scaling_bmm which accepts an optional output_dtype parameter, fp8_block_scaling_gemm_impl has no dtype parameter in its signature, making the fixed bfloat16 output intentional.

The unused a_scale, b_scale, and tune_max_num_tokens parameters are expected for fake registrations (required for signature matching with the real op).

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 25, 2025

📝 Walkthrough

Walkthrough

This PR renames the fp8_block_scaling_gemm Torch binding to fp8_block_scaling_gemm_impl across multiple files, introduces a new Fp8BlockScalingGemmRunner with auto-tuning capabilities, refactors warmup logic into a generalized flow, and enhances KV-cache estimation with memory-aware constraints and CUDA graph warmup requirements.

Changes

Cohort / File(s) Summary
Torch Binding Renaming
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp, tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
Renames public Torch binding API from fp8_block_scaling_gemm to fp8_block_scaling_gemm_impl in both fragment declaration and CUDA implementation registration. Internal C++ function reference remains unchanged.
New FP8 Block Scaling GEMM Operator
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Introduces new Fp8BlockScalingGemmRunner class with auto-tuning support, get_fp8_block_scaling_gemm_constraint_spec() helper function, and public fp8_block_scaling_gemm() operator with fake implementation registration for torch.compile compatibility. Renames tuning bucket generator from fp8_swap_ab_gen_tuning_buckets() to deep_gemm_gen_tuning_buckets() with explicit return value.
KV-Cache Estimation Enhancement
tensorrt_llm/_torch/pyexecutor/_util.py
Enhances token estimation logic to constrain calculations by CUDA graph warmup token requirements (based on model batch size and graph batch size) and free memory limits (percentage-based cap). Final estimate is minimum of beam-width scaled block-based count and memory-based limit.
Warmup Flow Generalization
tensorrt_llm/_torch/pyexecutor/model_engine.py
Introduces _general_warmup() method consolidating warmup iterations with dynamic batch sizing. Refactors _run_torch_compile_warmup() to delegate to generalized approach. Updates _create_warmup_request() signature with num_gen_requests and least_requests parameters to support new warmup semantics. Modifies CUDA graph capture to use continue for batch allocation failures, enabling multi-batch processing.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant fp8_block_scaling_gemm as fp8_block_scaling_gemm()
    participant AutoTuner
    participant Fp8BlockScalingGemmRunner
    participant fp8_block_scaling_gemm_impl

    User->>fp8_block_scaling_gemm: Call with tensors (a, b, a_scale, b_scale)
    fp8_block_scaling_gemm->>AutoTuner: Select best tactic based on<br/>constraint specs & input shape
    AutoTuner->>Fp8BlockScalingGemmRunner: Get valid tactics for SM version
    Fp8BlockScalingGemmRunner-->>AutoTuner: Return tactic list
    AutoTuner-->>fp8_block_scaling_gemm: Return selected tactic
    fp8_block_scaling_gemm->>fp8_block_scaling_gemm_impl: Call with selected tactic
    fp8_block_scaling_gemm_impl-->>fp8_block_scaling_gemm: Return result tensor
    fp8_block_scaling_gemm-->>User: Return fused GEMM output
Loading
sequenceDiagram
    participant PyTorchModelEngine
    participant _general_warmup
    participant _capture_generation_cuda_graphs
    participant _create_warmup_request as _create_warmup_request<br/>(num_gen_requests, least_requests)
    participant Forward

    PyTorchModelEngine->>_general_warmup: Start warmup (resource_manager, reverse)
    _general_warmup->>_general_warmup: Compute dynamic max_batch_size
    loop For each warmup config (sorted)
        _general_warmup->>_capture_generation_cuda_graphs: Process batch size/draft length
        alt Batch allocation success
            _capture_generation_cuda_graphs->>_create_warmup_request: Create warmup request<br/>(num_gen_requests, least_requests)
            _create_warmup_request->>_create_warmup_request: Enforce per-batch constraints<br/>Compute num_ctx_tokens, num_full_seqs
            _create_warmup_request-->>_capture_generation_cuda_graphs: Return ScheduledRequests
            _capture_generation_cuda_graphs->>Forward: Execute forward pass
            Forward-->>_capture_generation_cuda_graphs: Allocate memory blocks
        else Batch allocation fails
            _capture_generation_cuda_graphs->>_capture_generation_cuda_graphs: Continue to next config
        end
    end
    _general_warmup-->>PyTorchModelEngine: Warmup complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Key areas requiring attention:
    • torch_custom_ops.py: New Fp8BlockScalingGemmRunner implementation and auto-tuning integration logic; constraint spec selection across SM versions
    • model_engine.py: Significant refactoring of warmup flow with new _create_warmup_request() signature and control flow changes in CUDA graph capture (continue vs. return semantics); memory allocation handling under new parameter semantics
    • _util.py: KV-cache estimation now includes CUDA graph warmup constraints and memory-aware capping; verify calculation correctness for edge cases (e.g., batch size vs. warmup token requirements)
    • Cross-file consistency: Verify renamed tuning function deep_gemm_gen_tuning_buckets is correctly integrated in all call sites

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description is largely incomplete; it provides only a one-line summary and leaves required template sections (Description, Test Coverage, PR Checklist items) empty or unaddressed. Complete the PR description template by filling in the Description section with details about the specific changes, listing relevant tests in Test Coverage, and addressing PR Checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title references cherry-picking conflict changes from PRs 7999 and 8515 but lacks specificity about the actual changes being made. Provide a more specific title that clearly describes the main technical change being applied, such as '[None][fix] Rename fp8_block_scaling_gemm operations and update warmup flow' or similar.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

850-902: Critical bug: num_ctx_requests is never computed.

The variable num_ctx_requests is initialized to 0 at line 851 and is never updated, but it's used in:

  1. Line 877: Batch size check if num_ctx_requests + num_gen_requests > self.batch_size
  2. Line 894: Generating request IDs list(range(num_ctx_requests))
  3. Line 902: Adding spec resource manager requests
  4. Lines 906-918: Range calculations for generation requests

This results in:

  • The batch size check being effectively if num_gen_requests > self.batch_size
  • Context requests getting an empty ID range range(0), producing no dummy requests even when ctx_token_nums is non-empty

The fix should compute num_ctx_requests after determining num_full_seqs and num_left_over_tokens:

             if least_requests:
                 num_full_seqs = num_ctx_tokens // max_seq_len
                 num_left_over_tokens = num_ctx_tokens - num_full_seqs * max_seq_len

             else:
                 max_bs = min(num_ctx_tokens, max_context_requests)
                 if num_ctx_tokens % max_bs == 0:
                     num_full_seqs = max_bs
                 else:
                     num_full_seqs = max_bs - 1
                 max_seq_len = num_ctx_tokens // num_full_seqs
                 num_left_over_tokens = num_ctx_tokens - max_seq_len * num_full_seqs

+            num_ctx_requests = num_full_seqs + (1 if num_left_over_tokens > 0 else 0)
+
         if num_ctx_requests + num_gen_requests > self.batch_size:
             return None  # Not enough batch size to fill the request
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

793-803: w4a8_mxfp4_fp8_gemm fake implementation uses undefined act_fp8

In the fake implementation of w4a8_mxfp4_fp8_gemm, the parameter is named act_fp4 but the body calls act_fp8.new_empty(...). This will raise a NameError in meta/fake runs (e.g., torch.compile) and prevent graph construction.

You likely intended to use the actual input tensor argument:

 @w4a8_mxfp4_fp8_gemm.register_fake
 def _(
-    act_fp4: torch.Tensor,
+    act_fp4: torch.Tensor,
     weight: torch.Tensor,
     act_sf: torch.Tensor,
     weight_scale: torch.Tensor,
     alpha: torch.Tensor,
     output_dtype: torch.dtype,
     to_userbuffers: bool = False,
 ) -> torch.Tensor:
-    return act_fp8.new_empty((act_fp8.size(0), weight.size(0)),
-                             dtype=output_dtype)
+    return act_fp4.new_empty((act_fp4.size(0), weight.size(0)),
+                             dtype=output_dtype)
🧹 Nitpick comments (5)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

602-610: Variable naming mismatch with parameter name.

The loop variable num_gen_tokens is passed to _create_warmup_request which expects num_gen_requests. While semantically correct (for generation batches each request contributes 1 + runtime_draft_len tokens), the naming is misleading and inconsistent with the function signature.

Consider renaming for clarity:

-        for num_tokens, num_gen_tokens in warmup_requests_configs:
+        for num_tokens, num_gen_requests in warmup_requests_configs:
             with self._release_batch_context(
                     self._create_warmup_request(resource_manager, num_tokens,
-                                                num_gen_tokens),
+                                                num_gen_requests),
                     resource_manager) as batch:
tensorrt_llm/_torch/pyexecutor/_util.py (1)

281-284: Unused variable total_mem.

The total_mem variable from torch.cuda.mem_get_info() is unpacked but never used. Per static analysis hint.

-        free_mem, total_mem = torch.cuda.mem_get_info()
+        free_mem, _ = torch.cuda.mem_get_info()
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3)

1086-1109: Fp8BlockScalingGemmRunner correctly wraps the impl op; unused-arg lints are cosmetic

The runner’s tuning_config reuses deep_gemm_gen_tuning_buckets, get_valid_tactics returning [0] matches the fact there is only one underlying kernel configuration, and forward delegating straight to torch.ops.trtllm.fp8_block_scaling_gemm_impl is exactly what you want to both trigger JIT and route execution through the C++ binding.

Ruff’s ARG002 hints about the unused inputs, profile (in get_valid_tactics) and tactic (in forward) are purely stylistic; if you care about a clean lint run, you can safely rename them to _inputs, _profile, and _tactic without changing behavior.


1111-1122: ConstraintSpec logic matches fp8_quantize_1x128 behavior; consider renaming helper for clarity

The SM‑dependent ConstraintSpec setup mirrors the shapes produced by fp8_quantize_1x128:

  • For get_sm_version() >= 100, constraining dim‑1 of the third tensor to inputs[0][0] enforces a_scale.shape[1] == a.shape[0] (M).
  • For lower SMs, the fp8_quantize_1x128_sm90_constrant helper reproduces the 1‑D scale shape formula used in the fake fp8 quantize op.

Functionally this looks correct. Minor readability nit: fp8_quantize_1x128_sm90_constrant seems to have a typo in “constrant”; renaming it to “…_constraint” would make its purpose clearer.


1124-1148: Auto-tuned fp8_block_scaling_gemm public op is wired correctly; small fake-impl cleanup opportunity

The new public custom op:

  • Uses AutoTuner with Fp8BlockScalingGemmRunner and the shared deep‑GEMM buckets.
  • Updates tune_max_num_tokens and SM‑specific constraint_specs before tuning.
  • Invokes the runner, which in turn calls torch.ops.trtllm.fp8_block_scaling_gemm_impl, cleanly separating public op from impl op.

The fake implementation returns a BF16 tensor of shape (a.shape[0], b.shape[0]), matching the real C++ implementation’s output shape and dtype, so it should behave well under torch.compile.

Ruff’s ARG001 hints here are just about unused parameters (a_scale, b_scale, tune_max_num_tokens) in the fake path. If you want to silence them, you can rename those to _a_scale, _b_scale, and _tune_max_num_tokens (or explicitly del them), without affecting behavior.

Also applies to: 1151-1155

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a38d91a and fca3543.

📒 Files selected for processing (5)
  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp (2 hunks)
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,h,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements)
If and else should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g., thisIsASubDir and thisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues), though the 'm' pre...

Files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
🧠 Learnings (17)
📓 Common learnings
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-11-24T17:09:17.870Z
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:09:17.870Z
Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the guard name format `TRTLLM_` followed by the filename in all caps (e.g., `TRTLLM_FOO_BAR_HELLO_H` for file `FooBarHello.h`); do not include directory names in the symbol

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧬 Code graph analysis (2)
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • fp8_block_scaling_gemm (1125-1148)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • batch_size (37-38)
  • ScheduledRequests (20-41)
tensorrt_llm/_torch/attention_backend/sparse/rocket.py (1)
  • add_dummy_requests (923-950)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/pyexecutor/_util.py

281-281: Unpacked variable total_mem is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

1096-1096: Unused method argument: inputs

(ARG002)


1097-1097: Unused method argument: profile

(ARG002)


1104-1104: Unused method argument: tactic

(ARG002)


1152-1152: Unused function argument: a_scale

(ARG001)


1152-1152: Unused function argument: b_scale

(ARG001)


1152-1152: Unused function argument: tune_max_num_tokens

(ARG001)

🔇 Additional comments (8)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3)

617-626: LGTM!

Clean refactor that delegates warmup logic to the new generalized _general_warmup method while preserving the no_cuda_graph() context wrapper.


720-722: Good improvement to CUDA graph capture resilience.

Changing from return to continue allows the capture process to proceed with other batch sizes even when a specific configuration cannot be allocated. This is more robust as smaller batch sizes may still fit in available memory.


772-791: LGTM!

The post-capture warmup loop addresses memory fragmentation by pre-allocating blocks for "most requests" patterns. The comment clearly explains the rationale for this additional warmup phase.

tensorrt_llm/_torch/pyexecutor/_util.py (2)

268-279: LGTM!

The CUDA graph warmup token constraints ensure sufficient KV cache blocks are allocated for successful CUDA graph capture. The formula accounts for one maximum-length sequence plus additional blocks for the remaining batch entries, and the minimum batch size constraint ensures basic operation is always possible.


287-290: LGTM!

The memory-aware capping ensures the KV cache token estimate doesn't exceed available GPU memory. The block alignment and final min() operation provide a conservative, safe upper bound.

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)

201-205: Fake fp8_block_scaling_gemm_impl wiring and shape/dtype look correct

Renaming the fake registration to trtllm::fp8_block_scaling_gemm_impl matches the C++ Torch binding and the new Python runner call site, and the fake output (m = a.shape[0], n = b.shape[0]) with BF16 dtype is consistent with the real implementation in cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

997-1001: Shared deep‑GEMM tuning bucket helper and reuse in fp8SwapABGemmRunner look fine

deep_gemm_gen_tuning_buckets produces a dense set of M buckets in steps of 8 (and then 128) and is now reused by fp8SwapABGemmRunner.tuning_config, which keeps the tuning behavior uniform for these FP8 deep‑GEMM paths. No functional issues stand out here.

Also applies to: 1004-1009

cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp (1)

387-397: Torch binding rename to fp8_block_scaling_gemm_impl is correctly integrated; no broken callers

Verification confirms the rename is safe. The Python wrapper trtllm::fp8_block_scaling_gemm at torch_custom_ops.py:1124 correctly routes through Fp8BlockScalingGemmRunner, which invokes the new torch.ops.trtllm.fp8_block_scaling_gemm_impl binding at line 1107. All production callers (linear.py, tests) use the Python wrapper, not the C++ binding directly, so they remain unaffected. The _impl suffix properly marks it as an implementation detail.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25732 [ run ] completed with state SUCCESS. Commit: fca3543
/LLM/main/L0_MergeRequest_PR pipeline #19512 completed with status: 'FAILURE'

@liji-nv liji-nv force-pushed the dev-liji-cherry-pick-conflict-changes branch from fca3543 to ef0d44c Compare December 1, 2025 08:10
@liji-nv
Copy link
Collaborator Author

liji-nv commented Dec 1, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #26403 [ run ] triggered by Bot. Commit: ef0d44c

@tensorrt-cicd
Copy link
Collaborator

PR_Github #26403 [ run ] completed with state SUCCESS. Commit: ef0d44c
/LLM/main/L0_MergeRequest_PR pipeline #20061 completed with status: 'FAILURE'

@liji-nv liji-nv requested a review from a team as a code owner December 2, 2025 07:19
@liji-nv liji-nv requested a review from symphonylyh December 2, 2025 07:19
@liji-nv liji-nv force-pushed the dev-liji-cherry-pick-conflict-changes branch from 9f8c9e1 to 2e1df1f Compare December 2, 2025 07:20
@liji-nv
Copy link
Collaborator Author

liji-nv commented Dec 2, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #26553 [ run ] triggered by Bot. Commit: 2e1df1f

@tensorrt-cicd
Copy link
Collaborator

PR_Github #26553 [ run ] completed with state FAILURE. Commit: 2e1df1f
/LLM/main/L0_MergeRequest_PR pipeline #20192 completed with status: 'FAILURE'

@liji-nv
Copy link
Collaborator Author

liji-nv commented Dec 2, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #26606 [ run ] triggered by Bot. Commit: 2e1df1f

@tensorrt-cicd
Copy link
Collaborator

PR_Github #26606 [ run ] completed with state FAILURE. Commit: 2e1df1f
/LLM/main/L0_MergeRequest_PR pipeline #20233 completed with status: 'FAILURE'

@liji-nv
Copy link
Collaborator Author

liji-nv commented Dec 23, 2025

/bot run --disable-fail-fast

@liji-nv liji-nv force-pushed the dev-liji-cherry-pick-conflict-changes branch 2 times, most recently from 658f058 to e12a16d Compare December 23, 2025 06:38
@liji-nv
Copy link
Collaborator Author

liji-nv commented Dec 23, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29538 [ run ] triggered by Bot. Commit: e12a16d

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29538 [ run ] completed with state FAILURE. Commit: e12a16d

@liji-nv liji-nv force-pushed the dev-liji-cherry-pick-conflict-changes branch from e12a16d to d24bb34 Compare December 24, 2025 03:19
@liji-nv
Copy link
Collaborator Author

liji-nv commented Dec 24, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29711 [ run ] triggered by Bot. Commit: d24bb34

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
NVIDIA#8515)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
@liji-nv liji-nv force-pushed the dev-liji-cherry-pick-conflict-changes branch from d24bb34 to ad7c561 Compare December 24, 2025 07:30
@liji-nv
Copy link
Collaborator Author

liji-nv commented Dec 24, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29766 [ run ] triggered by Bot. Commit: ad7c561

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29766 [ run ] completed with state SUCCESS. Commit: ad7c561
/LLM/main/L0_MergeRequest_PR pipeline #22874 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@liji-nv
Copy link
Collaborator Author

liji-nv commented Dec 25, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29876 [ run ] triggered by Bot. Commit: ad7c561

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29876 [ run ] completed with state SUCCESS. Commit: ad7c561
/LLM/main/L0_MergeRequest_PR pipeline #22977 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@liji-nv
Copy link
Collaborator Author

liji-nv commented Dec 25, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29959 [ run ] triggered by Bot. Commit: ad7c561

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29959 [ run ] completed with state SUCCESS. Commit: ad7c561
/LLM/main/L0_MergeRequest_PR pipeline #23044 completed with status: 'SUCCESS'

@liji-nv liji-nv merged commit 7e4cef9 into NVIDIA:main Dec 25, 2025
5 checks passed
liji-nv added a commit that referenced this pull request Dec 31, 2025
…#9446 (#10334)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
videodanchik pushed a commit to videodanchik/TensorRT-LLM that referenced this pull request Jan 14, 2026
…9446)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>
videodanchik pushed a commit to videodanchik/TensorRT-LLM that referenced this pull request Jan 14, 2026
…NVIDIA#9446 (NVIDIA#10334)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants