-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[None][fix] Cherry-pick conflict changes for PR 7999 PR 8515 #9446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][fix] Cherry-pick conflict changes for PR 7999 PR 8515 #9446
Conversation
e025364 to
fca3543
Compare
|
/bot run |
|
PR_Github #25732 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis PR introduces a renamed FP8 block scaling GEMM operation with new autotuning support, refactors the warmup mechanism for torch.compile and CUDA graph captures, and improves KV-cache memory estimation to account for CUDA graph warmup overhead. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
241-291: KV-cache estimation warmup logic looks good; clean up Ruff issues and dead access.The new logic that:
- folds CUDA graph warmup requirements into
num_cache_blocks,- enforces a minimum of
batch_sizeblocks, and- caps by
free_gpu_memory_fraction * free_memis a reasonable, conservative estimate and matches the subsequent use in
try_prepare_estimation.Two minor issues from static analysis are worth fixing:
- Line 281:
total_memfromtorch.cuda.mem_get_info()is never used.- Line 283:
self._dummy_reqs[0].sampling_config.beam_widthis a no-op; the value is already read in the return expression below.You can address both with a small refactor:
- free_mem, total_mem = torch.cuda.mem_get_info() - max_memory = self._kv_cache_config.free_gpu_memory_fraction * free_mem - self._dummy_reqs[0].sampling_config.beam_width + free_mem, _total_mem = torch.cuda.mem_get_info() + max_memory = self._kv_cache_config.free_gpu_memory_fraction * free_mem max_num_tokens_in_memory = max_memory // self._get_kv_size_per_token( ) // self._tokens_per_block * self._tokens_per_blocktensorrt_llm/_torch/pyexecutor/model_engine.py (2)
747-792: Extra piecewise CUDA-graph warmup pass is reasonable once context warmup requests are fixed.The second loop that reruns warmup over
piecewise_cuda_graph_num_tokenswithleast_requests=Falseis a sensible heuristic to allocate larger, more fragmented blocks after capture and help PyTorch reuse them, matching the intent in the comment.However, this relies on
_create_warmup_request(..., num_gen_requests=0, least_requests=False)actually producing a non-empty set of context requests for the givennum_tokens. The current implementation of_create_warmup_requestnever updatesnum_ctx_requests, so context-only warmups end up with no context requests at all (see separate comment on that function). Once that bug is fixed, this extra pass should behave as intended.
823-915: Bug in_create_warmup_request:num_ctx_requestsnever set, so context-only warmups are empty.Within
_create_warmup_request:
num_ctx_requestsis initialized to 0 and never updated.- When
num_ctx_tokens > 0andnum_gen_requests == 0(pure context warmups used by autotuner and piecewise CUDA-graph warmup), you computenum_full_seqs/num_left_over_tokensandctx_token_nums, but still call:ctx_requests = kv_cache_manager.add_dummy_requests( list(range(num_ctx_requests)), # always [] token_nums=ctx_token_nums, ... )so:
- no context dummy requests are actually created,
- the batch passed into
forwardis empty,- autotuner/piecewise warmup never exercise real context lengths, and
ctx_token_numsis effectively dead.This breaks the core purpose of these warmups (they still run, but with zero real workload).
You should derive
num_ctx_requestsfrom the token partitioning and use it both in the batch-size check and therequest_idspassed toadd_dummy_requests. For example:- num_ctx_tokens = num_tokens - num_gen_tokens - num_ctx_requests = 0 + num_ctx_tokens = num_tokens - num_gen_tokens + num_ctx_requests = 0 @@ - if num_ctx_tokens > 0: - if least_requests: - num_full_seqs = num_ctx_tokens // max_seq_len - num_left_over_tokens = num_ctx_tokens - num_full_seqs * max_seq_len - - else: - max_bs = min(num_ctx_tokens, max_context_requests) - if num_ctx_tokens % max_bs == 0: - num_full_seqs = max_bs - else: - num_full_seqs = max_bs - 1 - max_seq_len = num_ctx_tokens // num_full_seqs - num_left_over_tokens = num_ctx_tokens - max_seq_len * num_full_seqs + if num_ctx_tokens > 0: + if least_requests: + # Use as few context requests as possible, up to max_seq_len-1 tokens each. + num_full_seqs = num_ctx_tokens // max_seq_len + num_left_over_tokens = num_ctx_tokens - num_full_seqs * max_seq_len + else: + # Use as many context requests as allowed (more balanced lengths). + max_bs = min(num_ctx_tokens, max_context_requests) + if num_ctx_tokens % max_bs == 0: + num_full_seqs = max_bs + else: + num_full_seqs = max_bs - 1 + max_seq_len = num_ctx_tokens // num_full_seqs + num_left_over_tokens = num_ctx_tokens - max_seq_len * num_full_seqs + + num_ctx_requests = num_full_seqs + (1 if num_left_over_tokens > 0 else 0) @@ - if num_ctx_requests + num_gen_requests > self.batch_size: + if num_ctx_requests + num_gen_requests > self.batch_size: return None # Not enough batch size to fill the request @@ - if num_ctx_tokens > 0: - ctx_token_nums = [max_seq_len] * num_full_seqs - if num_left_over_tokens > 0: - ctx_token_nums.append(num_left_over_tokens) - - ctx_requests = kv_cache_manager.add_dummy_requests( - list(range(num_ctx_requests)), - token_nums=ctx_token_nums, - is_gen=False, - max_num_draft_tokens=self.runtime_draft_len, - use_mrope=self.use_mrope) + if num_ctx_tokens > 0: + ctx_token_nums = [max_seq_len] * num_full_seqs + if num_left_over_tokens > 0: + ctx_token_nums.append(num_left_over_tokens) + + ctx_requests = kv_cache_manager.add_dummy_requests( + list(range(num_ctx_requests)), + token_nums=ctx_token_nums, + is_gen=False, + max_num_draft_tokens=self.runtime_draft_len, + use_mrope=self.use_mrope)This makes the number of created dummy context requests consistent with the token partitioning, and ensures autotuner and CUDA-graph warmups actually stress realistic context workloads.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
579-627: Centralized_general_warmupflow looks solid; consider trimming unusedreversefor now.Unifying torch.compile warmup through
_general_warmupand reusing_create_warmup_requestis a good cleanup, and the batch/num-token bounds (curr_max_num_tokens,max_batch_size) are consistent with how KV capacity and draft tokens are used elsewhere.The only minor nit is that the
reverseparameter is currently only used to flip the sort order ofwarmup_requests_configsand is never passed asTruefrom within this file. If there is no immediate plan to use it (e.g., for CUDA-graph-first warmups), you might either:
- drop the argument for now, or
- add a brief comment on intended future use to avoid confusion.
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
201-205: Fake FP8 GEMM op rename tofp8_block_scaling_gemm_implis consistent and shape-correct.The fake registration now targets
trtllm::fp8_block_scaling_gemm_impl, returning an[m, n]BF16 tensor withm = a.shape[0],n = b.shape[0], which matches the C++ implementation (mat1[M,K], mat2[N,K] → out[M,N]).Optionally, you could use
a.shape[-2]/b.shape[-2]for a bit more robustness to accidental higher-rank inputs, but given the op is defined as 2D in C++, the current form is acceptable.tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
1111-1121: Typo in local function name.Minor typo:
fp8_quantize_1x128_sm90_constrantshould befp8_quantize_1x128_sm90_constraint.- def fp8_quantize_1x128_sm90_constrant(inputs: List[List[int]]): + def fp8_quantize_1x128_sm90_constraint(inputs: List[List[int]]): pad_m = fp4_utils.pad_up(inputs[0][0], 4) blocked_n = (inputs[0][1] + 127) // 128 return fp4_utils.pad_up(pad_m * blocked_n * 4, 128) // 4 if get_sm_version() >= 100: return (ConstraintSpec(2, 1, lambda inputs: inputs[0][0]), ) else: - return (ConstraintSpec(2, 0, fp8_quantize_1x128_sm90_constrant), ) + return (ConstraintSpec(2, 0, fp8_quantize_1x128_sm90_constraint), )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp(2 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py(3 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(7 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., usefrom package.subpackage import fooand thenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g.,some_file.py)
Python class names should use PascalCase (e.g.,class SomeClass)
Python function and method names should use snake_case (e.g.,def my_awesome_function():)
Python local variable names should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile = ...)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g.,MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g.,self.x = 5followed by"""<type>: Description of 'x'""")
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic
Files:
tensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.pytensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
tensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.pycpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpptensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g.,} // namespace foo)
Preferconstorconstexprvariables over#definewhenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared asconst
Except0(only used in comparison for checking signness/existence/emptiness) andnullptr,true,false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an emptyfororwhileloop in a new line
The statement forming the body of aswitch,while,do .. whileorforstatement shall be a compound statement (use brace-delimited statements)
Ifandelseshould always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g.,thisIsASubDirandthisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g.,FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g.,localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g.,gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g.,sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g.,static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g.,mNbFooValues), though the 'm' pre...
Files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
🧠 Learnings (19)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: MrGeva
Repo: NVIDIA/TensorRT-LLM PR: 7219
File: tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py:162-168
Timestamp: 2025-09-04T07:33:10.618Z
Learning: When users explicitly provide cuda_graph_batch_sizes in TorchCudagraphCompiler, respect their choices and only sanitize the values (clamp, dedupe, sort) without forcing additional batch sizes like 1 or max_batch_size. Only add commonly-used batch sizes when falling back to the heuristic.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Applied to files:
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.pycpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-11-24T17:09:17.870Z
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:09:17.870Z
Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the guard name format `TRTLLM_` followed by the filename in all caps (e.g., `TRTLLM_FOO_BAR_HELLO_H` for file `FooBarHello.h`); do not include directory names in the symbol
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
🧬 Code graph analysis (3)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
tensorrt_llm/_torch/autotuner.py (6)
TunableRunner(159-215)TuningConfig(54-107)get_valid_tactics(162-180)forward(186-212)AutoTuner(520-1319)get(550-553)
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
fp8_block_scaling_gemm(1125-1148)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
add_dummy_requests(78-79)add_dummy_requests(453-531)tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
batch_size(37-38)ScheduledRequests(20-41)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/pyexecutor/_util.py
281-281: Unpacked variable total_mem is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
283-283: Found useless attribute access. Either assign it to a variable or remove it.
(B018)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
1096-1096: Unused method argument: inputs
(ARG002)
1097-1097: Unused method argument: profile
(ARG002)
1104-1104: Unused method argument: tactic
(ARG002)
1152-1152: Unused function argument: a_scale
(ARG001)
1152-1152: Unused function argument: b_scale
(ARG001)
1152-1152: Unused function argument: tune_max_num_tokens
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
711-736: Usingcontinueon KV-cache allocation failure in CUDA-graph capture is the right trade-off.Switching to
continuewhenbatch is Noneduring_capture_generation_cuda_graphsmeans we still warm up and capture graphs for smaller batch sizes or other draft lengths even if the largest configuration can't be allocated. This improves robustness without affecting correctness.cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp (1)
385-405: Torch op binding rename tofp8_block_scaling_gemm_implis consistent with Python side.The
TORCH_LIBRARY_FRAGMENTdefinition andTORCH_LIBRARY_IMPLmapping now expose the kernel undertrtllm::fp8_block_scaling_gemm_implwhile still routing totorch_ext::fp8_block_scaling_gemm. This lines up with the updated fake op incpp_custom_ops.pyand keeps the implementation unchanged.No functional issues spotted here.
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)
997-1001: LGTM!The function rename to
deep_gemm_gen_tuning_bucketsappropriately reflects its broader usage across multiple deep GEMM runners. The bucket generation logic is correct.
1004-1009: LGTM!Reference correctly updated to use the renamed
deep_gemm_gen_tuning_bucketsfunction.
1086-1109: LGTM! Design is appropriate for JIT warmup.The runner's purpose (triggering DeepGEMM JIT during autotune) is well-documented. The unused
inputs,profile, andtacticparameters are required by theTunableRunnerinterface - static analysis false positives.One minor observation: since there's only a single tactic
[0], thetacticparameter inforwardwill always be 0, which is correct behavior.
1124-1148: Class-leveltuning_configmutation is consistent but worth noting.The pattern of mutating
Fp8BlockScalingGemmRunner.tuning_configat lines 1134 and 1136-1137 is consistent with existing patterns in this file (e.g.,MoERunner.tuning_configat line 199,fp8SwapABGemmRunner.tuning_configat line 1061).This approach works because the autotuner typically runs single-threaded during warmup. However, if concurrent calls with different
tune_max_num_tokensvalues occur, there could be a race condition. This is a pre-existing pattern, so no immediate action required.
1151-1156: Hardcoded output dtype is correct—no issues found.The C++ implementation explicitly creates output tensors with
ScalarType::BFloat16, confirming the hardcoded dtype in the fake registration matches the actual behavior. Unlikefp8_block_scaling_bmmwhich accepts an optionaloutput_dtypeparameter,fp8_block_scaling_gemm_implhas no dtype parameter in its signature, making the fixed bfloat16 output intentional.The unused
a_scale,b_scale, andtune_max_num_tokensparameters are expected for fake registrations (required for signature matching with the real op).
📝 WalkthroughWalkthroughThis PR renames the fp8_block_scaling_gemm Torch binding to fp8_block_scaling_gemm_impl across multiple files, introduces a new Fp8BlockScalingGemmRunner with auto-tuning capabilities, refactors warmup logic into a generalized flow, and enhances KV-cache estimation with memory-aware constraints and CUDA graph warmup requirements. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant fp8_block_scaling_gemm as fp8_block_scaling_gemm()
participant AutoTuner
participant Fp8BlockScalingGemmRunner
participant fp8_block_scaling_gemm_impl
User->>fp8_block_scaling_gemm: Call with tensors (a, b, a_scale, b_scale)
fp8_block_scaling_gemm->>AutoTuner: Select best tactic based on<br/>constraint specs & input shape
AutoTuner->>Fp8BlockScalingGemmRunner: Get valid tactics for SM version
Fp8BlockScalingGemmRunner-->>AutoTuner: Return tactic list
AutoTuner-->>fp8_block_scaling_gemm: Return selected tactic
fp8_block_scaling_gemm->>fp8_block_scaling_gemm_impl: Call with selected tactic
fp8_block_scaling_gemm_impl-->>fp8_block_scaling_gemm: Return result tensor
fp8_block_scaling_gemm-->>User: Return fused GEMM output
sequenceDiagram
participant PyTorchModelEngine
participant _general_warmup
participant _capture_generation_cuda_graphs
participant _create_warmup_request as _create_warmup_request<br/>(num_gen_requests, least_requests)
participant Forward
PyTorchModelEngine->>_general_warmup: Start warmup (resource_manager, reverse)
_general_warmup->>_general_warmup: Compute dynamic max_batch_size
loop For each warmup config (sorted)
_general_warmup->>_capture_generation_cuda_graphs: Process batch size/draft length
alt Batch allocation success
_capture_generation_cuda_graphs->>_create_warmup_request: Create warmup request<br/>(num_gen_requests, least_requests)
_create_warmup_request->>_create_warmup_request: Enforce per-batch constraints<br/>Compute num_ctx_tokens, num_full_seqs
_create_warmup_request-->>_capture_generation_cuda_graphs: Return ScheduledRequests
_capture_generation_cuda_graphs->>Forward: Execute forward pass
Forward-->>_capture_generation_cuda_graphs: Allocate memory blocks
else Batch allocation fails
_capture_generation_cuda_graphs->>_capture_generation_cuda_graphs: Continue to next config
end
end
_general_warmup-->>PyTorchModelEngine: Warmup complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
850-902: Critical bug:num_ctx_requestsis never computed.The variable
num_ctx_requestsis initialized to0at line 851 and is never updated, but it's used in:
- Line 877: Batch size check
if num_ctx_requests + num_gen_requests > self.batch_size- Line 894: Generating request IDs
list(range(num_ctx_requests))- Line 902: Adding spec resource manager requests
- Lines 906-918: Range calculations for generation requests
This results in:
- The batch size check being effectively
if num_gen_requests > self.batch_size- Context requests getting an empty ID range
range(0), producing no dummy requests even whenctx_token_numsis non-emptyThe fix should compute
num_ctx_requestsafter determiningnum_full_seqsandnum_left_over_tokens:if least_requests: num_full_seqs = num_ctx_tokens // max_seq_len num_left_over_tokens = num_ctx_tokens - num_full_seqs * max_seq_len else: max_bs = min(num_ctx_tokens, max_context_requests) if num_ctx_tokens % max_bs == 0: num_full_seqs = max_bs else: num_full_seqs = max_bs - 1 max_seq_len = num_ctx_tokens // num_full_seqs num_left_over_tokens = num_ctx_tokens - max_seq_len * num_full_seqs + num_ctx_requests = num_full_seqs + (1 if num_left_over_tokens > 0 else 0) + if num_ctx_requests + num_gen_requests > self.batch_size: return None # Not enough batch size to fill the requesttensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
793-803:w4a8_mxfp4_fp8_gemmfake implementation uses undefinedact_fp8In the fake implementation of
w4a8_mxfp4_fp8_gemm, the parameter is namedact_fp4but the body callsact_fp8.new_empty(...). This will raise aNameErrorin meta/fake runs (e.g.,torch.compile) and prevent graph construction.You likely intended to use the actual input tensor argument:
@w4a8_mxfp4_fp8_gemm.register_fake def _( - act_fp4: torch.Tensor, + act_fp4: torch.Tensor, weight: torch.Tensor, act_sf: torch.Tensor, weight_scale: torch.Tensor, alpha: torch.Tensor, output_dtype: torch.dtype, to_userbuffers: bool = False, ) -> torch.Tensor: - return act_fp8.new_empty((act_fp8.size(0), weight.size(0)), - dtype=output_dtype) + return act_fp4.new_empty((act_fp4.size(0), weight.size(0)), + dtype=output_dtype)
🧹 Nitpick comments (5)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
602-610: Variable naming mismatch with parameter name.The loop variable
num_gen_tokensis passed to_create_warmup_requestwhich expectsnum_gen_requests. While semantically correct (for generation batches each request contributes1 + runtime_draft_lentokens), the naming is misleading and inconsistent with the function signature.Consider renaming for clarity:
- for num_tokens, num_gen_tokens in warmup_requests_configs: + for num_tokens, num_gen_requests in warmup_requests_configs: with self._release_batch_context( self._create_warmup_request(resource_manager, num_tokens, - num_gen_tokens), + num_gen_requests), resource_manager) as batch:tensorrt_llm/_torch/pyexecutor/_util.py (1)
281-284: Unused variabletotal_mem.The
total_memvariable fromtorch.cuda.mem_get_info()is unpacked but never used. Per static analysis hint.- free_mem, total_mem = torch.cuda.mem_get_info() + free_mem, _ = torch.cuda.mem_get_info()tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3)
1086-1109: Fp8BlockScalingGemmRunner correctly wraps the impl op; unused-arg lints are cosmeticThe runner’s
tuning_configreusesdeep_gemm_gen_tuning_buckets,get_valid_tacticsreturning[0]matches the fact there is only one underlying kernel configuration, andforwarddelegating straight totorch.ops.trtllm.fp8_block_scaling_gemm_implis exactly what you want to both trigger JIT and route execution through the C++ binding.Ruff’s ARG002 hints about the unused
inputs,profile(inget_valid_tactics) andtactic(inforward) are purely stylistic; if you care about a clean lint run, you can safely rename them to_inputs,_profile, and_tacticwithout changing behavior.
1111-1122: ConstraintSpec logic matches fp8_quantize_1x128 behavior; consider renaming helper for clarityThe SM‑dependent
ConstraintSpecsetup mirrors the shapes produced byfp8_quantize_1x128:
- For
get_sm_version() >= 100, constraining dim‑1 of the third tensor toinputs[0][0]enforcesa_scale.shape[1] == a.shape[0](M).- For lower SMs, the
fp8_quantize_1x128_sm90_constranthelper reproduces the 1‑D scale shape formula used in the fake fp8 quantize op.Functionally this looks correct. Minor readability nit:
fp8_quantize_1x128_sm90_constrantseems to have a typo in “constrant”; renaming it to “…_constraint” would make its purpose clearer.
1124-1148: Auto-tunedfp8_block_scaling_gemmpublic op is wired correctly; small fake-impl cleanup opportunityThe new public custom op:
- Uses
AutoTunerwithFp8BlockScalingGemmRunnerand the shared deep‑GEMM buckets.- Updates
tune_max_num_tokensand SM‑specificconstraint_specsbefore tuning.- Invokes the runner, which in turn calls
torch.ops.trtllm.fp8_block_scaling_gemm_impl, cleanly separating public op from impl op.The fake implementation returns a BF16 tensor of shape
(a.shape[0], b.shape[0]), matching the real C++ implementation’s output shape and dtype, so it should behave well undertorch.compile.Ruff’s ARG001 hints here are just about unused parameters (
a_scale,b_scale,tune_max_num_tokens) in the fake path. If you want to silence them, you can rename those to_a_scale,_b_scale, and_tune_max_num_tokens(or explicitlydelthem), without affecting behavior.Also applies to: 1151-1155
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp(2 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py(3 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(7 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., usefrom package.subpackage import fooand thenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g.,some_file.py)
Python class names should use PascalCase (e.g.,class SomeClass)
Python function and method names should use snake_case (e.g.,def my_awesome_function():)
Python local variable names should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile = ...)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g.,MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g.,self.x = 5followed by"""<type>: Description of 'x'""")
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic
Files:
tensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
tensorrt_llm/_torch/pyexecutor/_util.pycpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpptensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,h,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g.,} // namespace foo)
Preferconstorconstexprvariables over#definewhenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared asconst
Except0(only used in comparison for checking signness/existence/emptiness) andnullptr,true,false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an emptyfororwhileloop in a new line
The statement forming the body of aswitch,while,do .. whileorforstatement shall be a compound statement (use brace-delimited statements)
Ifandelseshould always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g.,thisIsASubDirandthisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g.,FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g.,localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g.,gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g.,sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g.,static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g.,mNbFooValues), though the 'm' pre...
Files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
🧠 Learnings (17)
📓 Common learnings
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-11-24T17:09:17.870Z
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:09:17.870Z
Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the guard name format `TRTLLM_` followed by the filename in all caps (e.g., `TRTLLM_FOO_BAR_HELLO_H` for file `FooBarHello.h`); do not include directory names in the symbol
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Applied to files:
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧬 Code graph analysis (2)
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
fp8_block_scaling_gemm(1125-1148)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
batch_size(37-38)ScheduledRequests(20-41)tensorrt_llm/_torch/attention_backend/sparse/rocket.py (1)
add_dummy_requests(923-950)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/pyexecutor/_util.py
281-281: Unpacked variable total_mem is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
1096-1096: Unused method argument: inputs
(ARG002)
1097-1097: Unused method argument: profile
(ARG002)
1104-1104: Unused method argument: tactic
(ARG002)
1152-1152: Unused function argument: a_scale
(ARG001)
1152-1152: Unused function argument: b_scale
(ARG001)
1152-1152: Unused function argument: tune_max_num_tokens
(ARG001)
🔇 Additional comments (8)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3)
617-626: LGTM!Clean refactor that delegates warmup logic to the new generalized
_general_warmupmethod while preserving theno_cuda_graph()context wrapper.
720-722: Good improvement to CUDA graph capture resilience.Changing from
returntocontinueallows the capture process to proceed with other batch sizes even when a specific configuration cannot be allocated. This is more robust as smaller batch sizes may still fit in available memory.
772-791: LGTM!The post-capture warmup loop addresses memory fragmentation by pre-allocating blocks for "most requests" patterns. The comment clearly explains the rationale for this additional warmup phase.
tensorrt_llm/_torch/pyexecutor/_util.py (2)
268-279: LGTM!The CUDA graph warmup token constraints ensure sufficient KV cache blocks are allocated for successful CUDA graph capture. The formula accounts for one maximum-length sequence plus additional blocks for the remaining batch entries, and the minimum batch size constraint ensures basic operation is always possible.
287-290: LGTM!The memory-aware capping ensures the KV cache token estimate doesn't exceed available GPU memory. The block alignment and final
min()operation provide a conservative, safe upper bound.tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
201-205: Fake fp8_block_scaling_gemm_impl wiring and shape/dtype look correctRenaming the fake registration to
trtllm::fp8_block_scaling_gemm_implmatches the C++ Torch binding and the new Python runner call site, and the fake output(m = a.shape[0], n = b.shape[0])with BF16 dtype is consistent with the real implementation incpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp.tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
997-1001: Shared deep‑GEMM tuning bucket helper and reuse in fp8SwapABGemmRunner look fine
deep_gemm_gen_tuning_bucketsproduces a dense set of M buckets in steps of 8 (and then 128) and is now reused byfp8SwapABGemmRunner.tuning_config, which keeps the tuning behavior uniform for these FP8 deep‑GEMM paths. No functional issues stand out here.Also applies to: 1004-1009
cpp/tensorrt_llm/thop/fp8BlockScalingGemm.cpp (1)
387-397: Torch binding rename tofp8_block_scaling_gemm_implis correctly integrated; no broken callersVerification confirms the rename is safe. The Python wrapper
trtllm::fp8_block_scaling_gemmattorch_custom_ops.py:1124correctly routes throughFp8BlockScalingGemmRunner, which invokes the newtorch.ops.trtllm.fp8_block_scaling_gemm_implbinding at line 1107. All production callers (linear.py, tests) use the Python wrapper, not the C++ binding directly, so they remain unaffected. The _impl suffix properly marks it as an implementation detail.
|
PR_Github #25732 [ run ] completed with state |
fca3543 to
ef0d44c
Compare
|
/bot run |
|
PR_Github #26403 [ run ] triggered by Bot. Commit: |
|
PR_Github #26403 [ run ] completed with state |
9f8c9e1 to
2e1df1f
Compare
|
/bot run |
|
PR_Github #26553 [ run ] triggered by Bot. Commit: |
|
PR_Github #26553 [ run ] completed with state |
|
/bot run |
|
PR_Github #26606 [ run ] triggered by Bot. Commit: |
|
PR_Github #26606 [ run ] completed with state |
|
/bot run --disable-fail-fast |
658f058 to
e12a16d
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #29538 [ run ] triggered by Bot. Commit: |
|
PR_Github #29538 [ run ] completed with state |
e12a16d to
d24bb34
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #29711 [ run ] triggered by Bot. Commit: |
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
NVIDIA#8515) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
d24bb34 to
ad7c561
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #29766 [ run ] triggered by Bot. Commit: |
|
PR_Github #29766 [ run ] completed with state
|
|
/bot run |
|
PR_Github #29876 [ run ] triggered by Bot. Commit: |
|
PR_Github #29876 [ run ] completed with state
|
|
/bot run |
|
PR_Github #29959 [ run ] triggered by Bot. Commit: |
|
PR_Github #29959 [ run ] completed with state |
…9446) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>
…NVIDIA#9446 (NVIDIA#10334) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>
Also applies a different fix with same effect as #8780 for issues introduced in #7999
Summary by CodeRabbit
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.