-
Notifications
You must be signed in to change notification settings - Fork 2k
[https://nvbugs/5322131][feat] Multi-LoRA serving with CUDA Graph #8279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds CUDA Graph-based multi-LoRA execution: new grouped GEMM kernels and a fused param-fill/reorder kernel; Torch bindings and THOP entry points; Python-side CUDA Graph LoRA params, slot and manager classes; engine integration with optional CUDA Graph path; PEFT cache/device-lookup extensions; tests and NVTX profiling hooks. Also adjusts attention and miscellaneous bindings. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Engine as PyTorchModelEngine
participant LoraMgr as CudaGraphLoraManager
participant SlotMgr as AdapterSlotManager
participant Peft as PeftCacheManager
participant Params as CudaGraphLoraParams
participant THOP as thop lora ops
participant Kern as CUDA Graph GEMM Kernels
Engine->>LoraMgr: prepare_cuda_graph_lora_params(scheduled_requests, attn_metadata, peft_cache_manager)
LoraMgr->>Peft: get_and_reset_batch_peft_table()
LoraMgr->>SlotMgr: update_slots(requests, peft_cache_manager)
SlotMgr-->>LoraMgr: batch_slot_ids, slots_changed
LoraMgr->>Params: update_sorted_indices(batch_slot_ids)
alt slots_changed
LoraMgr->>Params: update_weight_pointers(peft_table)
LoraMgr->>SlotMgr: reset_changed_flag()
end
LoraMgr->>Params: update_slots_params(batch_slot_ids)
LoraMgr-->>Engine: {cuda_graph_params, use_cuda_graph_mode, ...}
Engine->>THOP: lora_group_gemm_param_fill_row_reorder_fusion(...)
THOP->>Kern: launchLoraGroupGEMMParamFillRowReorderFusion(...)
Engine->>THOP: lora_grouped_gemm_cuda_graph(... in/out ...)
THOP->>Kern: cuda_graph_grouped_gemm / splitk_grouped_gemm(...)
sequenceDiagram
autonumber
participant Layer as LoraLayer.forward
participant CG as CUDA-Graph path
participant Legacy as Legacy path
Layer->>Layer: decide mode (cuda_graph_enabled && params available)
alt CUDA Graph mode
Layer->>CG: _forward_cuda_graph_mode(...)
CG->>CG: prepare_grouped_gemm_buffers / fused prep
CG-->>Layer: output tensor or None
else Legacy mode
Layer->>Legacy: _forward_legacy_mode(...)
Legacy-->>Layer: output tensor or None
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 18
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
362-372: Initialize CUDA Graph LoRA manager when LoRA is configured_set_lora_model_config does not call _init_cuda_graph_lora_manager, so graph mode will never be used. Call it after setting lora_model_config when cuda_graph_runner is enabled.
def set_lora_model_config(self, lora_target_modules: list[str], trtllm_modules_to_hf_modules: dict[str, str], swap_gate_up_proj_lora_b_weight: bool = True): self.lora_model_config = LoraModelConfig( lora_target_modules=lora_target_modules, trtllm_modules_to_hf_modules=trtllm_modules_to_hf_modules, hidden_size=self.model.config.hidden_size, dtype=torch_dtype_to_str(self.model.config.torch_dtype), swap_gate_up_proj_lora_b_weight=swap_gate_up_proj_lora_b_weight) + # Initialize CUDA Graph LoRA manager if possible + lora_config = getattr(self.pytorch_backend_config, "lora_config", None) + if lora_config is not None: + self._init_cuda_graph_lora_manager(lora_config)Also applies to: 373-392
tests/unittest/llmapi/test_llm_pytorch.py (2)
883-895: Remove duplicate class definition.The
TestLlmErrorclass is defined twice (lines 883-895 and 1013-1024) with identicaltest_max_num_token_checkmethods. The second definition will shadow the first. Remove one of the duplicate class definitions.Also applies to: 1013-1024
1-1: Add NVIDIA copyright header.The coding guidelines require prepending the NVIDIA Apache-2.0 copyright header with the current year to the top of all source files.
As per coding guidelines.
🧹 Nitpick comments (13)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h (1)
1-75: LGTM! Header declaration is well-documented and follows conventions.The new kernel launcher header:
- Comprehensive documentation with
@briefand detailed parameter descriptions- Proper copyright header with current year (2025)
- Correct namespace usage (
tensorrt_llm::kernels)- Function signature matches implementation (verified against .cu file snippet)
- Appropriate use of
#pragma oncefor include guardThe function consolidates multiple operations (param fill, row reorder, zero fill) into a single CUDA graph-compatible kernel, which is a good design for performance.
Optional suggestion: Consider using traditional include guards instead of
#pragma onceto align with the coding guideline that specifies the formatTRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H. However,#pragma onceis widely supported and more concise, so this is a minor stylistic preference.cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h (1)
28-59: Prefer Doxygen-style API docs for public headers.Convert to //! and document all parameters (e.g., lda/ldb/ldc/ldd and host vs device pointers).
As per coding guidelines
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
1238-1243: Fix return type to Optional (can be None).Method can return None after reset; align annotation to avoid confusion.
- def get_and_reset_batch_peft_table( - self) -> Dict[int, list[TaskLayerModuleConfig]]: + def get_and_reset_batch_peft_table( + self) -> Optional[Dict[int, List[TaskLayerModuleConfig]]]:As per coding guidelines
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py (1)
151-151: Remove no-op statement.
len(request_list)has no effect.- len(request_list) + # no-optensorrt_llm/_torch/peft/lora/layer.py (5)
89-98: Avoid mutable default args; silence unused variableUse None defaults and initialize inside. Prefix unused local to satisfy linters.
-def compare_grouped_gemm_params( - params: GroupedGemmParamsOutput, - ref: GroupedGemmParamsOutput, - params_input: GroupedGemmParamsInput, - params_to_store_msg: List[str] | None = ['splitk_offsets'], - params_exclude_msg: List[str] | None = None, -): +def compare_grouped_gemm_params( + params: GroupedGemmParamsOutput, + ref: GroupedGemmParamsOutput, + params_input: GroupedGemmParamsInput, + params_to_store_msg: Optional[List[str]] = None, + params_exclude_msg: Optional[List[str]] = None, +): @@ - bs, input_hidden_size = params.reordered_input.shape + bs, _input_hidden_size = params.reordered_input.shape @@ - if not params_to_store_msg: - params_to_store_msg = set(params_dict.keys()) + if not params_to_store_msg: + params_to_store_msg = set(params_dict.keys())Based on learnings
114-123: Type-safe equality for integer tensors in debug comparetorch.allclose is for floating dtypes. Use torch.equal for integral tensors to avoid runtime errors when debug is enabled.
- if name not in ("reordered_input", "a_offset"): - asserter.add( - v.allclose(ref_v), - get_msg(name, v, ref_v), - ) + if name not in ("reordered_input", "a_offset"): + if v.dtype.is_floating_point: + ok = torch.allclose(v, ref_v) + else: + ok = torch.equal(v, ref_v) + asserter.add(ok, get_msg(name, v, ref_v))
290-291: Tuple construction nit: use splat for shape_3dCleaner and avoids creating an intermediate tuple.
- shape_3d = shape_2d + (3, ) + shape_3d = (*shape_2d, 3)Apply in all three sites.
Also applies to: 413-414, 482-483
593-597: Remove f-prefix from constant stringsMinor lint (F541). Drop the f where no placeholders exist.
- print( - f'--------------------------------layer key: {layer_key}--------------------------------' - ) - print(f'cuda graph params values:') + print( + f'--------------------------------layer key: {layer_key}--------------------------------' + ) + print('cuda graph params values:') @@ - print(f'buffers values:') + print('buffers values:') @@ - print(f'calculated buffers') + print('calculated buffers')Also applies to: 610-618
44-52: Gate or remove debug flags before mergeThese globals alter runtime paths and include heavy printing/assert scaffolding. Consider gating via env vars or removing before release.
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (3)
144-146: Silence unused args in _create_layer_paramsRename to underscore to satisfy linters and reflect non-use.
- def _create_layer_params( - self, key: LoraLayerKey, layer_module_num: int, - module_output_sizes: torch.Tensor) -> LoraLayerParams: + def _create_layer_params( + self, _key: LoraLayerKey, layer_module_num: int, + _module_output_sizes: torch.Tensor) -> LoraLayerParams:
183-187: Silence unused local in get_sorted_indicesPrefix with underscore.
- sorted_slot_ids, sorted_indices = torch.sort(slot_ids, stable=True) + _sorted_slot_ids, sorted_indices = torch.sort(slot_ids, stable=True)
70-79: Avoid print in library initPrefer logger or remove to keep init silent.
- print( - f'cuda graph lora params init max batch size: {max_batch_size}, max lora size: {max_lora_size}, max rank: {max_rank}' - ) + # Consider using logger.debug for initialization info.tests/unittest/llmapi/test_llm_pytorch.py (1)
270-270: Avoid global CUDA device setting in tests. Replacetorch.cuda.set_device(0)with a scoped approach (e.g., a pytest fixture orwith torch.cuda.device(0):) to isolate device selection and prevent interference when tests run in parallel.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h(1 hunks)cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp(3 hunks)cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu(1 hunks)cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h(1 hunks)cpp/tensorrt_llm/kernels/lora/lora.cpp(1 hunks)cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu(1 hunks)cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp(1 hunks)cpp/tensorrt_llm/thop/loraOp.cpp(3 hunks)tensorrt_llm/_torch/modules/attention.py(1 hunks)tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py(1 hunks)tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py(1 hunks)tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py(1 hunks)tensorrt_llm/_torch/peft/lora/layer.py(3 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(1 hunks)tensorrt_llm/_torch/pyexecutor/guided_decoder.py(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(21 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(3 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(5 hunks)tensorrt_llm/_utils.py(1 hunks)tensorrt_llm/executor/worker.py(2 hunks)tests/unittest/llmapi/lora_test_utils.py(2 hunks)tests/unittest/llmapi/test_llm_pytorch.py(20 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/kernels/lora/lora.cppcpp/include/tensorrt_llm/batch_manager/peftCacheManager.hcpp/tensorrt_llm/batch_manager/peftCacheManager.cppcpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cucpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.hcpp/tensorrt_llm/thop/loraOp.cppcpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.hcpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cucpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/kernels/lora/lora.cppcpp/include/tensorrt_llm/batch_manager/peftCacheManager.hcpp/tensorrt_llm/batch_manager/peftCacheManager.cppcpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cucpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.hcpp/tensorrt_llm/thop/loraOp.cppcpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.hcpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cucpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
cpp/tensorrt_llm/kernels/lora/lora.cpptensorrt_llm/_torch/pyexecutor/py_executor.pycpp/include/tensorrt_llm/batch_manager/peftCacheManager.htensorrt_llm/_torch/peft/lora/adapter_slot_manager.pytensorrt_llm/_torch/pyexecutor/guided_decoder.pycpp/tensorrt_llm/batch_manager/peftCacheManager.cppcpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cutests/unittest/llmapi/lora_test_utils.pycpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.hcpp/tensorrt_llm/thop/loraOp.cpptensorrt_llm/_torch/pyexecutor/_util.pycpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.htensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.pycpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cutensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.pytensorrt_llm/executor/worker.pytensorrt_llm/_utils.pycpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpptensorrt_llm/_torch/peft/lora/layer.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/modules/attention.pytensorrt_llm/_torch/pyexecutor/resource_manager.pycpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpptests/unittest/llmapi/test_llm_pytorch.py
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tensorrt_llm/kernels/lora/lora.cppcpp/include/tensorrt_llm/batch_manager/peftCacheManager.hcpp/tensorrt_llm/batch_manager/peftCacheManager.cppcpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.hcpp/tensorrt_llm/thop/loraOp.cppcpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.hcpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
cpp/tensorrt_llm/kernels/lora/lora.cpptensorrt_llm/_torch/pyexecutor/py_executor.pycpp/include/tensorrt_llm/batch_manager/peftCacheManager.htensorrt_llm/_torch/peft/lora/adapter_slot_manager.pytensorrt_llm/_torch/pyexecutor/guided_decoder.pycpp/tensorrt_llm/batch_manager/peftCacheManager.cppcpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cutests/unittest/llmapi/lora_test_utils.pycpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.hcpp/tensorrt_llm/thop/loraOp.cpptensorrt_llm/_torch/pyexecutor/_util.pycpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.htensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.pycpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cutensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.pytensorrt_llm/executor/worker.pytensorrt_llm/_utils.pycpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpptensorrt_llm/_torch/peft/lora/layer.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/modules/attention.pytensorrt_llm/_torch/pyexecutor/resource_manager.pycpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpptests/unittest/llmapi/test_llm_pytorch.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/peft/lora/adapter_slot_manager.pytensorrt_llm/_torch/pyexecutor/guided_decoder.pytests/unittest/llmapi/lora_test_utils.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.pytensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.pytensorrt_llm/executor/worker.pytensorrt_llm/_utils.pytensorrt_llm/_torch/peft/lora/layer.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/modules/attention.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytests/unittest/llmapi/test_llm_pytorch.py
**/*.{h,hpp,hh,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.
Files:
cpp/include/tensorrt_llm/batch_manager/peftCacheManager.hcpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.hcpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h
**/*.{h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).
Files:
cpp/include/tensorrt_llm/batch_manager/peftCacheManager.hcpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.hcpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h
🧠 Learnings (2)
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
PR: NVIDIA/TensorRT-LLM#7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/thop/loraOp.cpp
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
tensorrt_llm/_torch/pyexecutor/_util.py
🧬 Code graph analysis (18)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
tensorrt_llm/_utils.py (1)
nvtx_pytorch_emit(54-66)tensorrt_llm/_torch/pyexecutor/seq_slot_manager.py (1)
SeqSlotManager(6-32)
cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h (1)
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (2)
isTaskCachedDevice(487-490)isTaskCachedDevice(487-487)
tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py (2)
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (1)
PeftCacheManager(231-255)tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
PeftCacheManager(1135-1245)is_task_cached_device(1244-1245)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu (2)
tensorrt_llm/_torch/peft/lora/layer.py (1)
slot_offsets(85-86)cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h (1)
ceilDiv(41-44)
tests/unittest/llmapi/lora_test_utils.py (3)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (4)
CudaGraphLoraParams(30-370)get_slot_counts(313-324)get_offset_from_counts(286-310)get_sorted_indices(175-187)tensorrt_llm/_torch/peft/lora/layer.py (5)
GroupedGemmParamsInput(71-86)LoraLayer(244-1110)compare_grouped_gemm_params(89-163)_prepare_grouped_gemm_buffers_fused(407-474)prepare_grouped_gemm_buffers(285-405)tensorrt_llm/llmapi/llm_args.py (1)
CudaGraphConfig(109-166)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h (2)
cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h (2)
tensorrt_llm(23-62)kernels(25-61)cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu (2)
launchLoraGroupGEMMParamFillRowReorderFusion(354-412)launchLoraGroupGEMMParamFillRowReorderFusion(354-361)
cpp/tensorrt_llm/thop/loraOp.cpp (2)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu (2)
launchLoraGroupGEMMParamFillRowReorderFusion(354-412)launchLoraGroupGEMMParamFillRowReorderFusion(354-361)cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu (4)
cuda_graph_splitk_grouped_gemm(320-379)cuda_graph_splitk_grouped_gemm(320-323)cuda_graph_grouped_gemm(143-202)cuda_graph_grouped_gemm(143-146)
tensorrt_llm/_torch/pyexecutor/_util.py (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
_init_cuda_graph_lora_manager(373-391)tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
lora_config(1082-1100)
cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h (2)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h (2)
tensorrt_llm(23-75)kernels(25-74)cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu (4)
cuda_graph_grouped_gemm(143-202)cuda_graph_grouped_gemm(143-146)cuda_graph_splitk_grouped_gemm(320-379)cuda_graph_splitk_grouped_gemm(320-323)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py (6)
tensorrt_llm/lora_manager.py (2)
LoraManager(639-1242)LoraModelConfig(241-246)tensorrt_llm/_torch/attention_backend/interface.py (2)
AttentionMetadata(40-336)num_seqs(245-249)tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
PeftCacheManager(1135-1245)get_and_reset_batch_peft_table(1238-1242)tensorrt_llm/_torch/pyexecutor/scheduler.py (1)
ScheduledRequests(18-39)tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py (5)
AdapterSlotManager(15-137)update_slots(92-120)get_slot_to_task_mapping(122-129)has_slots_changed(131-133)reset_changed_flag(135-137)tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (5)
CudaGraphLoraParams(30-370)LoraLayerInfo(41-51)update_sorted_indices(189-216)update_weight_pointers(218-283)update_slots_params(326-341)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (1)
tensorrt_llm/_torch/peft/lora/layer.py (1)
slot_offsets(85-86)
tensorrt_llm/executor/worker.py (1)
tensorrt_llm/_utils.py (3)
mpi_comm(504-505)mpi_rank(538-545)nvtx_pytorch_emit(54-66)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (2)
isTaskCachedDevice(487-490)isTaskCachedDevice(487-487)
tensorrt_llm/_torch/peft/lora/layer.py (2)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (4)
CudaGraphLoraParams(30-370)get_offset_from_counts(286-310)get_layer_params(359-370)get_problem_count(343-357)cpp/tensorrt_llm/thop/loraOp.cpp (4)
lora_grouped_gemm_cuda_graph(178-285)lora_grouped_gemm_cuda_graph(178-190)lora_grouped_gemm(54-176)lora_grouped_gemm(54-58)
tensorrt_llm/_torch/pyexecutor/model_engine.py (6)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py (2)
CudaGraphLoraManager(23-191)prepare_cuda_graph_lora_params(128-191)tensorrt_llm/_torch/pyexecutor/resource_manager.py (5)
PeftCacheManager(1135-1245)ResourceManager(1097-1132)ResourceManagerType(49-54)get_and_reset_batch_peft_table(1238-1242)get_resource_manager(1109-1110)tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
batch_size(35-36)ScheduledRequests(18-39)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
attn_metadata(121-122)needs_capture(193-195)replay(268-302)tensorrt_llm/_torch/attention_backend/interface.py (1)
AttentionMetadata(40-336)tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py (1)
remove_evicted_slots_in_cpp(83-90)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
tensorrt_llm(39-275)cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
tensorrt_llm(49-52)tensorrt_llm(54-168)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (2)
isTaskCachedDevice(487-490)isTaskCachedDevice(487-487)
tests/unittest/llmapi/test_llm_pytorch.py (1)
tests/unittest/llmapi/lora_test_utils.py (5)
create_mock_nemo_lora_checkpoint(132-242)compare_cuda_graph_lora_params_filler(360-378)CUDAGraphLoRATestParams(246-284)module_count(271-272)slot_count(275-276)
🪛 Ruff (0.13.3)
tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py
118-118: Unpacked variable evicted_task is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tests/unittest/llmapi/lora_test_utils.py
288-297: Do not perform function call CUDAGraphLoRATestParams in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py
144-144: Unused method argument: key
(ARG002)
145-145: Unused method argument: module_output_sizes
(ARG002)
183-183: Unpacked variable sorted_slot_ids is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tensorrt_llm/_torch/peft/lora/layer.py
93-93: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
98-98: Unpacked variable input_hidden_size is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
290-290: Consider (*shape_2d, 3) instead of concatenation
Replace with (*shape_2d, 3)
(RUF005)
413-413: Consider (*shape_2d, 3) instead of concatenation
Replace with (*shape_2d, 3)
(RUF005)
482-482: Consider (*shape_2d, 3) instead of concatenation
Replace with (*shape_2d, 3)
(RUF005)
593-593: f-string without any placeholders
Remove extraneous f prefix
(F541)
610-610: f-string without any placeholders
Remove extraneous f prefix
(F541)
612-612: Undefined name reordered_input
(F821)
618-618: f-string without any placeholders
Remove extraneous f prefix
(F541)
637-637: Undefined name ldb
(F821)
637-637: Undefined name ldb
(F821)
642-642: Undefined name ldd
(F821)
642-642: Undefined name ldd
(F821)
819-819: Undefined name reordered_input
(F821)
820-820: Undefined name in_sizes
(F821)
821-821: Undefined name out_sizes
(F821)
822-822: Undefined name a_offset
(F821)
826-826: Undefined name d_offset
(F821)
830-830: Undefined name d_prime_offset
(F821)
832-832: Undefined name reordered_input
(F821)
840-840: Undefined name lda
(F821)
841-841: Undefined name ldb
(F821)
842-842: Undefined name ldd
(F821)
843-843: Undefined name ldb_prime
(F821)
844-844: Undefined name ldd_prime
(F821)
847-847: Undefined name splitk_offsets
(F821)
969-969: Undefined name splitk_offsets
(F821)
1103-1106: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (14)
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (2)
487-490: LGTM! Device cache lookup follows existing patterns.The
isTaskCachedDeviceimplementation correctly delegates tomDeviceLoraCache->has(taskId), mirroring the pattern used byisTaskCachedfor host cache andisTaskDoneDevicefor device status.
465-466: Verify ensureBatch callers expect taskId-based keys
ensureBatchis only invoked in thekvCacheManagernanobind/pybind wrappers, which forward the result to Python. SincepeftTablenow usestaskId(not requestId) as its key, confirm all downstream Python consumers ofensure_batchhave been updated to handle taskId-based mappings.cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h (1)
118-118: LGTM! API declaration is consistent with existing patterns.The
isTaskCachedDevicemethod declaration:
- Properly uses
[[nodiscard]]attribute- Correctly marked
const- Mirrors the existing
isTaskDoneDevicepattern- Follows naming conventions
tensorrt_llm/executor/worker.py (1)
14-14: LGTM! Profiling instrumentation added correctly.The
nvtx_pytorch_emitdecorator addition:
- Properly imported from
_utils- Applied to
worker_mainentry point for profiling visibility- Defaults to
enabled=False, making it opt-in with no performance impact- Aligns with NVTX instrumentation strategy across the PR
Also applies to: 232-232
tensorrt_llm/_utils.py (1)
54-66: LGTM! Decorator factory implementation is correct.The
nvtx_pytorch_emitdecorator factory:
- Follows standard Python decorator factory pattern
- Correctly uses
functools.wrapsto preserve function metadata- Properly delegates to
torch.autograd.profiler.emit_nvtx- Safe defaults (
enabled=False) ensure no performance impact unless explicitly enabledcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
533-535: LGTM! Python binding follows established patterns.The
is_task_cached_devicebinding:
- Correctly uses
nb::call_guard<nb::gil_scoped_release>()to release GIL during the call- Follows the same pattern as the existing
is_task_cachedbinding- Properly exposes the C++
isTaskCachedDevicemethod to Pythoncpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)
526-528: LGTM! Pybind11 binding mirrors nanobind implementation.The
is_task_cached_devicebinding:
- Correctly uses
py::call_guard<py::gil_scoped_release>()to release GIL- Follows the same pattern as existing
is_task_cachedbinding- Mirrors the nanobind implementation in the other binding file
- Properly exposes
PeftCacheManager::isTaskCachedDeviceto Pythontests/unittest/llmapi/lora_test_utils.py (1)
261-268: Validate slot_id sentinel behavior.You allow slot_id == slot_count (Line 261). Please confirm this is intentional to represent base model (ignored) rows, given get_slot_counts truncates counts to max_lora_size.
cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu (1)
99-107: Avoid per-call workspace allocation during CUDA Graph capture
Dynamic calls to at::empty inside capture can break CUDA Graph replay unless the allocator is warmed up or guaranteed capture-safe. Pre-allocate or accept an external workspace (pointer/size), or document that workspace size is zeroed on warmup. Verify that any dynamic allocation here won’t invalidate graph capture.cpp/tensorrt_llm/thop/loraOp.cpp (1)
205-213: Confirm intent: A' pointers reuse D offsetsa_prime_ptrs_gpu is initialized from d_offsets (Line 210), implying A' = D for the second GEMM. This matches the design (LoRA-out uses intermediate as A). If intentional, LGTM. If not, switch to a_offsets.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
2046-2080: LoRA manager initialization and routing LGTM
Initialization via_init_cuda_graph_lora_managerin_util.pyis present; CUDA Graph and legacy paths are correctly handled.tests/unittest/llmapi/test_llm_pytorch.py (3)
20-23: LGTM!The new imports are appropriate for adding CUDA graph LoRA testing capabilities. The utilities from
lora_test_utilsandreplacefromdataclassesare used consistently throughout the file.Also applies to: 46-46
273-326: LGTM!The test comprehensively covers multiple edge cases for CUDA graph LoRA parameter filling:
- All requests with the same LoRA ID
- No LoRA in batch
- Multiple modules
- Invalid weight pointers
- Mixed slot IDs
The use of
replacefrom dataclasses to create test variations is clean and maintainable.
362-365: LGTM!The widespread updates to test functions follow a consistent pattern:
- Tests now accept
cuda_graph_configparameter- Using
@test_lora_with_and_without_cuda_graphdecorator ensures tests run with both CUDA graph enabled and disabled- LLM initializations consistently pass
cuda_graph_config- Helper functions updated to accept
**llm_kwargsfor flexibilityThis comprehensive approach ensures CUDA graph support is tested across diverse LoRA configurations (LoraConfig variants, NeMo LoRA, CodeLlama LoRA, Gemma, Bielik, etc.).
Also applies to: 368-375, 418-430, 433-445, 448-458, 461-482, 486-516, 519-535, 541-555, 575-617, 636-668, 685-726, 805-852
d22f2a6 to
f14382b
Compare
|
/bot run |
|
PR_Github #21382 [ run ] triggered by Bot |
|
PR_Github #21382 [ run ] completed with state |
f14382b to
7d7d903
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #21388 [ run ] triggered by Bot |
|
PR_Github #21388 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #21435 [ run ] triggered by Bot |
|
PR_Github #21435 [ run ] completed with state |
|
PR_Github #32984 [ reuse-pipeline ] completed with state |
61a5d2b to
d272821
Compare
|
/bot reuse-pipeline |
|
PR_Github #33108 [ reuse-pipeline ] triggered by Bot. Commit: |
d272821 to
ff96e95
Compare
|
PR_Github #33108 [ reuse-pipeline ] completed with state |
ff96e95 to
065918b
Compare
|
/bot reuse-pipeline |
|
PR_Github #33119 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #33119 [ reuse-pipeline ] completed with state |
065918b to
5ddaccf
Compare
|
/bot reuse-pipeline |
|
PR_Github #33131 [ reuse-pipeline ] triggered by Bot. Commit: |
|
/bot reuse-pipeline |
|
PR_Github #33137 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #33131 [ reuse-pipeline ] completed with state |
|
PR_Github #33137 [ reuse-pipeline ] completed with state |
5ddaccf to
ab3110e
Compare
|
/bot reuse-pipeline |
This is a combination of 16 commits. Implement CUDA Graph compatible multi LoRAs Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Refactor CUDA Graph LoRA integration to support precomputed leading dimensions - Updated `cuda_graph_grouped_gemm` and `cuda_graph_splitk_grouped_gemm` functions to accept leading dimension pointers for A, B, C, and D matrices. - Modified `LoraImpl` to retrieve and pass leading dimension pointers during GEMM operations. - Enhanced `CudaGraphLoraParams` to manage leading dimensions for each layer and module. - Adjusted `CudaGraphLoraManager` to initialize parameters based on actual layer configurations from the PEFT table. - Improved handling of layer-specific parameters to ensure compatibility with CUDA Graph operations. This refactor aims to optimize performance by leveraging precomputed leading dimensions, reducing overhead during GEMM execution. Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> bug fixes Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Move input prep to graph Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Fix bug in adapter size Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Pass all but `test_llama_7b_lora_config_overrides_peft_cache_config` on L40s Graph seems to capture code outside of the captured function? Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Pass all tests Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> sync slot manager with c++ Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Update kernel alignment selection Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Fix kernel workspace sizes Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> memcpy use pinned memory; remove assert in slot manager eviction Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Add param fill fused kernel Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Disable torch nvtx emit Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Disable init manager without cuda graph Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Update CI Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Moved files Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Fix meta tensor loading Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Add graph config to new test Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Fix bug without lora Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Fix CI: custom torch op fix impl; get lora params signature Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Fix failed TRT path Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Cleanup Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Fix review comments Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Change var names following review Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Address comment: Changed PeftTable, TaskPeftTable, and TaskIdToReqIds from std::map to std::unordered_map for better average time complexity in lookups and insertions. Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Address comments Add None as default for cuda_graph_config to llama_7b_lora_from_dir_test_harness if not provided Remove typo Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Address comments - update function precondition - refactor Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> add multi gpu test Signed-off-by: Jiayu Chang <jiayuc@nvidia.com> Address comments Signed-off-by: Jiayu Chang <jiayuc@nvidia.com>
Signed-off-by: Jiayu Chang <jiayuc@nvidia.com>
Signed-off-by: Jiayu Chang <jiayuc@nvidia.com>
Signed-off-by: Jiayu Chang <jiayuc@nvidia.com>
Signed-off-by: Jiayu Chang <jiayuc@nvidia.com>
ab3110e to
c9063df
Compare
|
/bot reuse-pipeline |
|
PR_Github #33159 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #33159 [ reuse-pipeline ] completed with state |
…IDIA#8279) Signed-off-by: Jiayu Chang <jiayuc@nvidia.com>
NvBug
https://nvbugspro.nvidia.com/bug/5322131
https://nvbugspro.nvidia.com/bug/5441746
Benchmark Perf
merge base
43c46a09Llama 3.3 70B, TP8; p5.48xlarge 8xH100
ISL: 1600; OSL: 600; Concurrency: 16; All requests query the same LoRA
Still need to remove some code for logging / testing
Potential Future Optimizations not included in this PR
Summary by CodeRabbit
New Features
Performance
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.