[https://nvbugs/5322131][feat] Multi-LoRA serving with CUDA Graph #8279

JyChang012 · 2025-10-11T01:43:43Z

NvBug

https://nvbugspro.nvidia.com/bug/5322131
https://nvbugspro.nvidia.com/bug/5441746

Benchmark Perf

merge base 43c46a09

	Avg ITL (ms)	Throughput (tok/s)	Avg TTFT (ms)
CUDA Graph + multi LoRA (after changes)	26	568	227
no CUDA Graph + multi LoRA (before changes)	119	127	435
CUDA Graph + no LoRA (before changes)	14	1068	137

Llama 3.3 70B, TP8; p5.48xlarge 8xH100
ISL: 1600; OSL: 600; Concurrency: 16; All requests query the same LoRA

Still need to remove some code for logging / testing

Potential Future Optimizations not included in this PR

Update prefill + decode fused batch to the new LoRA path, which might reduce bubbles in all-reduce
Update the sm80 (split-K) group GEMMs currently used

Summary by CodeRabbit

New Features
- Added CUDA Graph mode for LoRA with multi-adapter batching and slot management.
- Introduced fused parameter preparation and row reordering to reduce kernel launches.
- Exposed a device-side cache check for tasks in Python.
- Enabled optional NVTX profiling wrappers for easier performance tracing.
Performance
- Implemented CUDA Graph–compatible grouped and split-K GEMM paths for faster LoRA execution.
- Reduced per-step overhead via persistent buffers and slot reuse.
Tests
- Expanded test coverage to run LoRA scenarios with and without CUDA Graph, including edge cases.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-10-11T02:00:52Z

📝 Walkthrough

Walkthrough

Adds CUDA Graph-based multi-LoRA execution: new grouped GEMM kernels and a fused param-fill/reorder kernel; Torch bindings and THOP entry points; Python-side CUDA Graph LoRA params, slot and manager classes; engine integration with optional CUDA Graph path; PEFT cache/device-lookup extensions; tests and NVTX profiling hooks. Also adjusts attention and miscellaneous bindings.

Changes

Cohort / File(s)	Summary
PEFT cache API updates `cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h`, `cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp`	Add PeftCacheManager::isTaskCachedDevice; ensureBatch maps taskId to device-resolved LoRA config.
Bindings for PEFT cache `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`, `cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp`	Expose is_task_cached and new is_task_cached_device to Python with GIL release.
CUDA Graph grouped GEMM kernels `cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h`, `cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu`	New CUDA Graph-compatible grouped GEMM and split-K grouped GEMM implementations and declarations.
LoRA fused prep kernel `cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h`, `cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu`	New fused kernel to fill params, row-reorder, zero-fill, with launcher API.
LoRA kernel comment `cpp/tensorrt_llm/kernels/lora/lora.cpp`	Add clarifying comment on GemmCoord usage.
THOP LoRA ops `cpp/tensorrt_llm/thop/loraOp.cpp`	Add CUDA Graph grouped GEMM path and fused param-fill/reorder entry; register Torch ops.
Attention module tweak `tensorrt_llm/_torch/modules/attention.py`	Wrap o_lora init in string literal; o_lora assignment skipped.
CUDA Graph LoRA manager and params `tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py`, `.../cuda_graph_lora_manager.py`, `.../cuda_graph_lora_params.py`	Add AdapterSlotManager (LRU slots), CudaGraphLoraManager (prep flow), and CudaGraphLoraParams (persistent CUDA Graph buffers, pointers, sizes).
LoRA layer integration `tensorrt_llm/_torch/peft/lora/layer.py`	Add CUDA Graph mode in forward, buffer prep helpers, dataclasses for grouped GEMM params, tensorized size metadata.
Engine integration `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `.../_util.py`, `.../py_executor.py`, `tensorrt_llm/executor/worker.py`	Initialize CUDA Graph LoRA manager, propagate maybe_graph, add NVTX emit decorator and tracing wrapper; minor prints.
Resource/PEFT plumbing `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Add batch PEFT table getters/reset; expose is_task_cached_device; track batch PEFT state.
NVTX utility `tensorrt_llm/_utils.py`	Add nvtx_pytorch_emit decorator factory.
Tests: CUDA Graph LoRA `tests/unittest/llmapi/lora_test_utils.py`, `tests/unittest/llmapi/test_llm_pytorch.py`	Add CUDA Graph LoRA test params and helpers; parametrize many tests with cuda_graph_config; add kernel special-case tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Engine as PyTorchModelEngine
  participant LoraMgr as CudaGraphLoraManager
  participant SlotMgr as AdapterSlotManager
  participant Peft as PeftCacheManager
  participant Params as CudaGraphLoraParams
  participant THOP as thop lora ops
  participant Kern as CUDA Graph GEMM Kernels

  Engine->>LoraMgr: prepare_cuda_graph_lora_params(scheduled_requests, attn_metadata, peft_cache_manager)
  LoraMgr->>Peft: get_and_reset_batch_peft_table()
  LoraMgr->>SlotMgr: update_slots(requests, peft_cache_manager)
  SlotMgr-->>LoraMgr: batch_slot_ids, slots_changed
  LoraMgr->>Params: update_sorted_indices(batch_slot_ids)
  alt slots_changed
    LoraMgr->>Params: update_weight_pointers(peft_table)
    LoraMgr->>SlotMgr: reset_changed_flag()
  end
  LoraMgr->>Params: update_slots_params(batch_slot_ids)
  LoraMgr-->>Engine: {cuda_graph_params, use_cuda_graph_mode, ...}
  Engine->>THOP: lora_group_gemm_param_fill_row_reorder_fusion(...)
  THOP->>Kern: launchLoraGroupGEMMParamFillRowReorderFusion(...)
  Engine->>THOP: lora_grouped_gemm_cuda_graph(... in/out ...)
  THOP->>Kern: cuda_graph_grouped_gemm / splitk_grouped_gemm(...)

sequenceDiagram
  autonumber
  participant Layer as LoraLayer.forward
  participant CG as CUDA-Graph path
  participant Legacy as Legacy path

  Layer->>Layer: decide mode (cuda_graph_enabled && params available)
  alt CUDA Graph mode
    Layer->>CG: _forward_cuda_graph_mode(...)
    CG->>CG: prepare_grouped_gemm_buffers / fused prep
    CG-->>Layer: output tensor or None
  else Legacy mode
    Layer->>Legacy: _forward_legacy_mode(...)
    Legacy-->>Layer: output tensor or None
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	PR description is largely incomplete with empty required sections (Description, Test Coverage) and only contains benchmark results and NvBug links.	Complete the Description section explaining the multi-LoRA CUDA Graph feature and its benefits, and provide the Test Coverage section listing relevant test cases that validate this functionality.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "[https://nvbugs/5322131][feat] Multi-LoRA serving with CUDA Graph" accurately describes the primary intent of the changeset. The modifications comprehensively implement multi-LoRA serving support with CUDA Graph compatibility, including new GPU kernels for grouped GEMM operations, slot management infrastructure (AdapterSlotManager, CudaGraphLoraParams, CudaGraphLoraManager), Python bindings, TorchScript operations, and integration into the execution engine and resource managers. The title is clear, concise, and specific—it conveys that the feature enables serving multiple LoRA adapters using CUDA Graph, which directly matches the scope of changes across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
362-372: Initialize CUDA Graph LoRA manager when LoRA is configured

_set_lora_model_config does not call _init_cuda_graph_lora_manager, so graph mode will never be used. Call it after setting lora_model_config when cuda_graph_runner is enabled.
 def set_lora_model_config(self,
                           lora_target_modules: list[str],
                           trtllm_modules_to_hf_modules: dict[str, str],
                           swap_gate_up_proj_lora_b_weight: bool = True):
     self.lora_model_config = LoraModelConfig(
         lora_target_modules=lora_target_modules,
         trtllm_modules_to_hf_modules=trtllm_modules_to_hf_modules,
         hidden_size=self.model.config.hidden_size,
         dtype=torch_dtype_to_str(self.model.config.torch_dtype),
         swap_gate_up_proj_lora_b_weight=swap_gate_up_proj_lora_b_weight)
+    # Initialize CUDA Graph LoRA manager if possible
+    lora_config = getattr(self.pytorch_backend_config, "lora_config", None)
+    if lora_config is not None:
+        self._init_cuda_graph_lora_manager(lora_config)
Also applies to: 373-392
tests/unittest/llmapi/test_llm_pytorch.py (2)

883-895: Remove duplicate class definition.

The TestLlmError class is defined twice (lines 883-895 and 1013-1024) with identical test_max_num_token_check methods. The second definition will shadow the first. Remove one of the duplicate class definitions.

Also applies to: 1013-1024

1-1: Add NVIDIA copyright header.

The coding guidelines require prepending the NVIDIA Apache-2.0 copyright header with the current year to the top of all source files.

As per coding guidelines.

🧹 Nitpick comments (13)

cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h (1)

1-75: LGTM! Header declaration is well-documented and follows conventions.

The new kernel launcher header:

Comprehensive documentation with @brief and detailed parameter descriptions

Proper copyright header with current year (2025)

Correct namespace usage (tensorrt_llm::kernels)

Function signature matches implementation (verified against .cu file snippet)

Appropriate use of #pragma once for include guard

The function consolidates multiple operations (param fill, row reorder, zero fill) into a single CUDA graph-compatible kernel, which is a good design for performance.

Optional suggestion: Consider using traditional include guards instead of #pragma once to align with the coding guideline that specifies the format TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H. However, #pragma once is widely supported and more concise, so this is a minor stylistic preference.

cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h (1)

28-59: Prefer Doxygen-style API docs for public headers.

Convert to //! and document all parameters (e.g., lda/ldb/ldc/ldd and host vs device pointers).

As per coding guidelines
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
1238-1243: Fix return type to Optional (can be None).

Method can return None after reset; align annotation to avoid confusion.
-    def get_and_reset_batch_peft_table(
-            self) -> Dict[int, list[TaskLayerModuleConfig]]:
+    def get_and_reset_batch_peft_table(
+            self) -> Optional[Dict[int, List[TaskLayerModuleConfig]]]:
As per coding guidelines
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py (1)
151-151: Remove no-op statement.

len(request_list) has no effect.
-        len(request_list)
+        # no-op
tensorrt_llm/_torch/peft/lora/layer.py (5)
89-98: Avoid mutable default args; silence unused variable

Use None defaults and initialize inside. Prefix unused local to satisfy linters.
-def compare_grouped_gemm_params(
-    params: GroupedGemmParamsOutput,
-    ref: GroupedGemmParamsOutput,
-    params_input: GroupedGemmParamsInput,
-    params_to_store_msg: List[str] | None = ['splitk_offsets'],
-    params_exclude_msg: List[str] | None = None,
-):
+def compare_grouped_gemm_params(
+    params: GroupedGemmParamsOutput,
+    ref: GroupedGemmParamsOutput,
+    params_input: GroupedGemmParamsInput,
+    params_to_store_msg: Optional[List[str]] = None,
+    params_exclude_msg: Optional[List[str]] = None,
+):
@@
-    bs, input_hidden_size = params.reordered_input.shape
+    bs, _input_hidden_size = params.reordered_input.shape
@@
-    if not params_to_store_msg:
-        params_to_store_msg = set(params_dict.keys())
+    if not params_to_store_msg:
+        params_to_store_msg = set(params_dict.keys())
Based on learnings

114-123: Type-safe equality for integer tensors in debug compare

torch.allclose is for floating dtypes. Use torch.equal for integral tensors to avoid runtime errors when debug is enabled.
-        if name not in ("reordered_input", "a_offset"):
-            asserter.add(
-                v.allclose(ref_v),
-                get_msg(name, v, ref_v),
-            )
+        if name not in ("reordered_input", "a_offset"):
+            if v.dtype.is_floating_point:
+                ok = torch.allclose(v, ref_v)
+            else:
+                ok = torch.equal(v, ref_v)
+            asserter.add(ok, get_msg(name, v, ref_v))
290-291: Tuple construction nit: use splat for shape_3d

Cleaner and avoids creating an intermediate tuple.
-        shape_3d = shape_2d + (3, )
+        shape_3d = (*shape_2d, 3)
Apply in all three sites.

Also applies to: 413-414, 482-483

593-597: Remove f-prefix from constant strings

Minor lint (F541). Drop the f where no placeholders exist.
-            print(
-                f'--------------------------------layer key: {layer_key}--------------------------------'
-            )
-            print(f'cuda graph params values:')
+            print(
+                f'--------------------------------layer key: {layer_key}--------------------------------'
+            )
+            print('cuda graph params values:')
@@
-            print(f'buffers values:')
+            print('buffers values:')
@@
-            print(f'calculated buffers')
+            print('calculated buffers')
Also applies to: 610-618

44-52: Gate or remove debug flags before merge

These globals alter runtime paths and include heavy printing/assert scaffolding. Consider gating via env vars or removing before release.
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (3)
144-146: Silence unused args in _create_layer_params

Rename to underscore to satisfy linters and reflect non-use.
-    def _create_layer_params(
-            self, key: LoraLayerKey, layer_module_num: int,
-            module_output_sizes: torch.Tensor) -> LoraLayerParams:
+    def _create_layer_params(
+            self, _key: LoraLayerKey, layer_module_num: int,
+            _module_output_sizes: torch.Tensor) -> LoraLayerParams:
183-187: Silence unused local in get_sorted_indices

Prefix with underscore.
-        sorted_slot_ids, sorted_indices = torch.sort(slot_ids, stable=True)
+        _sorted_slot_ids, sorted_indices = torch.sort(slot_ids, stable=True)
70-79: Avoid print in library init

Prefer logger or remove to keep init silent.
-        print(
-            f'cuda graph lora params init max batch size: {max_batch_size}, max lora size: {max_lora_size}, max rank: {max_rank}'
-        )
+        # Consider using logger.debug for initialization info.
tests/unittest/llmapi/test_llm_pytorch.py (1)

270-270: Avoid global CUDA device setting in tests. Replace torch.cuda.set_device(0) with a scoped approach (e.g., a pytest fixture or with torch.cuda.device(0):) to isolate device selection and prevent interference when tests run in parallel.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84d2f12 and 1d61b14.

📒 Files selected for processing (24)

cpp/include/tensorrt_llm/batch_manager/peftCacheManager.h (1 hunks)
cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp (3 hunks)
cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu (1 hunks)
cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h (1 hunks)
cpp/tensorrt_llm/kernels/lora/lora.cpp (1 hunks)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu (1 hunks)
cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h (1 hunks)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1 hunks)
cpp/tensorrt_llm/thop/loraOp.cpp (3 hunks)
tensorrt_llm/_torch/modules/attention.py (1 hunks)
tensorrt_llm/_torch/peft/lora/adapter_slot_manager.py (1 hunks)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_manager.py (1 hunks)
tensorrt_llm/_torch/peft/lora/cuda_graph_lora_params.py (1 hunks)
tensorrt_llm/_torch/peft/lora/layer.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/guided_decoder.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (21 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (5 hunks)
tensorrt_llm/_utils.py (1 hunks)
tensorrt_llm/executor/worker.py (2 hunks)
tests/unittest/llmapi/lora_test_utils.py (2 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (20 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}