Add basic Nemo Ckpt Lora Loading in pytorch flow #6019

venkywonka · 2025-07-14T19:30:29Z

Re-use the NemoLoraLoader used in trt flow
Modify core + model-specific files to route based on ckpt source ("nemo" vs "hf")
Enforce the limitation that Nemo Lora ckpts currently only support loading fused "attn_qkv" adapters.
Add unittests + e2e on tinyllama with dummy nemo lora ckpt
Manually verify externally with a working .nemo ckpt
Fix bug that expects nemo file path in var lora_dir for nemo loading case

NOTE: This merely re-uses the pre-existing core Nemo Lora ckpt loading functionality, so its on parity with the TRT flow w.r.t Lora Ckpt Loading.
The limitations of Nemo Lora Ckpt loading, therefore, still apply.

Summary by CodeRabbit

New Features
- Added support for loading and applying NeMo-format LoRA (Low-Rank Adaptation) checkpoints, including robust discovery of .nemo files and validation of LoRA sources.
- Enhanced handling of key-value attention heads for better compatibility with models using grouped query attention (GQA).
- Added unified PyTorch LoRA loader supporting multiple checkpoint sources with source-based dispatch.
- Introduced a new field to specify LoRA checkpoint source, enabling source-aware loading logic.
Bug Fixes
- Restricted custom vocabulary and embedding loading to Hugging Face LoRA checkpoints, preventing unintended behavior with other sources.
- Improved handling of missing feed-forward multiplier values to prevent errors during model configuration.
- Added warnings and safeguards for non-uniform key-value attention heads per layer in LoRA modules.
Tests
- Introduced comprehensive tests for NeMo LoRA integration, including GQA support and validation of unsupported module configurations.
- Added utilities to generate mock NeMo LoRA checkpoints for testing.
Documentation
- Improved docstrings and type annotations for LoRA-related functions, clarifying usage and expected behavior.

venkywonka · 2025-07-14T20:04:58Z

/bot run --extra-stage "H100_PCIe-PyTorch-Post-Merge-1"

tensorrt-cicd · 2025-07-14T20:10:39Z

PR_Github #11846 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-14T22:27:30Z

PR_Github #11846 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8777 completed with status: 'FAILURE'

venkywonka · 2025-07-15T01:25:25Z

/bot run --extra-stage "H100_PCIe-PyTorch-Post-Merge-1"

tensorrt-cicd · 2025-07-15T01:31:09Z

PR_Github #11861 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-15T02:37:18Z

PR_Github #11861 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #8790 completed with status: 'FAILURE'

amitz-nv

Nice work :)
a few small comments and questions I'm not sure about

tensorrt_llm/_torch/models/modeling_llama.py

tensorrt_llm/_torch/pyexecutor/_util.py

tensorrt_llm/lora_manager.py

tests/integration/defs/llmapi/test_llm_pytorch_nemo_lora.py

tests/integration/test_lists/test-db/l0_h100.yml

tests/unittest/llmapi/test_pytorch_nemo_lora.py

venkywonka · 2025-07-15T17:39:21Z

/bot run

tensorrt-cicd · 2025-07-15T17:44:57Z

PR_Github #11967 [ run ] triggered by Bot

Copilot

Pull Request Overview

This PR adds end-to-end support for loading NeMo-formatted LoRA checkpoints in the PyTorch workflow. Key changes include:

Introducing helper functions (find_nemo_files, _find_nemo_files_single_path), enhancing NemoLoraLoader, and adding load_torch_nemo_lora with routing via load_torch_lora.
Updating the executor, request, and model initialization code to respect a new lora_ckpt_source flag and only apply HF-specific logic when appropriate.
Adding unit tests that generate a minimal .nemo archive and verify loader behavior, and updating the integration test matrix.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unittest/llmapi/test_llm_pytorch.py	Add NeMo LoRA checkpoint generation helper and loader unit tests
tests/integration/test_lists/test-db/l0_h100.yml	Update integration test list to include new NeMo LoRA tests
tensorrt_llm/lora_manager.py	Introduce file‐finding utilities, PyTorch routing, and docstrings
tensorrt_llm/executor/worker.py	Pass `ckpt_source` through when loading adapters
tensorrt_llm/executor/request.py	Extend `LoRARequest` with `lora_ckpt_source` and validation
tensorrt_llm/_torch/pyexecutor/_util.py	Route via new `load_torch_lora` instead of HF‐only loader
tensorrt_llm/_torch/models/modeling_utils.py	Guard HF‐only LoRA head checks under `lora_ckpt_source == "hf"`
tensorrt_llm/_torch/models/modeling_nemotron_nas.py	Same HF‐only guard for custom vocab in NeMo flow
tensorrt_llm/_torch/models/modeling_llama.py	Same HF‐only guard for custom vocab in NeMo flow

Comments suppressed due to low confidence (4)

tensorrt_llm/lora_manager.py:383

[nitpick] There are now two similarly named functions (load_nemo_lora and load_torch_nemo_lora), which can be confusing. Consider renaming one to clarify their distinct purposes.

def load_nemo_lora(model, lora_config: LoraConfig):

tensorrt_llm/executor/worker.py:353

The call to load_from_ckpt now includes a ckpt_source keyword—please confirm that the load_from_ckpt signature accepts this parameter to avoid unexpected keyword argument errors.

            ckpt_source=lora_request.ckpt_source)

tests/unittest/llmapi/test_llm_pytorch.py:41

The helper uses tempfile.TemporaryDirectory() but I don’t see import tempfile in this diff. Please verify that import tempfile is present at the top of the test file to avoid a NameError.

def create_mock_nemo_lora_checkpoint(

tensorrt_llm/_torch/models/modeling_utils.py

tensorrt-cicd · 2025-07-15T22:08:10Z

PR_Github #11967 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8882 completed with status: 'SUCCESS'

Signed-off-by: Venky Ganesh <[email protected]>

venkywonka · 2025-07-22T17:53:03Z

/bot run

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tensorrt_llm/_torch/model_config.py (1)
300-342: LGTM! Comprehensive KV heads handling with proper fallbacks.

The enhanced logic properly handles both uniform and per-layer KV head configurations with appropriate fallbacks and LoRA compatibility validation. The TP/CP scaling is applied consistently.

One minor formatting issue to address:
-                    # For uniform models, check: num_key_value_heads (standard) -> num_query_groups (NeMo) -> num_attention_heads
+                    # For uniform models, check: num_key_value_heads (standard) -> 
+                    # num_query_groups (NeMo) -> num_attention_heads
tests/unittest/llmapi/test_llm_pytorch.py (1)
493-567: LGTM! Comprehensive integration test for GQA NeMo LoRA support.

This test excellently validates the entire pipeline from checkpoint creation to generation, with proper deterministic setup and validation that LoRA has a measurable effect. The GQA configuration matches TinyLlama's specifications perfectly.

Fix the line length issues flagged by static analysis:
-    1. That a NeMo-format LoRA checkpoint with GQA (grouped query attention) can be loaded and applied to a TinyLlama model,
+    1. That a NeMo-format LoRA checkpoint with GQA (grouped query attention) can be loaded and
+       applied to a TinyLlama model,
-       and that generation with this LoRA produces a deterministic, expected output for a fixed prompt and temperature=0.0.
+       and that generation with this LoRA produces a deterministic, expected output for a fixed
+       prompt and temperature=0.0.
-    The test uses a deterministic dummy LoRA checkpoint (seed=42) and checks both the positive (LoRA applied) and negative
+    The test uses a deterministic dummy LoRA checkpoint (seed=42) and checks both the positive
+    (LoRA applied) and negative
tensorrt_llm/lora_manager.py (1)
349-385: LGTM! Enhanced loader with honest documentation about design limitations.

The comprehensive documentation and new get_target_modules method improve the class significantly. The honest acknowledgment of the misleading parameter name lora_dirs in the docstring is good practice.

Consider renaming the parameter in a future version:
-    def __init__(self, lora_dirs: List[str]):
+    def __init__(self, lora_paths: List[str]):  # Future improvement

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aeae974 and 56266c8.

📒 Files selected for processing (10)

tensorrt_llm/_torch/model_config.py (3 hunks)
tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
tensorrt_llm/_torch/models/modeling_nemotron_nas.py (1 hunks)
tensorrt_llm/_torch/models/modeling_utils.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (3 hunks)
tensorrt_llm/executor/request.py (2 hunks)
tensorrt_llm/executor/worker.py (1 hunks)
tensorrt_llm/lora_manager.py (10 hunks)
tests/unittest/llmapi/lora_test_utils.py (2 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)

🧠 Learnings (1)

tensorrt_llm/lora_manager.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/model_config.py

321-321: Line too long (129 > 120)

(E501)

tests/unittest/llmapi/test_llm_pytorch.py

498-498: Line too long (124 > 120)

(E501)

499-499: Line too long (123 > 120)

(E501)

503-503: Line too long (122 > 120)

(E501)

🚧 Files skipped from review as they are similar to previous changes (7)

tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/models/modeling_nemotron_nas.py
tensorrt_llm/executor/request.py
tensorrt_llm/_torch/models/modeling_utils.py
tensorrt_llm/_torch/pyexecutor/_util.py
tests/unittest/llmapi/lora_test_utils.py

🧰 Additional context used

🧠 Learnings (1)

tensorrt_llm/lora_manager.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/model_config.py

321-321: Line too long (129 > 120)

(E501)

tests/unittest/llmapi/test_llm_pytorch.py

498-498: Line too long (124 > 120)

(E501)

499-499: Line too long (123 > 120)

(E501)

503-503: Line too long (122 > 120)

(E501)

🔇 Additional comments (12)

tensorrt_llm/_torch/model_config.py (2)

363-366: LGTM! Correct routing to C++ binding methods.

The logic properly sets the KV heads configuration on the C++ model config based on whether per-layer or uniform KV heads are being used.

416-419: LGTM! Safe handling of None ffn_mult values.

The defensive programming approach correctly handles cases where some layers' ffn_mult attributes may be None by substituting zero, preventing runtime errors during max computation.

tests/unittest/llmapi/test_llm_pytorch.py (2)

432-467: LGTM! Well-structured unit test for NeMo LoRA loading.

The parameterized test effectively covers different LoRA rank configurations and validates that the loader correctly sets up the configuration with proper module mapping.

469-490: LGTM! Important validation test for error handling.

The test properly validates that the loader rejects unsupported module configurations and raises appropriate errors, which is crucial for user guidance.

tensorrt_llm/lora_manager.py (8)

26-60: LGTM! Excellent addition of type annotations and comprehensive documentation.

The enhanced function signature with type hints and detailed docstring significantly improves code clarity and maintainability without changing the core functionality.

69-157: LGTM! Consistent improvement in type safety and documentation.

The addition of comprehensive type annotations and docstrings follows good practices and makes these complex functions much more understandable, especially the callback signature documentation in iterate_hf_lora.

169-188: LGTM! Clear documentation for module mapping inversion.

The type annotations and docstring make this potentially confusing operation much clearer, especially the clarification that HF module names can be either strings or lists of strings.

280-347: LGTM! Well-designed file discovery with intelligent caching strategy.

The LRU caching on individual paths is a smart optimization that maximizes cache efficiency when paths are reused across different collections. The comprehensive error handling provides clear user feedback for various failure scenarios.

388-396: LGTM! Important addition of validation logic.

The validation check prevents silent failures when NeMo LoRA loading fails, providing clear error messages to users.

442-485: LGTM! Well-designed PyTorch-specific NeMo LoRA loader.

The function provides a clean abstraction for NeMo LoRA loading with appropriate validation, clear error messages, and proper documentation of current limitations. The hardcoded "attn_qkv" mapping aligns with NeMo's current supported functionality.

487-507: LGTM! Clean dispatcher pattern for LoRA checkpoint loading.

The dispatcher function provides a clean abstraction that routes to appropriate loaders based on checkpoint source, with proper error handling for unsupported sources.

772-777: LGTM! Proper integration of file discovery functionality.

The integration of find_nemo_files in the LoraManager ensures that NeMo checkpoint loading properly handles both files and directories, maintaining consistency with the enhanced loader capabilities.

tensorrt-cicd · 2025-07-22T17:58:19Z

PR_Github #12599 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-22T23:22:45Z

PR_Github #12599 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9375 completed with status: 'SUCCESS'

syuoni

The LLM part changes look good to me.

Signed-off-by: Shreyas Misra <[email protected]>

Signed-off-by: Ransiki Zhang <[email protected]>

venkywonka requested review from a team, Naveassaf, byshiue, shaharmor98 and tijyojwad and removed request for a team July 14, 2025 19:30

venkywonka force-pushed the user/venky/nemo-ckpt-lora-load-pyt branch from f6b09d7 to 18ab327 Compare July 14, 2025 20:01

venkywonka marked this pull request as ready for review July 14, 2025 20:02

venkywonka requested review from a team as code owners July 14, 2025 20:02

venkywonka self-assigned this Jul 14, 2025

venkywonka requested a review from Copilot July 14, 2025 20:24

This comment was marked as outdated.

Sign in to view

venkywonka requested a review from amitz-nv July 14, 2025 20:28

amitz-nv requested changes Jul 15, 2025

View reviewed changes

venkywonka force-pushed the user/venky/nemo-ckpt-lora-load-pyt branch from 4bfb756 to 11da27b Compare July 15, 2025 16:21

venkywonka requested a review from a team as a code owner July 15, 2025 16:21

venkywonka requested a review from nv-guomingz July 15, 2025 16:21

venkywonka requested review from amitz-nv and Copilot July 15, 2025 18:41

Copilot AI reviewed Jul 15, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_utils.py Show resolved Hide resolved

venkywonka added 10 commits July 22, 2025 10:52

rename unittest to resolve pytest conflict

0df5b5a

Signed-off-by: Venky Ganesh <[email protected]>

add ckpt_source flag to loraRequest, simplify tests

a07d76f

Signed-off-by: Venky Ganesh <[email protected]>

add review suggestions

897c30d

Signed-off-by: Venky Ganesh <[email protected]>

add review suggestions

1e9ac5c

Signed-off-by: Venky Ganesh <[email protected]>

enable gqa

939952c

Signed-off-by: Venky Ganesh <[email protected]>

support gqa and initial plubminb for vgqa

613d5de

Signed-off-by: Venky Ganesh <[email protected]>

cosmetic

a1227ea

Signed-off-by: Venky Ganesh <[email protected]>

validate outputs in e2e test, cosmetics

a041b1d

Signed-off-by: Venky Ganesh <[email protected]>

util input validation, et al.

c04b7df

Signed-off-by: Venky Ganesh <[email protected]>

make test deterministic and more robust, et al.

56266c8

Signed-off-by: Venky Ganesh <[email protected]>

venkywonka force-pushed the user/venky/nemo-ckpt-lora-load-pyt branch from aeae974 to 56266c8 Compare July 22, 2025 17:52

coderabbitai bot reviewed Jul 22, 2025

View reviewed changes

syuoni approved these changes Jul 23, 2025

View reviewed changes

venkywonka enabled auto-merge (squash) July 23, 2025 02:33

venkywonka merged commit 9538c8d into NVIDIA:main Jul 23, 2025
3 checks passed

NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025

Add basic Nemo Ckpt Lora Loading in pytorch flow (NVIDIA#6019)

740474a

Signed-off-by: Shreyas Misra <[email protected]>

Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025

Add basic Nemo Ckpt Lora Loading in pytorch flow (NVIDIA#6019)

9c8b35c

Signed-off-by: Ransiki Zhang <[email protected]>

This was referenced Aug 7, 2025

[None][fix] Refactoring to avoid circular import when importing torch models #6720

Merged

[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter #6786

Merged

[TRTLLM-6496][feat] Add LoRa Torch tests for the latest NIM model list #6806

Merged

This was referenced Aug 21, 2025

[https://nvbugs/5463720][fix]: Naively handle per-layer MLP dimensions for Nemotron 49B #7016

Closed

[TRTLLM-6825][fix] Update lora for phi4-mm #7149

Merged

[https://nvbugs/5463720][fix] tp-split the inferred mlp_hidden_size for nemotron-nas #7231

Merged

coderabbitai bot mentioned this pull request Sep 29, 2025

[https://nvbugs/5510879][fix] Fix pytorch & TRT-python flows fused LoRA adapter modules weight split with TP>1 #8063

Merged

1 task

Add basic Nemo Ckpt Lora Loading in pytorch flow #6019

Add basic Nemo Ckpt Lora Loading in pytorch flow #6019

Uh oh!

Conversation

venkywonka commented Jul 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

venkywonka commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

venkywonka commented Jul 15, 2025

Uh oh!

tensorrt-cicd commented Jul 15, 2025

Uh oh!

tensorrt-cicd commented Jul 15, 2025

Uh oh!

amitz-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

venkywonka commented Jul 15, 2025

Uh oh!

tensorrt-cicd commented Jul 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

tensorrt-cicd commented Jul 15, 2025

Uh oh!

venkywonka commented Jul 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

syuoni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

venkywonka commented Jul 14, 2025 •

edited by coderabbitai bot

Loading