Skip to content

Conversation

venkywonka
Copy link
Collaborator

@venkywonka venkywonka commented Jul 14, 2025

  • Re-use the NemoLoraLoader used in trt flow
  • Modify core + model-specific files to route based on ckpt source ("nemo" vs "hf")
  • Enforce the limitation that Nemo Lora ckpts currently only support loading fused "attn_qkv" adapters.
  • Add unittests + e2e on tinyllama with dummy nemo lora ckpt
  • Manually verify externally with a working .nemo ckpt
  • Fix bug that expects nemo file path in var lora_dir for nemo loading case

NOTE: This merely re-uses the pre-existing core Nemo Lora ckpt loading functionality, so its on parity with the TRT flow w.r.t Lora Ckpt Loading.
The limitations of Nemo Lora Ckpt loading, therefore, still apply.

Summary by CodeRabbit

  • New Features

    • Added support for loading and applying NeMo-format LoRA (Low-Rank Adaptation) checkpoints, including robust discovery of .nemo files and validation of LoRA sources.
    • Enhanced handling of key-value attention heads for better compatibility with models using grouped query attention (GQA).
    • Added unified PyTorch LoRA loader supporting multiple checkpoint sources with source-based dispatch.
    • Introduced a new field to specify LoRA checkpoint source, enabling source-aware loading logic.
  • Bug Fixes

    • Restricted custom vocabulary and embedding loading to Hugging Face LoRA checkpoints, preventing unintended behavior with other sources.
    • Improved handling of missing feed-forward multiplier values to prevent errors during model configuration.
    • Added warnings and safeguards for non-uniform key-value attention heads per layer in LoRA modules.
  • Tests

    • Introduced comprehensive tests for NeMo LoRA integration, including GQA support and validation of unsupported module configurations.
    • Added utilities to generate mock NeMo LoRA checkpoints for testing.
  • Documentation

    • Improved docstrings and type annotations for LoRA-related functions, clarifying usage and expected behavior.

@venkywonka venkywonka requested review from a team, Naveassaf, byshiue, shaharmor98 and tijyojwad and removed request for a team July 14, 2025 19:30
@venkywonka venkywonka force-pushed the user/venky/nemo-ckpt-lora-load-pyt branch from f6b09d7 to 18ab327 Compare July 14, 2025 20:01
@venkywonka venkywonka marked this pull request as ready for review July 14, 2025 20:02
@venkywonka venkywonka requested review from a team as code owners July 14, 2025 20:02
@venkywonka
Copy link
Collaborator Author

/bot run --extra-stage "H100_PCIe-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11846 [ run ] triggered by Bot

@venkywonka venkywonka self-assigned this Jul 14, 2025
@venkywonka venkywonka requested a review from Copilot July 14, 2025 20:24
Copilot

This comment was marked as outdated.

@venkywonka venkywonka requested a review from amitz-nv July 14, 2025 20:28
@tensorrt-cicd
Copy link
Collaborator

PR_Github #11846 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8777 completed with status: 'FAILURE'

@venkywonka
Copy link
Collaborator Author

/bot run --extra-stage "H100_PCIe-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11861 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11861 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #8790 completed with status: 'FAILURE'

Copy link
Collaborator

@amitz-nv amitz-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work :)
a few small comments and questions I'm not sure about

@venkywonka venkywonka force-pushed the user/venky/nemo-ckpt-lora-load-pyt branch from 4bfb756 to 11da27b Compare July 15, 2025 16:21
@venkywonka venkywonka requested a review from a team as a code owner July 15, 2025 16:21
@venkywonka venkywonka requested a review from nv-guomingz July 15, 2025 16:21
@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11967 [ run ] triggered by Bot

@venkywonka venkywonka requested review from amitz-nv and Copilot July 15, 2025 18:41
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds end-to-end support for loading NeMo-formatted LoRA checkpoints in the PyTorch workflow. Key changes include:

  • Introducing helper functions (find_nemo_files, _find_nemo_files_single_path), enhancing NemoLoraLoader, and adding load_torch_nemo_lora with routing via load_torch_lora.
  • Updating the executor, request, and model initialization code to respect a new lora_ckpt_source flag and only apply HF-specific logic when appropriate.
  • Adding unit tests that generate a minimal .nemo archive and verify loader behavior, and updating the integration test matrix.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unittest/llmapi/test_llm_pytorch.py Add NeMo LoRA checkpoint generation helper and loader unit tests
tests/integration/test_lists/test-db/l0_h100.yml Update integration test list to include new NeMo LoRA tests
tensorrt_llm/lora_manager.py Introduce file‐finding utilities, PyTorch routing, and docstrings
tensorrt_llm/executor/worker.py Pass ckpt_source through when loading adapters
tensorrt_llm/executor/request.py Extend LoRARequest with lora_ckpt_source and validation
tensorrt_llm/_torch/pyexecutor/_util.py Route via new load_torch_lora instead of HF‐only loader
tensorrt_llm/_torch/models/modeling_utils.py Guard HF‐only LoRA head checks under lora_ckpt_source == "hf"
tensorrt_llm/_torch/models/modeling_nemotron_nas.py Same HF‐only guard for custom vocab in NeMo flow
tensorrt_llm/_torch/models/modeling_llama.py Same HF‐only guard for custom vocab in NeMo flow
Comments suppressed due to low confidence (4)

tensorrt_llm/lora_manager.py:383

  • [nitpick] There are now two similarly named functions (load_nemo_lora and load_torch_nemo_lora), which can be confusing. Consider renaming one to clarify their distinct purposes.
def load_nemo_lora(model, lora_config: LoraConfig):

tensorrt_llm/executor/worker.py:353

  • The call to load_from_ckpt now includes a ckpt_source keyword—please confirm that the load_from_ckpt signature accepts this parameter to avoid unexpected keyword argument errors.
            ckpt_source=lora_request.ckpt_source)

tests/unittest/llmapi/test_llm_pytorch.py:41

  • The helper uses tempfile.TemporaryDirectory() but I don’t see import tempfile in this diff. Please verify that import tempfile is present at the top of the test file to avoid a NameError.
def create_mock_nemo_lora_checkpoint(

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11967 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8882 completed with status: 'SUCCESS'

@venkywonka venkywonka force-pushed the user/venky/nemo-ckpt-lora-load-pyt branch from aeae974 to 56266c8 Compare July 22, 2025 17:52
@venkywonka
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tensorrt_llm/_torch/model_config.py (1)

300-342: LGTM! Comprehensive KV heads handling with proper fallbacks.

The enhanced logic properly handles both uniform and per-layer KV head configurations with appropriate fallbacks and LoRA compatibility validation. The TP/CP scaling is applied consistently.

One minor formatting issue to address:

-                    # For uniform models, check: num_key_value_heads (standard) -> num_query_groups (NeMo) -> num_attention_heads
+                    # For uniform models, check: num_key_value_heads (standard) -> 
+                    # num_query_groups (NeMo) -> num_attention_heads
tests/unittest/llmapi/test_llm_pytorch.py (1)

493-567: LGTM! Comprehensive integration test for GQA NeMo LoRA support.

This test excellently validates the entire pipeline from checkpoint creation to generation, with proper deterministic setup and validation that LoRA has a measurable effect. The GQA configuration matches TinyLlama's specifications perfectly.

Fix the line length issues flagged by static analysis:

-    1. That a NeMo-format LoRA checkpoint with GQA (grouped query attention) can be loaded and applied to a TinyLlama model,
+    1. That a NeMo-format LoRA checkpoint with GQA (grouped query attention) can be loaded and
+       applied to a TinyLlama model,
-       and that generation with this LoRA produces a deterministic, expected output for a fixed prompt and temperature=0.0.
+       and that generation with this LoRA produces a deterministic, expected output for a fixed
+       prompt and temperature=0.0.
-    The test uses a deterministic dummy LoRA checkpoint (seed=42) and checks both the positive (LoRA applied) and negative
+    The test uses a deterministic dummy LoRA checkpoint (seed=42) and checks both the positive
+    (LoRA applied) and negative
tensorrt_llm/lora_manager.py (1)

349-385: LGTM! Enhanced loader with honest documentation about design limitations.

The comprehensive documentation and new get_target_modules method improve the class significantly. The honest acknowledgment of the misleading parameter name lora_dirs in the docstring is good practice.

Consider renaming the parameter in a future version:

-    def __init__(self, lora_dirs: List[str]):
+    def __init__(self, lora_paths: List[str]):  # Future improvement
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aeae974 and 56266c8.

📒 Files selected for processing (10)
  • tensorrt_llm/_torch/model_config.py (3 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_nemotron_nas.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_utils.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (3 hunks)
  • tensorrt_llm/executor/request.py (2 hunks)
  • tensorrt_llm/executor/worker.py (1 hunks)
  • tensorrt_llm/lora_manager.py (10 hunks)
  • tests/unittest/llmapi/lora_test_utils.py (2 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
🧠 Learnings (1)
tensorrt_llm/lora_manager.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/model_config.py

321-321: Line too long (129 > 120)

(E501)

tests/unittest/llmapi/test_llm_pytorch.py

498-498: Line too long (124 > 120)

(E501)


499-499: Line too long (123 > 120)

(E501)


503-503: Line too long (122 > 120)

(E501)

🚧 Files skipped from review as they are similar to previous changes (7)
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_nemotron_nas.py
  • tensorrt_llm/executor/request.py
  • tensorrt_llm/_torch/models/modeling_utils.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tests/unittest/llmapi/lora_test_utils.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/lora_manager.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/model_config.py

321-321: Line too long (129 > 120)

(E501)

tests/unittest/llmapi/test_llm_pytorch.py

498-498: Line too long (124 > 120)

(E501)


499-499: Line too long (123 > 120)

(E501)


503-503: Line too long (122 > 120)

(E501)

🔇 Additional comments (12)
tensorrt_llm/_torch/model_config.py (2)

363-366: LGTM! Correct routing to C++ binding methods.

The logic properly sets the KV heads configuration on the C++ model config based on whether per-layer or uniform KV heads are being used.


416-419: LGTM! Safe handling of None ffn_mult values.

The defensive programming approach correctly handles cases where some layers' ffn_mult attributes may be None by substituting zero, preventing runtime errors during max computation.

tests/unittest/llmapi/test_llm_pytorch.py (2)

432-467: LGTM! Well-structured unit test for NeMo LoRA loading.

The parameterized test effectively covers different LoRA rank configurations and validates that the loader correctly sets up the configuration with proper module mapping.


469-490: LGTM! Important validation test for error handling.

The test properly validates that the loader rejects unsupported module configurations and raises appropriate errors, which is crucial for user guidance.

tensorrt_llm/lora_manager.py (8)

26-60: LGTM! Excellent addition of type annotations and comprehensive documentation.

The enhanced function signature with type hints and detailed docstring significantly improves code clarity and maintainability without changing the core functionality.


69-157: LGTM! Consistent improvement in type safety and documentation.

The addition of comprehensive type annotations and docstrings follows good practices and makes these complex functions much more understandable, especially the callback signature documentation in iterate_hf_lora.


169-188: LGTM! Clear documentation for module mapping inversion.

The type annotations and docstring make this potentially confusing operation much clearer, especially the clarification that HF module names can be either strings or lists of strings.


280-347: LGTM! Well-designed file discovery with intelligent caching strategy.

The LRU caching on individual paths is a smart optimization that maximizes cache efficiency when paths are reused across different collections. The comprehensive error handling provides clear user feedback for various failure scenarios.


388-396: LGTM! Important addition of validation logic.

The validation check prevents silent failures when NeMo LoRA loading fails, providing clear error messages to users.


442-485: LGTM! Well-designed PyTorch-specific NeMo LoRA loader.

The function provides a clean abstraction for NeMo LoRA loading with appropriate validation, clear error messages, and proper documentation of current limitations. The hardcoded "attn_qkv" mapping aligns with NeMo's current supported functionality.


487-507: LGTM! Clean dispatcher pattern for LoRA checkpoint loading.

The dispatcher function provides a clean abstraction that routes to appropriate loaders based on checkpoint source, with proper error handling for unsupported sources.


772-777: LGTM! Proper integration of file discovery functionality.

The integration of find_nemo_files in the LoraManager ensures that NeMo checkpoint loading properly handles both files and directories, maintaining consistency with the enhanced loader capabilities.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12599 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12599 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9375 completed with status: 'SUCCESS'

Copy link
Collaborator

@syuoni syuoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LLM part changes look good to me.

@venkywonka venkywonka enabled auto-merge (squash) July 23, 2025 02:33
@venkywonka venkywonka merged commit 9538c8d into NVIDIA:main Jul 23, 2025
3 checks passed
NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025
Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants