Skip to content

Conversation

@venkywonka
Copy link
Collaborator

@venkywonka venkywonka commented Jul 23, 2025

Summary by CodeRabbit

  • New Features

    • Added support for loading LoRA adapters from both Hugging Face and NeMo checkpoints, including robust handling of NeMo-format LoRA weights.
    • Enhanced LoRA loading with improved error messages, diagnostics, and validation for missing or unsupported modules.
    • Enabled LoRA checkpoint source selection and validation in requests and model loading.
    • Added utility for creating mock NeMo LoRA checkpoints for testing.
  • Bug Fixes

    • Improved handling of per-layer key-value attention heads and FFN multipliers for models with non-uniform configurations.
    • Restricted custom vocabulary and lm_head loading to Hugging Face LoRA checkpoints only.
  • Tests

    • Introduced comprehensive tests for NeMo LoRA adapter loading, validation, and generation correctness, including support for grouped query attention (GQA).
    • Added integration test coverage for LoRA manager functionality.
  • Documentation

    • Added and refined docstrings and type annotations for LoRA-related functions and classes to improve clarity.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 23, 2025

Walkthrough

This update introduces robust NeMo LoRA checkpoint loading support, including new file discovery logic, enhanced error handling, and explicit validation for both NeMo and HuggingFace LoRA adapters. It restricts custom vocab and head loading to HF sources, updates model config logic for per-layer KV heads, and adds comprehensive tests and utilities for NeMo LoRA.

Changes

Cohort / File(s) Change Summary
NeMo & HF LoRA Adapter Loading Logic
tensorrt_llm/lora_manager.py
Refactored and extended NeMo/HF LoRA loading: added type annotations, docstrings, error handling, LRU-cached file discovery, PyTorch loader functions, module validation, and improved diagnostics for missing/unsupported modules.
Model Config: KV Heads Support
tensorrt_llm/_torch/model_config.py
Updated to support both uniform and per-layer KV attention heads; improved FFN multiplier inference robustness.
LoRA Custom Vocab/Head Loading Restriction
tensorrt_llm/_torch/models/modeling_llama.py, tensorrt_llm/_torch/models/modeling_nemotron_nas.py, tensorrt_llm/_torch/models/modeling_utils.py
Now only loads custom vocabularies and lm_head from LoRA directory if the checkpoint source is "hf", not for NeMo or others.
LoRA Loader Routing & KV Head Handling in PyExecutor
tensorrt_llm/_torch/pyexecutor/_util.py
Switched to generic load_torch_lora loader; added logic to handle non-uniform KV heads for LoRA module creation, with warnings for untested paths.
LoRARequest Extension
tensorrt_llm/executor/request.py
Added lora_ckpt_source field (default "hf") and validation to LoRARequest; added property for accessing checkpoint source.
LoRA Adapter Loading in Worker
tensorrt_llm/executor/worker.py
Passes ckpt_source from LoRARequest when loading LoRA adapters.
Integration Test List Update
tests/integration/test_lists/test-db/l0_a100.yml
Added new unit test for LoRA manager to pre-merge test suite for PyTorch backend on A100/Ubuntu.
NeMo LoRA Test Utilities
tests/unittest/llmapi/lora_test_utils.py
Added utility to create mock NeMo LoRA checkpoints for testing, supporting GQA and deterministic weights.
NeMo LoRA PyTorch Tests
tests/unittest/llmapi/test_llm_pytorch.py
Added tests for NeMo LoRA loader, validation of unsupported modules, and end-to-end GQA NeMo LoRA generation correctness.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant LoRARequest
    participant Worker
    participant LoraManager
    participant NemoLoraLoader
    participant Model

    User->>LoRARequest: Create (with lora_ckpt_source)
    LoRARequest->>Worker: Submit request
    Worker->>LoraManager: load_from_ckpt(..., ckpt_source)
    alt ckpt_source == "nemo"
        LoraManager->>NemoLoraLoader: find_nemo_files(...)
        NemoLoraLoader->>LoraManager: Return .nemo files
        LoraManager->>Model: Load NeMo LoRA weights
    else ckpt_source == "hf"
        LoraManager->>Model: Load HF LoRA weights
    end
    Model-->>User: Ready for inference
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Documentation, Community want to contribute

Suggested reviewers

  • shaharmor98
  • nv-guomingz
  • tijyojwad
  • litaotju

Poem

A bunny hops through code so neat,
NeMo and HF now both complete!
With loaders smart and warnings clear,
LoRA's magic draws quite near.
Tests and configs, all aligned,
Robust adapters, well-defined.
🐇✨ Hooray for LoRA—refined!

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@venkywonka venkywonka marked this pull request as draft July 23, 2025 03:37
@venkywonka venkywonka marked this pull request as ready for review July 23, 2025 03:38
@@ -0,0 +1,518 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

@coderabbitai coderabbitai bot changed the title @coderabbitai Improve LoRA adapter loading robustness and add comprehensive unit tests Jul 23, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/lora_manager.py (1)

1232-1234: Consider using logging for consistency

Same as the NeMo loader - consider using logging.info() instead of print() for better control and consistency.

🧹 Nitpick comments (1)
tensorrt_llm/lora_manager.py (1)

959-961: Consider using logging instead of print

While print() works, consider using Python's logging module for better control over output verbosity and consistency with the rest of the codebase.

-        if new_uids:
-            print(f"Successfully loaded NeMo LoRA adapters with UIDs: {new_uids}")
+        if new_uids:
+            import logging
+            logging.info(f"Successfully loaded NeMo LoRA adapters with UIDs: {new_uids}")
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2193ad3 and 73f2a38.

📒 Files selected for processing (3)
  • tensorrt_llm/lora_manager.py (6 hunks)
  • tests/integration/test_lists/test-db/l0_a100.yml (1 hunks)
  • tests/unittest/test_lora_manager.py (1 hunks)
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/lora_manager.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/lora_manager.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🔇 Additional comments (6)
tensorrt_llm/lora_manager.py (4)

5-5: LGTM!

The warnings module import is correctly added and properly placed with other standard library imports.


846-855: Consider edge case with non-contiguous layer indices

The logic for inferring num_layers from available weights could be affected by gaps in layer indices. For example, if only layers 0 and 2 have weights, this would infer 3 layers. However, I see that the subsequent code handles missing layers gracefully by creating zero tensors, so this should work correctly.


870-920: Excellent error handling and fallback logic!

The implementation robustly handles missing matrices with appropriate warnings and zero tensor creation. The rank determination logic has multiple fallback strategies that ensure the adapter can still be loaded even with incomplete checkpoints. The tensor dimensions are correct for NeMo's fused QKV architecture.


1127-1137: Good defensive handling of missing matrices

The code properly handles missing matrices by creating appropriately sized zero tensors and warning the user. This ensures the adapter can still be loaded even with incomplete checkpoints.

tests/integration/test_lists/test-db/l0_a100.yml (1)

18-18: LGTM!

The new unit test file is correctly added to the pre-merge PyTorch test suite.

tests/unittest/test_lora_manager.py (1)

1-519: Excellent test coverage!

This comprehensive test suite thoroughly validates the enhanced LoRA loading logic with:

  • Good coverage of edge cases for both HF and NeMo formats
  • Proper testing of warning messages and error handling
  • Effective use of mocking to verify internal behavior
  • Regression tests to prevent known issues
  • Well-structured code with reusable helper methods

The tests provide confidence that the robustness improvements work as intended.

@venkywonka venkywonka requested review from byshiue and shaharmor98 and removed request for shaharmor98 July 23, 2025 03:44
@venkywonka venkywonka marked this pull request as draft July 23, 2025 03:44
Signed-off-by: Venky Ganesh <[email protected]>
@venkywonka venkywonka changed the title Improve LoRA adapter loading robustness and add comprehensive unit tests @coderabbitai title Jul 29, 2025
@venkywonka venkywonka force-pushed the user/venky/robust-lora-manager branch from 73f2a38 to bc9f9dd Compare July 29, 2025 00:19
@coderabbitai coderabbitai bot changed the title @coderabbitai title Add NeMo LoRA checkpoint loading, validation, and related utilities title Jul 29, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
tensorrt_llm/_torch/model_config.py (1)

301-342: Comprehensive support for per-layer KV heads configuration.

The implementation properly handles both uniform and per-layer KV heads with:

  • Appropriate attribute priority checking (num_kv_heads_per_layer → num_key_value_heads → num_query_groups → num_attention_heads)
  • Correct TP/CP scaling applied per layer
  • Essential validation that LoRA requires uniform KV heads

This aligns well with the broader PR changes for NeMo LoRA support.

Consider breaking line 321 to comply with the 120-character limit:

-                    # For uniform models, check: num_key_value_heads (standard) -> num_query_groups (NeMo) -> num_attention_heads
+                    # For uniform models, check: num_key_value_heads (standard) ->
+                    # num_query_groups (NeMo) -> num_attention_heads
tests/unittest/llmapi/test_llm_pytorch.py (1)

493-567: Comprehensive test for GQA support with NeMo LoRA.

This test provides excellent coverage by:

  • Using realistic TinyLlama GQA configuration
  • Testing deterministic output with fixed seed
  • Verifying LoRA effect by comparing with non-LoRA output
  • Including proper resource cleanup

The dependency on seed=42 for the expected output is clearly documented.

Consider breaking the long lines in the docstring to comply with the 120-character limit:

-    1. That a NeMo-format LoRA checkpoint with GQA (grouped query attention) can be loaded and applied to a TinyLlama model,
-       and that generation with this LoRA produces a deterministic, expected output for a fixed prompt and temperature=0.0.
+    1. That a NeMo-format LoRA checkpoint with GQA (grouped query attention) can be loaded and applied to a
+       TinyLlama model, and that generation with this LoRA produces a deterministic, expected output for a
+       fixed prompt and temperature=0.0.
-    2. That the LoRA weights have a significant effect: generating with LoRA produces a different output than generating
-       without LoRA, confirming that the LoRA adapter is actually being applied.
+    2. That the LoRA weights have a significant effect: generating with LoRA produces a different output than
+       generating without LoRA, confirming that the LoRA adapter is actually being applied.
tensorrt_llm/lora_manager.py (3)

351-387: Consider tracking the parameter naming issue for future refactoring.

The docstring correctly identifies that lora_dirs is misleading since it accepts both files and directories. While the note is helpful, consider creating a tracking issue to rename this parameter to lora_paths in a future major version to avoid confusion.


825-826: Consider validating the unpacked weights structure.

Before processing the weights, consider adding validation to ensure the unpacked NeMo weights have the expected structure to fail fast with a clear error message.

 nemo_model_config, nemo_weights = unpack_nemo_weights(model_file)
+if not isinstance(nemo_weights, dict):
+    raise ValueError(f"Invalid NeMo weights structure in {model_file}: expected dict, got {type(nemo_weights)}")
 all_lora_weights = get_all_nemo_lora_weights(nemo_weights)

875-898: Consider extracting rank determination logic.

The rank determination logic is complex with multiple fallbacks. Consider extracting it into a separate helper function for better readability and testability.

+def _determine_lora_rank(config_rank, layer_idx, layer_weights, all_weights):
+    """Determine LoRA rank with fallback strategies."""
+    if config_rank is not None:
+        return config_rank
+    
+    # Try current layer
+    if "in" in layer_weights:
+        return layer_weights["in"].shape[0]
+    elif "out" in layer_weights:
+        return layer_weights["out"].shape[1]
+    
+    # Try other layers
+    for other_layer_idx in sorted(all_weights.keys()):
+        if other_layer_idx != layer_idx and "in" in all_weights[other_layer_idx]:
+            return all_weights[other_layer_idx]["in"].shape[0]
+    
+    # Final fallback
+    warnings.warn(
+        f"Layer {layer_idx}: No reference rank found for attn_qkv, using default rank 64"
+    )
+    return 64

 # Determine rank: prefer config rank, fall back to tensor-based derivation
-rank = config_rank
-if rank is None:
-    if "in" in layer_weights:
-        rank = layer_weights["in"].shape[0]
-    elif "out" in layer_weights:
-        rank = layer_weights["out"].shape[1]
-    else:
-        # Both matrices missing - look for rank from other layers or use default
-        for other_layer_idx in sorted(all_lora_weights.keys()):
-            if (
-                other_layer_idx != layer_idx
-                and "in" in all_lora_weights[other_layer_idx]
-            ):
-                rank = all_lora_weights[other_layer_idx]["in"].shape[0]
-                break
-        if rank is None:
-            # Final fallback to a reasonable default rank
-            rank = 64
-            warnings.warn(
-                f"Layer {layer_idx}: No reference rank found for attn_qkv, "
-                f"using default rank {rank}"
-            )
+rank = _determine_lora_rank(config_rank, layer_idx, layer_weights, all_lora_weights)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73f2a38 and bc9f9dd.

📒 Files selected for processing (11)
  • tensorrt_llm/_torch/model_config.py (3 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_nemotron_nas.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_utils.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (3 hunks)
  • tensorrt_llm/executor/request.py (2 hunks)
  • tensorrt_llm/executor/worker.py (1 hunks)
  • tensorrt_llm/lora_manager.py (15 hunks)
  • tests/integration/test_lists/test-db/l0_a100.yml (1 hunks)
  • tests/unittest/llmapi/lora_test_utils.py (2 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tensorrt_llm/executor/worker.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_lists/test-db/l0_a100.yml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case, and prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_utils.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/models/modeling_nemotron_nas.py
  • tensorrt_llm/executor/request.py
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/model_config.py
  • tests/unittest/llmapi/lora_test_utils.py
  • tensorrt_llm/lora_manager.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_utils.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/models/modeling_nemotron_nas.py
  • tensorrt_llm/executor/request.py
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/model_config.py
  • tests/unittest/llmapi/lora_test_utils.py
  • tensorrt_llm/lora_manager.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/models/modeling_llama.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/models/modeling_utils.py (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tensorrt_llm/_torch/pyexecutor/_util.py (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tensorrt_llm/executor/request.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tests/unittest/llmapi/test_llm_pytorch.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tests/unittest/llmapi/lora_test_utils.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tensorrt_llm/lora_manager.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🧬 Code Graph Analysis (3)
tensorrt_llm/_torch/models/modeling_utils.py (1)
tensorrt_llm/lora_manager.py (1)
  • HfLoraLoader (228-278)
tensorrt_llm/_torch/pyexecutor/_util.py (4)
tensorrt_llm/lora_manager.py (1)
  • load_torch_lora (488-507)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
  • lora_config (242-262)
tensorrt_llm/logger.py (1)
  • warning (131-132)
tensorrt_llm/runtime/generation.py (2)
  • hidden_size (1152-1154)
  • num_heads (1148-1149)
tensorrt_llm/_torch/model_config.py (3)
tests/unittest/_torch/test_resource_manager.py (1)
  • num_kv_heads_per_layer (74-75)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_size (46-47)
  • cp_size (38-39)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
  • lora_config (242-262)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm_pytorch.py

498-498: Line too long (124 > 120)

(E501)


499-499: Line too long (123 > 120)

(E501)


503-503: Line too long (122 > 120)

(E501)

tensorrt_llm/_torch/model_config.py

321-321: Line too long (129 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (36)
tensorrt_llm/executor/request.py (3)

28-28: LGTM!

The new lora_ckpt_source field is properly typed with a sensible default value for backward compatibility.


33-36: LGTM!

The validation logic is well-implemented with appropriate exception type and clear error messaging.


50-52: LGTM!

The read-only property provides a clean interface to access the checkpoint source.

tensorrt_llm/_torch/models/modeling_utils.py (1)

367-373: LGTM!

The conditional check properly restricts custom lm_head loading to HF LoRA checkpoints only, preventing incompatible usage with NeMo checkpoints. The implementation is clean and well-commented.

tensorrt_llm/_torch/models/modeling_nemotron_nas.py (1)

195-201: LGTM!

The conditional check correctly restricts custom vocabulary loading to HF LoRA checkpoints, maintaining consistency with the pattern used across other model files. The implementation is clean and well-commented.

tensorrt_llm/_torch/models/modeling_llama.py (1)

706-712: LGTM!

The conditional check properly restricts custom vocabulary loading to HF LoRA checkpoints, following the consistent pattern established across all model files in this PR. The implementation is clean and well-documented.

tensorrt_llm/_torch/pyexecutor/_util.py (3)

17-17: LGTM!

The import change from load_torch_hf_lora to load_torch_lora correctly aligns with the new loader routing logic that supports multiple checkpoint sources.


440-441: LGTM!

The function call update to use load_torch_lora is consistent with the enhanced LoRA loading framework that now routes to the appropriate loader based on the checkpoint source.


454-466: Conservative handling of non-uniform KV heads is valid and remains untested
Our review across model configs and tests did not uncover any pretrained setups that actually specify varying KV heads per layer. Retaining this fallback (using the max) along with the warning (TRTLLM-6561) is prudent for future heterogeneous-head models. No changes required.

tests/unittest/llmapi/lora_test_utils.py (2)

1-8: LGTM!

The imports are appropriate for creating mock NeMo LoRA checkpoints, including file operations, PyTorch tensors, and necessary utility functions.


124-234: Well-designed test utility for NeMo LoRA checkpoint creation.

The function provides comprehensive support for creating deterministic mock NeMo LoRA checkpoints with:

  • Proper parameter validation for head divisibility constraints
  • Correct GQA support with appropriate QKV dimension calculations
  • Clear documentation about the hardcoded coefficient's impact on tests
  • Pragmatic use of JSON for configuration serialization

The implementation correctly handles the NeMo checkpoint structure and will effectively support the new NeMo LoRA loading tests.

tensorrt_llm/_torch/model_config.py (2)

363-367: LGTM!

The code correctly updates the ModelConfigCpp with either per-layer KV heads or uniform KV heads based on the configuration.


417-419: Good defensive programming for None handling.

The fix properly handles None values in ffn_mult by replacing them with 0 before taking the maximum, preventing potential AttributeError exceptions.

tests/unittest/llmapi/test_llm_pytorch.py (3)

8-8: LGTM!

The import of create_mock_nemo_lora_checkpoint is properly added to support the new NeMo LoRA tests.


432-467: Well-structured parameterized test for NeMo LoRA loading.

The test effectively validates the load_torch_nemo_lora function with different rank configurations and properly asserts the expected module configurations.


469-491: Good validation test for unsupported modules.

The test correctly verifies that attempting to use unsupported LoRA modules with NeMo format raises an appropriate ValueError.

tensorrt_llm/lora_manager.py (20)

5-5: LGTM! Good addition for user feedback.

Adding the warnings import is appropriate for the new warning messages throughout the code.


8-8: LGTM! Appropriate for caching optimization.

The lru_cache import is well-suited for the new _find_nemo_files_single_path function.


10-10: LGTM! Comprehensive type hints.

The expanded type imports improve type safety and code clarity.


27-62: LGTM! Well-structured NeMo weight extraction.

The function is well-documented with clear parameter descriptions and error handling. The regex pattern matching approach is efficient and the error messages are descriptive.


70-133: LGTM! Excellent documentation improvements.

The added docstring clearly explains the function's purpose, parameters, and return value. The type annotations improve code clarity.


135-159: LGTM! Clear documentation for HF weight extraction.

The docstring additions make the function's purpose and parameters clear.


170-190: LGTM! Well-documented utility function.

The docstring clearly explains the module mapping inversion logic with examples.


281-316: LGTM! Smart caching strategy for file discovery.

The function is well-designed with comprehensive error handling and an efficient caching strategy. Caching individual paths maximizes cache efficiency when the same paths appear in different collections.


317-349: LGTM! Efficient wrapper for batch file discovery.

The function effectively uses the cached helper to process multiple paths efficiently. The documentation clearly explains the optimization strategy.


391-394: LGTM! Improved error handling.

The added validation with a clear error message improves the user experience when NeMo LoRA loading fails.


443-487: LGTM! Well-implemented NeMo LoRA loader for PyTorch.

The function is well-documented and includes proper validation for supported modules. The error messages clearly explain NeMo's limitations regarding module support.


488-509: LGTM! Clean routing implementation.

The function provides a clean abstraction for loading different LoRA checkpoint sources with clear error messages.


611-647: LGTM! Robust archive extraction with fallback paths.

The function handles both possible archive structures gracefully and includes comprehensive error handling.


773-782: LGTM! Proper integration with NeMo file discovery.

The method now correctly uses find_nemo_files to resolve .nemo files before passing them to the loader.


869-916: Excellent error handling for missing matrices.

The detailed error messages for missing "in" and "out" matrices provide clear guidance to users about incomplete or corrupted checkpoints. This will significantly improve the debugging experience.


961-963: LGTM! Helpful success message.

The success message with loaded UIDs helps users track which adapters were loaded.


1102-1106: LGTM! Helpful warning for skipped modules.

The warning message helps users understand why certain LoRA modules are being skipped.


1108-1108: LGTM! Clearer MoE detection logic.

Checking for the absence of both "in" and "out" keys makes the MoE detection more explicit and readable.


1129-1143: LGTM! Consistent error handling with NeMo loader.

The detailed error messages for missing matrices match the quality of the NeMo loader, providing consistent user experience across both checkpoint sources.


1238-1240: LGTM! Consistent success messaging.

The success message matches the NeMo loader's pattern.

@venkywonka
Copy link
Collaborator Author

closing this as the scope has been reduced to just logging.

@venkywonka venkywonka closed this Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant