-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add NeMo LoRA checkpoint loading, validation, and related utilities title #6276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NeMo LoRA checkpoint loading, validation, and related utilities title #6276
Conversation
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
WalkthroughThis update introduces robust NeMo LoRA checkpoint loading support, including new file discovery logic, enhanced error handling, and explicit validation for both NeMo and HuggingFace LoRA adapters. It restricts custom vocab and head loading to HF sources, updates model config logic for per-layer KV heads, and adds comprehensive tests and utilities for NeMo LoRA. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant LoRARequest
participant Worker
participant LoraManager
participant NemoLoraLoader
participant Model
User->>LoRARequest: Create (with lora_ckpt_source)
LoRARequest->>Worker: Submit request
Worker->>LoraManager: load_from_ckpt(..., ckpt_source)
alt ckpt_source == "nemo"
LoraManager->>NemoLoraLoader: find_nemo_files(...)
NemoLoraLoader->>LoraManager: Return .nemo files
LoraManager->>Model: Load NeMo LoRA weights
else ckpt_source == "hf"
LoraManager->>Model: Load HF LoRA weights
end
Model-->>User: Ready for inference
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
tests/unittest/test_lora_manager.py
Outdated
| @@ -0,0 +1,518 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/lora_manager.py (1)
1232-1234: Consider using logging for consistencySame as the NeMo loader - consider using
logging.info()instead ofprint()for better control and consistency.
🧹 Nitpick comments (1)
tensorrt_llm/lora_manager.py (1)
959-961: Consider using logging instead of printWhile
print()works, consider using Python'sloggingmodule for better control over output verbosity and consistency with the rest of the codebase.- if new_uids: - print(f"Successfully loaded NeMo LoRA adapters with UIDs: {new_uids}") + if new_uids: + import logging + logging.info(f"Successfully loaded NeMo LoRA adapters with UIDs: {new_uids}")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/lora_manager.py(6 hunks)tests/integration/test_lists/test-db/l0_a100.yml(1 hunks)tests/unittest/test_lora_manager.py(1 hunks)
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/lora_manager.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/lora_manager.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🔇 Additional comments (6)
tensorrt_llm/lora_manager.py (4)
5-5: LGTM!The
warningsmodule import is correctly added and properly placed with other standard library imports.
846-855: Consider edge case with non-contiguous layer indicesThe logic for inferring
num_layersfrom available weights could be affected by gaps in layer indices. For example, if only layers 0 and 2 have weights, this would infer 3 layers. However, I see that the subsequent code handles missing layers gracefully by creating zero tensors, so this should work correctly.
870-920: Excellent error handling and fallback logic!The implementation robustly handles missing matrices with appropriate warnings and zero tensor creation. The rank determination logic has multiple fallback strategies that ensure the adapter can still be loaded even with incomplete checkpoints. The tensor dimensions are correct for NeMo's fused QKV architecture.
1127-1137: Good defensive handling of missing matricesThe code properly handles missing matrices by creating appropriately sized zero tensors and warning the user. This ensures the adapter can still be loaded even with incomplete checkpoints.
tests/integration/test_lists/test-db/l0_a100.yml (1)
18-18: LGTM!The new unit test file is correctly added to the pre-merge PyTorch test suite.
tests/unittest/test_lora_manager.py (1)
1-519: Excellent test coverage!This comprehensive test suite thoroughly validates the enhanced LoRA loading logic with:
- Good coverage of edge cases for both HF and NeMo formats
- Proper testing of warning messages and error handling
- Effective use of mocking to verify internal behavior
- Regression tests to prevent known issues
- Well-structured code with reusable helper methods
The tests provide confidence that the robustness improvements work as intended.
Signed-off-by: Venky Ganesh <[email protected]>
73f2a38 to
bc9f9dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
tensorrt_llm/_torch/model_config.py (1)
301-342: Comprehensive support for per-layer KV heads configuration.The implementation properly handles both uniform and per-layer KV heads with:
- Appropriate attribute priority checking (num_kv_heads_per_layer → num_key_value_heads → num_query_groups → num_attention_heads)
- Correct TP/CP scaling applied per layer
- Essential validation that LoRA requires uniform KV heads
This aligns well with the broader PR changes for NeMo LoRA support.
Consider breaking line 321 to comply with the 120-character limit:
- # For uniform models, check: num_key_value_heads (standard) -> num_query_groups (NeMo) -> num_attention_heads + # For uniform models, check: num_key_value_heads (standard) -> + # num_query_groups (NeMo) -> num_attention_headstests/unittest/llmapi/test_llm_pytorch.py (1)
493-567: Comprehensive test for GQA support with NeMo LoRA.This test provides excellent coverage by:
- Using realistic TinyLlama GQA configuration
- Testing deterministic output with fixed seed
- Verifying LoRA effect by comparing with non-LoRA output
- Including proper resource cleanup
The dependency on seed=42 for the expected output is clearly documented.
Consider breaking the long lines in the docstring to comply with the 120-character limit:
- 1. That a NeMo-format LoRA checkpoint with GQA (grouped query attention) can be loaded and applied to a TinyLlama model, - and that generation with this LoRA produces a deterministic, expected output for a fixed prompt and temperature=0.0. + 1. That a NeMo-format LoRA checkpoint with GQA (grouped query attention) can be loaded and applied to a + TinyLlama model, and that generation with this LoRA produces a deterministic, expected output for a + fixed prompt and temperature=0.0. - 2. That the LoRA weights have a significant effect: generating with LoRA produces a different output than generating - without LoRA, confirming that the LoRA adapter is actually being applied. + 2. That the LoRA weights have a significant effect: generating with LoRA produces a different output than + generating without LoRA, confirming that the LoRA adapter is actually being applied.tensorrt_llm/lora_manager.py (3)
351-387: Consider tracking the parameter naming issue for future refactoring.The docstring correctly identifies that
lora_dirsis misleading since it accepts both files and directories. While the note is helpful, consider creating a tracking issue to rename this parameter tolora_pathsin a future major version to avoid confusion.
825-826: Consider validating the unpacked weights structure.Before processing the weights, consider adding validation to ensure the unpacked NeMo weights have the expected structure to fail fast with a clear error message.
nemo_model_config, nemo_weights = unpack_nemo_weights(model_file) +if not isinstance(nemo_weights, dict): + raise ValueError(f"Invalid NeMo weights structure in {model_file}: expected dict, got {type(nemo_weights)}") all_lora_weights = get_all_nemo_lora_weights(nemo_weights)
875-898: Consider extracting rank determination logic.The rank determination logic is complex with multiple fallbacks. Consider extracting it into a separate helper function for better readability and testability.
+def _determine_lora_rank(config_rank, layer_idx, layer_weights, all_weights): + """Determine LoRA rank with fallback strategies.""" + if config_rank is not None: + return config_rank + + # Try current layer + if "in" in layer_weights: + return layer_weights["in"].shape[0] + elif "out" in layer_weights: + return layer_weights["out"].shape[1] + + # Try other layers + for other_layer_idx in sorted(all_weights.keys()): + if other_layer_idx != layer_idx and "in" in all_weights[other_layer_idx]: + return all_weights[other_layer_idx]["in"].shape[0] + + # Final fallback + warnings.warn( + f"Layer {layer_idx}: No reference rank found for attn_qkv, using default rank 64" + ) + return 64 # Determine rank: prefer config rank, fall back to tensor-based derivation -rank = config_rank -if rank is None: - if "in" in layer_weights: - rank = layer_weights["in"].shape[0] - elif "out" in layer_weights: - rank = layer_weights["out"].shape[1] - else: - # Both matrices missing - look for rank from other layers or use default - for other_layer_idx in sorted(all_lora_weights.keys()): - if ( - other_layer_idx != layer_idx - and "in" in all_lora_weights[other_layer_idx] - ): - rank = all_lora_weights[other_layer_idx]["in"].shape[0] - break - if rank is None: - # Final fallback to a reasonable default rank - rank = 64 - warnings.warn( - f"Layer {layer_idx}: No reference rank found for attn_qkv, " - f"using default rank {rank}" - ) +rank = _determine_lora_rank(config_rank, layer_idx, layer_weights, all_lora_weights)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
tensorrt_llm/_torch/model_config.py(3 hunks)tensorrt_llm/_torch/models/modeling_llama.py(1 hunks)tensorrt_llm/_torch/models/modeling_nemotron_nas.py(1 hunks)tensorrt_llm/_torch/models/modeling_utils.py(1 hunks)tensorrt_llm/_torch/pyexecutor/_util.py(3 hunks)tensorrt_llm/executor/request.py(2 hunks)tensorrt_llm/executor/worker.py(1 hunks)tensorrt_llm/lora_manager.py(15 hunks)tests/integration/test_lists/test-db/l0_a100.yml(1 hunks)tests/unittest/llmapi/lora_test_utils.py(2 hunks)tests/unittest/llmapi/test_llm_pytorch.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- tensorrt_llm/executor/worker.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/test-db/l0_a100.yml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case, and prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/models/modeling_utils.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/models/modeling_nemotron_nas.pytensorrt_llm/executor/request.pytests/unittest/llmapi/test_llm_pytorch.pytensorrt_llm/_torch/model_config.pytests/unittest/llmapi/lora_test_utils.pytensorrt_llm/lora_manager.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/models/modeling_utils.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/models/modeling_nemotron_nas.pytensorrt_llm/executor/request.pytests/unittest/llmapi/test_llm_pytorch.pytensorrt_llm/_torch/model_config.pytests/unittest/llmapi/lora_test_utils.pytensorrt_llm/lora_manager.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/models/modeling_llama.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/models/modeling_utils.py (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
tensorrt_llm/_torch/pyexecutor/_util.py (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
tensorrt_llm/executor/request.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
tests/unittest/llmapi/lora_test_utils.py (1)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
tensorrt_llm/lora_manager.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (3)
tensorrt_llm/_torch/models/modeling_utils.py (1)
tensorrt_llm/lora_manager.py (1)
HfLoraLoader(228-278)
tensorrt_llm/_torch/pyexecutor/_util.py (4)
tensorrt_llm/lora_manager.py (1)
load_torch_lora(488-507)tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
lora_config(242-262)tensorrt_llm/logger.py (1)
warning(131-132)tensorrt_llm/runtime/generation.py (2)
hidden_size(1152-1154)num_heads(1148-1149)
tensorrt_llm/_torch/model_config.py (3)
tests/unittest/_torch/test_resource_manager.py (1)
num_kv_heads_per_layer(74-75)tensorrt_llm/_torch/distributed/communicator.py (2)
tp_size(46-47)cp_size(38-39)tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
lora_config(242-262)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm_pytorch.py
498-498: Line too long (124 > 120)
(E501)
499-499: Line too long (123 > 120)
(E501)
503-503: Line too long (122 > 120)
(E501)
tensorrt_llm/_torch/model_config.py
321-321: Line too long (129 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (36)
tensorrt_llm/executor/request.py (3)
28-28: LGTM!The new
lora_ckpt_sourcefield is properly typed with a sensible default value for backward compatibility.
33-36: LGTM!The validation logic is well-implemented with appropriate exception type and clear error messaging.
50-52: LGTM!The read-only property provides a clean interface to access the checkpoint source.
tensorrt_llm/_torch/models/modeling_utils.py (1)
367-373: LGTM!The conditional check properly restricts custom lm_head loading to HF LoRA checkpoints only, preventing incompatible usage with NeMo checkpoints. The implementation is clean and well-commented.
tensorrt_llm/_torch/models/modeling_nemotron_nas.py (1)
195-201: LGTM!The conditional check correctly restricts custom vocabulary loading to HF LoRA checkpoints, maintaining consistency with the pattern used across other model files. The implementation is clean and well-commented.
tensorrt_llm/_torch/models/modeling_llama.py (1)
706-712: LGTM!The conditional check properly restricts custom vocabulary loading to HF LoRA checkpoints, following the consistent pattern established across all model files in this PR. The implementation is clean and well-documented.
tensorrt_llm/_torch/pyexecutor/_util.py (3)
17-17: LGTM!The import change from
load_torch_hf_loratoload_torch_loracorrectly aligns with the new loader routing logic that supports multiple checkpoint sources.
440-441: LGTM!The function call update to use
load_torch_lorais consistent with the enhanced LoRA loading framework that now routes to the appropriate loader based on the checkpoint source.
454-466: Conservative handling of non-uniform KV heads is valid and remains untested
Our review across model configs and tests did not uncover any pretrained setups that actually specify varying KV heads per layer. Retaining this fallback (using the max) along with the warning (TRTLLM-6561) is prudent for future heterogeneous-head models. No changes required.tests/unittest/llmapi/lora_test_utils.py (2)
1-8: LGTM!The imports are appropriate for creating mock NeMo LoRA checkpoints, including file operations, PyTorch tensors, and necessary utility functions.
124-234: Well-designed test utility for NeMo LoRA checkpoint creation.The function provides comprehensive support for creating deterministic mock NeMo LoRA checkpoints with:
- Proper parameter validation for head divisibility constraints
- Correct GQA support with appropriate QKV dimension calculations
- Clear documentation about the hardcoded coefficient's impact on tests
- Pragmatic use of JSON for configuration serialization
The implementation correctly handles the NeMo checkpoint structure and will effectively support the new NeMo LoRA loading tests.
tensorrt_llm/_torch/model_config.py (2)
363-367: LGTM!The code correctly updates the ModelConfigCpp with either per-layer KV heads or uniform KV heads based on the configuration.
417-419: Good defensive programming for None handling.The fix properly handles None values in ffn_mult by replacing them with 0 before taking the maximum, preventing potential AttributeError exceptions.
tests/unittest/llmapi/test_llm_pytorch.py (3)
8-8: LGTM!The import of
create_mock_nemo_lora_checkpointis properly added to support the new NeMo LoRA tests.
432-467: Well-structured parameterized test for NeMo LoRA loading.The test effectively validates the
load_torch_nemo_lorafunction with different rank configurations and properly asserts the expected module configurations.
469-491: Good validation test for unsupported modules.The test correctly verifies that attempting to use unsupported LoRA modules with NeMo format raises an appropriate ValueError.
tensorrt_llm/lora_manager.py (20)
5-5: LGTM! Good addition for user feedback.Adding the
warningsimport is appropriate for the new warning messages throughout the code.
8-8: LGTM! Appropriate for caching optimization.The
lru_cacheimport is well-suited for the new_find_nemo_files_single_pathfunction.
10-10: LGTM! Comprehensive type hints.The expanded type imports improve type safety and code clarity.
27-62: LGTM! Well-structured NeMo weight extraction.The function is well-documented with clear parameter descriptions and error handling. The regex pattern matching approach is efficient and the error messages are descriptive.
70-133: LGTM! Excellent documentation improvements.The added docstring clearly explains the function's purpose, parameters, and return value. The type annotations improve code clarity.
135-159: LGTM! Clear documentation for HF weight extraction.The docstring additions make the function's purpose and parameters clear.
170-190: LGTM! Well-documented utility function.The docstring clearly explains the module mapping inversion logic with examples.
281-316: LGTM! Smart caching strategy for file discovery.The function is well-designed with comprehensive error handling and an efficient caching strategy. Caching individual paths maximizes cache efficiency when the same paths appear in different collections.
317-349: LGTM! Efficient wrapper for batch file discovery.The function effectively uses the cached helper to process multiple paths efficiently. The documentation clearly explains the optimization strategy.
391-394: LGTM! Improved error handling.The added validation with a clear error message improves the user experience when NeMo LoRA loading fails.
443-487: LGTM! Well-implemented NeMo LoRA loader for PyTorch.The function is well-documented and includes proper validation for supported modules. The error messages clearly explain NeMo's limitations regarding module support.
488-509: LGTM! Clean routing implementation.The function provides a clean abstraction for loading different LoRA checkpoint sources with clear error messages.
611-647: LGTM! Robust archive extraction with fallback paths.The function handles both possible archive structures gracefully and includes comprehensive error handling.
773-782: LGTM! Proper integration with NeMo file discovery.The method now correctly uses
find_nemo_filesto resolve .nemo files before passing them to the loader.
869-916: Excellent error handling for missing matrices.The detailed error messages for missing "in" and "out" matrices provide clear guidance to users about incomplete or corrupted checkpoints. This will significantly improve the debugging experience.
961-963: LGTM! Helpful success message.The success message with loaded UIDs helps users track which adapters were loaded.
1102-1106: LGTM! Helpful warning for skipped modules.The warning message helps users understand why certain LoRA modules are being skipped.
1108-1108: LGTM! Clearer MoE detection logic.Checking for the absence of both "in" and "out" keys makes the MoE detection more explicit and readable.
1129-1143: LGTM! Consistent error handling with NeMo loader.The detailed error messages for missing matrices match the quality of the NeMo loader, providing consistent user experience across both checkpoint sources.
1238-1240: LGTM! Consistent success messaging.The success message matches the NeMo loader's pattern.
|
closing this as the scope has been reduced to just logging. |
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation