-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add basic Nemo Ckpt Lora Loading in pytorch flow #6019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic Nemo Ckpt Lora Loading in pytorch flow #6019
Conversation
f6b09d7
to
18ab327
Compare
/bot run --extra-stage "H100_PCIe-PyTorch-Post-Merge-1" |
PR_Github #11846 [ run ] triggered by Bot |
PR_Github #11846 [ run ] completed with state |
/bot run --extra-stage "H100_PCIe-PyTorch-Post-Merge-1" |
PR_Github #11861 [ run ] triggered by Bot |
PR_Github #11861 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work :)
a few small comments and questions I'm not sure about
4bfb756
to
11da27b
Compare
/bot run |
PR_Github #11967 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds end-to-end support for loading NeMo-formatted LoRA checkpoints in the PyTorch workflow. Key changes include:
- Introducing helper functions (
find_nemo_files
,_find_nemo_files_single_path
), enhancingNemoLoraLoader
, and addingload_torch_nemo_lora
with routing viaload_torch_lora
. - Updating the executor, request, and model initialization code to respect a new
lora_ckpt_source
flag and only apply HF-specific logic when appropriate. - Adding unit tests that generate a minimal
.nemo
archive and verify loader behavior, and updating the integration test matrix.
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tests/unittest/llmapi/test_llm_pytorch.py | Add NeMo LoRA checkpoint generation helper and loader unit tests |
tests/integration/test_lists/test-db/l0_h100.yml | Update integration test list to include new NeMo LoRA tests |
tensorrt_llm/lora_manager.py | Introduce file‐finding utilities, PyTorch routing, and docstrings |
tensorrt_llm/executor/worker.py | Pass ckpt_source through when loading adapters |
tensorrt_llm/executor/request.py | Extend LoRARequest with lora_ckpt_source and validation |
tensorrt_llm/_torch/pyexecutor/_util.py | Route via new load_torch_lora instead of HF‐only loader |
tensorrt_llm/_torch/models/modeling_utils.py | Guard HF‐only LoRA head checks under lora_ckpt_source == "hf" |
tensorrt_llm/_torch/models/modeling_nemotron_nas.py | Same HF‐only guard for custom vocab in NeMo flow |
tensorrt_llm/_torch/models/modeling_llama.py | Same HF‐only guard for custom vocab in NeMo flow |
Comments suppressed due to low confidence (4)
tensorrt_llm/lora_manager.py:383
- [nitpick] There are now two similarly named functions (
load_nemo_lora
andload_torch_nemo_lora
), which can be confusing. Consider renaming one to clarify their distinct purposes.
def load_nemo_lora(model, lora_config: LoraConfig):
tensorrt_llm/executor/worker.py:353
- The call to
load_from_ckpt
now includes ackpt_source
keyword—please confirm that theload_from_ckpt
signature accepts this parameter to avoid unexpected keyword argument errors.
ckpt_source=lora_request.ckpt_source)
tests/unittest/llmapi/test_llm_pytorch.py:41
- The helper uses
tempfile.TemporaryDirectory()
but I don’t seeimport tempfile
in this diff. Please verify thatimport tempfile
is present at the top of the test file to avoid a NameError.
def create_mock_nemo_lora_checkpoint(
PR_Github #11967 [ run ] completed with state |
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
aeae974
to
56266c8
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tensorrt_llm/_torch/model_config.py (1)
300-342
: LGTM! Comprehensive KV heads handling with proper fallbacks.The enhanced logic properly handles both uniform and per-layer KV head configurations with appropriate fallbacks and LoRA compatibility validation. The TP/CP scaling is applied consistently.
One minor formatting issue to address:
- # For uniform models, check: num_key_value_heads (standard) -> num_query_groups (NeMo) -> num_attention_heads + # For uniform models, check: num_key_value_heads (standard) -> + # num_query_groups (NeMo) -> num_attention_headstests/unittest/llmapi/test_llm_pytorch.py (1)
493-567
: LGTM! Comprehensive integration test for GQA NeMo LoRA support.This test excellently validates the entire pipeline from checkpoint creation to generation, with proper deterministic setup and validation that LoRA has a measurable effect. The GQA configuration matches TinyLlama's specifications perfectly.
Fix the line length issues flagged by static analysis:
- 1. That a NeMo-format LoRA checkpoint with GQA (grouped query attention) can be loaded and applied to a TinyLlama model, + 1. That a NeMo-format LoRA checkpoint with GQA (grouped query attention) can be loaded and + applied to a TinyLlama model, - and that generation with this LoRA produces a deterministic, expected output for a fixed prompt and temperature=0.0. + and that generation with this LoRA produces a deterministic, expected output for a fixed + prompt and temperature=0.0. - The test uses a deterministic dummy LoRA checkpoint (seed=42) and checks both the positive (LoRA applied) and negative + The test uses a deterministic dummy LoRA checkpoint (seed=42) and checks both the positive + (LoRA applied) and negativetensorrt_llm/lora_manager.py (1)
349-385
: LGTM! Enhanced loader with honest documentation about design limitations.The comprehensive documentation and new
get_target_modules
method improve the class significantly. The honest acknowledgment of the misleading parameter namelora_dirs
in the docstring is good practice.Consider renaming the parameter in a future version:
- def __init__(self, lora_dirs: List[str]): + def __init__(self, lora_paths: List[str]): # Future improvement
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
tensorrt_llm/_torch/model_config.py
(3 hunks)tensorrt_llm/_torch/models/modeling_llama.py
(1 hunks)tensorrt_llm/_torch/models/modeling_nemotron_nas.py
(1 hunks)tensorrt_llm/_torch/models/modeling_utils.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(3 hunks)tensorrt_llm/executor/request.py
(2 hunks)tensorrt_llm/executor/worker.py
(1 hunks)tensorrt_llm/lora_manager.py
(10 hunks)tests/unittest/llmapi/lora_test_utils.py
(2 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
🧠 Learnings (1)
tensorrt_llm/lora_manager.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/model_config.py
321-321: Line too long (129 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
498-498: Line too long (124 > 120)
(E501)
499-499: Line too long (123 > 120)
(E501)
503-503: Line too long (122 > 120)
(E501)
🚧 Files skipped from review as they are similar to previous changes (7)
- tensorrt_llm/executor/worker.py
- tensorrt_llm/_torch/models/modeling_llama.py
- tensorrt_llm/_torch/models/modeling_nemotron_nas.py
- tensorrt_llm/executor/request.py
- tensorrt_llm/_torch/models/modeling_utils.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tests/unittest/llmapi/lora_test_utils.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/lora_manager.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/model_config.py
321-321: Line too long (129 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
498-498: Line too long (124 > 120)
(E501)
499-499: Line too long (123 > 120)
(E501)
503-503: Line too long (122 > 120)
(E501)
🔇 Additional comments (12)
tensorrt_llm/_torch/model_config.py (2)
363-366
: LGTM! Correct routing to C++ binding methods.The logic properly sets the KV heads configuration on the C++ model config based on whether per-layer or uniform KV heads are being used.
416-419
: LGTM! Safe handling of None ffn_mult values.The defensive programming approach correctly handles cases where some layers'
ffn_mult
attributes may be None by substituting zero, preventing runtime errors during max computation.tests/unittest/llmapi/test_llm_pytorch.py (2)
432-467
: LGTM! Well-structured unit test for NeMo LoRA loading.The parameterized test effectively covers different LoRA rank configurations and validates that the loader correctly sets up the configuration with proper module mapping.
469-490
: LGTM! Important validation test for error handling.The test properly validates that the loader rejects unsupported module configurations and raises appropriate errors, which is crucial for user guidance.
tensorrt_llm/lora_manager.py (8)
26-60
: LGTM! Excellent addition of type annotations and comprehensive documentation.The enhanced function signature with type hints and detailed docstring significantly improves code clarity and maintainability without changing the core functionality.
69-157
: LGTM! Consistent improvement in type safety and documentation.The addition of comprehensive type annotations and docstrings follows good practices and makes these complex functions much more understandable, especially the callback signature documentation in
iterate_hf_lora
.
169-188
: LGTM! Clear documentation for module mapping inversion.The type annotations and docstring make this potentially confusing operation much clearer, especially the clarification that HF module names can be either strings or lists of strings.
280-347
: LGTM! Well-designed file discovery with intelligent caching strategy.The LRU caching on individual paths is a smart optimization that maximizes cache efficiency when paths are reused across different collections. The comprehensive error handling provides clear user feedback for various failure scenarios.
388-396
: LGTM! Important addition of validation logic.The validation check prevents silent failures when NeMo LoRA loading fails, providing clear error messages to users.
442-485
: LGTM! Well-designed PyTorch-specific NeMo LoRA loader.The function provides a clean abstraction for NeMo LoRA loading with appropriate validation, clear error messages, and proper documentation of current limitations. The hardcoded "attn_qkv" mapping aligns with NeMo's current supported functionality.
487-507
: LGTM! Clean dispatcher pattern for LoRA checkpoint loading.The dispatcher function provides a clean abstraction that routes to appropriate loaders based on checkpoint source, with proper error handling for unsupported sources.
772-777
: LGTM! Proper integration of file discovery functionality.The integration of
find_nemo_files
in the LoraManager ensures that NeMo checkpoint loading properly handles both files and directories, maintaining consistency with the enhanced loader capabilities.
PR_Github #12599 [ run ] triggered by Bot |
PR_Github #12599 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LLM part changes look good to me.
Signed-off-by: Shreyas Misra <[email protected]>
Signed-off-by: Ransiki Zhang <[email protected]>
NemoLoraLoader
used in trt flowlora_dir
for nemo loading caseNOTE: This merely re-uses the pre-existing core Nemo Lora ckpt loading functionality, so its on parity with the TRT flow w.r.t Lora Ckpt Loading.
The limitations of Nemo Lora Ckpt loading, therefore, still apply.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation