- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.8k
[TRTLLM-5830][feat] Improve LoRA cache memory control #6220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-5830][feat] Improve LoRA cache memory control #6220
Conversation
| 📝 WalkthroughWalkthroughThis update refactors LoRA and PEFT cache configuration management in both core and test code. Deprecated LoRA fields are removed, LoRA config handling is unified, and PEFT cache merging is improved. Type annotations are updated for flexibility, and new tests are added to verify cache sizing and config override behaviors. Changes
 Sequence Diagram(s)sequenceDiagram
    participant User
    participant LLM
    participant EngineConfig
    participant LoraConfig
    participant PeftCacheConfig
    participant Executor
    User->>LLM: Build model (with/without lora_config)
    LLM->>EngineConfig: Load engine config
    alt lora_plugin enabled
        EngineConfig->>LoraConfig: Load lora_config
        alt User provides lora_config
            LLM->>LoraConfig: Override with user lora_config
        end
        LLM->>PeftCacheConfig: Merge existing config, update fields if lora_config present
    end
    LLM->>Executor: Create with merged lora_config and peft_cache_config
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~45 minutes Possibly related PRs
 Suggested reviewers
 Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
 🧪 Generate unit tests
 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
 Other keywords and placeholders
 Documentation and Community
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/unittest/llmapi/test_llm_pytorch.py (1)
331-344: Fix the line length violation.The test logic is excellent - it validates that
lora_configcache size parameters override conflictingpeft_cache_configvalues by successfully running with small cache sizes inpeft_cache_configbut adequate sizes inlora_config.However, there's a line length issue that needs to be addressed.
Apply this diff to fix the line length violation:
- """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg.""" + """Tests that cache size args in lora_config LLM arg override the cache size + parameters in peft_cache_config LLM arg."""tensorrt_llm/llmapi/llm_args.py (1)
702-719: Good factory method implementation with minor formatting suggestion.The
create_from_pybindmethod correctly implements the factory pattern for converting pybind objects to Python objects, supporting the flexible PEFT cache configuration mentioned in the PR objectives.Minor nitpick: Consider formatting the docstring as a single line per the static analysis hint.
- def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create PeftCacheConfig from pybind object."""tests/unittest/llmapi/test_llm.py (1)
1438-1453: Fix line length and approve override testing logicThe test effectively validates that LoRA config cache size parameters override PEFT cache config by creating a scenario where PEFT config would fail but LoRA config succeeds.
However, line 1438 exceeds the 120-character limit:
-def test_llama_7b_lora_config_overrides_peft_cache_config(): - """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg.""" +def test_llama_7b_lora_config_overrides_peft_cache_config(): + """Tests that cache size args in lora_config LLM arg override the cache size + parameters in peft_cache_config LLM arg."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
- tensorrt_llm/_torch/pyexecutor/_util.py(2 hunks)
- tensorrt_llm/llmapi/llm.py(3 hunks)
- tensorrt_llm/llmapi/llm_args.py(3 hunks)
- tensorrt_llm/lora_manager.py(1 hunks)
- tests/unittest/llmapi/test_llm.py(3 hunks)
- tests/unittest/llmapi/test_llm_pytorch.py(2 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/lora_manager.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/_util.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm_pytorch.py
332-332: Line too long (124 > 120)
(E501)
tests/unittest/llmapi/test_llm.py
1438-1438: Line too long (124 > 120)
(E501)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🔇 Additional comments (12)
tensorrt_llm/lora_manager.py (1)
147-148: LGTM! Clean type annotation update for flexible cache configuration.The change from
int = 4toint | None = Nonefor bothmax_lorasandmax_cpu_lorasaligns perfectly with the PR objective of making LoRA cache sizes optional. This allows the system to determine appropriate defaults viaPeftCacheConfigwhen these values are unset.tests/unittest/llmapi/test_llm_pytorch.py (3)
4-4: LGTM! Proper import addition for test functionality.The import of
PeftCacheConfigis necessary for the new cache configuration tests.
9-11: LGTM! Required import for multi-LoRA test harness.The import of
check_llama_7b_multi_lora_from_request_test_harnessis correctly added to support the new cache configuration tests.
297-329: LGTM! Well-designed test for PEFT cache configuration validation.This test effectively verifies that
PeftCacheConfigparameters directly impact cache sizing by intentionally setting values too small to hold a single adapter and expectingRuntimeErrorexceptions. The approach of testing failure cases is a solid strategy when direct cache size inspection isn't possible.The test covers both
host_cache_sizeanddevice_cache_percentparameters, providing comprehensive validation.tensorrt_llm/_torch/pyexecutor/_util.py (2)
14-14: LGTM! Required import for PEFT cache configuration merging.The import of
PeftCacheConfigis necessary for the new cache configuration merging logic.
471-481: Excellent refactoring for flexible PEFT cache configuration.The updated logic elegantly handles the merging of existing PEFT cache configuration with LoRA-specific parameters:
- Preserves existing config: Creates
PeftCacheConfigfrom existing pybind config when available, or uses defaults- Conditional overrides: Only updates
num_device_module_layerandnum_host_module_layerwhenlora_config.max_lorasandlora_config.max_cpu_lorasare explicitly set (notNone)- Maintains backwards compatibility: Falls back to defaults when no existing config is present
This approach is much more flexible than the previous direct construction and aligns perfectly with the PR objective of making LoRA cache sizes optional while allowing overrides.
tensorrt_llm/llmapi/llm_args.py (2)
644-644: Good documentation improvement!The clarifying suffixes help users understand that these parameters affect cache sizes and take precedence over other cache size settings, which aligns well with the PR's goal of improving LoRA cache memory control.
Also applies to: 649-649
1424-1428: Correctly implements optional LoraConfig handling.The changes properly make
lora_configoptional by:
- Always enabling
lora_pluginwhenenable_lorais true (for non-PyTorch backends), regardless oflora_configpresence- Adding a safety check before accessing
lora_config.max_lora_rankThis aligns with the PR objective of making
LoraConfig.max_lorasandLoraConfig.max_cpu_lorasoptional parameters.tests/unittest/llmapi/test_llm.py (3)
38-39: LGTM: Import changes support new PEFT cache testing functionalityThe new imports are correctly added to support the PEFT cache configuration testing and test harness usage.
Also applies to: 53-55
1393-1400: LGTM: Test harness integration improves consistencyThe modification to use the test harness with additional parameters is a good refactoring that standardizes the LoRA testing approach while maintaining the original test logic.
1403-1436: LGTM: Effective negative testing approach for cache configuration validationThe test correctly validates that PEFT cache configuration parameters affect actual cache sizes by testing failure scenarios with inadequately small cache sizes. The approach of using
pytest.raises(RuntimeError)is appropriate since direct cache size inspection isn't available.tensorrt_llm/llmapi/llm.py (1)
34-35: LGTM! Import addition is appropriate.The addition of
PeftCacheConfigimport is necessary for the new PEFT cache configuration management logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, overall looks good
53208f5    to
    0221b5b      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tensorrt_llm/llmapi/llm_args.py (1)
702-719: Fix docstring format and validate method implementation.The new
create_from_pybindfactory method enables conversion from pybind objects to Python models, which supports the refactored PEFT cache configuration handling.Apply this diff to fix the docstring format:
- @staticmethod - def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + @staticmethod + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create PeftCacheConfig from pybind object."""tests/unittest/llmapi/test_llm.py (1)
437-452: LGTM - Test correctly validates config override behavior with minor formatting issue.This test effectively validates that
LoraConfigcache size parameters take precedence overPeftCacheConfigparameters, which aligns with the PR objectives. The test design demonstrates this by using small cache sizes inPeftCacheConfigthat would normally cause failures, but providing adequate cache sizes inLoraConfigto ensure success.Apply this diff to fix the line length issue:
- """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg.""" + """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg."""Actually, let me provide a better fix for the line length:
- """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg.""" + """Tests that cache size args in lora_config LLM arg override the cache + size parameters in peft_cache_config LLM arg."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
- tensorrt_llm/_torch/pyexecutor/_util.py(2 hunks)
- tensorrt_llm/llmapi/llm.py(3 hunks)
- tensorrt_llm/llmapi/llm_args.py(3 hunks)
- tensorrt_llm/lora_manager.py(1 hunks)
- tests/unittest/llmapi/test_llm.py(3 hunks)
- tests/unittest/llmapi/test_llm_pytorch.py(2 hunks)
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
tests/unittest/llmapi/test_llm.py
1437-1437: Line too long (124 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
329-329: Line too long (124 > 120)
(E501)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/llmapi/llm.py
- tensorrt_llm/lora_manager.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
tests/unittest/llmapi/test_llm.py
1437-1437: Line too long (124 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
329-329: Line too long (124 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tests/unittest/llmapi/test_llm_pytorch.py (3)
4-4: LGTM: Import addition supports new test functionality.The import of
PeftCacheConfigis correctly added to support the new cache configuration tests.
9-11: LGTM: Import addition enables test harness usage.The import of
check_llama_7b_multi_lora_from_request_test_harnessis properly added to support the new LoRA cache testing functionality.
294-326: LGTM: Comprehensive test validates PEFT cache size impact.The test correctly validates that small cache sizes cause runtime failures when loading LoRA adapters. The approach of testing failure conditions is appropriate since direct cache size inspection isn't available.
Key strengths:
- Tests both
host_cache_sizeanddevice_cache_percentparameters- Uses minimal LoRA config without explicit cache size values
- Properly expects
RuntimeErrorfor insufficient cache sizes- Includes appropriate CUDA graph disabling for known issues
tensorrt_llm/llmapi/llm_args.py (2)
644-644: LGTM: Field descriptions clarified for better understanding.The updated descriptions for
num_host_module_layerandnum_device_module_layerprovide clearer explanations of their impact on cache sizes and overriding behavior.Also applies to: 648-650
1424-1428: LGTM: Simplified LoRA plugin configuration logic.The updated logic correctly:
- Always sets
lora_pluginto 'auto' when LoRA is enabled for non-pytorch backends- Only assigns
max_lora_rankwhenlora_configis present- Removes previous conditional checks that are no longer needed
This aligns with the PR's goal of removing deprecated LoRA fields and simplifying configuration.
tests/unittest/llmapi/test_llm.py (4)
38-39: LGTM - Import addition supports new cache configuration tests.The import of
PeftCacheConfigis necessary for the new test functions that verify cache sizing behavior.
53-55: LGTM - Test utility import is correctly added.The import of
check_llama_7b_multi_lora_from_request_test_harnessis necessary for the new test functions that verify cache behavior.
396-399: LGTM - Test function updated with proper parameter passing.The modification correctly updates the test to use the standardized test harness with appropriate parameters for LoRA functionality.
402-434: LGTM - Well-designed test for cache size validation.This test effectively validates that
PeftCacheConfigparameters affect cache sizing by testing failure scenarios with extremely small cache sizes. The approach is appropriate given that actual cache sizes cannot be directly inspected.The test covers both
host_cache_sizeanddevice_cache_percentparameters, ensuring comprehensive validation of the cache sizing functionality.
0221b5b    to
    7021f17      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/unittest/llmapi/test_llm_pytorch.py (1)
329-329: Fix line length violation.The docstring exceeds the project's line-length limit of 120 characters.
Apply this diff to split the docstring:
- """Tests that cache size args in lora_config LLM arg override the cache size parameters in peft_cache_config LLM arg.""" + """Tests that cache size args in lora_config LLM arg override the cache size + parameters in peft_cache_config LLM arg."""
🧹 Nitpick comments (2)
tensorrt_llm/llmapi/llm_args.py (1)
702-719: Fix docstring formatting and validate factory method logic.The static factory method correctly copies all fields from the pybind object to create a Python
PeftCacheConfiginstance. However, there's a docstring formatting issue.Apply this diff to fix the docstring format:
- @staticmethod - def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + @staticmethod + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create PeftCacheConfig from pybind object."""The method implementation correctly maps all pybind fields to the Python model fields, supporting the conversion pattern used elsewhere in the codebase.
tests/unittest/llmapi/test_llm.py (1)
437-452: LGTM - Excellent test for configuration override behaviorThis test perfectly validates the priority behavior where
lora_configcache size parameters should overridepeft_cache_configparameters. The test design is clever and effective.However, please fix the line length issue:
- peft_cache_config=PeftCacheConfig(host_cache_size=1, - device_cache_percent=0.000001)) + peft_cache_config=PeftCacheConfig( + host_cache_size=1, device_cache_percent=0.000001))
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
- tensorrt_llm/_torch/pyexecutor/_util.py(2 hunks)
- tensorrt_llm/llmapi/llm.py(3 hunks)
- tensorrt_llm/llmapi/llm_args.py(3 hunks)
- tensorrt_llm/lora_manager.py(1 hunks)
- tests/unittest/llmapi/test_llm.py(3 hunks)
- tests/unittest/llmapi/test_llm_pytorch.py(2 hunks)
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
tests/unittest/llmapi/test_llm.py
1437-1437: Line too long (124 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
329-329: Line too long (124 > 120)
(E501)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/lora_manager.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/llmapi/llm.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm_pytorch.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
tests/unittest/llmapi/test_llm.py
1437-1437: Line too long (124 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
329-329: Line too long (124 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tests/unittest/llmapi/test_llm_pytorch.py (4)
4-4: LGTM: Import addition supports new test functionality.The import of
PeftCacheConfigis correctly added to support the new cache configuration tests.
9-11: LGTM: Test utility imports are appropriate.The import of the LoRA test harness function is correctly added to support the new cache configuration tests.
294-326: LGTM: Test correctly validates PEFT cache size behavior.The test effectively verifies that PEFT cache configuration parameters (
host_cache_sizeanddevice_cache_percent) impact cache behavior by expectingRuntimeErrorwhen values are too small to contain a single adapter. This is a sound testing approach when direct cache size inspection isn't available.
330-342: LGTM: Test correctly validates cache parameter override behavior.The test properly verifies that LoRA config cache parameters (
max_loras,max_cpu_loras) override conflicting small cache sizes inpeft_cache_config, ensuring the precedence hierarchy works as expected.tensorrt_llm/llmapi/llm_args.py (2)
644-650: LGTM: Enhanced field descriptions improve clarity.The updated descriptions for
num_host_module_layerandnum_device_module_layereffectively clarify their impact on cache sizes and their override behavior on related cache size parameters.
1424-1428: LGTM: Updated LoRA config logic removes deprecated field dependencies.The logic correctly updates to use
lora_configdirectly instead of deprecated fields, with proper null checking to prevent attribute access onNone. This aligns with the PR objective to centralize LoRA configuration through thelora_configparameter.tests/unittest/llmapi/test_llm.py (3)
38-39: LGTM - Necessary imports for new PEFT cache testsThe import additions are appropriate and necessary to support the new PEFT cache configuration test functions.
Also applies to: 53-55
392-399: LGTM - Updated test function call with explicit argumentsThe addition of explicit keyword arguments (
LLM,enable_lora=True,build_config=build_config,fast_build=True) appears to be updating the function call to match the expected signature of the test harness function.
402-434: LGTM - Well-designed test for PEFT cache size validationThis test effectively validates that PEFT cache configuration parameters affect cache behavior by testing failure scenarios with intentionally small cache sizes. The approach is sound given that actual cache sizes cannot be directly inspected. The test covers both host and device cache parameters and has clear documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- tensorrt_llm/llmapi/llm_args.py(5 hunks)
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/llmapi/llm_args.py (5)
644-645: Good documentation improvement.The clarification about how
num_host_module_layeraffects host cache size and overrideshost_cache_sizehelps users understand the parameter relationships better.
648-650: Good documentation improvement.The clarification about how
num_device_module_layeraffects device cache size and overridesdevice_cache_percentprovides important context for users configuring PEFT cache parameters.
685-686: Good documentation improvement.Adding the note about pytorch backend not supporting LoRA prefetch aligns well with the validation logic added later in the file and helps prevent user confusion.
702-719: Well-implemented factory method.The
create_from_pybindmethod correctly implements the standard pattern for converting pybind objects to Python objects by copying all fields. This supports the improved PEFT cache configuration management mentioned in the PR objectives.Note: There's a static analysis hint about docstring formatting, but it appears to be a false positive as no docstring is visible in this segment.
1424-1428: Good refactoring of LoRA configuration logic.The changes improve the validation logic by:
- Simplifying the
lora_pluginsetting to always enable when LoRA is enabled (for non-pytorch backends)- Adding a defensive check before accessing
lora_config.max_lora_rankto prevent potential AttributeErrorThese changes align well with the PR's goal of removing deprecated LoRA arguments and making configuration handling more robust.
| PR_Github #12560 [ run ] triggered by Bot | 
| PR_Github #12561 [ run ] triggered by Bot | 
| /bot run | 
| PR_Github #12560 [ run ] completed with state  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
702-719: Approve the factory method implementation with a minor formatting fix.The
create_from_pybindmethod correctly copies all fields from the pybind object to create a PythonPeftCacheConfiginstance, supporting the cache configuration improvements in this PR.Fix the docstring formatting issue:
- @staticmethod - def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + @staticmethod + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create a PeftCacheConfig instance from a pybind _PeftCacheConfig object."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- tensorrt_llm/llmapi/llm_args.py(5 hunks)
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/llmapi/llm_args.py (3)
644-644: Good documentation improvements.The extended field descriptions clearly explain the cache sizing impact and override behavior, which helps users understand how these parameters affect memory allocation.
Also applies to: 648-649, 685-685
1424-1427: Good refactoring of LoRA configuration logic.The simplified condition and added null check for
lora_configimprove robustness while aligning with the goal of removing deprecated LoRA fields.
1558-1564: Proper implementation of PEFT cache validation.The validator correctly includes the null check for
peft_cache_configbefore accessinglora_prefetch_dir, addressing the potential AttributeError concern from previous reviews. The warning appropriately informs users about PyTorch backend limitations.
| PR_Github #12562 [ run ] triggered by Bot | 
| PR_Github #12561 [ run ] completed with state  | 
| PR_Github #12562 [ run ] completed with state  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
702-719: Fix docstring formatting for the new static method.The method implementation is correct and provides a clean way to convert from pybind objects to Python objects. However, the docstring should be reformatted to one line.
Apply this fix to reformat the docstring:
- @staticmethod - def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + @staticmethod + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create a PeftCacheConfig from a pybind _PeftCacheConfig object.""" return PeftCacheConfig(
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
- tensorrt_llm/llmapi/llm_args.py(5 hunks)
- tests/unittest/llmapi/test_llm.py(3 hunks)
- tests/unittest/llmapi/test_llm_pytorch.py(2 hunks)
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/unittest/llmapi/test_llm_pytorch.py
- tests/unittest/llmapi/test_llm.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/llmapi/llm_args.py (3)
644-644: Good documentation improvements.These description updates clarify the behavior and relationships between PEFT cache configuration fields, making it clearer to users how these parameters interact and what their current support status is.
Also applies to: 648-649, 684-684
1424-1427: Good improvement to LoRA configuration logic.The changes correctly:
- Simplify the condition for setting
lora_pluginto 'auto' - now it only depends onenable_loraand backend type- Add proper null checking before accessing
lora_config.max_lora_rankThis aligns with the PR objective of making LoRA config parameters optional.
1558-1562: Well-implemented validator with proper null checking.This validator correctly warns users when they attempt to use the unsupported
lora_prefetch_dirfeature. The null checking is properly implemented, addressing the concern from previous reviews.
d1dfbeb    to
    b3c3e60      
    Compare
  
    | /bot run | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
702-719: LGTM: Well-implemented factory method with minor formatting fix needed.This factory method correctly creates
PeftCacheConfiginstances from pybind objects, supporting the PEFT cache merging functionality described in the PR objectives.Address the docstring formatting issue flagged by static analysis:
- def create_from_pybind( - peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + def create_from_pybind(peft_cache_config: _PeftCacheConfig) -> "PeftCacheConfig": + """Create PeftCacheConfig from pybind _PeftCacheConfig object."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
- tensorrt_llm/_torch/pyexecutor/_util.py(2 hunks)
- tensorrt_llm/llmapi/llm.py(3 hunks)
- tensorrt_llm/llmapi/llm_args.py(5 hunks)
- tensorrt_llm/lora_manager.py(1 hunks)
- tests/unittest/llmapi/test_llm.py(4 hunks)
- tests/unittest/llmapi/test_llm_pytorch.py(6 hunks)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🚧 Files skipped from review as they are similar to previous changes (5)
- tensorrt_llm/_torch/pyexecutor/_util.py
- tests/unittest/llmapi/test_llm_pytorch.py
- tensorrt_llm/lora_manager.py
- tests/unittest/llmapi/test_llm.py
- tensorrt_llm/llmapi/llm.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
713-714: One-line docstring should fit on one line
Reformat to one line
(D200)
🔇 Additional comments (4)
tensorrt_llm/llmapi/llm_args.py (4)
643-649: LGTM: Improved field descriptions for better clarity.The enhanced descriptions clearly explain how these fields affect cache sizes and their overriding behavior, which aligns with the PR's goal of improving LoRA cache memory control.
684-684: LGTM: Clear indication of unsupported feature.The updated description correctly indicates that LoRA prefetch is currently not supported, which helps set proper user expectations.
1424-1427: LGTM: Proper handling of optional lora_config.The changes correctly:
- Always set
lora_pluginto 'auto' when LoRA is enabled (removing dependency onlora_configbeing None)- Conditionally access
lora_config.max_lora_rankonly whenlora_configexistsThis properly supports the optional nature of
lora_configparameters as described in the PR objectives.
1558-1562: LGTM: Proper validation with null safety.The validator correctly:
- Performs null check on
peft_cache_configbefore accessing its attributes- Provides clear warning about unsupported LoRA prefetch functionality
- Addresses the previous review comment about potential
AttributeErrorThis implementation aligns with the updated field documentation indicating LoRA prefetch is not supported.
| PR_Github #12573 [ run ] triggered by Bot | 
| PR_Github #12573 [ run ] completed with state  | 
b3c3e60    to
    3fe1f26      
    Compare
  
    …rrelevant to pytorch backend Signed-off-by: Amit Zuker <[email protected]>
… PybindMirror, updated its PeftCacheConfig tests accordingly, removed default values from description, raise exception when unused peft_cache_config.lora_prefetch_dir was set instead of writing a warning log message Signed-off-by: Amit Zuker <[email protected]>
Signed-off-by: Amit Zuker <[email protected]>
Signed-off-by: Amit Zuker <[email protected]>
…che sizes, fix incorrect lora request creation Signed-off-by: Amit Zuker <[email protected]>
50e940e    to
    bce06ad      
    Compare
  
    | /bot run --disable-fail-fast | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
604-658: Fix docstring formatting.The implementation of the generic
from_pybindmethod is well-designed and handles optional fields correctly. However, there's a minor docstring formatting issue.Apply this fix for the docstring formatting:
- """Construct an instance of the given class from the fields in the given - pybind class instance. + """Construct an instance of the given class from the fields in the given pybind class instance.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
- examples/llm-api/llm_multilora.py(1 hunks)
- examples/llm-api/quickstart_multimodal.py(1 hunks)
- tensorrt_llm/_torch/models/modeling_phi4mm.py(1 hunks)
- tensorrt_llm/_torch/pyexecutor/_util.py(2 hunks)
- tensorrt_llm/llmapi/llm.py(3 hunks)
- tensorrt_llm/llmapi/llm_args.py(7 hunks)
- tensorrt_llm/lora_manager.py(1 hunks)
- tests/unittest/llmapi/apps/_test_openai_lora.py(1 hunks)
- tests/unittest/llmapi/apps/_test_trtllm_serve_lora.py(1 hunks)
- tests/unittest/llmapi/test_llm.py(4 hunks)
- tests/unittest/llmapi/test_llm_args.py(1 hunks)
- tests/unittest/llmapi/test_llm_multi_gpu.py(0 hunks)
- tests/unittest/llmapi/test_llm_pytorch.py(6 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/llmapi/test_llm_multi_gpu.py
✅ Files skipped from review due to trivial changes (1)
- examples/llm-api/quickstart_multimodal.py
🚧 Files skipped from review as they are similar to previous changes (8)
- tests/unittest/llmapi/apps/_test_trtllm_serve_lora.py
- tests/unittest/llmapi/apps/_test_openai_lora.py
- examples/llm-api/llm_multilora.py
- tensorrt_llm/lora_manager.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/_torch/models/modeling_phi4mm.py
- tests/unittest/llmapi/test_llm_pytorch.py
- tensorrt_llm/llmapi/llm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
- tests/unittest/llmapi/test_llm.py
- tensorrt_llm/llmapi/llm_args.py
- tests/unittest/llmapi/test_llm_args.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
- tests/unittest/llmapi/test_llm.py
- tensorrt_llm/llmapi/llm_args.py
- tests/unittest/llmapi/test_llm_args.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tests/unittest/llmapi/test_llm.py (4)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.350Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx} : In function calls where parameters are not obvious from inspection, use an inline C comment to document the parameter for readers.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.350Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
tensorrt_llm/llmapi/llm_args.py (3)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
657-658: One-line docstring should fit on one line
Reformat to one line
(D200)
tests/unittest/llmapi/test_llm_args.py
269-269: PeftCacheConfig may be undefined, or defined from star imports
(F405)
299-299: PeftCacheConfig may be undefined, or defined from star imports
(F405)
309-309: PeftCacheConfig may be undefined, or defined from star imports
(F405)
311-311: PeftCacheConfig may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (13)
tests/unittest/llmapi/test_llm_args.py (3)
255-282: LGTM! Comprehensive test for pybind conversion.The test thoroughly validates the
from_pybindclass method by setting all fields explicitly and verifying they transfer correctly to the PythonPeftCacheConfiginstance.
284-314: Excellent test for default value handling.This test validates the critical behavior where Python-side defaults are applied when pybind fields are
None. The use ofPeftCacheConfig.model_fieldsto access expected defaults is the correct approach and ensures the test remains maintainable if default values change.
14-14: Static analysis false positives on star imports.The Ruff warnings about
PeftCacheConfigbeing undefined are false positives. The class is correctly imported via the star import and used consistently throughout the test file, including in existing tests liketest_PeftCacheConfig_declaration.tests/unittest/llmapi/test_llm.py (6)
38-39: LGTM!The import of
PeftCacheConfigis necessary for the new PEFT cache configuration tests and follows the existing import pattern.
54-56: LGTM!The import of the test harness function is necessary for the new PEFT cache tests and follows the existing import pattern.
1430-1431: LGTM!The addition of explicit LoRA cache size parameters to the
BuildConfigaligns with the PR objective to improve LoRA cache memory control. The parameter values are appropriate for testing.
1484-1487: LGTM!The additional arguments to the test harness function are appropriate and consistent with the LoRA testing requirements.
1490-1523: Well-designed test for PEFT cache size validation.The test approach of using intentionally small cache sizes to trigger failures is clever since the actual cache sizes aren't directly accessible. The test covers both host and device cache configurations effectively.
The extremely small values (1 byte for host cache and 0.0000001 percent for device cache) should be sufficient to trigger failures on any realistic system configuration.
1526-1544: Excellent test for configuration override behavior.The test effectively validates that
lora_configparameters properly overridepeft_cache_configparameters. Using intentionally problematic values inpeft_cache_configthat are overridden by proper values inlora_configis a solid approach to verify the override mechanism.tensorrt_llm/llmapi/llm_args.py (4)
6-6: LGTM!The new imports support the generic
from_pybindmethod implementation and follow Python typing best practices.Also applies to: 12-12, 65-65
757-757: LGTM!The changes properly establish default values for cache configuration and clarify the override behavior between different cache size parameters. The explicit defaults (2% for device cache, 1 GiB for host cache) align with the C++ implementation and improve configuration transparency.
Also applies to: 761-762, 789-800
1539-1542: LGTM!The simplification correctly removes dependency on deprecated LoRA fields and adds proper null checking for
lora_configbefore accessing its properties. This aligns with making LoRA configuration parameters optional.
1672-1678: LGTM!The validator correctly enforces that the unsupported
lora_prefetch_dirfeature cannot be used. The null check prevents AttributeError whenpeft_cache_configis None, addressing the concern from previous reviews.
| PR_Github #13503 [ run ] triggered by Bot | 
| PR_Github #13484 [ run ] completed with state  | 
| PR_Github #13503 [ run ] completed with state  | 
Signed-off-by: Amit Zuker <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Amit Zuker <[email protected]>
Description
LoraConfig.max_lorasandLoraConfig.max_cpu_lorasto be optional. When they're not set, the cache size would be determined by thePeftCacheConfig.peft_cache_configandlora_config- when cache size fields inlora_confighave a value, they would take precedence over the relevant fields inpeft_cache_configlora_config: LoraConfigLLM arg:max_lora_rank,max_loras,max_cpu_loras.PeftCacheConfigclass fordevice_cache_percentto 2% andhost_cache_sizeto 1GiB, the same default values that the CPP code uses when these fields have no value.lora_configin LLM args would take precedence overlora_configfrom the engine build config.peft_cache_config.lora_prefetch_dirhas a value, as currently it's not supported.Summary by CodeRabbit
New Features
Tests
Examples
Documentation & Validation
Test Coverage
tests/unittest/llmapi/test_llm.py::test_llama_7b_peft_cache_config_affects_peft_cache_sizetests/unittest/llmapi/test_llm.py::test_llama_7b_lora_config_overrides_peft_cache_configtests/unittest/llmapi/test_llm_pytorch.py::test_llama_7b_peft_cache_config_affects_peft_cache_sizetests/unittest/llmapi/test_llm_pytorch.py::test_llama_7b_lora_config_overrides_peft_cache_configGitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.