-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Revert "[TRTLLM-5530][BREAKING CHANGE] refactor: unify KvCacheConfig … #6209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…in LLM class for pytorch backend (NVIDIA#5752)" This reverts commit a02606a.
WalkthroughThe changes refactor how the key-value cache data type ( Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ConfigLoader
participant LLM
participant QuantConfig
User->>ConfigLoader: Provide config dict (with kv_cache_dtype, quant_config)
ConfigLoader->>LLM: Instantiate LLM(kv_cache_dtype, quant_config)
LLM->>QuantConfig: Use quant_config for quantization setup
LLM->>LLM: Use kv_cache_dtype for KV cache setup
Estimated code review effort4 (approximately one day) Suggested reviewers
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
145-165: Consistent implementation of explicit quantization configuration.The pattern of explicitly creating
QuantConfigobjects, settingquant_algoandkv_cache_quant_algo, and adding appropriate assertions is well-implemented and consistent across the test suite.This pattern is repeated throughout the file in multiple test methods. Consider extracting a helper function to reduce code duplication:
def create_fp8_quant_config(fp8kv=False): quant_config = QuantConfig(QuantAlgo.FP8) if fp8kv: quant_config.kv_cache_quant_algo = QuantAlgo.FP8 return quant_config
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
examples/llm-api/quickstart_advanced.py(1 hunks)tensorrt_llm/bench/benchmark/utils/general.py(2 hunks)tensorrt_llm/bench/dataclasses/configuration.py(1 hunks)tensorrt_llm/bench/dataclasses/reporting.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(4 hunks)tensorrt_llm/llmapi/llm_utils.py(0 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py(34 hunks)tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py(2 hunks)tests/integration/defs/perf/pytorch_model_config.py(0 hunks)tests/unittest/_torch/modeling/test_modeling_deepseek.py(2 hunks)tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py(1 hunks)tests/unittest/api_stability/references_committed/llm.yaml(1 hunks)
💤 Files with no reviewable changes (2)
- tests/integration/defs/perf/pytorch_model_config.py
- tensorrt_llm/llmapi/llm_utils.py
🧰 Additional context used
🧠 Learnings (2)
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/llmapi/llm_args.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (1)
tests/unittest/_torch/modeling/test_modeling_deepseek.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
KvCacheConfig(753-827)
🔇 Additional comments (20)
tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py (1)
66-66: LGTM: Consistent with the KV cache dtype refactor.This change correctly moves the
kv_cache_dtypespecification from theKvCacheConfigconstructor to the top-level pytorch configuration dictionary, aligning with the broader refactoring effort to decouple KV cache data type from the cache config object.examples/llm-api/quickstart_advanced.py (1)
197-197: LGTM: Improved configuration separation.Moving
kv_cache_dtypeto the LLM constructor level provides better separation of concerns and makes the configuration more explicit, rather than embedding it within theKvCacheConfigobject.tests/unittest/_torch/modeling/test_modeling_deepseek.py (1)
71-71: LGTM: Follows the refactoring pattern correctly.This change properly moves
kv_cache_dtypefrom theKvCacheConfigconstructor to the PyTorch configuration dictionary, consistent with the broader architectural refactoring to centralize KV cache data type configuration.tensorrt_llm/bench/dataclasses/configuration.py (1)
117-117: LGTM: Adds KV cache dtype to autodeploy configuration.This addition ensures that the
kv_cache_dtypeconfiguration is available for the autodeploy backend, with "auto" being a sensible default value that allows automatic data type determination.tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (3)
124-124: LGTM: Consistent KV cache dtype configuration for workers.These changes properly add
kv_cache_dtype="auto"to both context and generation worker configurations in theverify_disaggregatedfunction, ensuring consistent KV cache data type handling across the disaggregated serving setup.Also applies to: 131-131
250-250: LGTM: Worker configuration updates for capacity test.The addition of
kv_cache_dtype="auto"to both worker configurations in the capacity test ensures proper KV cache data type configuration for the disaggregated context capacity testing scenario.Also applies to: 257-257
261-263: LGTM: Simplified KvCacheConfig constructor.The
KvCacheConfigconstructor now only includes the essential cache configuration parameters (max_tokensandenable_block_reuse), with the data type configuration moved to the higher-level worker configuration. This improves separation of concerns.tests/unittest/api_stability/references_committed/llm.yaml (1)
60-66: LGTM: Parameter additions align with revert objectives.The addition of
quant_configandcalib_configparameters to the LLM__init__method is consistent with the revert's goal of undoing the KvCacheConfig refactoring. The parameter annotations and positioning are appropriate.tensorrt_llm/bench/benchmark/utils/general.py (2)
95-96: Configuration simplification approved.The direct check for
"kv_cache_dtype"in the YAML configuration removes unnecessary nesting and aligns with the revert's goal of simplifying KV cache dtype handling.
163-163: All downstream consumers now use top-levelkv_cache_dtypeVerified across the codebase—including integration tests (
pytorch_config["kv_cache_dtype"]), CLI parsers, and allpyexecutorcomponents—there are no remaining references topyt_options["kv_cache_config"]["dtype"]. The new top-level key is correctly consumed everywhere.• Location to note:
- tensorrt_llm/bench/benchmark/utils/general.py:163 (
"kv_cache_dtype": kv_cache_dtype,)tests/integration/defs/accuracy/test_llm_api_pytorch.py (6)
26-26: LGTM: Import addition aligns with refactoring goals.The addition of
QuantConfigimport supports the new pattern of explicit quantization configuration management as described in the PR summary.
163-165: Good test coverage for the new quantization configuration.The conditional assertions properly verify that
kv_cache_quant_algois set toQuantAlgo.FP8when FP8 key-value cache is enabled, ensuring the refactored configuration is working as expected.
156-158: Correct implementation of decoupled kv_cache_dtype configuration.The pattern of setting
kv_cache_dtypedirectly inpytorch_configwhen FP8 key-value cache is enabled properly implements the refactoring goal of decoupling this setting fromKvCacheConfig.
770-774: Appropriate pattern for complex quantization configurations.The two-step approach of creating an empty
QuantConfigand then settingquant_algoandkv_cache_quant_algois correctly implemented for the FP8_BLOCK_SCALES configuration in DeepSeek tests.
1269-1278: Well-structured conditional quantization configuration.The conditional logic properly handles different quantization scenarios by setting
quant_configtoNonewhen no quantization is needed, and creating appropriateQuantConfigobjects for FP8 and NVFP4 quantization types.
1-1663: Excellent execution of the quantization configuration refactoring.The changes throughout this file demonstrate a well-planned and consistently executed refactoring that:
- Decouples kv_cache_dtype from KvCacheConfig - Moving it to a dedicated field as described in the AI summary
- Centralizes quantization configuration - Making QuantConfig usage explicit and consistent
- Maintains comprehensive test coverage - Adding appropriate assertions to verify the new configuration approach
- Preserves functionality - No breaking changes to the test API while implementing the new pattern
The consistent application across all model test classes (Llama, Mixtral, DeepSeek, Qwen, etc.) shows thorough attention to detail and proper testing of the refactored functionality.
tensorrt_llm/llmapi/llm_args.py (4)
1037-1039: LGTM! Centralized quantization config.Moving
quant_configto the base class promotes better code reuse and consistency across both PyTorch and TensorRT backends.
1220-1226: Good default initialization pattern.The validator ensures
quant_configis always initialized with a defaultQuantConfig()instance, preventing potential None reference errors downstream.
1832-1834: Appropriate backend-specific configuration.Decoupling
kv_cache_dtypefromKvCacheConfigand making it PyTorch-specific is a good architectural decision, as it allows for backend-specific dtype handling while keeping the shared config clean.
2066-2066: Compatibility check passed: no remaining references tokv_cache_config.dtypeRan the searches and AST grep—no occurrences of the old
kv_cache_config.dtypepattern remain.
- Confirmed that
self.kv_cache_dtypeis now used consistently inget_pytorch_backend_config.- Please manually verify any downstream code or deserialization logic (e.g., loading from saved checkpoints or serialized configs) to ensure they handle the new
kv_cache_dtypefield correctly.
| validate_and_set_kv_cache_quant(model_config, | ||
| self.kwargs["kv_cache_dtype"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Simplification looks good, but verify kwargs consistency.
The direct access to self.kwargs["kv_cache_dtype"] simplifies the logic by removing the intermediate KvCacheConfig extraction. However, ensure that all callers consistently provide kv_cache_dtype in kwargs to avoid KeyError exceptions.
🏁 Script executed:
#!/bin/bash
# Description: Verify that kv_cache_dtype is consistently provided in kwargs across the codebase
# Expected: All ReportUtility instantiations should include kv_cache_dtype in kwargs
# Search for ReportUtility instantiations
echo "=== ReportUtility instantiations ==="
rg -A 10 -B 5 "ReportUtility\(" --type py
# Search for places where kwargs might be passed to ReportUtility
echo -e "\n=== Potential kwargs sources ==="
rg -A 5 -B 5 "kv_cache_dtype.*kwargs\|kwargs.*kv_cache_dtype" --type pyLength of output: 3274
🏁 Script executed:
#!/bin/bash
set -e
echo "=== ReportUtility context in low_latency.py ==="
rg -n -C 5 "report_utility" tensorrt_llm/bench/benchmark/low_latency.py || true
echo -e "\n=== 'kwargs' declarations/usages in low_latency.py ==="
rg -n "kwargs" tensorrt_llm/bench/benchmark/low_latency.py || true
echo -e "\n=== ReportUtility context in throughput.py ==="
rg -n -C 5 "report_utility" tensorrt_llm/bench/benchmark/throughput.py || true
echo -e "\n=== 'kwargs' declarations/usages in throughput.py ==="
rg -n "kwargs" tensorrt_llm/bench/benchmark/throughput.py || trueLength of output: 4469
🏁 Script executed:
#!/bin/bash
set -e
echo "=== Searching for get_llm_args definition and usage ==="
rg -n "get_llm_args" -t py
echo -e "\n=== Inspecting RuntimeConfig class for kv_cache_dtype ==="
rg -n -C 3 "class .*RuntimeConfig" -t py || trueLength of output: 1901
🏁 Script executed:
#!/bin/bash
set -e
echo "=== Extracting RuntimeConfig class and get_llm_args method ==="
rg -n -C 50 "class RuntimeConfig" tensorrt_llm/bench/dataclasses/configuration.py
echo -e "\n=== Extracting get_llm_args implementation ==="
rg -n -C 50 "def get_llm_args" tensorrt_llm/bench/dataclasses/configuration.py
echo -e "\n=== Searching for kv_cache_dtype in configuration ==="
rg -n "kv_cache_dtype" -t pyLength of output: 35393
🏁 Script executed:
#!/bin/bash
set -e
echo "=== validate_and_set_kv_cache_quant definition ==="
rg -n -C 20 "def validate_and_set_kv_cache_quant" tensorrt_llm/bench/benchmark/utils/general.py || trueLength of output: 224
Add kv_cache_dtype to kwargs before calling ReportUtility
The new call to
validate_and_set_kv_cache_quant(model_config, self.kwargs["kv_cache_dtype"])will KeyError unless every ReportUtility instantiation’s kwargs includes "kv_cache_dtype". Please update both benchmark entry-points:
• In tensorrt_llm/bench/benchmark/low_latency.py (around line 353):
# merge runtime_config args
kwargs = kwargs | runtime_config.get_llm_args()
kwargs["backend"] = backend
+ # ensure kv_cache_dtype is passed through for reporting
+ kwargs["kv_cache_dtype"] = runtime_config.settings_config.get_kvcache_config().kv_cache_dtype
report_utility = ReportUtility(statistics, metadata, runtime_config,
- logger, kwargs, True)
+ logger, kwargs, True)• In tensorrt_llm/bench/benchmark/throughput.py (around line 436):
kwargs = kwargs | runtime_config.get_llm_args()
kwargs["backend"] = backend
+ # ensure kv_cache_dtype is passed through for reporting
+ kwargs["kv_cache_dtype"] = runtime_config.settings_config.get_kvcache_config().kv_cache_dtype
report_utility = ReportUtility(statistics, metadata, runtime_config,
- logger, kwargs, streaming)
+ logger, kwargs, streaming)With these additions, self.kwargs["kv_cache_dtype"] will always be defined and validate_and_set_kv_cache_quant will receive the expected dtype.
🤖 Prompt for AI Agents
In tensorrt_llm/bench/dataclasses/reporting.py at lines 278-279, the code
accesses self.kwargs["kv_cache_dtype"] which can cause a KeyError if
"kv_cache_dtype" is not present in kwargs. To fix this, update the benchmark
entry points in tensorrt_llm/bench/benchmark/low_latency.py around line 353 and
tensorrt_llm/bench/benchmark/throughput.py around line 436 to include
"kv_cache_dtype" in the kwargs passed to ReportUtility. This ensures
self.kwargs["kv_cache_dtype"] is always defined before calling
validate_and_set_kv_cache_quant.
…in LLM class for pytorch backend (#5752)"
This reverts commit a02606a.
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.