-
Notifications
You must be signed in to change notification settings - Fork 660
[KVCache] support unified cache backend #4903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ltd0924 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
Thanks for your contribution! |
| self.mla_cache = envs.FD_ATTENTION_BACKEND == "MLA_ATTN" | ||
| for i in range(self.model_config.num_hidden_layers): | ||
| key_cache_name = f"key_caches_{i}_rank{local_rank}.device{self.device_id}" | ||
| if not self.mla_cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里和gpu_model_runner不统一了,是不是也按照if value_cache_shape 进行判断
| self.head_dim, | ||
| ) | ||
| ] | ||
| return key_cache_shape, value_cache_shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样返回担心有风险,改成key_cache_shape与value_cache_shape非同一个list吧,key_cache_shape和value_cache_shape是同一个list,但调用方不知道改了其中一个返回值,其实另一个返回值也会发生变化
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the KV cache shape handling across the codebase to support separate key and value cache shapes, particularly for Multi-Head Latent Attention (MLA) architectures where value cache may not be needed. The refactoring changes the get_kv_cache_shape() method to return a tuple of two shapes instead of a single shape.
Key Changes:
- Modified all attention backends to return separate
key_cache_shapeandvalue_cache_shapefromget_kv_cache_shape() - Updated model runners and cache managers to handle separate shapes for key and value caches
- Replaced MLA-specific flags (
self.mla_cache) with shape-based logic (checking ifvalue_cache_shapeis empty) - Updated cache transfer and cache messager modules to pass shape information as command-line arguments
Reviewed Changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/cache_manager/test_cache_transfer_manager.py | Updated test Args to use new shape-based parameters |
| fastdeploy/model_executor/layers/attention/*.py | Modified all attention backends to return tuple of (key_cache_shape, value_cache_shape) |
| fastdeploy/worker/*_model_runner.py | Updated model runners to unpack and use separate cache shapes |
| fastdeploy/cache_manager/*.py | Refactored cache managers to compute and pass shapes as command-line arguments |
| fastdeploy/demo/offline_disaggregated_demo.py | Changed demo configuration (model path and port) |
Motivation
Standardize the acquisition method of the unified KV cache shape in the scenario of prefix cache and PD disaggregation, reducing the cost of integrating different attention backends with the prefix cache.
Modifications
get_kv_cache_shapeto distinguish the shapes of key cache and value cache.Usage or Command
No change.
Accuracy Tests
The existing CI already covers it, no need to add.
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.