[Bugfix] Disable cross-layer KV cache for MLA attention backends#37090
[Bugfix] Disable cross-layer KV cache for MLA attention backends#37090orozery merged 5 commits intovllm-project:mainfrom
Conversation
MLA attention kernels assume contiguous per-layer KV cache views. When KV offloading enables cross-layer blocks, per-layer views become non-contiguous (stride(0) includes a num_layers factor), causing decode kernels to read from wrong memory locations and produce garbage output. Have MLA backends raise NotImplementedError from get_kv_cache_stride_order(include_num_layers_dimension=True) so that use_uniform_kv_cache() falls back to per-layer registration with contiguous tensors. Fixes: vllm-project#37032 Co-authored-by: Claude Signed-off-by: haosdent <haosdent@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a bug causing garbage output with MLA models when KV offloading is enabled. The root cause is the incorrect handling of non-contiguous memory layouts by MLA attention kernels when cross-layer KV cache is used. The fix correctly disables this feature for MLA backends by raising a NotImplementedError, which leverages an existing fallback mechanism to use per-layer contiguous KV caches. The changes are well-targeted, clearly explained, and are accompanied by new tests that verify the corrected behavior. The solution appears robust and effectively resolves the issue.
|
Hi @haosdent, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
The OffloadingConnectorWorker.register_kv_caches() used `Attention` as the layer type filter, which excluded MLA attention layers (MLAAttention is not a subclass of Attention). This caused a KeyError when MLA models fell back to per-layer KV caches. Use `AttentionLayerBase` (the common base class) instead to match both Attention and MLAAttention layers. Co-authored-by: Claude Signed-off-by: haosdent <haosdent@gmail.com>
|
@haosdent thanks for this fix! However, instead of returning |
Thanks @orozery ! Let me update the pull request |
Hi, @orozery, I checked again, and this approach does not seem to work. The identity permutation still allocates a single cross-layer tensor. Slicing tensor[i] gives a non-contiguous view where |
It will allocate a single cross-layer tensor, but will create per-layer views I think that if This also aligns with #34742, aiming to remove all exception throwing by |
MLA backends now return identity permutation (0,1,2,3) from get_kv_cache_stride_order(include_num_layers_dimension=True) instead of raising NotImplementedError. use_uniform_kv_cache() checks stride_order[0] == 0 (layers dim first) to skip cross-layer allocation. This aligns with the direction of removing exceptions from get_kv_cache_stride_order. Signed-off-by: haosdent <haosdent@gmail.com>
|
@haosdent LGTM! |
Signed-off-by: haosdent <haosdent@gmail.com>
Done, thanks @orozery |
…m-project#37090) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>
…m-project#37090) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>
…m-project#37090) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>
…m-project#37090) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>
…m-project#37090) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>
…m-project#37090) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…m-project#37090) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>
…m-project#37090) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…m-project#37090) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Purpose
Fixes #37032
MLA models (e.g., GLM-4.7-Flash) produce garbage output with
--kv-offloading-sizebecause cross-layer KV cache allocation creates non-contiguous per-layer views. MLA decode kernels assume contiguous block layout, so they read wrong memory for block_id > 0.Fix (3 parts)
mla_attention.py,indexer.py: MLA backends return identity permutation(0, 1, 2, 3)fromget_kv_cache_stride_order(include_num_layers_dimension=True), keepingnum_layersfirst in physical layout to signal cross-layer is unsupported.kv_connector_model_runner_mixin.py:use_uniform_kv_cache()checksstride_order[0] == 0(layers dim first) and skips cross-layer allocation, falling back to per-layer contiguous KV caches.offloading_connector.py:register_kv_caches()usesAttentionLayerBaseinstead ofAttentionto match bothAttentionandMLAAttentionlayers.Test Plan
End-to-end with
deepseek-ai/DeepSeek-V2-Lite-Chat+ KV offloading on NVIDIA GB10.Test Result
Unit tests: 2/2 passed. End-to-end:
' 夹缝缝缝缝缝缝缝缝缝缝缝缝缝缝缝缝缝缝'(garbage)' 5, 6, 7, 8, 9, 10.\n'(correct)