[Bugfix] Preserve TurboQuant sliding-window KV specs#41497
[Bugfix] Preserve TurboQuant sliding-window KV specs#41497lesj0610 wants to merge 3 commits intovllm-project:mainfrom
Conversation
Keep TurboQuant page-size accounting when a layer also uses sliding-window attention by preserving tq_slot_size through the KV cache spec and manager dispatch paths. Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for TurboQuant (TQ) with sliding window attention in the v1 engine. It adds the TQSlidingWindowSpec class to handle TQ-aware page sizes and updates the attention layer, KV cache utilities, and manager mappings to support this new specification. Unit tests have been added to verify the correct behavior of TQ sliding window specs and their integration with the cache manager. I have no feedback to provide.
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
|
@heheda12345 Sorry one more. This is also small fix in hybrid KV quantized path, related to #40308 but independent. Problem: TurboQuant layer with sliding_window was created as normal SlidingWindowSpec, because get_kv_cache_spec checked sliding_window before turboquant path. After that, hybrid unification converts it through normal FullAttentionSpec route, and TQ page/slot size info is lost. This PR adds TQSlidingWindowSpec that keeps TQ-aware page size, makes TurboQuant sliding-window attention return it, and preserves it as TQFullAttentionSpec during hybrid unification. SlidingWindowManager also handles TQSlidingWindowSpec so sliding-window cap still works. This touches kv_cache_interface.py and v1/core, so your review would help a lot. Small patch with tests included. |
Summary
Fix TurboQuant KV cache spec selection for sliding-window attention.
TurboQuant + sliding-window layers need both
tq_slot_sizepage sizing and sliding-window manager routing. The old code could collapse this into plainSlidingWindowSpecand lose TurboQuant page-size behavior.Changes
TQSlidingWindowSpec.TQSlidingWindowSpectoSlidingWindowManager.Related upstream PRs
This replaces the KV-spec fix from the larger TurboQuant PRs. No kernel or model-loading changes.
SlidingWindowSpec, which loses TurboQuant page sizing.TQSlidingWindowSpecsotq_slot_sizeand sliding-window routing stay together.Validation
ruff checkon changed files: passedpytest tests/v1/core/test_kv_cache_utils.py tests/v1/core/test_single_type_kv_cache_manager.py -q -k 'tq_sliding_window or preserves_tq_page_size or sliding_window_uses_sliding_window_manager': 2 passedAI assistance was used (Codex, Claude, Gemini)