Skip to content

[Bugfix] Preserve TurboQuant sliding-window KV specs#41497

Open
lesj0610 wants to merge 3 commits intovllm-project:mainfrom
lesj0610:lesj/tq-sliding-window-kv-spec-pr
Open

[Bugfix] Preserve TurboQuant sliding-window KV specs#41497
lesj0610 wants to merge 3 commits intovllm-project:mainfrom
lesj0610:lesj/tq-sliding-window-kv-spec-pr

Conversation

@lesj0610
Copy link
Copy Markdown
Contributor

@lesj0610 lesj0610 commented May 2, 2026

Summary

Fix TurboQuant KV cache spec selection for sliding-window attention.

TurboQuant + sliding-window layers need both tq_slot_size page sizing and sliding-window manager routing. The old code could collapse this into plain SlidingWindowSpec and lose TurboQuant page-size behavior.

Changes

  • Add TQSlidingWindowSpec.
  • Return it from attention spec selection when TurboQuant and sliding-window are both enabled.
  • Preserve TurboQuant page sizing during hybrid KV spec unification.
  • Route TQSlidingWindowSpec to SlidingWindowManager.
  • Add focused tests for spec selection, page-size preservation, and manager routing.

Related upstream PRs

This replaces the KV-spec fix from the larger TurboQuant PRs. No kernel or model-loading changes.

Area Existing PRs This PR
Scope Broader PRs mix model support, kernels, platform/config paths, and KV-spec changes (#40108, #39931). Only fixes KV spec selection and manager routing for TurboQuant + sliding-window.
TurboQuant + sliding-window KV spec The spec can collapse into normal SlidingWindowSpec, which loses TurboQuant page sizing. Adds TQSlidingWindowSpec so tq_slot_size and sliding-window routing stay together.
Hybrid TurboQuant config #41123 handles config/argument acceptance for hybrid TurboQuant. This fixes the lower KV-spec object after attention spec creation and during hybrid spec unification.
Review scope Larger PRs are harder to review as a small bugfix. Narrow scope, direct tests, no kernel/platform changes.

Validation

  • ruff check on changed files: passed
  • pytest tests/v1/core/test_kv_cache_utils.py tests/v1/core/test_single_type_kv_cache_manager.py -q -k 'tq_sliding_window or preserves_tq_page_size or sliding_window_uses_sliding_window_manager': 2 passed

AI assistance was used (Codex, Claude, Gemini)

Keep TurboQuant page-size accounting when a layer also uses sliding-window attention by preserving tq_slot_size through the KV cache spec and manager dispatch paths.

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added v1 bug Something isn't working labels May 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for TurboQuant (TQ) with sliding window attention in the v1 engine. It adds the TQSlidingWindowSpec class to handle TQ-aware page sizes and updates the attention layer, KV cache utilities, and manager mappings to support this new specification. Unit tests have been added to verify the correct behavior of TQ sliding window specs and their integration with the cache manager. I have no feedback to provide.

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
@lesj0610
Copy link
Copy Markdown
Contributor Author

lesj0610 commented May 8, 2026

@heheda12345 Sorry one more. This is also small fix in hybrid KV quantized path, related to #40308 but independent.

Problem: TurboQuant layer with sliding_window was created as normal SlidingWindowSpec, because get_kv_cache_spec checked sliding_window before turboquant path. After that, hybrid unification converts it through normal FullAttentionSpec route, and TQ page/slot size info is lost.

This PR adds TQSlidingWindowSpec that keeps TQ-aware page size, makes TurboQuant sliding-window attention return it, and preserves it as TQFullAttentionSpec during hybrid unification. SlidingWindowManager also handles TQSlidingWindowSpec so sliding-window cap still works. This touches kv_cache_interface.py and v1/core, so your review would help a lot. Small patch with tests included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant