Fix TURBOQUANT backend selection in cuda.py#40060
Conversation
Added TURBOQUANT to the list of attention backends and removed specialized TurboQuant KV cache handling. Signed-off-by: Michael Goin <mgoin64@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request integrates TURBOQUANT into the standard attention backend priority lists for CUDA platforms while removing the previous hardcoded bypass for TurboQuant KV cache types. Feedback indicates that moving TURBOQUANT to the end of the priority list is a regression that could lead to incorrect backend selection; it is recommended to use the kv_cache_dtype parameter to prioritize TURBOQUANT when explicitly requested while still allowing for standard validation.
|
Hi @mgoin, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @mgoin, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Michael Goin <mgoin64@gmail.com>
|
Documentation preview: https://vllm--40060.org.readthedocs.build/en/40060/ |
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Purpose
Added TURBOQUANT to the selection list of attention backends and removed specialized case.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.