[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache#37252
Conversation
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
|
Documentation preview: https://vllm--37252.org.readthedocs.build/en/37252/ |
There was a problem hiding this comment.
Code Review
This pull request modifies the attention backend selection logic to prioritize Flashinfer sparse MLA for FP8 KV cache on Blackwell GPUs. The change is implemented in vllm/platforms/cuda.py by updating _get_backend_priorities to consider the kv_cache_dtype. The corresponding documentation in docs/design/attention_backends.md has been updated to reflect the new backend priority. The pull request also refactors type hints in vllm/platforms/cuda.py by introducing from __future__ import annotations.
MatthewBonanni
left a comment
There was a problem hiding this comment.
LGTM, thanks! Can you update the documentation to discuss when one is preferred over the other? I should have done this when I conditioned it on num_heads and neglected to. You'll need to modify the generator script. It can be as simple as just adding an asterisk to each of those and a footnote somewhere
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
|
@MatthewBonanni Done. Thanks! |
…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Purpose
This PR sets Flashinfer sparse MLA as default backend for FP8 KV cache for better performance.
Test Plan
Test Result
Kernel microbenchmark results: #35807
E2E results with different TP (with EP enabled)

Command:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.