[KVCache] PagedKVCache refactor, FlashInfer JIT and MLA integration #17674

MasterJH5574 · 2025-02-24T16:44:22Z

This PR consists of the following parts:

We reorganized paged_kv_cache.cc by moving some of the utilities to attn_utils.h.
To integrate with the JIT kernel compilation in the latest FlashInfer project, while still being able to support attention kernels written with TIR, we introduced AttnBackendFunc in attn_backend.h, which exposes attention interfaces (e.g., MHA, MLA) to PagedKVCache. We subclass AttnBackendFunc and implement FlashInfer backends and TIR backends respectively.
With AttnBackendFunc, we refactored the PagedKVCache constructor. The new constructor is not backward compatible, and will break the existing compiled model libraries.
For both TIR and FlashInfer attention implementations, now we require an explicit attention softmax scale factor sm_scale to be passed in. Previously, it has an inlined sm_scale of head_dim ** -0.5. Due to the recent LLM inference techniques such as MLA weight absorption in DeepSeek models, the inlined sm_scale causes confusion and inconvenience. To keep attention interface standard and clear, we now require the explicit passing of sm_scale.
We refactored the existing GPU unit tests of the PagedKVCache, by updating from numpy to PyTorch for std calculation. This significantly reduces the test case run time.

This PR consists of the following parts: * We reorganized `paged_kv_cache.cc` by moving some of the utilities to `attn_utils.h`. * To integrate with the JIT kernel compilation in the latest FlashInfer project, while still being able to support attention kernels written with TIR, we introduced `AttnBackendFunc` in `attn_backend.h`, which exposes attention interfaces (e.g., `MHA`, `MLA`) to PagedKVCache. We subclass `AttnBackendFunc` and implement FlashInfer backends and TIR backends respectively. * With `AttnBackendFunc`, we refactored the PagedKVCache constructor. The new constructor is not backward compatible, and will break the existing compiled model libraries. * For both TIR and FlashInfer attention implementations, now we require an explicit attention softmax scale factor `sm_scale` to be passed in. Previously, it has an inlined `sm_scale` of `head_dim ** -0.5`. Due to the recent LLM inference techniques such as MLA weight absorption in DeepSeek models, the inlined `sm_scale` causes confusion and inconvenience. To keep attention interface standard and clear, we now require the explicit passing of `sm_scale`. * We refactored the existing GPU unit tests of the PagedKVCache, by updating from numpy to PyTorch for std calculation. This significantly reduces the test case run time.

…pache#17674) This PR consists of the following parts: * We reorganized `paged_kv_cache.cc` by moving some of the utilities to `attn_utils.h`. * To integrate with the JIT kernel compilation in the latest FlashInfer project, while still being able to support attention kernels written with TIR, we introduced `AttnBackendFunc` in `attn_backend.h`, which exposes attention interfaces (e.g., `MHA`, `MLA`) to PagedKVCache. We subclass `AttnBackendFunc` and implement FlashInfer backends and TIR backends respectively. * With `AttnBackendFunc`, we refactored the PagedKVCache constructor. The new constructor is not backward compatible, and will break the existing compiled model libraries. * For both TIR and FlashInfer attention implementations, now we require an explicit attention softmax scale factor `sm_scale` to be passed in. Previously, it has an inlined `sm_scale` of `head_dim ** -0.5`. Due to the recent LLM inference techniques such as MLA weight absorption in DeepSeek models, the inlined `sm_scale` causes confusion and inconvenience. To keep attention interface standard and clear, we now require the explicit passing of `sm_scale`. * We refactored the existing GPU unit tests of the PagedKVCache, by updating from numpy to PyTorch for std calculation. This significantly reduces the test case run time.

MasterJH5574 force-pushed the tvm-dev/2025-02-24-attn-jit branch 4 times, most recently from 1053886 to 8bf03d0 Compare February 26, 2025 14:54

MasterJH5574 force-pushed the tvm-dev/2025-02-24-attn-jit branch from 8bf03d0 to d7fb5e4 Compare February 26, 2025 16:44

Hzfengsy approved these changes Feb 27, 2025

View reviewed changes

Hzfengsy merged commit 61f6e7f into apache:main Feb 27, 2025
15 checks passed

CharlieFRuan mentioned this pull request Mar 2, 2025

[WASM] Update wasm include in accordance to kv cache revamp #17695

Merged

ysh329 mentioned this pull request Apr 19, 2025

[Release] v0.20.0 Release Candidate Notes #17860

Closed

kurisu6912 mentioned this pull request Sep 5, 2025

kurisu add assume attr patch 1 tile-ai/tvm#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KVCache] PagedKVCache refactor, FlashInfer JIT and MLA integration #17674

[KVCache] PagedKVCache refactor, FlashInfer JIT and MLA integration #17674

Uh oh!

MasterJH5574 commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[KVCache] PagedKVCache refactor, FlashInfer JIT and MLA integration #17674

[KVCache] PagedKVCache refactor, FlashInfer JIT and MLA integration #17674

Uh oh!

Conversation

MasterJH5574 commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants