Refactor MLA kv cache by nvjullin · Pull Request #21835 · sgl-project/sglang

nvjullin · 2026-04-01T09:25:30Z

Motivation

See #21011. This PR splits the flashmla part out for easier management, which will be submitted later.
This PR also fixes several merge issues with hisparse.

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

This reverts commit 46dc55e.

gemini-code-assist

Code Review

This pull request refactors the MLA KV cache management by introducing the MLAKVCacheLayout enum and kv_cache_size attribute, replacing the previous boolean-based and dimension-based tracking. These changes are propagated through the memory pools, model runner, and attention backends to provide a more robust way of handling different quantization layouts (FP4, BF16, FP8). A critical regression was identified in the nsa_backend.py where the refactored flashmla_sparse path incorrectly raises a ValueError for BF16 layouts. This occurs because BF16 defaults to the PAGED transform method, while the new logic strictly requires RAGGED, leading to a runtime crash for those models.

gemini-code-assist · 2026-04-01T09:29:28Z

+            if topk_transform_method != TopkTransformMethod.RAGGED:
+                raise ValueError(
+                    "Internal error: Unexpected topk transform method for NSA backend flashmla_sparse."
+                )
+
+            if any(forward_batch.extend_prefix_lens_cpu):
+                page_table_1_flattened = self.forward_metadata.page_table_1_flattened
+                assert page_table_1_flattened is not None
+                kv_cache = dequantize_k_cache_paged(kv_cache, page_table_1_flattened)
+            else:
+                kv_cache = _cat([k, k_rope], dim=-1)
+            page_table_1 = topk_indices


The refactoring of the flashmla_sparse path in forward_extend introduces a regression for models using the BF16 layout.

Currently, get_topk_transform_method returns TopkTransformMethod.PAGED for the BF16 layout (line 2153). However, the new code in forward_extend explicitly raises a ValueError if the method is not RAGGED (line 1417). This means any BF16 model using the NSA backend with flashmla_sparse (which is the default for BF16 in set_nsa_impl, line 2141) will crash at runtime.

By restoring the conditional check for TopkTransformMethod.RAGGED, the PAGED method (used by BF16) can correctly proceed using the physical indices in page_table_1 (computed at line 1388) and the global kv_cache buffer, while the RAGGED specific logic (including dequantization) remains restricted to the appropriate layouts.

Suggested change

if topk_transform_method != TopkTransformMethod.RAGGED:

raise ValueError(

"Internal error: Unexpected topk transform method for NSA backend flashmla_sparse."

)

if any(forward_batch.extend_prefix_lens_cpu):

page_table_1_flattened = self.forward_metadata.page_table_1_flattened

assert page_table_1_flattened is not None

kv_cache = dequantize_k_cache_paged(kv_cache, page_table_1_flattened)

else:

kv_cache = _cat([k, k_rope], dim=-1)

page_table_1 = topk_indices

if topk_transform_method == TopkTransformMethod.RAGGED:

if any(forward_batch.extend_prefix_lens_cpu):

page_table_1_flattened = self.forward_metadata.page_table_1_flattened

assert page_table_1_flattened is not None

kv_cache = dequantize_k_cache_paged(kv_cache, page_table_1_flattened)

else:

kv_cache = _cat([k, k_rope], dim=-1)

page_table_1 = topk_indices

nvjullin · 2026-04-01T09:43:27Z

+        if forward_mode is None or forward_mode.is_decode_or_idle():
+            return TopkTransformMethod.PAGED
+        elif (
+            self.kv_cache_layout == MLAKVCacheLayout.FP8_NOPE_WITH_BLOCK_SCALE_BF16_ROPE


This is a faithful refactor from the original code, but got flagged as a bug by gemini. I don't believe flashmla_sparse accepts PAGED topk though, so this seems like a bug in the original code?

nvjullin and others added 11 commits April 1, 2026 07:20

fixed double bug accidental correctness in flashmla_auto

18ff61a

fixed None for cudagraph

0cc8fa9

fixed broken refactor

42c2541

rewrote MLA kv-cache to be explicit about layout

6b721d0

added requant path for flashmla_kv

46dc55e

compute metadata whenever needed

3b54270

fixed rope in mixed backends

9574c2b

clean topk method dispatch

2c43503

clean comments

9e53233

fixed stray renames and rewrote comment

131a109

Revert "added requant path for flashmla_kv"

295526d

This reverts commit 46dc55e.

nvjullin requested review from Fridge003, HaiShaw, Qiaolin-Yu, Ying1123, ch-wan, fzyzcjy, hanming-lu, hebiao064, hnyls2002, hzh0425, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 1, 2026 09:25

nvjullin changed the title ~~Refactor mla kv cache~~ Refactor MLA kv cache Apr 1, 2026

Fridge003 self-assigned this Apr 1, 2026

gemini-code-assist Bot reviewed Apr 1, 2026

View reviewed changes

nvjullin commented Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor MLA kv cache#21835

Refactor MLA kv cache#21835
nvjullin wants to merge 11 commits into
sgl-project:mainfrom
nvjullin:refactor-mla-kv-cache

nvjullin commented Apr 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Uh oh!

nvjullin Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nvjullin commented Apr 1, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

nvjullin Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants