Skip to content

[HiSparse] Support FP8 KV cache by routing to flashmla_kv backend#23013

Merged
whybeyoung merged 3 commits intosgl-project:mainfrom
whybeyoung:hisparse_fp8
May 6, 2026
Merged

[HiSparse] Support FP8 KV cache by routing to flashmla_kv backend#23013
whybeyoung merged 3 commits intosgl-project:mainfrom
whybeyoung:hisparse_fp8

Conversation

@whybeyoung
Copy link
Copy Markdown
Collaborator

@whybeyoung whybeyoung commented Apr 17, 2026

flashmla_sparse does not accept FP8 input, so HiSparse was previously pinned to BF16 KV. The flashmla_kv kernel already supports native FP8 + sparse attention (is_fp8_kvcache=True + indices=...), and HiSparse's hot-buffer indices are drop-in compatible with its indices contract.

Related resource:
#13841
#13832
sgl-project/FlashMLA#1

Pair HiSparse with the correct backend by KV dtype:

  • bfloat16 -> flashmla_sparse (unchanged)
  • fp8_e4m3 -> flashmla_kv (new)

Decode cli add:

  --page-size 64 \
  --kv-cache-dtype fp8_e4m3 \
  --enable-hisparse \
  --disable-radix-cache \
  --hisparse-config '{"top_k": 2048, "device_buffer_size": 4096}'

Accuracy:
image

CC @xiezhq-hermann @hzh0425

flashmla_sparse does not accept FP8 input, so HiSparse was previously
pinned to BF16 KV. The flashmla_kv kernel already supports native FP8 +
sparse attention (is_fp8_kvcache=True + indices=...), and HiSparse's
hot-buffer indices are drop-in compatible with its indices contract.

Pair HiSparse with the correct backend by KV dtype:
- bfloat16 -> flashmla_sparse (unchanged)
- fp8_e4m3 -> flashmla_kv     (new)

No attention-path code changes are needed; the existing
_forward_flashmla_kv path handles HiSparse indices as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@whybeyoung
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@hzh0425
Copy link
Copy Markdown
Collaborator

hzh0425 commented Apr 30, 2026

/rerun-test test/registered/8-gpu-models/test_dsa_models_hisparse.py

@github-actions
Copy link
Copy Markdown
Contributor

8-gpu-h200 (1 test): View workflow run

cd test/ && python3 registered/8-gpu-models/test_dsa_models_hisparse.py

@whybeyoung whybeyoung merged commit 3da8790 into sgl-project:main May 6, 2026
393 of 449 checks passed
@whybeyoung whybeyoung deleted the hisparse_fp8 branch May 6, 2026 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants