[HiSparse] Support FP8 KV cache by routing to flashmla_kv backend by whybeyoung · Pull Request #23013 · sgl-project/sglang

whybeyoung · 2026-04-17T01:47:56Z

flashmla_sparse does not accept FP8 input, so HiSparse was previously pinned to BF16 KV. The flashmla_kv kernel already supports native FP8 + sparse attention (is_fp8_kvcache=True + indices=...), and HiSparse's hot-buffer indices are drop-in compatible with its indices contract.

Related resource:
#13841
#13832
sgl-project/FlashMLA#1

Pair HiSparse with the correct backend by KV dtype:

bfloat16 -> flashmla_sparse (unchanged)
fp8_e4m3 -> flashmla_kv (new)

Decode cli add:

  --page-size 64 \
  --kv-cache-dtype fp8_e4m3 \
  --enable-hisparse \
  --disable-radix-cache \
  --hisparse-config '{"top_k": 2048, "device_buffer_size": 4096}'

Accuracy:

CC @xiezhq-hermann @hzh0425

flashmla_sparse does not accept FP8 input, so HiSparse was previously pinned to BF16 KV. The flashmla_kv kernel already supports native FP8 + sparse attention (is_fp8_kvcache=True + indices=...), and HiSparse's hot-buffer indices are drop-in compatible with its indices contract. Pair HiSparse with the correct backend by KV dtype: - bfloat16 -> flashmla_sparse (unchanged) - fp8_e4m3 -> flashmla_kv (new) No attention-path code changes are needed; the existing _forward_flashmla_kv path handles HiSparse indices as-is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-04-17T01:48:00Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

whybeyoung · 2026-04-30T08:03:55Z

/tag-and-rerun-ci

hzh0425 · 2026-04-30T09:54:51Z

/rerun-test test/registered/8-gpu-models/test_dsa_models_hisparse.py

github-actions · 2026-04-30T09:55:21Z

✅ 8-gpu-h200 (1 test): View workflow run

cd test/ && python3 registered/8-gpu-models/test_dsa_models_hisparse.py

xiezhq-hermann self-assigned this Apr 17, 2026

xiezhq-hermann added the run-ci label Apr 17, 2026

Merge branch 'main' into hisparse_fp8

d7526cc

xiezhq-hermann approved these changes Apr 17, 2026

View reviewed changes

Merge branch 'main' into hisparse_fp8

59ecf96

Kangyan-Zhou mentioned this pull request Apr 28, 2026

ci: clean up stale-CUDA mooncake variant in install_extra_deps #23960

Merged

2 tasks

whybeyoung enabled auto-merge (squash) April 30, 2026 08:01

ShangmingCai approved these changes Apr 30, 2026

View reviewed changes

hzh0425 approved these changes Apr 30, 2026

View reviewed changes

whybeyoung merged commit 3da8790 into sgl-project:main May 6, 2026
393 of 449 checks passed

whybeyoung deleted the hisparse_fp8 branch May 6, 2026 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HiSparse] Support FP8 KV cache by routing to flashmla_kv backend#23013

[HiSparse] Support FP8 KV cache by routing to flashmla_kv backend#23013
whybeyoung merged 3 commits intosgl-project:mainfrom
whybeyoung:hisparse_fp8

whybeyoung commented Apr 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 17, 2026

Uh oh!

whybeyoung commented Apr 30, 2026

Uh oh!

hzh0425 commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

whybeyoung commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Apr 17, 2026

Uh oh!

whybeyoung commented Apr 30, 2026

Uh oh!

hzh0425 commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

whybeyoung commented Apr 17, 2026 •

edited

Loading