Fix FlashMLA Shared-Memory Overflow in SGLang's Pure-TP Mode with Low-SMEM Fallback Scheduler by YAMY1234 · Pull Request #2 · sgl-project/FlashMLA

YAMY1234 · 2025-11-20T10:35:37Z

Motivation

When running SGLang + DeepSeek V3.2 in pure Tensor Parallel (TP) mode, the NSA backend can produce large expanded sequence lists (seqlens_k)—especially when topk=2048 and batch sizes are large.

Under these workloads, the existing get_mla_metadata_kernel allocates dynamic shared memory proportional to batch_size:

smem_size = sizeof(int) * (batch_size * 5 + 1);

In pure-TP setups, batch_size can easily exceed 10k–20k expanded rows, causing smem_size to exceed the GPU’s shared memory limit (e.g., on Hopper/Blackwell). As a result:

cudaFuncSetAttribute fails due to insufficient shared memory
The metadata kernel cannot launch
FlashMLA KV decoding path crashes, even though the downstream compute kernels can run correctly

This prevents SGLang from using FlashMLA efficiently in DeepSeek V3.2 pure-TP mode — a major limitation for users running large-scale inference.

Modifications

1. Added a low-shared-memory fallback kernel

Introduced:

__global__ void get_mla_metadata_kernel_low_smem(const GetDecodingMetadataParams params);

Key characteristics:

Uses no dynamic shared memory
Only threadIdx.x == 0 performs the scheduling computation
Recomputes num_blocks, first/last_block_idx, and writes:
- tile_scheduler_metadata_ptr
- num_splits_ptr
Scheduling logic is semantically identical to the original high-sMem kernel
Removes all O(batch_size) shared-memory allocations

This ensures correctness while avoiding sMem overflow entirely.

2. Kernel selection controlled at runtime

Updated run_get_mla_metadata_kernel:

Compute required dynamic shared memory:
smem_size = sizeof(int) * (batch_size * 5 + 1);
Query device limits via cudaDevAttrMaxSharedMemoryPerBlockOptin
Decision:
- If smem_size <= max_smem:
  → Use original high-performance shared-memory kernel
- Else (typical in DeepSeek V3.2 pure-TP mode):
  → Fallback to get_mla_metadata_kernel_low_smem
  (zero dynamic sMem)

Both kernels preserve existing APIs and metadata formats.

Impact

Pure-TP workloads for SGLang + DeepSeek V3.2 now run stably, without crashing due to shared-memory overflow in the scheduler.
Performance for normal-sized batches is unchanged (fast-path unchanged).
For extremely large batches, the fallback kernel performs slightly more sequential work, but its runtime remains negligible compared to FlashMLA compute kernels.
Enables robust FlashMLA KV decoding for DeepSeek V3.2 models under TP-only deployments — previously impossible without reducing batch size or disabling FlashMLA.

YAMY1234 and others added 2 commits November 19, 2025 22:04

add metadata fallback path

f78e39f

remove debug prints

1b4d86a

YAMY1234 mentioned this pull request Nov 20, 2025

[DeepSeekV3.2] Enable pure TP & Partial DP Attention sgl-project/sglang#13646

Merged

6 tasks

Fridge003 merged commit be055fb into sgl-project:sgl Nov 20, 2025

Fridge003 mentioned this pull request Dec 4, 2025

[Bug] DeepSeek V3.2 PD prefill TP mode show batch_size exceeds shared mem limit, falling back to low-smem kernel sgl-project/sglang#14425

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FlashMLA Shared-Memory Overflow in SGLang's Pure-TP Mode with Low-SMEM Fallback Scheduler#2

Fix FlashMLA Shared-Memory Overflow in SGLang's Pure-TP Mode with Low-SMEM Fallback Scheduler#2
Fridge003 merged 2 commits intosgl-project:sglfrom
YAMY1234:metadata_fallback

YAMY1234 commented Nov 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YAMY1234 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

1. Added a low-shared-memory fallback kernel

2. Kernel selection controlled at runtime

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YAMY1234 commented Nov 20, 2025 •

edited

Loading