amd/deepseek_v4 integration 4/N - TilelangAttn 0428 by 1am9trash · Pull Request #24033 · sgl-project/sglang

1am9trash · 2026-04-29T08:55:18Z

Motivation

Update amd/deepseek_v4 integration branch

Following PRs have large set of conflict, we use this PR and upstream amd/deepseek_v4 branch to integrate in parallel.
#23600
#23608

The original flash_mla_with_kvcache_torch launches 101 kernels per call; switching to the tilelang kernel reduces this to 2.
Controlled by export SGLANG_HACK_FLASHMLA_BACKEND=tilelang.

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request replaces the BF16 sparse attention kernel with a more efficient 2-stage FP8 attention kernel system consisting of partial and combine kernels. The new implementation supports in-kernel dequantization and dual KV caches. Additionally, it includes a workaround for a tilelang mutation bug and utility functions for cache reinterpretation. Review feedback focuses on improving code maintainability by replacing magic numbers with constants, using more idiomatic PyTorch APIs for integer limits, and addressing the use of deprecated storage methods.

gemini-code-assist · 2026-04-29T08:57:40Z

+def _build_fp8_combined_view(k_cache: torch.Tensor) -> Tuple[torch.Tensor, int, int]:
+    """
+    Reinterpret a MODEL1_FP8Sparse KV cache as a contiguous uint32 view.
+    Input:  k_cache (num_blocks, block_size, 1, d_qk)  fp8/uint8
+            — per-block storage also holds scales + padding past d_qk.
+    Output: (num_blocks, block_pad_u32) uint32 covering the full block
+            stride. Same storage ashe input, no copy.
+    """
+    k_u8 = k_cache.view(torch.uint8) if k_cache.dtype != torch.uint8 else k_cache
+    num_blocks = k_u8.shape[0]
+    block_size = k_u8.shape[1]
+    block_pad_u32 = k_u8.stride(0) // 4
+    storage = k_u8.untyped_storage()
+    flat_u32 = torch.empty(0, dtype=torch.uint32, device=k_u8.device).set_(
+        storage, 0, (storage.nbytes() // 4,), (1,)
+    )
+    k_combined = torch.as_strided(
+        flat_u32,
+        size=(num_blocks, block_pad_u32),
+        stride=(block_pad_u32, 1),
+        storage_offset=k_u8.storage_offset() // 4,
+    )
+    return k_combined, num_blocks, block_size


The function _build_fp8_combined_view uses untyped_storage() which is deprecated in newer PyTorch versions. Consider using storage() or untyped_storage() if you are sure about the version compatibility, but untyped_storage() is generally discouraged in favor of storage() for newer codebases.

gemini-code-assist · 2026-04-29T08:57:40Z

-        return max(tilelang.math.next_power_of_2(head_kv), 16)
-    return max(_next_power_of_2(head_kv), 16)
+_TOPK_LEN_SENTINEL_CACHE: dict = {}
+_INT32_MAX = 2**31 - 1


Using 2**31 - 1 is fine, but torch.iinfo(torch.int32).max is more idiomatic and readable for defining the maximum value of a 32-bit integer.

gemini-code-assist · 2026-04-29T08:57:40Z

+    cur = _TOPK_LEN_SENTINEL_CACHE.get(device)
+    if cur is None or cur.numel() < batch:
+        cur = torch.full(
+            (max(batch, 256),), _INT32_MAX, dtype=torch.int32, device=device


The use of max(batch, 256) for the sentinel tensor size is a magic number. It should be defined as a named constant at the module level to improve maintainability.

Add dpsk-v4 tilelang attn kernel support

3cfd20e

1am9trash requested review from Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, hlu1, hubertlu-tw, ispobock, kkHuang-amd and merrymercy as code owners April 29, 2026 08:55

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

HaiShaw merged commit 35634dd into sgl-project:amd/deepseek_v4 Apr 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amd/deepseek_v4 integration 4/N - TilelangAttn 0428#24033

amd/deepseek_v4 integration 4/N - TilelangAttn 0428#24033
HaiShaw merged 1 commit into
sgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_0428_tilelang_attn

1am9trash commented Apr 29, 2026 •

edited by HaiShaw

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

1am9trash commented Apr 29, 2026 • edited by HaiShaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1am9trash commented Apr 29, 2026 •

edited by HaiShaw

Loading