amd/deepseek_v4 integration 5/N - indexer TilelangAttn 0428 by 1am9trash · Pull Request #24050 · sgl-project/sglang

1am9trash · 2026-04-29T11:52:06Z

Motivation

Update amd/deepseek_v4 integration branch

Following PRs have large set of conflict, we use this PR and upstream amd/deepseek_v4 branch to integrate in parallel.
#23600
#23608

The original fp8_paged_mqa_logits_torch launches 12 kernels per call; switching to this tilelang kernel reduces this to 1.
Controlled by export SGLANG_OPT_USE_TILELANG_INDEXER=true.

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request introduces a TileLang-based FP8 paged MQA logits kernel and its corresponding wrapper function. Key changes include the implementation of the fp8_paged_mqa_logits_kernel with shared memory optimization and pipelining, as well as updates to FP8 data type definitions. Feedback highlights a potential out-of-bounds write when max_seq_len is not aligned with the block size, a logic contradiction in the clean_logits parameter's default value and assertion, a redundant variable assignment, and the use of a hardcoded compute unit count which may affect performance across different GPU architectures.

gemini-code-assist · 2026-04-29T11:53:48Z

+                T.reduce_sum(logits, logits_sum, dim=1)
+                for j2 in T.Parallel(B):
+                    logits_sum[j2] *= k_s_frag[j2]
+                T.copy(logits_sum, o[bx, i * B])


Potential out-of-bounds write: T.copy(logits_sum, o[bx, i * B]) writes B (block_size) elements to the output buffer. If max_seq_len (symbolic S) is not a multiple of block_size, the last block write will exceed the tensor bounds. Consider adding an assertion in the wrapper function to ensure max_seq_len is aligned to block_size or implementing a masked copy in the kernel.

gemini-code-assist · 2026-04-29T11:53:48Z

+    page_table: torch.Tensor,
+    deep_gemm_metadata: Any,
+    max_seq_len: int,
+    clean_logits: bool = True,


The default value True for clean_logits contradicts the assertion assert clean_logits == False on line 1515. This will cause the function to crash when called with default arguments. If clean_logits=True is not currently supported by the kernel, the default value should be changed to False.

gemini-code-assist · 2026-04-29T11:53:48Z

+
+    logits = page_table.new_empty((batch_size, max_seq_len), dtype=torch.float32)
+
+    NUM_CU = 256


NUM_CU is hardcoded to 256. It is recommended to use the actual compute unit count of the device (e.g., via get_device_core_count()) to ensure the split_kv heuristic is correctly optimized for different GPU architectures (e.g., MI250 vs MI300X).

gemini-code-assist · 2026-04-29T11:53:48Z

+    logits = page_table.new_empty((batch_size, max_seq_len), dtype=torch.float32)
+
+    NUM_CU = 256
+    split_kv = split_kv = max(1, min(max_seq_len // block_size, NUM_CU // batch_size))


Redundant assignment split_kv = split_kv = ....

Add dpsk-v4 indexer tilelang attn kernel support

67479c8

1am9trash requested review from Fridge003, HaiShaw, hlu1, hubertlu-tw and kkHuang-amd as code owners April 29, 2026 11:52

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

HaiShaw approved these changes Apr 29, 2026

View reviewed changes

HaiShaw merged commit 18afbf1 into sgl-project:amd/deepseek_v4 Apr 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amd/deepseek_v4 integration 5/N - indexer TilelangAttn 0428#24050

amd/deepseek_v4 integration 5/N - indexer TilelangAttn 0428#24050
HaiShaw merged 1 commit intosgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_0428_indexer_tilelang_attn

1am9trash commented Apr 29, 2026 •

edited by HaiShaw

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		logits = page_table.new_empty((batch_size, max_seq_len), dtype=torch.float32)

		NUM_CU = 256

Conversation

1am9trash commented Apr 29, 2026 • edited by HaiShaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1am9trash commented Apr 29, 2026 •

edited by HaiShaw

Loading