Skip to content

amd/deepseek_v4 integration 5/N - indexer TilelangAttn 0428#24050

Merged
HaiShaw merged 1 commit intosgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_0428_indexer_tilelang_attn
Apr 29, 2026
Merged

amd/deepseek_v4 integration 5/N - indexer TilelangAttn 0428#24050
HaiShaw merged 1 commit intosgl-project:amd/deepseek_v4from
HaiShaw:amd/deepseek_v4_0428_indexer_tilelang_attn

Conversation

@1am9trash
Copy link
Copy Markdown
Collaborator

@1am9trash 1am9trash commented Apr 29, 2026

Motivation

Update amd/deepseek_v4 integration branch

Following PRs have large set of conflict, we use this PR and upstream amd/deepseek_v4 branch to integrate in parallel.
#23600
#23608

The original fp8_paged_mqa_logits_torch launches 12 kernels per call; switching to this tilelang kernel reduces this to 1.
Controlled by export SGLANG_OPT_USE_TILELANG_INDEXER=true.

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a TileLang-based FP8 paged MQA logits kernel and its corresponding wrapper function. Key changes include the implementation of the fp8_paged_mqa_logits_kernel with shared memory optimization and pipelining, as well as updates to FP8 data type definitions. Feedback highlights a potential out-of-bounds write when max_seq_len is not aligned with the block size, a logic contradiction in the clean_logits parameter's default value and assertion, a redundant variable assignment, and the use of a hardcoded compute unit count which may affect performance across different GPU architectures.

T.reduce_sum(logits, logits_sum, dim=1)
for j2 in T.Parallel(B):
logits_sum[j2] *= k_s_frag[j2]
T.copy(logits_sum, o[bx, i * B])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Potential out-of-bounds write: T.copy(logits_sum, o[bx, i * B]) writes B (block_size) elements to the output buffer. If max_seq_len (symbolic S) is not a multiple of block_size, the last block write will exceed the tensor bounds. Consider adding an assertion in the wrapper function to ensure max_seq_len is aligned to block_size or implementing a masked copy in the kernel.

page_table: torch.Tensor,
deep_gemm_metadata: Any,
max_seq_len: int,
clean_logits: bool = True,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The default value True for clean_logits contradicts the assertion assert clean_logits == False on line 1515. This will cause the function to crash when called with default arguments. If clean_logits=True is not currently supported by the kernel, the default value should be changed to False.


logits = page_table.new_empty((batch_size, max_seq_len), dtype=torch.float32)

NUM_CU = 256
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

NUM_CU is hardcoded to 256. It is recommended to use the actual compute unit count of the device (e.g., via get_device_core_count()) to ensure the split_kv heuristic is correctly optimized for different GPU architectures (e.g., MI250 vs MI300X).

logits = page_table.new_empty((batch_size, max_seq_len), dtype=torch.float32)

NUM_CU = 256
split_kv = split_kv = max(1, min(max_seq_len // block_size, NUM_CU // batch_size))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Redundant assignment split_kv = split_kv = ....

@HaiShaw HaiShaw merged commit 18afbf1 into sgl-project:amd/deepseek_v4 Apr 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants