[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) by froststeam · Pull Request #22051 · sgl-project/sglang

froststeam · 2026-04-03T14:36:46Z

Motivation

This PR fixes the Flash Attention backend support that was previously merged in PR #17985 but later reverted in PR #22002 due to a bug. The original commit 2373552 caused CI failures (see failed CI job).

Previously, the MUSA-adapted flash attention implementation had a bug in the _forward_extend_impl method. The code was missing a proper mechanism to select the correct kernel implementation based on the fa_impl_ver parameter, causing it to always use the default FA3 implementation regardless of the specified version.

Fix Applied

After rebasing to the latest main branch, the kernel selection logic has been refactored and moved to the FlashAttentionBackend.__init__ method. This ensures that the appropriate flash attention implementation is selected during initialization based on the fa_impl_ver parameter.

Moved kernel selection to __init__: The logic to select the correct flash attention kernel (including MUSA-specific implementations) is now handled in the FlashAttentionBackend.__init__ method, where two instance variables are initialized:
- self.flash_attn_with_kvcache: For cached attention operations
- self.flash_attn_varlen_func: For variable-length attention operations
Updated forward methods: Both _forward_extend_impl and _forward_decode_impl now use these instance variables instead of directly calling the default implementations, ensuring the correct kernel is used based on the initialized configuration.

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the https://docs.sglang.io/developer_guide/contribution_guide.html#format-code-with-pre-commit .
Add unit tests according to the https://docs.sglang.io/developer_guide/contribution_guide.html#run-and-add-unit-tests .
Update documentation according to https://docs.sglang.io/developer_guide/contribution_guide.html#write-documentations .
Provide accuracy and speed benchmark results according to https://docs.sglang.io/developer_guide/contribution_guide.html#test-the-accuracy and https://docs.sglang.io/developer_guide/contribution_guide.html#benchmark-the-speed .
Follow the SGLang code style https://docs.sglang.io/developer_guide/contribution_guide.html#code-style-guidance .

Review and Merge Process

Ping Merge Oncalls to start the process. See the https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md#pull-request-merge-process .
Get approvals from https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS and other reviewers.
Trigger CI tests with https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Related Links:

Original merge PR: [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) #17985
Revert PR: Revert "[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)" #22002 (commit efa7b2d)
Failed CI job: https://github.com/sgl-project/sglang/actions/runs/23928333410/job/69789912493
Fix commits: 0af5fe5

gemini-code-assist

Code Review

This pull request introduces support for the MUSA (Moore Threads GPU) hardware backend, specifically focusing on Flash Attention integration. It adds necessary dependencies, configuration parameters, and a new MUSA-specific attention module that wraps the mate library's flash attention functions. The implementation uses a thread-local context manager to automatically inject scheduler metadata into attention calls. Key changes include updates to the attention registry, the FlashAttentionBackend to handle MUSA-specific logic, and server argument adjustments for MUSA compatibility. Feedback highlights potential issues with global buffer safety in multi-GPU environments, metadata cache collisions due to non-unique keys, and the implications of ignoring cu_seqlens_k_new in the MUSA implementation.

python/sglang/srt/hardware_backend/musa/attention/flash_attention.py

yeahdongcn

I think it would be better to split this into two commits: one carrying over changes from the previous PR, and another fixing the regression in selecting FA kernels for different NVIDIA GPU architectures. This should make it easier for the SGLang core team to review.

python/sglang/srt/layers/attention/flashattention_backend.py

…ensor Engine)

yeahdongcn · 2026-04-06T13:20:29Z

python/sglang/srt/layers/attention/flashattention_backend.py

+        flash_attn_varlen_func = self.flash_attn_varlen_func
+        flash_attn_with_kvcache = self.flash_attn_with_kvcache


The key updates to resolve the previous regression in FA3/FA4 kernel wiring on NVIDIA GPUs should be here, since upstream main now selects and stores the kernels during __init__.

I'll go ahead and approve this PR to trigger CI. Could @Fridge003 and @Kangyan-Zhou please take a final look? Thanks!

yeahdongcn · 2026-04-06T13:23:48Z

/tag-and-rerun-ci

froststeam requested review from Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners April 3, 2026 14:36

github-actions bot added the dependencies Pull requests that update a dependency file label Apr 3, 2026

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

python/sglang/srt/hardware_backend/musa/attention/flash_attention.py Show resolved Hide resolved

python/sglang/srt/hardware_backend/musa/attention/flash_attention.py Show resolved Hide resolved

python/sglang/srt/hardware_backend/musa/attention/flash_attention.py Show resolved Hide resolved

froststeam changed the title ~~[MUSA][9/N] Re-introduceFA3 attention backend support through MATE (MUSA AI Tensor Engine)~~ [MUSA][9/N] Re-introduce FA3 attention backend support through MATE Apr 3, 2026

yeahdongcn reviewed Apr 5, 2026

View reviewed changes

python/sglang/srt/layers/attention/flashattention_backend.py Outdated Show resolved Hide resolved

yeahdongcn requested a review from Kangyan-Zhou April 5, 2026 13:20

froststeam force-pushed the qzg/musa-fa-fix branch from 3369ebb to ba20eee Compare April 6, 2026 11:05

froststeam changed the title ~~[MUSA][9/N] Re-introduce FA3 attention backend support through MATE~~ [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) Apr 6, 2026

yeahdongcn added the mthreads label Apr 6, 2026

froststeam force-pushed the qzg/musa-fa-fix branch 2 times, most recently from 08c699e to 9cb257c Compare April 6, 2026 12:44

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI T…

0af5fe5

…ensor Engine)

froststeam force-pushed the qzg/musa-fa-fix branch from 9cb257c to 0af5fe5 Compare April 6, 2026 12:58

yeahdongcn reviewed Apr 6, 2026

View reviewed changes

yeahdongcn approved these changes Apr 6, 2026

View reviewed changes

github-actions bot added the run-ci label Apr 6, 2026

yeahdongcn mentioned this pull request Apr 6, 2026

[Roadmap][Feature] Support Moore Threads (MUSA) GPU #16565

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)#22051

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)#22051
froststeam wants to merge 1 commit intosgl-project:mainfrom
froststeam:qzg/musa-fa-fix

froststeam commented Apr 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yeahdongcn left a comment

Uh oh!

Uh oh!

yeahdongcn Apr 6, 2026

Uh oh!

yeahdongcn Apr 6, 2026

Uh oh!

yeahdongcn commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		flash_attn_varlen_func = self.flash_attn_varlen_func
		flash_attn_with_kvcache = self.flash_attn_with_kvcache

Conversation

froststeam commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Fix Applied

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yeahdongcn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yeahdongcn Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

yeahdongcn Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

yeahdongcn commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

froststeam commented Apr 3, 2026 •

edited

Loading