Skip to content

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)#22051

Open
froststeam wants to merge 1 commit intosgl-project:mainfrom
froststeam:qzg/musa-fa-fix
Open

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)#22051
froststeam wants to merge 1 commit intosgl-project:mainfrom
froststeam:qzg/musa-fa-fix

Conversation

@froststeam
Copy link
Copy Markdown
Contributor

@froststeam froststeam commented Apr 3, 2026

Motivation

This PR fixes the Flash Attention backend support that was previously merged in PR #17985 but later reverted in PR #22002 due to a bug. The original commit 2373552 caused CI failures (see failed CI job).

Previously, the MUSA-adapted flash attention implementation had a bug in the _forward_extend_impl method. The code was missing a proper mechanism to select the correct kernel implementation based on the fa_impl_ver parameter, causing it to always use the default FA3 implementation regardless of the specified version.

Fix Applied

After rebasing to the latest main branch, the kernel selection logic has been refactored and moved to the FlashAttentionBackend.__init__ method. This ensures that the appropriate flash attention implementation is selected during initialization based on the fa_impl_ver parameter.

  1. Moved kernel selection to __init__: The logic to select the correct flash attention kernel (including MUSA-specific implementations) is now handled in the FlashAttentionBackend.__init__ method, where two instance variables are initialized:

    • self.flash_attn_with_kvcache: For cached attention operations
    • self.flash_attn_varlen_func: For variable-length attention operations
  2. Updated forward methods: Both _forward_extend_impl and _forward_decode_impl now use these instance variables instead of directly calling the default implementations, ensuring the correct kernel is used based on the initialized configuration.

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md#pull-request-merge-process .
  2. Get approvals from https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS and other reviewers.
  3. Trigger CI tests with https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Related Links:

@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Apr 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the MUSA (Moore Threads GPU) hardware backend, specifically focusing on Flash Attention integration. It adds necessary dependencies, configuration parameters, and a new MUSA-specific attention module that wraps the mate library's flash attention functions. The implementation uses a thread-local context manager to automatically inject scheduler metadata into attention calls. Key changes include updates to the attention registry, the FlashAttentionBackend to handle MUSA-specific logic, and server argument adjustments for MUSA compatibility. Feedback highlights potential issues with global buffer safety in multi-GPU environments, metadata cache collisions due to non-unique keys, and the implications of ignoring cu_seqlens_k_new in the MUSA implementation.

@froststeam froststeam changed the title [MUSA][9/N] Re-introduceFA3 attention backend support through MATE (MUSA AI Tensor Engine) [MUSA][9/N] Re-introduce FA3 attention backend support through MATE Apr 3, 2026
Copy link
Copy Markdown
Collaborator

@yeahdongcn yeahdongcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to split this into two commits: one carrying over changes from the previous PR, and another fixing the regression in selecting FA kernels for different NVIDIA GPU architectures. This should make it easier for the SGLang core team to review.

@yeahdongcn yeahdongcn requested a review from Kangyan-Zhou April 5, 2026 13:20
@froststeam froststeam changed the title [MUSA][9/N] Re-introduce FA3 attention backend support through MATE [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) Apr 6, 2026
@froststeam froststeam force-pushed the qzg/musa-fa-fix branch 2 times, most recently from 08c699e to 9cb257c Compare April 6, 2026 12:44
Comment on lines +752 to +753
flash_attn_varlen_func = self.flash_attn_varlen_func
flash_attn_with_kvcache = self.flash_attn_with_kvcache
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key updates to resolve the previous regression in FA3/FA4 kernel wiring on NVIDIA GPUs should be here, since upstream main now selects and stores the kernels during __init__.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go ahead and approve this PR to trigger CI. Could @Fridge003 and @Kangyan-Zhou please take a final look? Thanks!

@yeahdongcn
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file mthreads run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants