Skip to content

[diffusion] hardware: support sage attention backend on MUSA (attn backend, 21/N)#24752

Merged
Kangyan-Zhou merged 2 commits into
sgl-project:mainfrom
yeahdongcn:xd/musa_sage
May 12, 2026
Merged

[diffusion] hardware: support sage attention backend on MUSA (attn backend, 21/N)#24752
Kangyan-Zhou merged 2 commits into
sgl-project:mainfrom
yeahdongcn:xd/musa_sage

Conversation

@yeahdongcn
Copy link
Copy Markdown
Collaborator

Motivation

Add Sage Attention backend support to MUSA platform.

Modifications

  1. Update musa.py to add Sage Attention selection.
  2. Update docs

Accuracy Tests

Key evidence:

[05-09 11:26:38] Using Sage Attention backend
[05-09 11:26:38] Using sage_attn backend for transformer

Full log:

root@worker3218:/ws# sglang generate --model-path /home/dist/FLUX.1-dev     --prompt "A logo With Bold Large text: SGL Diffusion"     --save-output --attention-backend sage_attn
WARNING:py.warnings:/ws/python/sglang/srt/layers/quantization/awq/awq.py:52: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-05-09 11:26:07 | warnings | 140678467908736 | WARNING : /ws/python/sglang/srt/layers/quantization/awq/awq.py:52: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

INFO:sglang.multimodal_gen.runtime.utils.hf_diffusers_utils:Diffusers version: 0.30.0.dev0
2026-05-09 11:26:09 | hf_diffusers_utils | 140678467908736 | INFO : Diffusers version: 0.30.0.dev0
[05-09 11:26:09] Disabling some offloading (except dit, text_encoder) for image generation model
[05-09 11:26:09] server_args: {"model_path": "/home/dist/FLUX.1-dev", "model_id": null, "backend": "auto", "attention_backend": "sage_attn", "attention_backend_config": {}, "component_attention_backends": {}, "cache_dit_config": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "cfg_parallel_degree": 1, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": 3600, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "lora_scale": 1.0, "lora_weight_name": null, "component_paths": {}, "transformer_weights_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "dit_offload_prefetch_size": 0.0, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": false, "vae_cpu_offload": false, "use_fsdp_inference": false, "pin_cpu_memory": true, "ltx2_two_stage_device_mode": null, "comfyui_mode": false, "enable_torch_compile": false, "warmup": false, "warmup_resolutions": null, "warmup_steps": 1, "disable_autocast": true, "quantization": null, "master_port": 30005, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5611, "batching_mode": "dynamic", "batching_max_size": 1, "batching_delay_ms": 0.0, "batching_config": null, "enable_batching_metrics": false, "strict_ports": false, "output_path": "outputs/", "input_save_path": "inputs/uploads", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true, "video_vae": true, "audio_vae": true, "video_dit": true, "audio_dit": true, "dual_tower_bridge": true}, "boundary_ratio": null, "base_gpu_id": 0, "disagg_role": "monolithic", "disagg_timeout": 600, "disagg_dispatch_policy": "round_robin", "disagg_mode": false, "disagg_server_addr": null, "encoder_urls": null, "denoiser_urls": null, "decoder_urls": null, "encoder_tp": null, "denoiser_tp": null, "denoiser_sp": null, "denoiser_ulysses": null, "denoiser_ring": null, "decoder_tp": null, "disagg_transfer_pool_size": 268435456, "disagg_p2p_hostname": "127.0.0.1", "disagg_ib_device": null, "pool_work_endpoint": null, "pool_result_endpoint": null, "log_level": "info", "uvicorn_access_log_exclude_prefixes": [], "enable_trace": false, "otlp_traces_endpoint": "localhost:4317"}
[05-09 11:26:09] Diffusers version: 0.30.0.dev0
[05-09 11:26:09] Local mode: True
[05-09 11:26:09] Starting server...
WARNING:py.warnings:/ws/python/sglang/srt/layers/quantization/awq/awq.py:52: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-05-09 11:26:17 | warnings | 140094605653120 | WARNING : /ws/python/sglang/srt/layers/quantization/awq/awq.py:52: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

[05-09 11:26:19] Scheduler bind at endpoint: tcp://127.0.0.1:5611
[05-09 11:26:19] Initializing distributed environment with world_size=1, device=musa:0, timeout=3600
[05-09 11:26:19] Setting distributed timeout to 3600 seconds
[05-09 11:26:19] No pipeline_class_name specified, using model_index.json
[05-09 11:26:19] Diffusers version: 0.30.0.dev0
[05-09 11:26:19] Using pipeline from model_index.json: FluxPipeline
[05-09 11:26:19] Loading pipeline modules...
[05-09 11:26:19] Model already exists locally and is complete
[05-09 11:26:19] Model path: /home/dist/FLUX.1-dev
[05-09 11:26:19] Diffusers version: 0.30.0.dev0
[05-09 11:26:19] Loading pipeline modules from config: {'_class_name': 'FluxPipeline', '_diffusers_version': '0.30.0.dev0', 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'text_encoder': ['transformers', 'CLIPTextModel'], 'text_encoder_2': ['transformers', 'T5EncoderModel'], 'tokenizer': ['transformers', 'CLIPTokenizer'], 'tokenizer_2': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'FluxTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[05-09 11:26:19] Loading required components: ['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|                                                                                                        | 0/7 [00:00<?, ?it/s][05-09 11:26:19] Loading text_encoder from /home/dist/FLUX.1-dev/text_encoder. avail mem: 79.30 GB
[05-09 11:26:20] Using FlashAttention (FA3) backend
[05-09 11:26:20] Using fa backend for text_encoder
[05-09 11:26:20] [RunAI Streamer] Overall time to stream 234.7 MiB of all files to cpu: 0.06s, 4.1 GiB/s
[05-09 11:26:24] Applied FSDP to 13 submodules in FSDPCLIPTextModel using explicit shard conditions
[05-09 11:26:24] Loaded text_encoder: FSDPCLIPTextModel (sgl-diffusion version). model size: 0.23 GB, consumed GPU mem: 0.25 GB, avail GPU mem: 79.05 GB
Loading required modules:  14%|█████████████▋                                                                                  | 1/7 [00:05<00:30,  5.13s/it][05-09 11:26:24] Loading text_encoder_2 from /home/dist/FLUX.1-dev/text_encoder_2. avail mem: 79.05 GB
[05-09 11:26:26] [RunAI Streamer] Overall time to stream 8.9 GiB of all files to cpu: 1.95s, 4.6 GiB/s
[05-09 11:26:38] Applied FSDP to 26 submodules in FSDPT5EncoderModel using explicit shard conditions
[05-09 11:26:38] Loaded text_encoder_2: FSDPT5EncoderModel (sgl-diffusion version). model size: 8.87 GB, consumed GPU mem: 0.03 GB, avail GPU mem: 79.03 GB
Loading required modules:  29%|███████████████████████████▍                                                                    | 2/7 [00:18<00:50, 10.10s/it][05-09 11:26:38] Loading tokenizer from /home/dist/FLUX.1-dev/tokenizer. avail mem: 79.03 GB
[05-09 11:26:38] Loaded tokenizer: CLIPTokenizerFast (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 79.03 GB
Loading required modules:  43%|█████████████████████████████████████████▏                                                      | 3/7 [00:18<00:22,  5.53s/it][05-09 11:26:38] Loading tokenizer_2 from /home/dist/FLUX.1-dev/tokenizer_2. avail mem: 79.03 GB
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
[05-09 11:26:38] Loaded tokenizer_2: T5TokenizerFast (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 79.03 GB
Loading required modules:  57%|██████████████████████████████████████████████████████▊                                         | 4/7 [00:19<00:10,  3.44s/it][05-09 11:26:38] Loading vae from /home/dist/FLUX.1-dev/vae. avail mem: 79.03 GB
[05-09 11:26:38] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.31 GB, consumed GPU mem: 0.35 GB, avail GPU mem: 78.68 GB
Loading required modules:  71%|████████████████████████████████████████████████████████████████████▌                           | 5/7 [00:19<00:04,  2.28s/it][05-09 11:26:38] Loading transformer from /home/dist/FLUX.1-dev/transformer. avail mem: 78.68 GB
[05-09 11:26:38] Loading FluxTransformer2DModel from 3 safetensors file(s) , param_dtype: torch.bfloat16
[05-09 11:26:38] Using Sage Attention backend
[05-09 11:26:38] Using sage_attn backend for transformer
[05-09 11:26:42] [RunAI Streamer] Overall time to stream 22.2 GiB of all files to cpu: 3.24s, 6.8 GiB/s
[05-09 11:26:52] Loaded model with 11.90B parameters
[05-09 11:26:52] Loaded transformer: FluxTransformer2DModel (sgl-diffusion version). model size: 22.17 GB, consumed GPU mem: 0.00 GB, avail GPU mem: 78.68 GB
Loading required modules:  86%|██████████████████████████████████████████████████████████████████████████████████▎             | 6/7 [00:33<00:06,  6.25s/it][05-09 11:26:52] Loading scheduler from /home/dist/FLUX.1-dev/scheduler. avail mem: 78.68 GB
[05-09 11:26:52] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: NA GB, consumed GPU mem: 0.00 GB, avail GPU mem: 78.68 GB
Loading required modules: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:33<00:00,  4.75s/it]
[05-09 11:26:52] Creating pipeline stages...
[05-09 11:26:52] Using Sage Attention backend
[05-09 11:26:52] Pipeline instantiated
[05-09 11:26:52] Worker 0: Initialized device, model, and distributed environment.
[05-09 11:26:52] Worker 0: Scheduler loop started.
[05-09 11:26:52] Processing 1 grouped request(s)
[05-09 11:26:52] Sampling params:
                       width: 1024
                      height: 1024
                  num_frames: 1
                         fps: 24
                      prompt: <redacted, len=42>
                  neg_prompt: None
                        seed: 42
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 3.5
     embedded_guidance_scale: 3.5
                    n_tokens: None
                  flow_shift: None
                  image_path: None
                 save_output: True
            output_file_path: outputs/A_logo_With_Bold_Large_text_SGL_Diffusion_20260509-112652_e04df28b.png
        
[05-09 11:26:52] Running pipeline stages: ['InputValidationStage', 'prompt_encoding_stage_primary', 'TimestepPreparationStage', 'LatentPreparationStage', 'DenoisingStage', 'DecodingStage']
[05-09 11:26:52] [InputValidationStage] started...
[05-09 11:26:52] [InputValidationStage] finished in 0.0001 seconds
[05-09 11:26:52] [TextEncodingStage] started...
[05-09 11:27:07] [TextEncodingStage] finished in 14.0563 seconds
[05-09 11:27:13] [TimestepPreparationStage] started...
[05-09 11:27:13] [TimestepPreparationStage] finished in 0.0022 seconds
[05-09 11:27:13] [LatentPreparationStage] started...
[05-09 11:27:13] [LatentPreparationStage] finished in 0.0057 seconds
[05-09 11:27:13] [DenoisingStage] started...
  0%|                                                                                                                                 | 0/50 [00:00<?, ?it/s][05-09 11:27:18] FlashInfer not available, using Triton fallback for RoPE
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:25<00:00,  1.93it/s]
[05-09 11:27:39] [DenoisingStage] average time per step: 0.5188 seconds
[05-09 11:27:39] [DenoisingStage] finished in 25.9497 seconds
[05-09 11:27:39] [DecodingStage] started...
[05-09 11:27:39] [DecodingStage] finished in 0.0463 seconds
[05-09 11:27:39] Output saved to outputs/A_logo_With_Bold_Large_text_SGL_Diffusion_20260509-112652_e04df28b.png
[05-09 11:27:39] Pixel data generated successfully in 46.80 seconds
[05-09 11:27:39] Memory usage - Max peak: 29740.00 MB, Avg peak: 29740.00 MB
[05-09 11:27:39] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
[05-09 11:27:40] Worker 0: Shutdown complete.
root@worker3218:/ws# 

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added documentation Improvements or additions to documentation diffusion SGLang Diffusion mthreads labels May 9, 2026
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
@yeahdongcn
Copy link
Copy Markdown
Collaborator Author

@mickqian Please take a look. Thanks!

@mickqian
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@Kangyan-Zhou Kangyan-Zhou merged commit 0a37d24 into sgl-project:main May 12, 2026
169 of 189 checks passed
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
…ckend, 21/N) (sgl-project#24752)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
SpencerGarnets added a commit to ai-blaise/optimization-playground that referenced this pull request May 12, 2026
…ack)

Brings in upstream sgl-project/sglang main commits since
096ad02 (merge base, Laguna-XS.2 model support).
Total: 28 upstream commits composed.

Custom-stack files preserved intact (entirely-ours, byte-identical to
origin/main):
  - Blackwell CuTe kernel suite (warp_decode_cute, g1_attention_cute,
    gated_norm_cute, layersplit_cute, fused_store_index_cache)
  - TurboQuant 2.5-bit dense KV cache path
  - HIGGS 2-bit dense KV cache path (with split-K decode)
  - NVFP4 IndexCache dispatcher (active gate)
  - quantization_config_dispatch (HF-config-driven runtime routing)
  - All custom server-args flags and runtime methods preserved

Verification:
  - 200+ merged Python files compile cleanly
  - Dispatcher symbol presence verified
  - HIGGS pool / TurboQuant pool classes present at expected lines
  - compressed_tensors_w4a4_nvfp4_moe imports clean
  - All custom server-args flags present (enable_higgs_dense_2bit_kv_cache,
    enable_turboquant_dense_kv_cache, turboquant_dense_kv_preset,
    indexer_quantization_declared, higgs_mla_decode_num_splits, etc.)

Manual-merged shared files (auto-merge gave broken/mixed output; cleaned
up post-merge):
  - python/sglang/srt/disaggregation/mooncake/conn.py: upstream's PR#24932
    refactored maybe_send_extra into a state-types-loop. Replayed our
    LayerSplit NSA state-index-length-mismatch check inside the SWA/NSA
    branch of the new loop body.
  - sgl-kernel/python/sgl_kernel/__init__.py: upstream's PR#23449 (Apple
    Silicon Metal kernel) wrapped the entire module body in
    `if darwin/arm64: from sgl_kernel.metal import * else: ...`. The
    auto-merge duplicated the file body; rewrote cleanly with upstream's
    structure and re-injected our `g1_gate_forward`,
    `warp_decode_cute_moe_forward`, and
    `warp_decode_cute_moe_packed_forward` imports plus `g1_gate_forward`
    in _DEBUG_EXPORT_NAMES.
  - python/sglang/srt/managers/scheduler_output_processor_mixin.py: line
    628 still referenced `result.num_accepted_drafts` (renamed by PR
    sgl-project#25038 to `num_correct_drafts`). Renamed in place.
  - python/sglang/srt/observability/scheduler_metrics_mixin.py: a block
    around the spec-decode logging path had mixed old/new names from
    auto-merge (lines 553/557/560). Renamed `spec_num_accepted_tokens`
    -> `spec_num_accept_tokens` and local `num_accepted_drafts` ->
    `num_correct_drafts` to match the rest of the file.
  - test/test_smc_info.py: stub Req mock used the old field names
    `spec_accepted_drafts` and `update_spec_acceptance_histogram`.
    Renamed to `spec_num_correct_drafts` and
    `update_spec_correct_drafts_histogram` per PR sgl-project#24081.

Auto-merge cleanly integrated upstream changes to:
  - server_args.py (new fields: prefill_only_disable_kv_cache,
    weight_loader_drop_cache_after_load, prefill_delayer_queue_min_ratio,
    prefill_delayer_max_delay_ms, speculative_draft_window_size, etc.)
  - mem_cache/memory_pool.py (new NoOpMHATokenToKVPool)
  - model_executor/model_runner_kv_cache_mixin.py (NoOpMHATokenToKVPool
    pool factory + _validate_prefill_only_disable_kv_cache_pool_family)
  - layers/attention/nsa_backend.py (spec rename
    num_accepted_drafts -> num_correct_drafts;
    num_accepted_tokens -> num_accept_tokens)
  - layers/attention/nsa/nsa_indexer.py (new _apply_q_scale_and_softmax_scale
    compile method; torch.mm replaces deep_gemm wrapper)
  - 28+ disaggregation/spec/runner files with mostly clean
    upstream-side-only integration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

----- upstream commit subjects (28) -----
fd3eb77 [Cookbook]: add Laguna-XS.2 (Poolside) (sgl-project#24730)
6be1a45 Fix swa component host hit (sgl-project#25085)
693f497 [NPU] use causal_conv1d_update_v2 for performance (sgl-project#24595)
1efe9e2 [Bug Fix] Reject incompatible combination of --disable-cuda-graph-padding and --enable-torch-compile (sgl-project#23903)
8d27ce7 Optimize uvicorn startup command (sgl-project#25041)
b35fd5f [fix] skip legacy minicpmv conv template for MiniCPM-V 4.6 (sgl-project#24998)
7582237 [Tiny Fix] Disable BCG when inner layer_model unresolved (sgl-project#25021)
ca3bc05 Deepseek-v4-Pro share expert tp1 (sgl-project#24949)
a72d3ae [Spec] Multi-layer mamba scatter cleanup; fix positional call bug (sgl-project#25030)
7128533 Revert "Migrate Intel CPU cases to the test/registered." (sgl-project#25044)
1f985c5 [Spec] Rename `accepted_indices` -> `accept_indices`; drop `_token_id` suffix per Rule 5 (sgl-project#25038)
ecf5d84 Migrate Intel CPU cases to the test/registered. (sgl-project#22670)
d7f4761 [PD] Refactor hybrid state transfer (sgl-project#24932)
91907b7 [UnifiedTree]: Fix Unified HiCache tombstone lock release replay (sgl-project#24972)
4ad63ad [Spec] Rename `accepted_drafts` -> `correct_drafts` for unambiguous naming (sgl-project#24081)
6bfb365 [PD] Rate limit prefill inflight polling warnings (sgl-project#24967)
6bb79c1 [Linear Attn] Add CUSTOM enum and plugin extensibility for kernel backends (sgl-project#24937)
cfc41d5 Fix kimi k2.5 mla eagle + dp attention (sgl-project#25033)
0f3932c [Fix] Qwen3-ASR config: set thinker_config before super().__init__ (sgl-project#24187)
f526e3f [Spec] Mamba scatter cleanup; fix multi-layer positional bug; dflash naming (sgl-project#25029)
10375a1 [NIXL][XPU] Fix uint64 overflow for mismatched P/D TP sizes (e.g. prefill_tp=1, decode_tp=2) (sgl-project#24648)
0a37d24 [diffusion] hardware: support sage attention backend on MUSA (attn backend, 21/N) (sgl-project#24752)
5495026 [HiCache] feat: default storage prefetch timeout (sgl-project#23309)
186eb42 Feat: Support SWA (Sliding Window Attention) for EAGLE-3 drafter (sgl-project#24664)
a75b79e Feat: Support newer EAGLE-3 drafters (sgl-project#24663)
f3a8189 [Spec] Internal rename per N2 v2 naming rule (sgl-project#25014)
bfc2eda [MUSA] Use MUSA-optimized operators in piecewise CUDA graph (sgl-project#23633)
74d70af [Apple Silicon] Add Metal kernel support in sgl-kernel (sgl-project#23449)
xjpang pushed a commit to xjpang/sglang that referenced this pull request May 13, 2026
…ckend, 21/N) (sgl-project#24752)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation mthreads run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants