Skip to content

[Diffusion] Refactor diffusion Flash Attention backend#16000

Closed
BBuf wants to merge 3 commits intomainfrom
refactor_diffusion_fa_backend
Closed

[Diffusion] Refactor diffusion Flash Attention backend#16000
BBuf wants to merge 3 commits intomainfrom
refactor_diffusion_fa_backend

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Dec 28, 2025

Motivation

Refer to : #15812

This PR refactors the flash attention implementation in the diffusion module to use only the official flash-attention library instead of sgl-kernel, simplifying the codebase and improving performance.

H100 single gpu Performance Benchmarks

Tested on single GPU with various diffusion models:

Model Steps Main (s) PR (s) Speedup Improvement
FLUX.1-dev 50 8.26 7.96 1.04x 3.6% faster
FLUX.2-dev 50 26.49 23.45 1.13x 11.5% faster
Qwen-Image 50 14.16 13.55 1.04x 4.3% faster
Qwen-Image-Edit-2511 40 27.74 25.58 1.08x 7.8% faster
Average - - - 1.07x 6.8% faster

Per-Step Performance

Model Main (s/step) PR (s/step) Improvement
FLUX.1-dev 0.1649 0.1591 3.5% faster
FLUX.2-dev 0.5209 0.4614 11.4% faster
Qwen-Image 0.2831 0.2709 4.3% faster
Qwen-Image-Edit-2511 0.6934 0.6395 7.8% faster
Average - - 6.8% faster

main

--- a/python/sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py
+++ b/python/sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py
@@ -59,6 +59,7 @@ def _should_use_upstream_flash_attention(
     k_shape: tuple[int, ...],
     v_shape: tuple[int, ...],
 ) -> bool:
+    return False
     if not upstream_available or not upstream_heads_ok:
         return False
 

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.1-dev --output-file-name=FLUX.1.dev

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:08<00:00, 6.06it/s]
[12-27 15:00:40] [DenoisingStage] average time per step: 0.1649 seconds
[12-27 15:00:40] [DenoisingStage] finished in 8.2553 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.2-dev --output-file-name=FLUX.2.dev

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:26<00:00, 1.92it/s]
[12-27 14:59:45] [DenoisingStage] average time per step: 0.5209 seconds
[12-27 14:59:46] [DenoisingStage] finished in 26.4943 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=Qwen/Qwen-Image --output-file-name=Qwen-Image

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:14<00:00, 3.53it/s]
[12-27 15:03:20] [DenoisingStage] average time per step: 0.2831 seconds
[12-27 15:03:20] [DenoisingStage] finished in 14.1618 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --model-path Qwen/Qwen-Image-Edit-2511 --prompt "Change the person to a standing position, bending over to hold the dog's front paws." --image-path "/home/lmsys/bbuf/LightX2V/examples/qwen_image/1.png"

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00, 1.44it/s]
[12-27 15:07:31] [DenoisingStage] average time per step: 0.6934 seconds
[12-27 15:07:31] [DenoisingStage] finished in 27.7448 seconds

pr

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.1-dev --output-file-name=FLUX.1.dev

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:07<00:00, 6.29it/s]
[12-28 06:09:44] [DenoisingStage] average time per step: 0.1591 seconds
[12-28 06:09:44] [DenoisingStage] finished in 7.9619 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.2-dev --output-file-name=FLUX.2.dev

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:23<00:00, 2.17it/s]
[12-28 06:12:50] [DenoisingStage] average time per step: 0.4614 seconds
[12-28 06:12:51] [DenoisingStage] finished in 23.4544 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=Qwen/Qwen-Image --output-file-name=Qwen-Image

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:13<00:00, 3.69it/s]
[12-28 06:14:14] [DenoisingStage] average time per step: 0.2709 seconds
[12-28 06:14:14] [DenoisingStage] finished in 13.5527 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --model-path Qwen/Qwen-Image-Edit-2511 --prompt "Change the person to a standing position, bending over to hold the dog's front paws." --image-path "/home/lmsys/bbuf/LightX2V/examples/qwen_image/1.png"

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:25<00:00, 1.56it/s]
[12-28 06:54:22] [DenoisingStage] average time per step: 0.6395 seconds
[12-28 06:54:22] [DenoisingStage] finished in 25.5836 seconds

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the diffusion SGLang Diffusion label Dec 28, 2025
@yuan-luo
Copy link
Copy Markdown
Collaborator

Can this enhancement be adopted in VLM?

@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Dec 28, 2025
@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Dec 28, 2025

Can this enhancement be adopted in VLM?

Does the VLM use the attention backend under the diffusion module when calling flash attention?

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Dec 28, 2025

/tag-and-rerun-ci

@BBuf BBuf added the run-ci label Dec 28, 2025
@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Jan 2, 2026

Can this enhancement be adopted in VLM?

Does the VLM use the attention backend under the diffusion module when calling flash attention?

FA3 is an optional backend of VisionAttention in VLM. I reckon similar refactor might be helpful.

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Jan 4, 2026

It's hard to install fa in sglang docker, so close it and we need upgrade sgl-kernel flash-attention.

@BBuf BBuf closed this Jan 4, 2026
@BBuf BBuf deleted the refactor_diffusion_fa_backend branch January 5, 2026 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants