[Diffusion] Refactor diffusion Flash Attention backend by BBuf · Pull Request #16000 · sgl-project/sglang

BBuf · 2025-12-28T06:57:37Z

Motivation

Refer to : #15812

This PR refactors the flash attention implementation in the diffusion module to use only the official flash-attention library instead of sgl-kernel, simplifying the codebase and improving performance.

H100 single gpu Performance Benchmarks

Tested on single GPU with various diffusion models:

Model	Steps	Main (s)	PR (s)	Speedup	Improvement
FLUX.1-dev	50	8.26	7.96	1.04x	3.6% faster
FLUX.2-dev	50	26.49	23.45	1.13x	11.5% faster
Qwen-Image	50	14.16	13.55	1.04x	4.3% faster
Qwen-Image-Edit-2511	40	27.74	25.58	1.08x	7.8% faster
Average	-	-	-	1.07x	6.8% faster

Per-Step Performance

Model	Main (s/step)	PR (s/step)	Improvement
FLUX.1-dev	0.1649	0.1591	3.5% faster
FLUX.2-dev	0.5209	0.4614	11.4% faster
Qwen-Image	0.2831	0.2709	4.3% faster
Qwen-Image-Edit-2511	0.6934	0.6395	7.8% faster
Average	-	-	6.8% faster

main

--- a/python/sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py
+++ b/python/sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py
@@ -59,6 +59,7 @@ def _should_use_upstream_flash_attention(
     k_shape: tuple[int, ...],
     v_shape: tuple[int, ...],
 ) -> bool:
+    return False
     if not upstream_available or not upstream_heads_ok:
         return False

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.1-dev --output-file-name=FLUX.1.dev

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:08<00:00, 6.06it/s]
[12-27 15:00:40] [DenoisingStage] average time per step: 0.1649 seconds
[12-27 15:00:40] [DenoisingStage] finished in 8.2553 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.2-dev --output-file-name=FLUX.2.dev

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:26<00:00, 1.92it/s]
[12-27 14:59:45] [DenoisingStage] average time per step: 0.5209 seconds
[12-27 14:59:46] [DenoisingStage] finished in 26.4943 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=Qwen/Qwen-Image --output-file-name=Qwen-Image

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:14<00:00, 3.53it/s]
[12-27 15:03:20] [DenoisingStage] average time per step: 0.2831 seconds
[12-27 15:03:20] [DenoisingStage] finished in 14.1618 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --model-path Qwen/Qwen-Image-Edit-2511 --prompt "Change the person to a standing position, bending over to hold the dog's front paws." --image-path "/home/lmsys/bbuf/LightX2V/examples/qwen_image/1.png"

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00, 1.44it/s]
[12-27 15:07:31] [DenoisingStage] average time per step: 0.6934 seconds
[12-27 15:07:31] [DenoisingStage] finished in 27.7448 seconds

pr

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.1-dev --output-file-name=FLUX.1.dev

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:07<00:00, 6.29it/s]
[12-28 06:09:44] [DenoisingStage] average time per step: 0.1591 seconds
[12-28 06:09:44] [DenoisingStage] finished in 7.9619 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.2-dev --output-file-name=FLUX.2.dev

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:23<00:00, 2.17it/s]
[12-28 06:12:50] [DenoisingStage] average time per step: 0.4614 seconds
[12-28 06:12:51] [DenoisingStage] finished in 23.4544 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=Qwen/Qwen-Image --output-file-name=Qwen-Image

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:13<00:00, 3.69it/s]
[12-28 06:14:14] [DenoisingStage] average time per step: 0.2709 seconds
[12-28 06:14:14] [DenoisingStage] finished in 13.5527 seconds

CUDA_VISIBLE_DEVICES=7 sglang generate --model-path Qwen/Qwen-Image-Edit-2511 --prompt "Change the person to a standing position, bending over to hold the dog's front paws." --image-path "/home/lmsys/bbuf/LightX2V/examples/qwen_image/1.png"

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:25<00:00, 1.56it/s]
[12-28 06:54:22] [DenoisingStage] average time per step: 0.6395 seconds
[12-28 06:54:22] [DenoisingStage] finished in 25.5836 seconds

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-28T06:57:41Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yuan-luo · 2025-12-28T08:24:29Z

Can this enhancement be adopted in VLM?

python/sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py

BBuf · 2025-12-28T15:43:29Z

Can this enhancement be adopted in VLM?

Does the VLM use the attention backend under the diffusion module when calling flash attention?

BBuf · 2025-12-28T15:45:40Z

/tag-and-rerun-ci

yuan-luo · 2026-01-02T00:22:36Z

Can this enhancement be adopted in VLM?

Does the VLM use the attention backend under the diffusion module when calling flash attention?

FA3 is an optional backend of VisionAttention in VLM. I reckon similar refactor might be helpful.

BBuf · 2026-01-04T07:16:25Z

It's hard to install fa in sglang docker, so close it and we need upgrade sgl-kernel flash-attention.

BBuf added 2 commits December 27, 2025 22:30

upd

af899a6

ud

d30e276

BBuf requested review from mickqian and yhyang201 as code owners December 28, 2025 06:57

github-actions bot added the diffusion SGLang Diffusion label Dec 28, 2025

mickqian reviewed Dec 28, 2025

View reviewed changes

python/sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py Show resolved Hide resolved

ud

77bdcff

BBuf requested review from Fridge003, ispobock and merrymercy as code owners December 28, 2025 15:27

github-actions bot added the dependencies Pull requests that update a dependency file label Dec 28, 2025

BBuf added the run-ci label Dec 28, 2025

BBuf closed this Jan 4, 2026

BBuf deleted the refactor_diffusion_fa_backend branch January 5, 2026 05:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diffusion] Refactor diffusion Flash Attention backend#16000

[Diffusion] Refactor diffusion Flash Attention backend#16000
BBuf wants to merge 3 commits intomainfrom
refactor_diffusion_fa_backend

BBuf commented Dec 28, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 28, 2025

Uh oh!

yuan-luo commented Dec 28, 2025

Uh oh!

Uh oh!

BBuf commented Dec 28, 2025

Uh oh!

BBuf commented Dec 28, 2025

Uh oh!

yuan-luo commented Jan 2, 2026 •

edited

Loading

Uh oh!

BBuf commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BBuf commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

H100 single gpu Performance Benchmarks

Per-Step Performance

main

pr

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 28, 2025

Uh oh!

yuan-luo commented Dec 28, 2025

Uh oh!

Uh oh!

BBuf commented Dec 28, 2025

Uh oh!

BBuf commented Dec 28, 2025

Uh oh!

yuan-luo commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BBuf commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BBuf commented Dec 28, 2025 •

edited

Loading

yuan-luo commented Jan 2, 2026 •

edited

Loading