[Diffusion] Refactor diffusion Flash Attention backend#16000
Closed
[Diffusion] Refactor diffusion Flash Attention backend#16000
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
|
Can this enhancement be adopted in VLM? |
mickqian
reviewed
Dec 28, 2025
python/sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py
Show resolved
Hide resolved
Collaborator
Author
Does the VLM use the attention backend under the diffusion module when calling flash attention? |
Collaborator
Author
|
/tag-and-rerun-ci |
Collaborator
FA3 is an optional backend of VisionAttention in VLM. I reckon similar refactor might be helpful. |
Collaborator
Author
|
It's hard to install fa in sglang docker, so close it and we need upgrade sgl-kernel flash-attention. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Refer to : #15812
This PR refactors the flash attention implementation in the diffusion module to use only the official
flash-attentionlibrary instead ofsgl-kernel, simplifying the codebase and improving performance.H100 single gpu Performance Benchmarks
Tested on single GPU with various diffusion models:
Per-Step Performance
main
--- a/python/sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py +++ b/python/sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py @@ -59,6 +59,7 @@ def _should_use_upstream_flash_attention( k_shape: tuple[int, ...], v_shape: tuple[int, ...], ) -> bool: + return False if not upstream_available or not upstream_heads_ok: return FalseCUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.1-dev --output-file-name=FLUX.1.dev
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:08<00:00, 6.06it/s]
[12-27 15:00:40] [DenoisingStage] average time per step: 0.1649 seconds
[12-27 15:00:40] [DenoisingStage] finished in 8.2553 seconds
CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.2-dev --output-file-name=FLUX.2.dev
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:26<00:00, 1.92it/s]
[12-27 14:59:45] [DenoisingStage] average time per step: 0.5209 seconds
[12-27 14:59:46] [DenoisingStage] finished in 26.4943 seconds
CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=Qwen/Qwen-Image --output-file-name=Qwen-Image
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:14<00:00, 3.53it/s]
[12-27 15:03:20] [DenoisingStage] average time per step: 0.2831 seconds
[12-27 15:03:20] [DenoisingStage] finished in 14.1618 seconds
CUDA_VISIBLE_DEVICES=7 sglang generate --model-path Qwen/Qwen-Image-Edit-2511 --prompt "Change the person to a standing position, bending over to hold the dog's front paws." --image-path "/home/lmsys/bbuf/LightX2V/examples/qwen_image/1.png"
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:27<00:00, 1.44it/s]
[12-27 15:07:31] [DenoisingStage] average time per step: 0.6934 seconds
[12-27 15:07:31] [DenoisingStage] finished in 27.7448 seconds
pr
CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.1-dev --output-file-name=FLUX.1.dev
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:07<00:00, 6.29it/s]
[12-28 06:09:44] [DenoisingStage] average time per step: 0.1591 seconds
[12-28 06:09:44] [DenoisingStage] finished in 7.9619 seconds
CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=black-forest-labs/FLUX.2-dev --output-file-name=FLUX.2.dev
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:23<00:00, 2.17it/s]
[12-28 06:12:50] [DenoisingStage] average time per step: 0.4614 seconds
[12-28 06:12:51] [DenoisingStage] finished in 23.4544 seconds
CUDA_VISIBLE_DEVICES=7 sglang generate --prompt='A curious raccoon' --save-output --log-level=debug --output-path=outputs --model-path=Qwen/Qwen-Image --output-file-name=Qwen-Image
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:13<00:00, 3.69it/s]
[12-28 06:14:14] [DenoisingStage] average time per step: 0.2709 seconds
[12-28 06:14:14] [DenoisingStage] finished in 13.5527 seconds
CUDA_VISIBLE_DEVICES=7 sglang generate --model-path Qwen/Qwen-Image-Edit-2511 --prompt "Change the person to a standing position, bending over to hold the dog's front paws." --image-path "/home/lmsys/bbuf/LightX2V/examples/qwen_image/1.png"
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:25<00:00, 1.56it/s]
[12-28 06:54:22] [DenoisingStage] average time per step: 0.6395 seconds
[12-28 06:54:22] [DenoisingStage] finished in 25.5836 seconds
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist