[diffusion][hot fix] fix torch.compile graph break caused by torch._dynamo.disable by yingluosanqian · Pull Request #18336 · sgl-project/sglang

yingluosanqian · 2026-02-06T03:46:13Z

Motivation

PR14717 decorated the jit kernel with @torch._dynamo.disable, introducing a torch.compile graph break. This increases cpu overhead and, in certain models (e.g. QwenImage 263ms -> 287ms per step), leads to some gpu bubbles, as shown below.

This PR fixes the issue by decorate the kernel with a torch.library.custom_op, following the approach described in the pytorch documentation.

after fix (this pr):

PR14717 did not rerun e2e benchmarks after the latest commits, so the regression was not caught in time. After fix, this pr reran e2e benchmarks: QwenImage showed a ~9% slowdown (fixed in this PR), while no regressions were observed in other models (wan, hunyuan).
- QwenImage denoise stage: ~1.9% speedup;
- Wan2.2-T2V denoise stage: ~2.0% speedup;
- Wan2.1-T2V denoise stage: ~2.8% speedup;
- hunyuan denoise stage: <1% speedup;
- (detailed in Benchmarking and Profiling)
The ~2% speedup is expected, as the time reduced by the fusion kernel accounts for roughly the same proportion of a single layer’s execution time.
- taking QwenImage as an example, a single layer takes approximately 2200 µs, while kernel fusion reduces about 50–60 µs, which corresponds to roughly ~2% of the layer execution time.
- the previously observed ~10% speedup came from two factors: faster gpu kernels and the elimination of gpu bubbles. The bubbles now appear to have been removed by other mechanisms as well (e.g., torch.compile or other cpu-side optimizations), leaving only the kernel-level speedup.

Modifications

Accuracy Tests

Compare image between this pr and 669a9bd (before 14717).

Qwen

669a9bd:

this pr:

hunyuan video

669a9bd:

this pr:

Wan-AI/Wan2.2-T2V-A14B-Diffusers

669a9bd:

this pr:

Benchmarking and Profiling

Benchmark on H200

QwenImage

Command

sglang generate --model-path=Qwen/Qwen-Image-2512 --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" '--negative-prompt= ' --width=1024 --height=1024 --num-inference-steps=50 --guidance-scale=4.0 --seed=42 --save-output --enable-torch-compile --warmup --dit-cpu-offload false --text-encoder-cpu-offload false

before regression (669a9bd):

[02-06 02:04:27] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 50/50 [00:13<00:00,  3.79it/s]
[02-06 02:04:40] [DenoisingStage] average time per step: 0.2637 seconds
[02-06 02:04:40] [DenoisingStage] finished in 13.1895 seconds

after regression (4739f2e):

[02-06 02:02:08] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 50/50 [00:14<00:00,  3.48it/s]
[02-06 02:02:23] [DenoisingStage] average time per step: 0.2876 seconds
[02-06 02:02:23] [DenoisingStage] finished in 14.3843 seconds

after fix (this pr):

[02-06 01:20:15] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 50/50 [00:12<00:00,  3.87it/s]
[02-06 01:20:27] [DenoisingStage] average time per step: 0.2582 seconds
[02-06 01:20:27] [DenoisingStage] finished in 12.9134 seconds

Wan2.2

Command

sglang generate --model-path=Wan-AI/Wan2.2-T2V-A14B-Diffusers --log-level=info --prompt="A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." --negative-prompt=" " --720p --num-inference-steps=40 --num-frames=81 --guidance-scale=5.0 --seed=42 --save-output  --num-gpus=8 --enable-cfg-parallel --ulysses-degree=4 --dit-layerwise-offload true --dit-cpu-offload false --vae-cpu-offload false --text-encoder-cpu-offload true --warmup --enable-torch-compile true

before regression (669a9bd):

[02-06 01:44:49] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 40/40 [03:19<00:00,  4.98s/it]
[02-06 01:48:09] [DenoisingStage] average time per step: 4.9828 seconds
[02-06 01:48:09] [DenoisingStage] finished in 199.3217 seconds

after regression (4739f2e, no regression)

[02-06 01:56:30] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 40/40 [03:15<00:00,  4.89s/it]
[02-06 01:59:45] [DenoisingStage] average time per step: 4.8867 seconds
[02-06 01:59:45] [DenoisingStage] finished in 195.4734 seconds

after fix (this pr):

[02-06 02:45:31] [DenoisingStage] started...
100%|█████████████████████████████████████████████████████████████████████| 40/40 [03:15<00:00,  4.89s/it]
[02-06 02:48:47] [DenoisingStage] average time per step: 4.8879 seconds
[02-06 02:48:47] [DenoisingStage] finished in 195.5214 seconds

Wan-AI/Wan2.1-T2V-1.3B-Diffusers

Command

sglang generate --model-path=Wan-AI/Wan2.1-T2V-1.3B-Diffusers --log-level=info --prompt="A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." --negative-prompt=" " --720p --num-inference-steps=40 --num-frames=81 --guidance-scale=5.0 --seed=42 --save-output  --num-gpus=8 --enable-cfg-parallel --ulysses-degree=4 --dit-layerwise-offload true --dit-cpu-offload false --vae-cpu-offload false --text-encoder-cpu-offload true --warmup --enable-torch-compile true

before regression (669a9bd):

[02-06 02:25:02] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 40/40 [00:43<00:00,  1.08s/it]
[02-06 02:25:45] [DenoisingStage] average time per step: 1.0801 seconds
[02-06 02:25:45] [DenoisingStage] finished in 43.2059 seconds

after regression (4739f2e, no regression)

[02-06 02:29:26] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 40/40 [00:42<00:00,  1.05s/it]
[02-06 02:30:08] [DenoisingStage] average time per step: 1.0508 seconds
[02-06 02:30:08] [DenoisingStage] finished in 42.0333 seconds

after fix (this pr):

[02-06 02:33:34] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 40/40 [00:42<00:00,  1.05s/it]
[02-06 02:34:16] [DenoisingStage] average time per step: 1.0533 seconds
[02-06 02:34:16] [DenoisingStage] finished in 42.1340 seconds

hunyuan

Command

sglang generate --model-path hunyuanvideo-community/HunyuanVideo --text-encoder-cpu-offload --pin-cpu-memory --prompt "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." --save-output --num-frames 65 --width 848 --height 480 --num-inference-steps 30 --seed=42 --save-output --warmup --enable-torch-compile true --dit-layerwise-offload true --dit-cpu-offload false --vae-cpu-offload false --text-encoder-cpu-offload true --warmup --enable-torch-compile true

before regression (669a9bd):

[02-06 03:33:08] [DenoisingStage] started...
100%|█████████████████████████████████████████████████████████████████████| 30/30 [00:52<00:00,  1.76s/it]
[02-06 03:34:01] [DenoisingStage] average time per step: 1.7645 seconds
[02-06 03:34:01] [DenoisingStage] finished in 52.9382 seconds

after regression (4739f2e, no regression)

[02-06 03:36:20] [DenoisingStage] started...
100%|█████████████████████████████████████████████████████████████████████| 30/30 [00:52<00:00,  1.76s/it]
[02-06 03:37:13] [DenoisingStage] average time per step: 1.7637 seconds
[02-06 03:37:13] [DenoisingStage] finished in 52.9137 seconds

after fix (this pr):

[02-06 03:30:12] [DenoisingStage] started...
100%|█████████████████████████████████████████████████████████████████████| 30/30 [00:52<00:00,  1.76s/it]
[02-06 03:31:05] [DenoisingStage] average time per step: 1.7611 seconds
[02-06 03:31:05] [DenoisingStage] finished in 52.8349 seconds

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-06T03:46:18Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yingluosanqian · 2026-02-06T03:59:49Z

/tag-and-rerun-ci

BBuf

LGTM

BBuf · 2026-02-06T05:34:49Z

Diffusion ci has passed.

mickqian · 2026-02-06T06:45:39Z

only diffusion is affected, bypassing

mickqian · 2026-02-06T06:47:57Z

brilliant!

We should be more careful about performance improvements and regression, and build a more mature system and tool to track them automatically.

For diffusion models, the thresholds in PR-test is loose (to make sure PR merge is not blocked), so this is an important issue that needs serious concern.

cc @dougyster

…o.disable (sgl-project#18336)

custom op to avoid torch graph break

4c5e25b

yingluosanqian requested review from BBuf and mickqian as code owners February 6, 2026 03:46

code format

9b54503

yingluosanqian added run-ci diffusion SGLang Diffusion sgl-kernel labels Feb 6, 2026

yingluosanqian added 2 commits February 6, 2026 12:01

Merge branch 'main' into diffusion_perf_fix

1307cdb

fix

99c15cb

yingluosanqian requested review from DarkSharpness and yhyang201 as code owners February 6, 2026 04:46

BBuf approved these changes Feb 6, 2026

View reviewed changes

mickqian merged commit f798ab9 into main Feb 6, 2026
94 of 106 checks passed

mickqian deleted the diffusion_perf_fix branch February 6, 2026 06:48

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 9, 2026

[diffusion] fix: fix torch.compile graph break caused by torch._dynam…

ecff317

…o.disable (sgl-project#18336)

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[diffusion] fix: fix torch.compile graph break caused by torch._dynam…

b4a23f5

…o.disable (sgl-project#18336)

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

[diffusion] fix: fix torch.compile graph break caused by torch._dynam…

4c36b61

…o.disable (sgl-project#18336)

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[diffusion] fix: fix torch.compile graph break caused by torch._dynam…

5b2f09e

…o.disable (sgl-project#18336)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[diffusion][hot fix] fix torch.compile graph break caused by torch._dynamo.disable#18336

[diffusion][hot fix] fix torch.compile graph break caused by torch._dynamo.disable#18336
mickqian merged 4 commits intomainfrom
diffusion_perf_fix

yingluosanqian commented Feb 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Uh oh!

yingluosanqian commented Feb 6, 2026

Uh oh!

BBuf left a comment

Uh oh!

BBuf commented Feb 6, 2026 •

edited

Loading

Uh oh!

mickqian commented Feb 6, 2026

Uh oh!

mickqian commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yingluosanqian commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Qwen

hunyuan video

Wan-AI/Wan2.2-T2V-A14B-Diffusers

Benchmarking and Profiling

QwenImage

Wan2.2

Wan-AI/Wan2.1-T2V-1.3B-Diffusers

hunyuan

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Uh oh!

yingluosanqian commented Feb 6, 2026

Uh oh!

BBuf left a comment

Choose a reason for hiding this comment

Uh oh!

BBuf commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mickqian commented Feb 6, 2026

Uh oh!

mickqian commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yingluosanqian commented Feb 6, 2026 •

edited

Loading

BBuf commented Feb 6, 2026 •

edited

Loading