Skip to content

[diffusion][hot fix] fix torch.compile graph break caused by torch._dynamo.disable#18336

Merged
mickqian merged 4 commits intomainfrom
diffusion_perf_fix
Feb 6, 2026
Merged

[diffusion][hot fix] fix torch.compile graph break caused by torch._dynamo.disable#18336
mickqian merged 4 commits intomainfrom
diffusion_perf_fix

Conversation

@yingluosanqian
Copy link
Collaborator

@yingluosanqian yingluosanqian commented Feb 6, 2026

Motivation

PR14717 decorated the jit kernel with @torch._dynamo.disable, introducing a torch.compile graph break. This increases cpu overhead and, in certain models (e.g. QwenImage 263ms -> 287ms per step), leads to some gpu bubbles, as shown below.

This PR fixes the issue by decorate the kernel with a torch.library.custom_op, following the approach described in the pytorch documentation.

image

after fix (this pr):
image

  • PR14717 did not rerun e2e benchmarks after the latest commits, so the regression was not caught in time. After fix, this pr reran e2e benchmarks: QwenImage showed a ~9% slowdown (fixed in this PR), while no regressions were observed in other models (wan, hunyuan).
    • QwenImage denoise stage: ~1.9% speedup;
    • Wan2.2-T2V denoise stage: ~2.0% speedup;
    • Wan2.1-T2V denoise stage: ~2.8% speedup;
    • hunyuan denoise stage: <1% speedup;
    • (detailed in Benchmarking and Profiling)
  • The ~2% speedup is expected, as the time reduced by the fusion kernel accounts for roughly the same proportion of a single layer’s execution time.
    • taking QwenImage as an example, a single layer takes approximately 2200 µs, while kernel fusion reduces about 50–60 µs, which corresponds to roughly ~2% of the layer execution time.
    • the previously observed ~10% speedup came from two factors: faster gpu kernels and the elimination of gpu bubbles. The bubbles now appear to have been removed by other mechanisms as well (e.g., torch.compile or other cpu-side optimizations), leaving only the kernel-level speedup.

Modifications

Accuracy Tests

Compare image between this pr and 669a9bd (before 14717).

Qwen

669a9bd:

image

this pr:

image

hunyuan video

669a9bd:

image

this pr:

image

Wan-AI/Wan2.2-T2V-A14B-Diffusers

669a9bd:

image

this pr:

image

Benchmarking and Profiling

Benchmark on H200

QwenImage

Command

sglang generate --model-path=Qwen/Qwen-Image-2512 --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" '--negative-prompt= ' --width=1024 --height=1024 --num-inference-steps=50 --guidance-scale=4.0 --seed=42 --save-output --enable-torch-compile --warmup --dit-cpu-offload false --text-encoder-cpu-offload false

before regression (669a9bd):

[02-06 02:04:27] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 50/50 [00:13<00:00,  3.79it/s]
[02-06 02:04:40] [DenoisingStage] average time per step: 0.2637 seconds
[02-06 02:04:40] [DenoisingStage] finished in 13.1895 seconds

after regression (4739f2e):

[02-06 02:02:08] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 50/50 [00:14<00:00,  3.48it/s]
[02-06 02:02:23] [DenoisingStage] average time per step: 0.2876 seconds
[02-06 02:02:23] [DenoisingStage] finished in 14.3843 seconds

after fix (this pr):

[02-06 01:20:15] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 50/50 [00:12<00:00,  3.87it/s]
[02-06 01:20:27] [DenoisingStage] average time per step: 0.2582 seconds
[02-06 01:20:27] [DenoisingStage] finished in 12.9134 seconds

Wan2.2

Command

sglang generate --model-path=Wan-AI/Wan2.2-T2V-A14B-Diffusers --log-level=info --prompt="A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." --negative-prompt=" " --720p --num-inference-steps=40 --num-frames=81 --guidance-scale=5.0 --seed=42 --save-output  --num-gpus=8 --enable-cfg-parallel --ulysses-degree=4 --dit-layerwise-offload true --dit-cpu-offload false --vae-cpu-offload false --text-encoder-cpu-offload true --warmup --enable-torch-compile true

before regression (669a9bd):

[02-06 01:44:49] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 40/40 [03:19<00:00,  4.98s/it]
[02-06 01:48:09] [DenoisingStage] average time per step: 4.9828 seconds
[02-06 01:48:09] [DenoisingStage] finished in 199.3217 seconds

after regression (4739f2e, no regression)

[02-06 01:56:30] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 40/40 [03:15<00:00,  4.89s/it]
[02-06 01:59:45] [DenoisingStage] average time per step: 4.8867 seconds
[02-06 01:59:45] [DenoisingStage] finished in 195.4734 seconds

after fix (this pr):

[02-06 02:45:31] [DenoisingStage] started...
100%|█████████████████████████████████████████████████████████████████████| 40/40 [03:15<00:00,  4.89s/it]
[02-06 02:48:47] [DenoisingStage] average time per step: 4.8879 seconds
[02-06 02:48:47] [DenoisingStage] finished in 195.5214 seconds

Wan-AI/Wan2.1-T2V-1.3B-Diffusers

Command

sglang generate --model-path=Wan-AI/Wan2.1-T2V-1.3B-Diffusers --log-level=info --prompt="A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." --negative-prompt=" " --720p --num-inference-steps=40 --num-frames=81 --guidance-scale=5.0 --seed=42 --save-output  --num-gpus=8 --enable-cfg-parallel --ulysses-degree=4 --dit-layerwise-offload true --dit-cpu-offload false --vae-cpu-offload false --text-encoder-cpu-offload true --warmup --enable-torch-compile true

before regression (669a9bd):

[02-06 02:25:02] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 40/40 [00:43<00:00,  1.08s/it]
[02-06 02:25:45] [DenoisingStage] average time per step: 1.0801 seconds
[02-06 02:25:45] [DenoisingStage] finished in 43.2059 seconds

after regression (4739f2e, no regression)

[02-06 02:29:26] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 40/40 [00:42<00:00,  1.05s/it]
[02-06 02:30:08] [DenoisingStage] average time per step: 1.0508 seconds
[02-06 02:30:08] [DenoisingStage] finished in 42.0333 seconds

after fix (this pr):

[02-06 02:33:34] [DenoisingStage] started...
100%|██████████████████████████████████████████████████████████████| 40/40 [00:42<00:00,  1.05s/it]
[02-06 02:34:16] [DenoisingStage] average time per step: 1.0533 seconds
[02-06 02:34:16] [DenoisingStage] finished in 42.1340 seconds

hunyuan

Command

sglang generate --model-path hunyuanvideo-community/HunyuanVideo --text-encoder-cpu-offload --pin-cpu-memory --prompt "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." --save-output --num-frames 65 --width 848 --height 480 --num-inference-steps 30 --seed=42 --save-output --warmup --enable-torch-compile true --dit-layerwise-offload true --dit-cpu-offload false --vae-cpu-offload false --text-encoder-cpu-offload true --warmup --enable-torch-compile true

before regression (669a9bd):

[02-06 03:33:08] [DenoisingStage] started...
100%|█████████████████████████████████████████████████████████████████████| 30/30 [00:52<00:00,  1.76s/it]
[02-06 03:34:01] [DenoisingStage] average time per step: 1.7645 seconds
[02-06 03:34:01] [DenoisingStage] finished in 52.9382 seconds

after regression (4739f2e, no regression)

[02-06 03:36:20] [DenoisingStage] started...
100%|█████████████████████████████████████████████████████████████████████| 30/30 [00:52<00:00,  1.76s/it]
[02-06 03:37:13] [DenoisingStage] average time per step: 1.7637 seconds
[02-06 03:37:13] [DenoisingStage] finished in 52.9137 seconds

after fix (this pr):

[02-06 03:30:12] [DenoisingStage] started...
100%|█████████████████████████████████████████████████████████████████████| 30/30 [00:52<00:00,  1.76s/it]
[02-06 03:31:05] [DenoisingStage] average time per step: 1.7611 seconds
[02-06 03:31:05] [DenoisingStage] finished in 52.8349 seconds

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yingluosanqian
Copy link
Collaborator Author

/tag-and-rerun-ci

Copy link
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BBuf
Copy link
Collaborator

BBuf commented Feb 6, 2026

Diffusion ci has passed.

图片

@mickqian
Copy link
Collaborator

mickqian commented Feb 6, 2026

only diffusion is affected, bypassing

@mickqian
Copy link
Collaborator

mickqian commented Feb 6, 2026

brilliant!

We should be more careful about performance improvements and regression, and build a more mature system and tool to track them automatically.

For diffusion models, the thresholds in PR-test is loose (to make sure PR merge is not blocked), so this is an important issue that needs serious concern.

cc @dougyster

@mickqian mickqian merged commit f798ab9 into main Feb 6, 2026
94 of 106 checks passed
@mickqian mickqian deleted the diffusion_perf_fix branch February 6, 2026 06:48
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 9, 2026
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants