[Diffusion] test vide coding ability by BBuf · Pull Request #19647 · sgl-project/sglang

BBuf · 2026-03-02T04:13:15Z

Motivation

Created by #19540 and claude code (Thanks @Lyken17 )

run.sh

sglang generate \
  --model-path=Qwen/Qwen-Image-2512 \
  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" \
  '--negative-prompt= ' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=50 \
  --guidance-scale=4.0 \
  --seed=42 \
  --save-output \
  --enable-torch-compile \
  --warmup \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false

sglang generate \
  --model-path=black-forest-labs/FLUX.1-dev \
  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=50 \
  --guidance-scale=4.0 \
  --seed=42 \
  --save-output \
  --warmup \
  --enable-torch-compile

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --width=1024 \
  --height=1024 \
  --dit-layerwise-offload false \
  --enable-torch-compile \
  --warmup \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload true \
  --vae-cpu-offload false

sglang generate \
  --model-path=Tongyi-MAI/Z-Image-Turbo \
  --log-level=info \
  --prompt='A fantasy landscape with mountains and a river, detailed, vibrant colors' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=9 \
  --guidance-scale=0.0 \
  --seed=42 \
  --save-output \
  --enable-torch-compile \
  --warmup \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false

sglang generate \
  --model-path Wan-AI/Wan2.2-TI2V-5B-Diffusers \
  --log-level info \
  --warmup \
  --dit-layerwise-offload false \
  --dit-cpu-offload false \
  --vae-cpu-offload false \
  --text-encoder-cpu-offload false \
  --enable-torch-compile \
  --prompt "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot." \
  --negative-prompt "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards" \
  --image-path /workspace/gen_benchmark/figs/astronaut.jpg \
  --num-frames 81 \
  --720p \
  --num-inference-steps 50 \
  --guidance-scale 5.0 \
  --seed 42 \
  --save-output

bash ../run.sh >& ../baseline.txt

bash ../run.sh >& ../codex_gemini.txt

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-02T04:13:39Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant optimizations and refactorings across several multimodal generation models. The core changes involve enhancing layer normalization with flexible scaling, leveraging PyTorch's compilation capabilities for critical forward passes, and implementing CUDA-optimized rotary embedding applications. These updates are designed to improve the overall performance and efficiency of the models by utilizing fused kernels and better memory management.

Highlights

Enhanced Layer Normalization and Scaling: Modified _ScaleResidualNormScaleShift and _NormScaleShift classes to accept a scale_constant parameter, allowing for more flexible scaling in fused kernels. The CUDA implementations now fall back to native if scale_constant is not 1.0, as the CuTe DSL kernel currently only supports the default value.
PyTorch Compilation for Performance: Introduced @torch.compile decorators to forward methods in FluxAttention, RotaryEmbedding, QwenImageAttention, and ZImageAttention across various models, aiming to optimize execution speed and reduce overhead.
Optimized Rotary Embeddings: Implemented a CUDA-optimized path for applying rotary embeddings using apply_flashinfer_rope_qk_inplace in hunyuanvideo.py, wanvideo.py, and zimage.py. This path is conditionally enabled when on CUDA and query/key tensors are contiguous, falling back to the standard implementation otherwise.
Refactored Residual and Normalization Blocks: Replaced manual residual addition and gating with MulAdd in FluxBlock and integrated ScaleResidualLayerNormScaleShift and LayerNormScaleShift in FluxBlock, Flux2Block, and ZImageBlock for more efficient and fused normalization and scaling operations, especially for modulated paths.
Improved Tensor Memory Layout: Ensured that query, key, and value tensors maintain a contiguous memory layout after reshaping or unflattening in qwen_image.py and zimage.py, which is crucial for performance with CUDA kernels.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/multimodal_gen/runtime/layers/layernorm.py
- Added scale_constant parameter to _ScaleResidualNormScaleShift and _NormScaleShift classes.
- Updated forward_cuda methods to use scale_constant and fall back to native if not 1.0.
- Modified docstrings to reflect the new scaling formula.
python/sglang/multimodal_gen/runtime/models/dits/flux.py
- Imported MulAdd and ScaleResidualLayerNormScaleShift.
- Added _is_cuda flag.
- Applied @torch.compile to FluxAttention and RotaryEmbedding forward methods.
- Updated cos and sin tensor conversion to use memory_format=torch.contiguous_format.
- Replaced manual residual and gate operations with MulAdd in FluxBlock.
- Replaced LayerNorm with ScaleResidualLayerNormScaleShift for norm2 and norm2_context.
- Adjusted scale_mlp for Nunchaku structure when using ScaleResidualLayerNormScaleShift.
python/sglang/multimodal_gen/runtime/models/dits/flux_2.py
- Imported LayerNormScaleShift and ScaleResidualLayerNormScaleShift.
- Modified Flux2Attention to use apply_qk_norm for query and key.
- Replaced nn.LayerNorm with LayerNormScaleShift in Flux2Attention and Flux2Block.
- Refactored normalization and scaling in Flux2Block to use LayerNormScaleShift and ScaleResidualLayerNormScaleShift.
python/sglang/multimodal_gen/runtime/models/dits/hunyuanvideo.py
- Added _is_cuda flag.
- Implemented CUDA-optimized rotary embedding application using apply_flashinfer_rope_qk_inplace.
- Updated cos and sin tensor conversion to use memory_format=torch.contiguous_format for the optimized path.
python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py
- Applied @torch.compile to QwenImageAttention forward method.
- Ensured query, key, and value tensors are contiguous after unflattening.
python/sglang/multimodal_gen/runtime/models/dits/wanvideo.py
- Implemented CUDA-optimized rotary embedding application using apply_flashinfer_rope_qk_inplace.
- Updated cos and sin tensor conversion to use memory_format=torch.contiguous_format for the optimized path.
- Changed hidden_states.flatten(2) to hidden_states.flatten(2, 3).
python/sglang/multimodal_gen/runtime/models/dits/zimage.py
- Imported RMSNormScaleShift.
- Applied @torch.compile to ZImageAttention forward method.
- Ensured query, key, and value tensors are contiguous after reshaping.
- Implemented CUDA-optimized rotary embedding application using apply_flashinfer_rope_qk_inplace.
- Updated cos and sin tensor conversion to use memory_format=torch.contiguous_format for the optimized path.
- Changed hidden_states.flatten(2) to hidden_states.flatten(2, 3).
- Introduced RMSNormScaleShift for fused attention and FFN norms in ZImageBlock when modulation is active.
- Added zero_shift buffer for RMSNormScaleShift with scale_constant=0.0.

Activity

The changes in this pull request are based on a previous pull request ([Diffusion] diffusion profile and opt skills #19540) and potentially AI-assisted code generation from Claude, as indicated in the motivation section.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a series of optimizations and refactorings across various DiT models and their layers. Key changes include:

Refactoring layernorm and model implementations to use fused kernels for operations like (residual + gate * x) followed by normalization and modulation. This improves performance and code clarity.
Adding @torch.compile to several forward methods for just-in-time compilation, which should speed up execution.
Utilizing optimized kernels like flashinfer for rotary position embeddings where applicable, with added checks for tensor contiguity for correctness.
Improving memory layout by explicitly making tensors contiguous after reshaping operations.

My review identifies a potential bug in zimage.py where a change in logic for adaptive layer norm scaling might have been unintentionally introduced. Other changes appear to be solid improvements.

gemini-code-assist · 2026-03-02T04:15:54Z

python/sglang/multimodal_gen/runtime/models/dits/zimage.py

+            gate_msa = gate_msa.clone().tanh()
+            gate_mlp = gate_mlp.clone().tanh()
+            scale_msa = scale_msa.clone()
+            scale_mlp = scale_mlp.clone()


The 1.0 + operation on scale_msa and scale_mlp has been removed. The RMSNormScaleShift is initialized with scale_constant=0.0, which results in the computation norm(x) * scale. The original logic was norm(x) * (1.0 + scale). This change in logic is likely a bug, as similar refactorings in other files in this PR preserve the (1 + scale) logic. To restore the original behavior, you can add 1.0 + back to the scales.

Suggested change

gate_msa = gate_msa.clone().tanh()

gate_mlp = gate_mlp.clone().tanh()

scale_msa = scale_msa.clone()

scale_mlp = scale_mlp.clone()

gate_msa = gate_msa.clone().tanh()

gate_mlp = gate_mlp.clone().tanh()

scale_msa = 1.0 + scale_msa.clone()

scale_mlp = 1.0 + scale_mlp.clone()

ud

bffc072

github-actions bot added the diffusion SGLang Diffusion label Mar 2, 2026

BBuf changed the title ud [Diffusion] test vide coding ablity Mar 2, 2026

BBuf changed the title ~~[Diffusion] test vide coding ablity~~ [Diffusion] test vide coding ability Mar 2, 2026

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

BBuf closed this Mar 2, 2026

BBuf deleted the cluade_code_opt branch March 2, 2026 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diffusion] test vide coding ability#19647

[Diffusion] test vide coding ability#19647
BBuf wants to merge 1 commit intodiffusion_skillsfrom
cluade_code_opt

BBuf commented Mar 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BBuf commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BBuf commented Mar 2, 2026 •

edited

Loading