Skip to content

[Diffusion] test vide coding ability#19647

Closed
BBuf wants to merge 1 commit intodiffusion_skillsfrom
cluade_code_opt
Closed

[Diffusion] test vide coding ability#19647
BBuf wants to merge 1 commit intodiffusion_skillsfrom
cluade_code_opt

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Mar 2, 2026

Motivation

Created by #19540 and claude code (Thanks @Lyken17 )

run.sh

sglang generate \
  --model-path=Qwen/Qwen-Image-2512 \
  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" \
  '--negative-prompt= ' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=50 \
  --guidance-scale=4.0 \
  --seed=42 \
  --save-output \
  --enable-torch-compile \
  --warmup \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false

sglang generate \
  --model-path=black-forest-labs/FLUX.1-dev \
  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets, highly detailed, 8k" \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=50 \
  --guidance-scale=4.0 \
  --seed=42 \
  --save-output \
  --warmup \
  --enable-torch-compile

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --width=1024 \
  --height=1024 \
  --dit-layerwise-offload false \
  --enable-torch-compile \
  --warmup \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload true \
  --vae-cpu-offload false

sglang generate \
  --model-path=Tongyi-MAI/Z-Image-Turbo \
  --log-level=info \
  --prompt='A fantasy landscape with mountains and a river, detailed, vibrant colors' \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=9 \
  --guidance-scale=0.0 \
  --seed=42 \
  --save-output \
  --enable-torch-compile \
  --warmup \
  --dit-cpu-offload false \
  --text-encoder-cpu-offload false

sglang generate \
  --model-path Wan-AI/Wan2.2-TI2V-5B-Diffusers \
  --log-level info \
  --warmup \
  --dit-layerwise-offload false \
  --dit-cpu-offload false \
  --vae-cpu-offload false \
  --text-encoder-cpu-offload false \
  --enable-torch-compile \
  --prompt "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot." \
  --negative-prompt "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards" \
  --image-path /workspace/gen_benchmark/figs/astronaut.jpg \
  --num-frames 81 \
  --720p \
  --num-inference-steps 50 \
  --guidance-scale 5.0 \
  --seed 42 \
  --save-output
bash ../run.sh >& ../baseline.txt

bash ../run.sh >& ../codex_gemini.txt

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added the diffusion SGLang Diffusion label Mar 2, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant optimizations and refactorings across several multimodal generation models. The core changes involve enhancing layer normalization with flexible scaling, leveraging PyTorch's compilation capabilities for critical forward passes, and implementing CUDA-optimized rotary embedding applications. These updates are designed to improve the overall performance and efficiency of the models by utilizing fused kernels and better memory management.

Highlights

  • Enhanced Layer Normalization and Scaling: Modified _ScaleResidualNormScaleShift and _NormScaleShift classes to accept a scale_constant parameter, allowing for more flexible scaling in fused kernels. The CUDA implementations now fall back to native if scale_constant is not 1.0, as the CuTe DSL kernel currently only supports the default value.
  • PyTorch Compilation for Performance: Introduced @torch.compile decorators to forward methods in FluxAttention, RotaryEmbedding, QwenImageAttention, and ZImageAttention across various models, aiming to optimize execution speed and reduce overhead.
  • Optimized Rotary Embeddings: Implemented a CUDA-optimized path for applying rotary embeddings using apply_flashinfer_rope_qk_inplace in hunyuanvideo.py, wanvideo.py, and zimage.py. This path is conditionally enabled when on CUDA and query/key tensors are contiguous, falling back to the standard implementation otherwise.
  • Refactored Residual and Normalization Blocks: Replaced manual residual addition and gating with MulAdd in FluxBlock and integrated ScaleResidualLayerNormScaleShift and LayerNormScaleShift in FluxBlock, Flux2Block, and ZImageBlock for more efficient and fused normalization and scaling operations, especially for modulated paths.
  • Improved Tensor Memory Layout: Ensured that query, key, and value tensors maintain a contiguous memory layout after reshaping or unflattening in qwen_image.py and zimage.py, which is crucial for performance with CUDA kernels.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/multimodal_gen/runtime/layers/layernorm.py
    • Added scale_constant parameter to _ScaleResidualNormScaleShift and _NormScaleShift classes.
    • Updated forward_cuda methods to use scale_constant and fall back to native if not 1.0.
    • Modified docstrings to reflect the new scaling formula.
  • python/sglang/multimodal_gen/runtime/models/dits/flux.py
    • Imported MulAdd and ScaleResidualLayerNormScaleShift.
    • Added _is_cuda flag.
    • Applied @torch.compile to FluxAttention and RotaryEmbedding forward methods.
    • Updated cos and sin tensor conversion to use memory_format=torch.contiguous_format.
    • Replaced manual residual and gate operations with MulAdd in FluxBlock.
    • Replaced LayerNorm with ScaleResidualLayerNormScaleShift for norm2 and norm2_context.
    • Adjusted scale_mlp for Nunchaku structure when using ScaleResidualLayerNormScaleShift.
  • python/sglang/multimodal_gen/runtime/models/dits/flux_2.py
    • Imported LayerNormScaleShift and ScaleResidualLayerNormScaleShift.
    • Modified Flux2Attention to use apply_qk_norm for query and key.
    • Replaced nn.LayerNorm with LayerNormScaleShift in Flux2Attention and Flux2Block.
    • Refactored normalization and scaling in Flux2Block to use LayerNormScaleShift and ScaleResidualLayerNormScaleShift.
  • python/sglang/multimodal_gen/runtime/models/dits/hunyuanvideo.py
    • Added _is_cuda flag.
    • Implemented CUDA-optimized rotary embedding application using apply_flashinfer_rope_qk_inplace.
    • Updated cos and sin tensor conversion to use memory_format=torch.contiguous_format for the optimized path.
  • python/sglang/multimodal_gen/runtime/models/dits/qwen_image.py
    • Applied @torch.compile to QwenImageAttention forward method.
    • Ensured query, key, and value tensors are contiguous after unflattening.
  • python/sglang/multimodal_gen/runtime/models/dits/wanvideo.py
    • Implemented CUDA-optimized rotary embedding application using apply_flashinfer_rope_qk_inplace.
    • Updated cos and sin tensor conversion to use memory_format=torch.contiguous_format for the optimized path.
    • Changed hidden_states.flatten(2) to hidden_states.flatten(2, 3).
  • python/sglang/multimodal_gen/runtime/models/dits/zimage.py
    • Imported RMSNormScaleShift.
    • Applied @torch.compile to ZImageAttention forward method.
    • Ensured query, key, and value tensors are contiguous after reshaping.
    • Implemented CUDA-optimized rotary embedding application using apply_flashinfer_rope_qk_inplace.
    • Updated cos and sin tensor conversion to use memory_format=torch.contiguous_format for the optimized path.
    • Changed hidden_states.flatten(2) to hidden_states.flatten(2, 3).
    • Introduced RMSNormScaleShift for fused attention and FFN norms in ZImageBlock when modulation is active.
    • Added zero_shift buffer for RMSNormScaleShift with scale_constant=0.0.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@BBuf BBuf changed the title ud [Diffusion] test vide coding ablity Mar 2, 2026
@BBuf BBuf changed the title [Diffusion] test vide coding ablity [Diffusion] test vide coding ability Mar 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a series of optimizations and refactorings across various DiT models and their layers. Key changes include:

  • Refactoring layernorm and model implementations to use fused kernels for operations like (residual + gate * x) followed by normalization and modulation. This improves performance and code clarity.
  • Adding @torch.compile to several forward methods for just-in-time compilation, which should speed up execution.
  • Utilizing optimized kernels like flashinfer for rotary position embeddings where applicable, with added checks for tensor contiguity for correctness.
  • Improving memory layout by explicitly making tensors contiguous after reshaping operations.

My review identifies a potential bug in zimage.py where a change in logic for adaptive layer norm scaling might have been unintentionally introduced. Other changes appear to be solid improvements.

Comment on lines +389 to +392
gate_msa = gate_msa.clone().tanh()
gate_mlp = gate_mlp.clone().tanh()
scale_msa = scale_msa.clone()
scale_mlp = scale_mlp.clone()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The 1.0 + operation on scale_msa and scale_mlp has been removed. The RMSNormScaleShift is initialized with scale_constant=0.0, which results in the computation norm(x) * scale. The original logic was norm(x) * (1.0 + scale). This change in logic is likely a bug, as similar refactorings in other files in this PR preserve the (1 + scale) logic. To restore the original behavior, you can add 1.0 + back to the scales.

Suggested change
gate_msa = gate_msa.clone().tanh()
gate_mlp = gate_mlp.clone().tanh()
scale_msa = scale_msa.clone()
scale_mlp = scale_mlp.clone()
gate_msa = gate_msa.clone().tanh()
gate_mlp = gate_mlp.clone().tanh()
scale_msa = 1.0 + scale_msa.clone()
scale_mlp = 1.0 + scale_mlp.clone()

@BBuf BBuf closed this Mar 2, 2026
@BBuf BBuf deleted the cluade_code_opt branch March 2, 2026 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant