Conversation
Summary of ChangesHello @WhoisZihan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for regional compilation in diffusion models, specifically targeting repeated blocks like transformer layers. The primary goal is to significantly reduce the initial compilation overhead and overall inference time by compiling only the frequently reused parts of the model, rather than the entire graph. This is exposed via a new command-line argument, allowing users to easily enable this optimization and achieve faster model execution, as demonstrated by the provided benchmarks. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces regional compilation to speed up the model compilation time, which is a valuable performance improvement. The changes involve adding a new server argument --regional-compile and logic to compile only specific repeated blocks of the model.
My review has identified two critical issues:
- An unused and buggy class
WanPreTransformerLayershas been added inwanvideo.py. It should be removed to avoid confusion and potential errors. - The new regional compilation feature is not activated because the call sites for
_maybe_enable_torch_compileindenoising.pyhave not been updated to pass the newregionalflag.
Addressing these points will ensure the new feature works as intended and the code remains clean.
I am having trouble creating individual review comments. Click here to see my feedback.
python/sglang/multimodal_gen/runtime/models/dits/wanvideo.py (678-697)
The new class WanPreTransformerLayers appears to be unused in the codebase. Additionally, its __init__ method calls super().__init__(config=config, hf_config=hf_config), but torch.nn.Module.__init__ does not accept these arguments. This will cause a TypeError if the class is ever instantiated.
The layers defined within it also seem to be duplicated from WanTransformer3DModel. It looks like this might be leftover code from a refactoring. To avoid confusion and potential bugs, it would be best to remove this class entirely.
python/sglang/multimodal_gen/runtime/pipelines_core/stages/denoising.py (119)
The new regional parameter is a great addition for enabling regional compilation. However, the call sites for this function don't seem to be updated to pass this new parameter. The value of regional will always be its default False, so the regional compilation logic will never be triggered.
You should update the calls in DenoisingStage.__init__ (line 97) and DenoisingStage._prepare_denoising_loop (line 523) to pass regional=self.server_args.regional_compile.
For example, in __init__:
for transformer in filter(None, [self.transformer, self.transformer_2]):
self._maybe_enable_torch_compile(transformer, regional=self.server_args.regional_compile)Without this change, the new feature is not active.
Motivation
Currently the compile integration in diffusions only support compiling the whole model. The graph trace will include all repeated layers like transformer block, which leads to potential long compile time.
To address the issue, other frameworks like vllm/transformers leverages regional compile to only compile the repeated blocks(you can of course manually add other non-repeated blocks as well), so that you need to only compile once and reuse the code cache for the same blocks.
Modifications
A new server arg is added to control whether using regional compile.
Accuracy Tests
Benchmarking and Profiling
I only have A100 at hand, and I use a smaller model
Wan2.2-TI2V-5B-Diffusersjust to illustrate how it works.SGLANG_DIFFUSION_ATTENTION_CONFIG=./mask_strategy_wan.json \ sglang generate \ --model-path Wan-AI/Wan2.2-TI2V-5B-Diffusers/ \ --pin-cpu-memory \ --num-gpus 8 \ --ulysses-degree 8 \ --attention-backend sage_attn \ --enable-torch-compile \ --regional-compile \ --prompt "A cat walks on the grass, realistic" \ --num-frames 81 \ --height 480 \ --width 832 \ --num-inference-steps 27 \ --guidance-scale 3.5 \ --guidance-scale-2 4.0 \ --perf-dump-path ./dump/wan_step_profile_cp8_main.jsonThe denoising step-wise duration is
Observations
There are a few interesting observations from the statistics
For now I only add transformer blocks into repeated blocks, and have not tested adding other non-repeated layers, which should help improve performance as well.
Also, there are many graph breaks inside the model right now, which could lead to unexpected time change too.
Plan for mitigating graph breaks and a more general compilation mechanism
From the dynamo logs, there are many graph breaks in the compile procedure. Due to the massive graph breaks, the performance has definitely not reached the peak of compiled artifacts, regardless of full or regional compile.
I have already figured out a few graph break points from the dynamo logs, and plan to mitigate such breaks in future PRs.
Also, I think the diffusion model layers are not quite compile-friendly at the moment, and it would be great to have a more structured and flexible compilation mechanism, which could introduces more optimization opportunities.
There is already an on-going work #11830 trying to integrate the compilation mechanism, which would be really helpful to fully utilize the torch.compile benefits. Is there any plan to proceed the work?
I am glad to help and contribute to the compile mechanism if possible.
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci