[diffusion] Add performance mode defaults#24491
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds a --performance-mode CLI flag with presets to automate memory and performance configurations for diffusion models, supported by a new Deployment Cookbook and improved OOM diagnostics. Reviewers suggested enhancing GPU memory detection accuracy, maintaining encoder offloading in memory-constrained modes, and refactoring duplicated offload logic.
| if self.use_fsdp_inference: | ||
| self._set_gpu_resident_defaults(use_fsdp=True) | ||
| return |
There was a problem hiding this comment.
In memory mode, if use_fsdp_inference is explicitly set to True, calling _set_gpu_resident_defaults(use_fsdp=True) will disable CPU offloading for the text and image encoders (setting them to False). This is counter-intuitive for a mode intended to minimize GPU memory usage. It would be better to still enable offloading for these components even when FSDP is used for the DiT.
if self.use_fsdp_inference:
if self.text_encoder_cpu_offload is None:
self.text_encoder_cpu_offload = True
if self.image_encoder_cpu_offload is None:
self.image_encoder_cpu_offload = True
if self.dit_cpu_offload is None:
self.dit_cpu_offload = False
return| self._is_wan_or_mova_pipeline() | ||
| and not envs.SGLANG_CACHE_DIT_ENABLED | ||
| and current_platform.enable_dit_layerwise_offload_for_wan_by_default() | ||
| ): | ||
| if self.dit_layerwise_offload is None: | ||
| self.dit_layerwise_offload = True | ||
| if self.dit_cpu_offload is None: | ||
| self.dit_cpu_offload = False | ||
| if self.text_encoder_cpu_offload is None: | ||
| self.text_encoder_cpu_offload = True | ||
| if self.image_encoder_cpu_offload is None: | ||
| self.image_encoder_cpu_offload = True |
There was a problem hiding this comment.
|
/tag-and-rerun-ci |
…rmance-mode-clean # Conflicts: # python/sglang/multimodal_gen/runtime/server_args.py
…rmance-mode-clean
…rmance-mode-clean # Conflicts: # python/sglang/srt/speculative/eagle_worker_v2.py
…args + deployment cookbook Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
--performance-mode/--modefor diffusion server defaults:auto,throughput,memory, andbalanced, withaggressive,conservative, andbalancealiases.runtime/server_args_auto_tune.py;ServerArgsnow only invokes the resolver.PipelineConfigmethods instead of hard-coding Qwen/Wan/MOVA class-name checks inServerArgs.--enable-cfg-parallel falsenow explicitly disables CFG parallelism.TODO