Skip to content

[diffusion] Add performance mode defaults#24491

Merged
mickqian merged 35 commits into
sgl-project:mainfrom
mickqian:codex/diffusion-performance-mode-clean
May 13, 2026
Merged

[diffusion] Add performance mode defaults#24491
mickqian merged 35 commits into
sgl-project:mainfrom
mickqian:codex/diffusion-performance-mode-clean

Conversation

@mickqian
Copy link
Copy Markdown
Collaborator

@mickqian mickqian commented May 6, 2026

Summary

  • Add --performance-mode / --mode for diffusion server defaults: auto, throughput, memory, and balanced, with aggressive, conservative, and balance aliases.
  • Auto-select FSDP+CFG only for high-confidence multi-GPU Qwen/Wan CFG cases, gated by the least available memory across selected GPUs.
  • Move the auto-tune decision logic into runtime/server_args_auto_tune.py; ServerArgs now only invokes the resolver.
  • Declare model-specific auto-tune hints on PipelineConfig methods instead of hard-coding Qwen/Wan/MOVA class-name checks in ServerArgs.
  • Preserve explicit FSDP, offload, and parallelism flags; --enable-cfg-parallel false now explicitly disables CFG parallelism.
  • Update the diffusion OOM guidance and add a concise deployment cookbook for CPU offload, FSDP, CFG, SP, and TP choices.

TODO

  • support a performance mode that would not make any adjustments

@github-actions github-actions Bot added documentation Improvements or additions to documentation diffusion SGLang Diffusion labels May 6, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a --performance-mode CLI flag with presets to automate memory and performance configurations for diffusion models, supported by a new Deployment Cookbook and improved OOM diagnostics. Reviewers suggested enhancing GPU memory detection accuracy, maintaining encoder offloading in memory-constrained modes, and refactoring duplicated offload logic.

Comment thread python/sglang/multimodal_gen/runtime/server_args.py Outdated
Comment on lines +526 to +528
if self.use_fsdp_inference:
self._set_gpu_resident_defaults(use_fsdp=True)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In memory mode, if use_fsdp_inference is explicitly set to True, calling _set_gpu_resident_defaults(use_fsdp=True) will disable CPU offloading for the text and image encoders (setting them to False). This is counter-intuitive for a mode intended to minimize GPU memory usage. It would be better to still enable offloading for these components even when FSDP is used for the DiT.

            if self.use_fsdp_inference:
                if self.text_encoder_cpu_offload is None:
                    self.text_encoder_cpu_offload = True
                if self.image_encoder_cpu_offload is None:
                    self.image_encoder_cpu_offload = True
                if self.dit_cpu_offload is None:
                    self.dit_cpu_offload = False
                return

Comment on lines +531 to +542
self._is_wan_or_mova_pipeline()
and not envs.SGLANG_CACHE_DIT_ENABLED
and current_platform.enable_dit_layerwise_offload_for_wan_by_default()
):
if self.dit_layerwise_offload is None:
self.dit_layerwise_offload = True
if self.dit_cpu_offload is None:
self.dit_cpu_offload = False
if self.text_encoder_cpu_offload is None:
self.text_encoder_cpu_offload = True
if self.image_encoder_cpu_offload is None:
self.image_encoder_cpu_offload = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for auto-enabling layerwise offload for Wan/Mova models is duplicated here and in _adjust_platform_specific (lines 960-994). Consider refactoring this into a helper method to improve maintainability and ensure consistency across different performance modes.

@mickqian
Copy link
Copy Markdown
Collaborator Author

mickqian commented May 7, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label May 7, 2026
@mickqian mickqian merged commit ff70aea into sgl-project:main May 13, 2026
100 of 138 checks passed
Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026
alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Jun 4, 2026
zijiexia added a commit to zijiexia/sglang that referenced this pull request Jun 4, 2026
…args + deployment cookbook

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd deepseek diffusion SGLang Diffusion documentation Improvements or additions to documentation lora model-gateway npu run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant