Skip to content

[diffusion] refactor: introduce component residency manager#23771

Merged
mickqian merged 34 commits into
sgl-project:mainfrom
mickqian:component-residency-manager
May 1, 2026
Merged

[diffusion] refactor: introduce component residency manager#23771
mickqian merged 34 commits into
sgl-project:mainfrom
mickqian:component-residency-manager

Conversation

@mickqian
Copy link
Copy Markdown
Collaborator

@mickqian mickqian commented Apr 26, 2026

Motivation

Component management is crucial for the latency of diffusion pipeline inference, especially with modern diffusion models requiring larger sub-components (mistral, qwen as encoders) and more components (dual-dit for Wan and LTX) which have parameter size sum up to be larger than most modern GPUs.

Currently the model management code is scattered in each pipeline's pre and post hooks, making it hard to apply advance coordination.

Modifications

  1. introduce ComponentResidencyStrategy to abstract and cover all pre-existing module management techniques (including layerwise-offload, snapshot and resident mode for LTX pre-merged LoRA)
  2. introduce the ComponentResidencyManager to serve as a global manager to coordinate the components' placement, to maximize the latency while making full use of VRAM. The manager calls the preparation, pre-fetch and release hooks defined by the aforementioned strategies to coordinate

Accuracy Tests

Speed Tests and Profiling

Checklist

TODO

  1. use warmup request to build the component relationship as a graph, for manager to coordinate the module placement in a smaller granularity
  2. provide argument for user to switch between different memory mode (balanced, high-vram, low-vram)

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@github-actions github-actions Bot added diffusion SGLang Diffusion jit-kernel labels Apr 26, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a centralized PipelineResidencyManager to coordinate the loading and offloading of model components (such as VAEs, DiTs, and text encoders) across different pipeline stages. This system replaces legacy stage-local behavior with configurable residency strategies (static, dynamic, or disabled) and integrates hooks into the pipeline executors to manage component lifecycles. Additionally, the PR adds a compatibility wrapper for Flash Attention v3 kernels and updates various stages to declare their component usage. Feedback focuses on a potential type error in text encoding attention masks, the lack of prefetching in the vanilla D2H strategy, and the use of hardcoded memory buffer values.

Comment on lines +389 to +391
self._trace("prefetch_skip", use, strategy, module)
return
self._trace("prefetch", use, strategy, module)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Prefetching is explicitly disabled for VanillaD2HStrategy. This strategy is used for components like text encoders, which can be large and would benefit from asynchronous H2D transfers to hide latency. Unless there is a specific reason to avoid overlapping these transfers (e.g., memory pressure concerns that aren't already handled by the manager), prefetching should be enabled for this strategy as well.

memory_usage = getattr(self.pipeline, "memory_usages", {}).get(component_name)
if memory_usage is None:
return False
return memory_usage + 2.0 < current_platform.get_available_gpu_memory()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The memory buffer value 2.0 (presumably GB) is hardcoded. It would be better to define this as a named constant or make it configurable via ServerArgs to improve maintainability and allow tuning for different hardware or workloads.

…est-20260426

# Conflicts:
#	python/sglang/multimodal_gen/runtime/managers/scheduler.py
#	python/sglang/multimodal_gen/runtime/pipelines_core/executors/parallel_executor.py
#	python/sglang/multimodal_gen/runtime/pipelines_core/executors/sync_executor.py
@mickqian mickqian force-pushed the component-residency-manager branch from 0537cf9 to 4ced751 Compare April 30, 2026 06:27
@github-actions github-actions Bot added the lora label Apr 30, 2026
@mickqian mickqian force-pushed the component-residency-manager branch from 5187e0b to 1077f35 Compare April 30, 2026 12:55
@mickqian mickqian marked this pull request as ready for review April 30, 2026 12:55
@mickqian
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@mickqian mickqian force-pushed the component-residency-manager branch from 77e3f2c to c561d4a Compare April 30, 2026 16:11
@mickqian mickqian merged commit 9d84268 into sgl-project:main May 1, 2026
69 of 78 checks passed
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant