[diffusion] refactor: introduce component residency manager#23771
Conversation
…t-residency-test-20260426
There was a problem hiding this comment.
Code Review
This pull request introduces a centralized PipelineResidencyManager to coordinate the loading and offloading of model components (such as VAEs, DiTs, and text encoders) across different pipeline stages. This system replaces legacy stage-local behavior with configurable residency strategies (static, dynamic, or disabled) and integrates hooks into the pipeline executors to manage component lifecycles. Additionally, the PR adds a compatibility wrapper for Flash Attention v3 kernels and updates various stages to declare their component usage. Feedback focuses on a potential type error in text encoding attention masks, the lack of prefetching in the vanilla D2H strategy, and the use of hardcoded memory buffer values.
| self._trace("prefetch_skip", use, strategy, module) | ||
| return | ||
| self._trace("prefetch", use, strategy, module) |
There was a problem hiding this comment.
Prefetching is explicitly disabled for VanillaD2HStrategy. This strategy is used for components like text encoders, which can be large and would benefit from asynchronous H2D transfers to hide latency. Unless there is a specific reason to avoid overlapping these transfers (e.g., memory pressure concerns that aren't already handled by the manager), prefetching should be enabled for this strategy as well.
| memory_usage = getattr(self.pipeline, "memory_usages", {}).get(component_name) | ||
| if memory_usage is None: | ||
| return False | ||
| return memory_usage + 2.0 < current_platform.get_available_gpu_memory() |
…est-20260426 # Conflicts: # python/sglang/multimodal_gen/runtime/managers/scheduler.py # python/sglang/multimodal_gen/runtime/pipelines_core/executors/parallel_executor.py # python/sglang/multimodal_gen/runtime/pipelines_core/executors/sync_executor.py
0537cf9 to
4ced751
Compare
5187e0b to
1077f35
Compare
|
/tag-and-rerun-ci |
77e3f2c to
c561d4a
Compare
Motivation
Component management is crucial for the latency of diffusion pipeline inference, especially with modern diffusion models requiring larger sub-components (mistral, qwen as encoders) and more components (dual-dit for Wan and LTX) which have parameter size sum up to be larger than most modern GPUs.
Currently the model management code is scattered in each pipeline's pre and post hooks, making it hard to apply advance coordination.
Modifications
ComponentResidencyStrategyto abstract and cover all pre-existing module management techniques (including layerwise-offload, snapshot and resident mode for LTX pre-merged LoRA)ComponentResidencyManagerto serve as a global manager to coordinate the components' placement, to maximize the latency while making full use of VRAM. The manager calls the preparation, pre-fetch and release hooks defined by the aforementioned strategies to coordinateAccuracy Tests
Speed Tests and Profiling
Checklist
TODO
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci