[Refactor] Let diffusion pipelines declare offloadable modules via SupportsModuleOffload by NickCao · Pull Request #2427 · vllm-project/vllm-omni

NickCao · 2026-04-01T19:09:58Z

Purpose

ModuleDiscovery previously hardcoded attribute names to find DiT,
encoder, and VAE modules for CPU offload. This silently failed for
pipelines using non-standard names (e.g. OmniGen2's 'mllm', Bagel's
'vit_model', MammothModa2's 'gen_transformer'/'gen_vae'), leaving
multi-GB models idle on GPU during the denoising loop.

Add SupportsModuleOffload protocol to the pipeline interface.
Pipelines declare _dit_modules, _encoder_modules, and _vae_modules
as class variables, and ModuleDiscovery.discover() reads them
directly. Both DiT and encoder lists are needed because the offload
hooks use mutual exclusion. Pipelines without the protocol fall back
to the existing attribute name scan.

Also update PipelineModules.vae to PipelineModules.vaes (list) to
support pipelines with multiple VAEs (e.g. LTX2's audio_vae,
DreamIDOmni's vae_model_audio). Both sequential and layerwise
offload backends updated to iterate the list.

Behavioral changes from unifying collection logic into
_collect_modules:

Encoder collection now checks isinstance(nn.Module) (original
did not) — prevents non-Module objects from reaching .to(device).
Encoder collection now deduplicates (original did not) — avoids
double hook registration when two attrs point to the same module.
Non-Module attributes are warned when declared via the protocol
(pipeline authoring bug), silently skipped in fallback path.

Test Plan

vllm serve --omni --model OmniGen2/OmniGen2 --port 8091
vllm serve --omni --model OmniGen2/OmniGen2 --port 8091 --enable-cpu-offload
# send image generation requests

Test Result

Config	Peak Reserved	Peak Allocated	Gen Time (steady)	Status
upstream/main, no offload	19.33 GB	15.20 GB	3.69s	Works
upstream/main, `--enable-cpu-offload`	—	—	—	CRASH: `No encoder modules found`, model stays on CPU
Our branch, no offload	19.33 GB	15.20 GB	3.69s	Works (same as upstream)
Our branch, `--enable-cpu-offload`	8.87 GB	8.20 GB	8.84s	Works, `transformer <-> mllm` mutual exclusion active

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

hsliuustc0106 · 2026-04-01T22:08:37Z

@yuanheng-zhao PTAL

gcanlin · 2026-04-02T00:57:56Z

Could you add a column in your table to show the current main branch memory and performance?

yuanheng-zhao

Thanks for contributing. It's good to have SupportsModuleOffload as an interface to adapt module level offloading for new models more flexibly. Left some comments

NickCao · 2026-04-02T14:51:34Z

Could you add a column in your table to show the current main branch memory and performance?

Updated the table, on main branch, OmniGen2 actually crashes with --enable-cpu-offload, since the modules are on wrong devices.

lishunyang12

left a couple of small things, overall approach looks good.

hsliuustc0106 · 2026-04-02T23:43:27Z

add tests since it introduces a new class. does it affect api and ux?

NickCao · 2026-04-03T14:17:13Z

add tests since it introduces a new class. does it affect api and ux?

Unit test added. This PR alone should not affect UX and external API, do affect model authors. After all the models are migrated to the explicit path we can drop the fallback, and throw an error when offload is enabled on unsupported model rather than crashing, that's when UX would be improved.

NickCao · 2026-04-03T14:18:16Z

Also: I find SupportsModuleOffload to be not very descriptive, what do you think would be better, SupportsSequentialOffload?

NickCao · 2026-04-09T14:44:34Z

Rebased, conflicts with #2339

yuanheng-zhao

LGTM. This PR will be helpful for other models with uncommon attr names and multiple VAE/encoder components as well.

NickCao · 2026-04-20T12:56:37Z

Added support for nested modules, and declared SupportModuleOffload for Bagel and LTX2

NickCao · 2026-04-20T14:41:22Z

can you help update the add diffusion model skill for this refactor?

I see that there are skills in both .claude/skills of this repo, and https://github.com/hsliuustc0106/vllm-omni-skills, which one should be considered the authoritative source?

hsliuustc0106 · 2026-04-20T14:44:47Z

can you help update the add diffusion model skill for this refactor?

I see that there are skills in both .claude/skills of this repo, and https://github.com/hsliuustc0106/vllm-omni-skills, which one should be considered the authoritative source?

this repo please

NickCao · 2026-04-20T15:08:11Z

can you help update the add diffusion model skill for this refactor?

Done.

yuanheng-zhao

LGTM

NickCao · 2026-04-21T13:57:28Z

(APIServer pid=315) ERROR 04-21 13:42:57 [stage_config.py:272] Failed to import pipeline module 'vllm_omni.model_executor.models.voxcpm2.pipeline' for 'voxcpm2': No module named 'librosa'

Huh why's librosa back.

NickCao · 2026-04-21T14:05:17Z

#2996

yuanheng-zhao · 2026-04-22T12:18:52Z

CI failed, please help to take a look @NickCao , @tjtanaa

NickCao · 2026-04-22T12:21:05Z

CI failed, please help to take a look @NickCao , @tjtanaa

It's due to huggingface ratelimit, could anyone restart it?

33948:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
34378:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json
34586:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json (Request ID: Root=1-69e7a2f7-65d3f9a20a38256813385be2;bf03e6ec-a19a-49b0-bcd0-5e1d2c80e127)
34783:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/r...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
35213:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json
35421:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json (Request ID: Root=1-69e7a2f8-14d95cdf441882464ab0fa0a;0ef66b86-7c00-4e80-bf17-5bfc0c9e7fb8)
35618:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/r...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
36048:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json
36256:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json (Request ID: Root=1-69e7a2f8-67ec549d7e332e2c4b4f789e;500fe18f-2458-4000-9393-79904fb3b5c5)
36453:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
36883:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json
37091:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json (Request ID: Root=1-69e7a2f8-01b997bc2203b8de397d87f4;4d81b8bd-be83-4c0a-959c-548d6f9e3ee8)
37288:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
37718:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json
37926:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json (Request ID: Root=1-69e7a2f9-5d606b362a8fedf818755b96;c89ff1f8-d55d-4aa8-9734-026aeceed648)
38123:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
38553:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json
38761:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json (Request ID: Root=1-69e7a2f9-2660c87f14c748352a776090;cfe64865-4109-41f2-a52e-f04f25d352d7)

yuanheng-zhao · 2026-04-22T12:44:27Z

Weird, they were happening during cpu unit tests. I checked several recent commits and didn't find the same rate limit issue. Could you merge main to trigger CI again?

NickCao · 2026-04-22T13:35:06Z

The doc failure seems unrelated? Can't tell.

…pportsModuleOffload ModuleDiscovery previously hardcoded attribute names to find DiT, encoder, and VAE modules for CPU offload. This silently failed for pipelines using non-standard names (e.g. OmniGen2's 'mllm', Bagel's 'vit_model', MammothModa2's 'gen_transformer'/'gen_vae'), leaving multi-GB models idle on GPU during the denoising loop. Add SupportsModuleOffload protocol to the pipeline interface. Pipelines declare _dit_modules, _encoder_modules, and _vae_modules as class variables, and ModuleDiscovery.discover() reads them directly. Both DiT and encoder lists are needed because the offload hooks use mutual exclusion. Pipelines without the protocol fall back to the existing attribute name scan. Also update PipelineModules.vae to PipelineModules.vaes (list) to support pipelines with multiple VAEs (e.g. LTX2's audio_vae, DreamIDOmni's vae_model_audio). Both sequential and layerwise offload backends updated to iterate the list. Behavioral changes from unifying collection logic into _collect_modules: - Encoder collection now checks isinstance(nn.Module) (original did not) — prevents non-Module objects from reaching .to(device). - Encoder collection now deduplicates (original did not) — avoids double hook registration when two attrs point to the same module. - Non-Module attributes are warned when declared via the protocol (pipeline authoring bug), silently skipped in fallback path. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

Add SupportsModuleOffload to OmniGen2Pipeline so ModuleDiscovery can find the Qwen2.5-VL text encoder ('mllm', ~6-16 GB) for sequential CPU offload. Previously, 'mllm' was not in the hardcoded attribute scan list, so enable_cpu_offload silently left it on GPU during the entire denoising loop. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

Allow dotted attribute names (e.g. "pipe.transformer") in _dit_modules, _encoder_modules, and _vae_modules to resolve nested modules via operator.attrgetter. This handles pipelines like LTX2TwoStagesPipeline where the transformer lives under a child pipeline (pipe.transformer), and Bagel where the encoder is at language_model.model. Flat attribute names continue to work unchanged. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

…offload Add _resident_modules class variable to SupportsModuleOffload for small submodules that must stay on GPU during layer-wise offloading (e.g. embedders, connectors). Defaults to empty list. During layerwise offload, pipelines load everything to CPU and the offloader selectively moves dit/encoder/vae groups to GPU. Modules outside these groups stay on CPU, which breaks pipelines like Bagel where time_embedder, vae2llm, vit_model etc. are needed every forward pass but are not children of any discovered group. _resident_modules lets pipelines declare these modules explicitly. The layerwise backend pins them on GPU alongside encoders and VAEs. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

Add 'To Support a Model' section under model-level offloading showing how to implement the SupportsModuleOffload protocol. Restore the layerwise 'To Support a Model' section under its own parent. Update the Module Discovery section to document both protocol-based and fallback attribute scan discovery paths. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

LTX2 two-stage pipelines have nested module structure where the DiT, encoders, and VAEs live under self.pipe. The fallback attribute scan cannot find them, causing layerwise offloading to skip DiT discovery entirely. Implement SupportsModuleOffload on LTX2TwoStagesPipeline and LTX2ImageToVideoTwoStagesPipeline using dotted paths to reach nested modules (pipe.transformer, pipe.text_encoder, pipe.vae, pipe.audio_vae). Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

BagelPipeline has non-standard module layout: the DiT lives at language_model.model, and several small modules under self.bagel (time_embedder, vae2llm, llm2vae, latent_pos_embed, vit_model, connector, vit_pos_embed) are needed every forward pass but are not children of the DiT. Implement SupportsModuleOffload with _resident_modules to pin these small modules on GPU during layerwise offloading. Without this, they stay on CPU (offload pipelines skip self.to(device)) and forward() fails with device mismatch. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

Add Step 11 (CPU Offload Support) covering SupportsModuleOffload protocol: _dit_modules, _encoder_modules, _vae_modules, _resident_modules, dotted path support. Add cpu_offload_diffusion.md to Step 7 required docs list. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

…upportsModuleOffload (vllm-project#2427) Signed-off-by: Nick Cao <ncao@redhat.com> Co-authored-by: Claude <noreply@anthropic.com>

NickCao requested a review from hsliuustc0106 as a code owner April 1, 2026 19:09

NickCao force-pushed the fix/offload-module-discovery branch from 0f85f38 to 8876037 Compare April 1, 2026 19:20

hsliuustc0106 requested review from SamitHuang and gcanlin April 1, 2026 22:09

yuanheng-zhao reviewed Apr 2, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/offloader/module_collector.py Outdated

Comment thread vllm_omni/diffusion/models/interface.py

lishunyang12 reviewed Apr 2, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/offloader/module_collector.py

Comment thread vllm_omni/diffusion/offloader/module_collector.py

NickCao force-pushed the fix/offload-module-discovery branch 3 times, most recently from 5409f46 to 1699909 Compare April 2, 2026 15:45

NickCao force-pushed the fix/offload-module-discovery branch from 1699909 to b65fe3d Compare April 3, 2026 14:15

NickCao force-pushed the fix/offload-module-discovery branch from b65fe3d to f4ffc03 Compare April 9, 2026 14:43

NickCao force-pushed the fix/offload-module-discovery branch from f4ffc03 to 67d39bd Compare April 15, 2026 19:11

NickCao requested review from lishunyang12 and yuanheng-zhao April 15, 2026 19:12

yuanheng-zhao approved these changes Apr 16, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/offloader/module_collector.py Outdated

Comment thread vllm_omni/diffusion/offloader/module_collector.py Outdated

yuanheng-zhao mentioned this pull request Apr 16, 2026

[Feat] support layerwise offload for Bagel #2734

Merged

NickCao force-pushed the fix/offload-module-discovery branch 2 times, most recently from 3cc4711 to 3babe17 Compare April 16, 2026 20:16

BBuf mentioned this pull request Apr 20, 2026

SGLang Diffusion 外部影响力调研：kernel、feature 与平台采用情况 BBuf/how-to-optim-algorithm-in-cuda#14

Open

yuanheng-zhao mentioned this pull request Apr 20, 2026

[BugFix] Fix layerwise CPU offloading for LTX2 two-stages pipeline #2935

Closed

5 tasks

NickCao force-pushed the fix/offload-module-discovery branch from 3babe17 to 4970b73 Compare April 20, 2026 12:55

NickCao force-pushed the fix/offload-module-discovery branch from 4970b73 to 37354be Compare April 20, 2026 13:58

hsliuustc0106 added the ready label to trigger buildkite CI label Apr 20, 2026

NickCao force-pushed the fix/offload-module-discovery branch from 0e1ecc5 to c8099c1 Compare April 20, 2026 15:08

yuanheng-zhao approved these changes Apr 21, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/interface.py Outdated

Comment thread vllm_omni/diffusion/offloader/module_collector.py Outdated

Songrui625 mentioned this pull request Apr 21, 2026

[Test] Add L4 complete diffusion feature test for LTX-2 model #2815

Open

5 tasks

NickCao force-pushed the fix/offload-module-discovery branch from c8099c1 to 61cd06f Compare April 21, 2026 13:32

NickCao force-pushed the fix/offload-module-discovery branch from 61cd06f to fef0586 Compare April 21, 2026 16:04

NickCao force-pushed the fix/offload-module-discovery branch from fef0586 to 288db29 Compare April 22, 2026 13:08

NickCao and others added 10 commits April 22, 2026 09:39

[Test] Add unit tests for ModuleDiscovery and SupportsModuleOffload

24e5500

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

[Cleanup] Trim docstrings in SupportsModuleOffload and ModuleDiscovery

7df49c3

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Nick Cao <ncao@redhat.com>

NickCao force-pushed the fix/offload-module-discovery branch from 288db29 to 7df49c3 Compare April 22, 2026 13:39

lishunyang12 merged commit e3b0afb into vllm-project:main Apr 22, 2026
8 checks passed

Songrui625 mentioned this pull request Apr 23, 2026

[Bug]: Failed to run LTX-2 two-stage pipelines when HSDP is enabled #3062

Open

1 task

Conversation

NickCao commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

hsliuustc0106 commented Apr 1, 2026

Uh oh!

gcanlin commented Apr 2, 2026

Uh oh!

yuanheng-zhao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NickCao commented Apr 2, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 2, 2026

Uh oh!

NickCao commented Apr 3, 2026

Uh oh!

NickCao commented Apr 3, 2026

Uh oh!

NickCao commented Apr 9, 2026

Uh oh!

yuanheng-zhao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NickCao commented Apr 20, 2026

Uh oh!

NickCao commented Apr 20, 2026

Uh oh!

hsliuustc0106 commented Apr 20, 2026

Uh oh!

NickCao commented Apr 20, 2026

Uh oh!

yuanheng-zhao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NickCao commented Apr 21, 2026

Uh oh!

NickCao commented Apr 21, 2026

Uh oh!

yuanheng-zhao commented Apr 22, 2026

Uh oh!

NickCao commented Apr 22, 2026

Uh oh!

yuanheng-zhao commented Apr 22, 2026

Uh oh!

NickCao commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NickCao commented Apr 1, 2026 •

edited

Loading