Skip to content

[Refactor] Let diffusion pipelines declare offloadable modules via SupportsModuleOffload#2427

Merged
lishunyang12 merged 10 commits intovllm-project:mainfrom
NickCao:fix/offload-module-discovery
Apr 22, 2026
Merged

[Refactor] Let diffusion pipelines declare offloadable modules via SupportsModuleOffload#2427
lishunyang12 merged 10 commits intovllm-project:mainfrom
NickCao:fix/offload-module-discovery

Conversation

@NickCao
Copy link
Copy Markdown
Contributor

@NickCao NickCao commented Apr 1, 2026

Purpose

ModuleDiscovery previously hardcoded attribute names to find DiT,
encoder, and VAE modules for CPU offload. This silently failed for
pipelines using non-standard names (e.g. OmniGen2's 'mllm', Bagel's
'vit_model', MammothModa2's 'gen_transformer'/'gen_vae'), leaving
multi-GB models idle on GPU during the denoising loop.

Add SupportsModuleOffload protocol to the pipeline interface.
Pipelines declare _dit_modules, _encoder_modules, and _vae_modules
as class variables, and ModuleDiscovery.discover() reads them
directly. Both DiT and encoder lists are needed because the offload
hooks use mutual exclusion. Pipelines without the protocol fall back
to the existing attribute name scan.

Also update PipelineModules.vae to PipelineModules.vaes (list) to
support pipelines with multiple VAEs (e.g. LTX2's audio_vae,
DreamIDOmni's vae_model_audio). Both sequential and layerwise
offload backends updated to iterate the list.

Behavioral changes from unifying collection logic into
_collect_modules:

  • Encoder collection now checks isinstance(nn.Module) (original
    did not) — prevents non-Module objects from reaching .to(device).
  • Encoder collection now deduplicates (original did not) — avoids
    double hook registration when two attrs point to the same module.
  • Non-Module attributes are warned when declared via the protocol
    (pipeline authoring bug), silently skipped in fallback path.

Test Plan

vllm serve --omni --model OmniGen2/OmniGen2 --port 8091
vllm serve --omni --model OmniGen2/OmniGen2 --port 8091 --enable-cpu-offload
# send image generation requests

Test Result

Config Peak Reserved Peak Allocated Gen Time (steady) Status
upstream/main, no offload 19.33 GB 15.20 GB 3.69s Works
upstream/main, --enable-cpu-offload CRASH: No encoder modules found, model stays on CPU
Our branch, no offload 19.33 GB 15.20 GB 3.69s Works (same as upstream)
Our branch, --enable-cpu-offload 8.87 GB 8.20 GB 8.84s Works, transformer <-> mllm mutual exclusion active

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@NickCao NickCao requested a review from hsliuustc0106 as a code owner April 1, 2026 19:09
@NickCao NickCao force-pushed the fix/offload-module-discovery branch from 0f85f38 to 8876037 Compare April 1, 2026 19:20
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@yuanheng-zhao PTAL

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Apr 2, 2026

Could you add a column in your table to show the current main branch memory and performance?

Copy link
Copy Markdown
Contributor

@yuanheng-zhao yuanheng-zhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing. It's good to have SupportsModuleOffload as an interface to adapt module level offloading for new models more flexibly. Left some comments

Comment thread vllm_omni/diffusion/offloader/module_collector.py Outdated
Comment thread vllm_omni/diffusion/models/interface.py
@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Apr 2, 2026

Could you add a column in your table to show the current main branch memory and performance?

Updated the table, on main branch, OmniGen2 actually crashes with --enable-cpu-offload, since the modules are on wrong devices.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple of small things, overall approach looks good.

Comment thread vllm_omni/diffusion/offloader/module_collector.py
Comment thread vllm_omni/diffusion/offloader/module_collector.py
@NickCao NickCao force-pushed the fix/offload-module-discovery branch 3 times, most recently from 5409f46 to 1699909 Compare April 2, 2026 15:45
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

add tests since it introduces a new class. does it affect api and ux?

@NickCao NickCao force-pushed the fix/offload-module-discovery branch from 1699909 to b65fe3d Compare April 3, 2026 14:15
@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Apr 3, 2026

add tests since it introduces a new class. does it affect api and ux?

Unit test added. This PR alone should not affect UX and external API, do affect model authors. After all the models are migrated to the explicit path we can drop the fallback, and throw an error when offload is enabled on unsupported model rather than crashing, that's when UX would be improved.

@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Apr 3, 2026

Also: I find SupportsModuleOffload to be not very descriptive, what do you think would be better, SupportsSequentialOffload?

@NickCao NickCao force-pushed the fix/offload-module-discovery branch from b65fe3d to f4ffc03 Compare April 9, 2026 14:43
@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Apr 9, 2026

Rebased, conflicts with #2339

@NickCao NickCao force-pushed the fix/offload-module-discovery branch from f4ffc03 to 67d39bd Compare April 15, 2026 19:11
Copy link
Copy Markdown
Contributor

@yuanheng-zhao yuanheng-zhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This PR will be helpful for other models with uncommon attr names and multiple VAE/encoder components as well.

Comment thread vllm_omni/diffusion/offloader/module_collector.py Outdated
Comment thread vllm_omni/diffusion/offloader/module_collector.py Outdated
@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Apr 20, 2026

Added support for nested modules, and declared SupportModuleOffload for Bagel and LTX2

@NickCao NickCao force-pushed the fix/offload-module-discovery branch from 4970b73 to 37354be Compare April 20, 2026 13:58
@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Apr 20, 2026

can you help update the add diffusion model skill for this refactor?

I see that there are skills in both .claude/skills of this repo, and https://github.com/hsliuustc0106/vllm-omni-skills, which one should be considered the authoritative source?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

can you help update the add diffusion model skill for this refactor?

I see that there are skills in both .claude/skills of this repo, and https://github.com/hsliuustc0106/vllm-omni-skills, which one should be considered the authoritative source?

this repo please

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Apr 20, 2026
@NickCao NickCao force-pushed the fix/offload-module-discovery branch from 0e1ecc5 to c8099c1 Compare April 20, 2026 15:08
@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Apr 20, 2026

can you help update the add diffusion model skill for this refactor?

Done.

Copy link
Copy Markdown
Contributor

@yuanheng-zhao yuanheng-zhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread vllm_omni/diffusion/models/interface.py Outdated
Comment thread vllm_omni/diffusion/offloader/module_collector.py Outdated
@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Apr 21, 2026

(APIServer pid=315) ERROR 04-21 13:42:57 [stage_config.py:272] Failed to import pipeline module 'vllm_omni.model_executor.models.voxcpm2.pipeline' for 'voxcpm2': No module named 'librosa'

Huh why's librosa back.

@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Apr 21, 2026

#2996

@NickCao NickCao force-pushed the fix/offload-module-discovery branch from 61cd06f to fef0586 Compare April 21, 2026 16:04
@yuanheng-zhao
Copy link
Copy Markdown
Contributor

CI failed, please help to take a look @NickCao , @tjtanaa

@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Apr 22, 2026

CI failed, please help to take a look @NickCao , @tjtanaa

It's due to huggingface ratelimit, could anyone restart it?

33948:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
34378:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json
34586:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json (Request ID: Root=1-69e7a2f7-65d3f9a20a38256813385be2;bf03e6ec-a19a-49b0-bcd0-5e1d2c80e127)
34783:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/r...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
35213:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json
35421:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json (Request ID: Root=1-69e7a2f8-14d95cdf441882464ab0fa0a;0ef66b86-7c00-4e80-bf17-5bfc0c9e7fb8)
35618:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/r...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
36048:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json
36256:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json (Request ID: Root=1-69e7a2f8-67ec549d7e332e2c4b4f789e;500fe18f-2458-4000-9393-79904fb3b5c5)
36453:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
36883:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json
37091:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json (Request ID: Root=1-69e7a2f8-01b997bc2203b8de397d87f4;4d81b8bd-be83-4c0a-959c-548d6f9e3ee8)
37288:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
37718:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json
37926:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/config.json (Request ID: Root=1-69e7a2f9-5d606b362a8fedf818755b96;c89ff1f8-d55d-4aa8-9734-026aeceed648)
38123:Too Many Requests for url: https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/main/...00 resolvers requests per 5 minutes period. Check with HF support to work around the issue or get even higher limits.')
38553:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json
38761:Too Many Requests for url: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/resolve/main/config.json (Request ID: Root=1-69e7a2f9-2660c87f14c748352a776090;cfe64865-4109-41f2-a52e-f04f25d352d7)

@yuanheng-zhao
Copy link
Copy Markdown
Contributor

Weird, they were happening during cpu unit tests. I checked several recent commits and didn't find the same rate limit issue. Could you merge main to trigger CI again?

@NickCao NickCao force-pushed the fix/offload-module-discovery branch from fef0586 to 288db29 Compare April 22, 2026 13:08
@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Apr 22, 2026

The doc failure seems unrelated? Can't tell.

NickCao and others added 10 commits April 22, 2026 09:39
…pportsModuleOffload

ModuleDiscovery previously hardcoded attribute names to find DiT,
encoder, and VAE modules for CPU offload. This silently failed for
pipelines using non-standard names (e.g. OmniGen2's 'mllm', Bagel's
'vit_model', MammothModa2's 'gen_transformer'/'gen_vae'), leaving
multi-GB models idle on GPU during the denoising loop.

Add SupportsModuleOffload protocol to the pipeline interface.
Pipelines declare _dit_modules, _encoder_modules, and _vae_modules
as class variables, and ModuleDiscovery.discover() reads them
directly. Both DiT and encoder lists are needed because the offload
hooks use mutual exclusion. Pipelines without the protocol fall back
to the existing attribute name scan.

Also update PipelineModules.vae to PipelineModules.vaes (list) to
support pipelines with multiple VAEs (e.g. LTX2's audio_vae,
DreamIDOmni's vae_model_audio). Both sequential and layerwise
offload backends updated to iterate the list.

Behavioral changes from unifying collection logic into
_collect_modules:
- Encoder collection now checks isinstance(nn.Module) (original
  did not) — prevents non-Module objects from reaching .to(device).
- Encoder collection now deduplicates (original did not) — avoids
  double hook registration when two attrs point to the same module.
- Non-Module attributes are warned when declared via the protocol
  (pipeline authoring bug), silently skipped in fallback path.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Nick Cao <ncao@redhat.com>
Add SupportsModuleOffload to OmniGen2Pipeline so ModuleDiscovery
can find the Qwen2.5-VL text encoder ('mllm', ~6-16 GB) for
sequential CPU offload. Previously, 'mllm' was not in the hardcoded
attribute scan list, so enable_cpu_offload silently left it on GPU
during the entire denoising loop.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Nick Cao <ncao@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Nick Cao <ncao@redhat.com>
Allow dotted attribute names (e.g. "pipe.transformer") in
_dit_modules, _encoder_modules, and _vae_modules to resolve
nested modules via operator.attrgetter.  This handles pipelines
like LTX2TwoStagesPipeline where the transformer lives under a
child pipeline (pipe.transformer), and Bagel where the encoder
is at language_model.model.

Flat attribute names continue to work unchanged.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Nick Cao <ncao@redhat.com>
…offload

Add _resident_modules class variable to SupportsModuleOffload for
small submodules that must stay on GPU during layer-wise offloading
(e.g. embedders, connectors).  Defaults to empty list.

During layerwise offload, pipelines load everything to CPU and the
offloader selectively moves dit/encoder/vae groups to GPU.  Modules
outside these groups stay on CPU, which breaks pipelines like Bagel
where time_embedder, vae2llm, vit_model etc. are needed every
forward pass but are not children of any discovered group.

_resident_modules lets pipelines declare these modules explicitly.
The layerwise backend pins them on GPU alongside encoders and VAEs.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Nick Cao <ncao@redhat.com>
Add 'To Support a Model' section under model-level offloading showing
how to implement the SupportsModuleOffload protocol. Restore the
layerwise 'To Support a Model' section under its own parent. Update
the Module Discovery section to document both protocol-based and
fallback attribute scan discovery paths.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Nick Cao <ncao@redhat.com>
LTX2 two-stage pipelines have nested module structure where the
DiT, encoders, and VAEs live under self.pipe.  The fallback
attribute scan cannot find them, causing layerwise offloading
to skip DiT discovery entirely.

Implement SupportsModuleOffload on LTX2TwoStagesPipeline and
LTX2ImageToVideoTwoStagesPipeline using dotted paths to reach
nested modules (pipe.transformer, pipe.text_encoder, pipe.vae,
pipe.audio_vae).

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Nick Cao <ncao@redhat.com>
BagelPipeline has non-standard module layout: the DiT lives at
language_model.model, and several small modules under self.bagel
(time_embedder, vae2llm, llm2vae, latent_pos_embed, vit_model,
connector, vit_pos_embed) are needed every forward pass but are
not children of the DiT.

Implement SupportsModuleOffload with _resident_modules to pin
these small modules on GPU during layerwise offloading.  Without
this, they stay on CPU (offload pipelines skip self.to(device))
and forward() fails with device mismatch.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Nick Cao <ncao@redhat.com>
Add Step 11 (CPU Offload Support) covering SupportsModuleOffload
protocol: _dit_modules, _encoder_modules, _vae_modules,
_resident_modules, dotted path support.

Add cpu_offload_diffusion.md to Step 7 required docs list.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Nick Cao <ncao@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Nick Cao <ncao@redhat.com>
@NickCao NickCao force-pushed the fix/offload-module-discovery branch from 288db29 to 7df49c3 Compare April 22, 2026 13:39
@lishunyang12 lishunyang12 merged commit e3b0afb into vllm-project:main Apr 22, 2026
8 checks passed
qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
…upportsModuleOffload (vllm-project#2427)

Signed-off-by: Nick Cao <ncao@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants