[feat]: General diffusers adapter backend to run diffusion models#2724
Conversation
3ff4eb4 to
c307b39
Compare
|
🔴 Pre-commit gate failing. Fix before requesting review. This is a substantial PR (1445 LOC, 12 files). After fixing gates, please run L3 tests locally and paste results in PR description: https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/test_guide/#l3-level--l4-level Test Plan/Test Result sections show "Drafting" - need concrete validation evidence for this feature. |
358340e to
2005b0f
Compare
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks a lot for the RFC. @DN6 please take a look as well.
Do we think http://hf.co/blog/modular-diffusers would be a better base candidate for the integration?
|
|
||
| Any model loadable via `DiffusionPipeline.from_pretrained()` is supported, including: | ||
|
|
||
| - **Text-to-Image:** SD 1.5, SD 2.1, SDXL, PixArt-Σ, Kandinsky, DeepFloyd IF |
There was a problem hiding this comment.
We could probably mention more recent variants such as Flux, QwenImage, etc.?
| The diffusers backend is a black-box adapter. The following features are NOT supported: | ||
|
|
||
| - CFG parallel execution | ||
| - Sequence parallel execution |
There was a problem hiding this comment.
We should be able to depend on Diffusers' extensive CP support for this no?
https://huggingface.co/docs/diffusers/main/en/training/distributed_inference#context-parallelism
There was a problem hiding this comment.
Thanks for the info! I thought it was done only externally by xdit. But for these parallelism features, I will also need to confirm whether it plays well with our architecture
|
|
||
| - CFG parallel execution | ||
| - Sequence parallel execution | ||
| - TeaCache / Cache-DiT acceleration |
There was a problem hiding this comment.
https://huggingface.co/docs/diffusers/main/en/optimization/cache
CacheDiT is supported too: https://github.com/vipshop/cache-dit?tab=readme-ov-file#quick-start-cache-parallelism-and-quantization
TeaCache is incoming: huggingface/diffusers#12652
Cc: @DN6 we should probably prioritize that PR?
There was a problem hiding this comment.
Thanks for the clarification. I also learned that it is possible to turn on these features. Apart from Cache-DIT, there seem to be also:
- dtype & quantization
- cpu offloading
- Attention backend
- VAE sliding and tiling
- Torch compile (eagerness)
There was a problem hiding this comment.
Yup.
Then there's this concept of regional compilation which provides a trade-off:
https://pytorch.org/blog/torch-compile-and-diffusers-a-hands-on-guide-to-peak-performance/
There was a problem hiding this comment.
TeaCache is incoming: huggingface/diffusers#12652
Cc: @DN6 we should probably prioritize that PR?
You can take your time on your TeaCache support :)
After a careful study of both codebases, I think the support for caching in the adapter layer is non-trivial. It can be deferred to a later PR. Put some notes here #2403 (comment)
| # Step-wise execution — explicitly rejected | ||
| # ------------------------------------------------------------------ | ||
|
|
||
| def prepare_encode(self, state: Any, **kwargs: Any) -> Any: |
There was a problem hiding this comment.
Well, our pipelines are implemented in a way, where we could compute text encodings and then use the precomputed text encodings for denoising + decoding.
There was a problem hiding this comment.
Thanks for the info. The "Step-wise execution" is a new experimetal feature on our side. Glad to know that you also have this. Do you mean the Modular Blocks https://huggingface.co/docs/diffusers/main/en/modular_diffusers/quickstart ?
|
Thanks @DN6 for adding to the review! Based on both of you's review, I have added some discussions on the feature support and adaptation to the companion issue page #2403 accordingly. Maybe I can also paste it here: Doable in the first PRTo avoid unnecessary complication, only support an optimization feature if
Specifically, these optimization toggles can be added:
Optionally, I can figure out how to integrate the Modular Pipeline to check input/output modalities. Deferred to sequel PRs
|
2005b0f to
292d5fa
Compare
|
TODO list for this draft
|
5678c56 to
85371a3
Compare
2012d67 to
a36cf33
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a general Hugging Face Diffusers adapter backend so vLLM-Omni can serve arbitrary DiffusionPipeline.from_pretrained() models via a new diffusers diffusion load format.
Changes:
- Added
DiffusersAdapterPipelineplus loader/registry/config wiring to enablediffusion_load_format=diffusers. - Exposed CLI + stage-config knobs to pass through
from_pretrained()andpipeline.__call__()kwargs. - Added unit + e2e coverage and an online serving example for the adapter workflow.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_omni/entrypoints/cli/serve.py | Adds CLI flags for selecting the diffusers load format + passing JSON kwargs. |
| vllm_omni/engine/async_omni_engine.py | Plumbs diffusers kwargs into the default diffusion stage config. |
| vllm_omni/diffusion/worker/diffusion_model_runner.py | Changes default load_format handling for diffusion runner. |
| vllm_omni/diffusion/registry.py | Registers DiffusersAdapterPipeline in the diffusion model registry. |
| vllm_omni/diffusion/models/diffusers_adapter/pipeline_diffusers_adapter.py | Implements the black-box adapter around DiffusionPipeline. |
| vllm_omni/diffusion/models/diffusers_adapter/init.py | Exposes adapter pipeline symbol for imports. |
| vllm_omni/diffusion/model_loader/diffusers_loader.py | Adds loader branch to construct/load the adapter pipeline. |
| vllm_omni/diffusion/data.py | Adds config fields + validation + enrich behavior for diffusers adapter. |
| tests/e2e/online_serving/test_diffusers_adapter.py | E2E coverage for serving and calling a diffusers-backed model. |
| tests/diffusion/test_diffusers_adapter.py | Unit tests for adapter guards, kwargs mapping, and output wrapping. |
| examples/online_serving/diffusers_pipeline_adapter/stage_config.yaml | Example stage config enabling the diffusers adapter. |
| examples/online_serving/diffusers_pipeline_adapter/README.md | Usage docs and limitations for the adapter workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
NotesFeature (Optimization+Parallelism) SupportThis PR only enables basic backend adaptation and simplest feature toggle. As is listed in the RFC, some features forwarding may be deferred to later PRs. The above comments on these deferred features are deliberately not resolved---for future reference. YAML Config#2383 and #2887 was working on YAML config system refactoring. Seems the "new" config system still has some problems with passing diffusion-specific configurations. Since this config is also continuing, and the old config system works well, this PR still uses the old config system (i.e., "stage config" instead of "deploy config"). After the continuing work on the config system, relevant content here can be updated later. PerfRunning Qwen-Image With vllm-omni + diffusers backend:
Note that the diffusers backend is a black box. We can only get one total time. Everything is counted in Running with native backend:
|
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> unit test for diffusers pipeline argument passing Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> L2 e2e test (random weight model, only e2e infer) Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> BUGFIX Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> bugfix Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> ensure pipeline device is correctly set Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> fix generator not set if seed not present Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> adjust doc Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> change test model Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> improve type check Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> fix wrong function call Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> attn backend not read Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> revert irrelevant changes Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> typo Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> add hardware mark Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> optimize CLI deault arg per AI comment Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com> revert irrelevant changes Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
57d37b4 to
d69ecd8
Compare
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
|
@hsliuustc0106 @Gaohan123 This PR is ready, PTAL and add a ready tag. Thanks |
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
| @@ -0,0 +1,31 @@ | |||
| # Example stage config for diffusers backend | |||
There was a problem hiding this comment.
when we are going to rm this yaml?
There was a problem hiding this comment.
I previously could not successfully forward some diffusion engine_args under the new config system (from deploy yaml to OmniDiffusionConfig). I planned to wait for #2987. But saw it just closed yesterday. I can look further into this, see if I can somehow get the new config system working
|
I didn't see any test results here: how do vllm-omni successfully serve a model using diffusion backend? Have you caompre the acc/perf compared with diffusers? how many models are supported now? is there any doc upates about these infos. Correct me if I miss something, otherwise, I suggest to revert this PR |
| vllm serve "stable-diffusion-v1-5/stable-diffusion-v1-5" \ | ||
| --omni \ | ||
| --diffusion-load-format diffusers \ | ||
| --diffusers-load-kwargs '{"use_safetensors": true}' \ |
There was a problem hiding this comment.
can we reuse the kwargs from vllm serve cli args instead of introducing 3 more args? I suggest to only keep one
There was a problem hiding this comment.
--diffusion-load-format is already there. I reuse it and add a new value. --diffusers-load-kwargs and --diffusers-call-kwargs are pass-throughs so that when a specific model has any niche parameters, users have a fallback way to set them
I have a perf compare with vllm-omni above. Indeed there is no acc/perf comparison with diffusers, or the model coverage. I will work on them now and attach them later. |
…lm-project#2724) Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
|
@hsliuustc0106 Quick update of a test with Qwen Image: Accuracy
Perf
Somehow, our backend is even faster than the bare bone library 🤔 The setup is as follow Ours (the preprocess and postprocess are all included in the forward run) Diffusers (following our defaults of bfloat16 dtype and use_safetensors) I'll |
…lm-project#2724) Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
…lm-project#2724) Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>


PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Fulfill #2403
Test Plan
L1 Unit test (no GPU):
L2 test with GPU:
Test Result
Passed on my side
Release Note
--diffusion-load-format diffusers. Check out the documentation hereEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)