Skip to content

[feat]: General diffusers adapter backend to run diffusion models#2724

Merged
Gaohan123 merged 5 commits into
vllm-project:mainfrom
fhfuih:diffusers-backend
Apr 22, 2026
Merged

[feat]: General diffusers adapter backend to run diffusion models#2724
Gaohan123 merged 5 commits into
vllm-project:mainfrom
fhfuih:diffusers-backend

Conversation

@fhfuih
Copy link
Copy Markdown
Contributor

@fhfuih fhfuih commented Apr 13, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Fulfill #2403

Test Plan

L1 Unit test (no GPU):

  • DiffusersAdapterPipeline can forward underlying DiffusionPipeline's output
  • DiffusersAdapterPipeline raise when the user requests (yet) unsupported diffusion features
  • When an arbitrary data structure is returned by the underlying DiffusionPipeline (typically when calling call with return_dict=True), wrap this output as-is with our diffusionoutput
  • DiffusersAdapterPipeline can correctly forward call arguments
    L2 test with GPU:
  • Running random Qwen Image model with diffusers pipeline

Test Result

Passed on my side

Release Note

  • Support running diffusion models with diffusers backend. Turn this feature on with --diffusion-load-format diffusers. Check out the documentation here

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

🔴 Pre-commit gate failing. Fix before requesting review.

This is a substantial PR (1445 LOC, 12 files). After fixing gates, please run L3 tests locally and paste results in PR description:

https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/test_guide/#l3-level--l4-level

Test Plan/Test Result sections show "Drafting" - need concrete validation evidence for this feature.

@fhfuih fhfuih force-pushed the diffusers-backend branch from 358340e to 2005b0f Compare April 15, 2026 03:12
Copy link
Copy Markdown

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the RFC. @DN6 please take a look as well.

Do we think http://hf.co/blog/modular-diffusers would be a better base candidate for the integration?


Any model loadable via `DiffusionPipeline.from_pretrained()` is supported, including:

- **Text-to-Image:** SD 1.5, SD 2.1, SDXL, PixArt-Σ, Kandinsky, DeepFloyd IF
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably mention more recent variants such as Flux, QwenImage, etc.?

The diffusers backend is a black-box adapter. The following features are NOT supported:

- CFG parallel execution
- Sequence parallel execution
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to depend on Diffusers' extensive CP support for this no?
https://huggingface.co/docs/diffusers/main/en/training/distributed_inference#context-parallelism

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info! I thought it was done only externally by xdit. But for these parallelism features, I will also need to confirm whether it plays well with our architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do support CP natively :)


- CFG parallel execution
- Sequence parallel execution
- TeaCache / Cache-DiT acceleration
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I also learned that it is possible to turn on these features. Apart from Cache-DIT, there seem to be also:

  • dtype & quantization
  • cpu offloading
  • Attention backend
  • VAE sliding and tiling
  • Torch compile (eagerness)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

Then there's this concept of regional compilation which provides a trade-off:
https://pytorch.org/blog/torch-compile-and-diffusers-a-hands-on-guide-to-peak-performance/

Copy link
Copy Markdown
Contributor Author

@fhfuih fhfuih Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TeaCache is incoming: huggingface/diffusers#12652

Cc: @DN6 we should probably prioritize that PR?

You can take your time on your TeaCache support :)

After a careful study of both codebases, I think the support for caching in the adapter layer is non-trivial. It can be deferred to a later PR. Put some notes here #2403 (comment)

Comment thread examples/online_serving/diffusers_pipeline_adapter/README.md Outdated
Comment thread examples/online_serving/diffusers_pipeline_adapter/README.md Outdated
Comment thread vllm_omni/diffusion/models/diffusers_adapter/pipeline_diffusers_adapter.py Outdated
Comment thread vllm_omni/diffusion/models/diffusers_adapter/pipeline_diffusers_adapter.py Outdated
Comment thread vllm_omni/diffusion/models/diffusers_adapter/pipeline_diffusers_adapter.py Outdated
# Step-wise execution — explicitly rejected
# ------------------------------------------------------------------

def prepare_encode(self, state: Any, **kwargs: Any) -> Any:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, our pipelines are implemented in a way, where we could compute text encodings and then use the precomputed text encodings for denoising + decoding.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info. The "Step-wise execution" is a new experimetal feature on our side. Glad to know that you also have this. Do you mean the Modular Blocks https://huggingface.co/docs/diffusers/main/en/modular_diffusers/quickstart ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

Comment thread vllm_omni/diffusion/models/diffusers_adapter/pipeline_diffusers_adapter.py Outdated
Comment thread vllm_omni/diffusion/models/diffusers_adapter/pipeline_diffusers_adapter.py Outdated
Comment thread vllm_omni/diffusion/models/diffusers_adapter/pipeline_diffusers_adapter.py Outdated
@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Apr 17, 2026

Thanks @DN6 for adding to the review! Based on both of you's review, I have added some discussions on the feature support and adaptation to the companion issue page #2403 accordingly.

Maybe I can also paste it here:

Doable in the first PR

To avoid unnecessary complication, only support an optimization feature if

  1. It can be turned on with a simple toggle/config of the pipeline object
  2. It's main business logic (if only) is within the boundary of a pipeline object, both for diffusers and for vllm-omni

Specifically, these optimization toggles can be added:

  • Setting model dtype: trivial pipeline load-time configuration
  • VAE slicing and tiling: trial pipeline load-time configuration
  • Attention Backend: vllm-omni defines external attention classes, creates and calls them inside pipeline's transformer modules. A complete pipeline rewrite skips vllm-omni attention utilities, and we can forward attention backend configuration to underlying diffusers pipeline classes.

Optionally, I can figure out how to integrate the Modular Pipeline to check input/output modalities.

Deferred to sequel PRs

  • torch.compile: finding and configuring transformer blocks. vllm-omni enables it in model runner (direct wrapper for pipelines). The logic is straightforward: looking for transformers and transformers_2 blocks and torch.compile them. Need to test whether model data structures and the same, and these blocks are discoverable using the current implementation.
  • CacheDiT: vllm-omni enables caching (calling cache_dit.enable_cache(pipe)) in model runner (wrapper for pipelines), and our CacheDiTBackend wraps cache_dit library with extra validation routines. Theoretically, once DiffusionModelRunner::pipe is loaded as a diffusers pipeline, the logic to enable CacheDiT is already there. But testing & validation is definitely required.
  • Quant: The load-time quantization logic is different. vllm-omni loads the model weights into our customized XPipeline classes, and then post-process quantization configuration. For diffusers, it is a bundled config kwarg at from_pretrained. Enabling it would require modifications to the DiffusersPipelineLoader.
  • Context Parallel ("Sequence Parallel" in vllm-omni): vllm-omni borrows the hook system from diffusers and CP implementation. The hooks are applied during model weight-loading time. To enable it, we need to either (1) check if the current SP is compatible with diffusers pipeline format, or (2) route away vllm-omni SP implementation, enable diffuser's CP, and see if the parallelism plays well with vllm-omni's higher-level orchestration layers.
  • "Block"-wise inference ("Step"-wise execution in vllm-omni): The new diffusers Modular Pipeline also supports this. But since the step-wise execution in vllm-omni is even more experimental, this can be deferred to a future PR.

@fhfuih fhfuih force-pushed the diffusers-backend branch from 2005b0f to 292d5fa Compare April 17, 2026 02:34
@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Apr 20, 2026

TODO list for this draft

  • Adapt GPU device settings
  • Remove irrelevant load-time and run-time settings
  • Add basic profiling timing
  • Go over the AI-generated tests
  • Attach image output, profiling info, and memory traces to this PR

@fhfuih fhfuih force-pushed the diffusers-backend branch from 5678c56 to 85371a3 Compare April 20, 2026 02:52
@fhfuih fhfuih force-pushed the diffusers-backend branch 2 times, most recently from 2012d67 to a36cf33 Compare April 20, 2026 13:11
@fhfuih fhfuih marked this pull request as ready for review April 21, 2026 02:06
@fhfuih fhfuih requested a review from hsliuustc0106 as a code owner April 21, 2026 02:06
Copilot AI review requested due to automatic review settings April 21, 2026 02:06
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a general Hugging Face Diffusers adapter backend so vLLM-Omni can serve arbitrary DiffusionPipeline.from_pretrained() models via a new diffusers diffusion load format.

Changes:

  • Added DiffusersAdapterPipeline plus loader/registry/config wiring to enable diffusion_load_format=diffusers.
  • Exposed CLI + stage-config knobs to pass through from_pretrained() and pipeline.__call__() kwargs.
  • Added unit + e2e coverage and an online serving example for the adapter workflow.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
vllm_omni/entrypoints/cli/serve.py Adds CLI flags for selecting the diffusers load format + passing JSON kwargs.
vllm_omni/engine/async_omni_engine.py Plumbs diffusers kwargs into the default diffusion stage config.
vllm_omni/diffusion/worker/diffusion_model_runner.py Changes default load_format handling for diffusion runner.
vllm_omni/diffusion/registry.py Registers DiffusersAdapterPipeline in the diffusion model registry.
vllm_omni/diffusion/models/diffusers_adapter/pipeline_diffusers_adapter.py Implements the black-box adapter around DiffusionPipeline.
vllm_omni/diffusion/models/diffusers_adapter/init.py Exposes adapter pipeline symbol for imports.
vllm_omni/diffusion/model_loader/diffusers_loader.py Adds loader branch to construct/load the adapter pipeline.
vllm_omni/diffusion/data.py Adds config fields + validation + enrich behavior for diffusers adapter.
tests/e2e/online_serving/test_diffusers_adapter.py E2E coverage for serving and calling a diffusers-backed model.
tests/diffusion/test_diffusers_adapter.py Unit tests for adapter guards, kwargs mapping, and output wrapping.
examples/online_serving/diffusers_pipeline_adapter/stage_config.yaml Example stage config enabling the diffusers adapter.
examples/online_serving/diffusers_pipeline_adapter/README.md Usage docs and limitations for the adapter workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_omni/entrypoints/cli/serve.py
Comment thread vllm_omni/diffusion/model_loader/diffusers_loader.py
Comment thread vllm_omni/diffusion/models/diffusers_adapter/pipeline_diffusers_adapter.py Outdated
Comment thread vllm_omni/diffusion/worker/diffusion_model_runner.py Outdated
Comment thread tests/e2e/online_serving/test_diffusers_adapter.py Outdated
Comment thread examples/online_serving/diffusers_pipeline_adapter/README.md
Comment thread tests/diffusion/test_diffusers_adapter.py Outdated
Comment thread tests/e2e/online_serving/test_diffusers_adapter.py
@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Apr 21, 2026

Notes

Feature (Optimization+Parallelism) Support

This PR only enables basic backend adaptation and simplest feature toggle. As is listed in the RFC, some features forwarding may be deferred to later PRs. The above comments on these deferred features are deliberately not resolved---for future reference.

YAML Config

#2383 and #2887 was working on YAML config system refactoring. Seems the "new" config system still has some problems with passing diffusion-specific configurations. Since this config is also continuing, and the old config system works well, this PR still uses the old config system (i.e., "stage config" instead of "deploy config"). After the continuing work on the config system, relevant content here can be updated later.

Perf

Running Qwen-Image With vllm-omni + diffusers backend:

vllm serve /data/models/Qwen/Qwen-Image --stage-configs-path examples/online_serving/diffusers_pipeline_adapter/stage_config.yaml --omni --port 12345 --enable-diffusion-pipeline-profiler

Note that the diffusers backend is a black box. We can only get one total time. Everything is counted in forward.

INFO 04-21 10:07:55 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] DiffusersAdapterPipeline.forward took 2.684139s
INFO 04-21 10:07:55 [diffusion_model_runner.py:213] Peak GPU memory (this request): 55.25 GB reserved, 54.84 GB allocated, 0.41 GB pool overhead (0.7%)
(APIServer pid=741892) INFO 04-21 10:07:55 [diffusion_engine.py:126] Generation completed successfully.
(APIServer pid=741892) INFO 04-21 10:07:55 [diffusion_engine.py:173] Post-processing completed in 0.0000 seconds
(APIServer pid=741892) INFO 04-21 10:07:55 [diffusion_engine.py:176] DiffusionEngine.step breakdown: preprocess=0.00 ms, add_req_and_wait=2689.99 ms, postprocess=0.00 ms, total=2690.32 ms

Running with native backend:

vllm serve Qwen/Qwen-Image --omni --port 12345 --enable-diffusion-pipeline-profiler

INFO 04-21 10:31:43 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] QwenImagePipeline.text_encoder.forward took 0.333929s
INFO 04-21 10:31:45 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] QwenImagePipeline.diffuse took 1.965977s
INFO 04-21 10:31:45 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] QwenImagePipeline.vae.decode took 0.034542s
INFO 04-21 10:31:45 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] QwenImagePipeline.forward took 2.341323s
INFO 04-21 10:31:45 [diffusion_model_runner.py:213] Peak GPU memory (this request): 55.11 GB reserved, 54.79 GB allocated, 0.32 GB pool overhead (0.6%)
(APIServer pid=748505) INFO 04-21 10:31:45 [diffusion_engine.py:126] Generation completed successfully.
(APIServer pid=748505) INFO 04-21 10:31:45 [diffusion_engine.py:173] Post-processing completed in 0.0176 seconds
(APIServer pid=748505) INFO 04-21 10:31:45 [diffusion_engine.py:176] DiffusionEngine.step breakdown: preprocess=0.00 ms, add_req_and_wait=2369.65 ms, postprocess=17.61 ms, total=2387.85 ms

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

unit test for diffusers pipeline argument passing

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

L2 e2e test (random weight model, only e2e infer)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

BUGFIX

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

bugfix

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

ensure pipeline device is correctly set

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fix generator not set if seed not present

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

adjust doc

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

change test model

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

improve type check

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fix wrong function call

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

attn backend not read

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

revert irrelevant changes

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

typo

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

add hardware mark

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

optimize CLI deault arg per AI comment

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

revert irrelevant changes

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
@fhfuih fhfuih force-pushed the diffusers-backend branch from 57d37b4 to d69ecd8 Compare April 21, 2026 03:26
fhfuih added 2 commits April 21, 2026 11:32
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Apr 21, 2026

@hsliuustc0106 @Gaohan123 This PR is ready, PTAL and add a ready tag. Thanks

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
@yenuo26 yenuo26 added the ready label to trigger buildkite CI label Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@SamitHuang SamitHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Gaohan123 Gaohan123 merged commit d8cc7a0 into vllm-project:main Apr 22, 2026
8 checks passed
@@ -0,0 +1,31 @@
# Example stage config for diffusers backend
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we are going to rm this yaml?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously could not successfully forward some diffusion engine_args under the new config system (from deploy yaml to OmniDiffusionConfig). I planned to wait for #2987. But saw it just closed yesterday. I can look further into this, see if I can somehow get the new config system working

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

I didn't see any test results here: how do vllm-omni successfully serve a model using diffusion backend? Have you caompre the acc/perf compared with diffusers? how many models are supported now? is there any doc upates about these infos. Correct me if I miss something, otherwise, I suggest to revert this PR

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

cc @Gaohan123 @SamitHuang

vllm serve "stable-diffusion-v1-5/stable-diffusion-v1-5" \
--omni \
--diffusion-load-format diffusers \
--diffusers-load-kwargs '{"use_safetensors": true}' \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the kwargs from vllm serve cli args instead of introducing 3 more args? I suggest to only keep one

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--diffusion-load-format is already there. I reuse it and add a new value. --diffusers-load-kwargs and --diffusers-call-kwargs are pass-throughs so that when a specific model has any niche parameters, users have a fallback way to set them

@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Apr 23, 2026

I didn't see any test results here: how do vllm-omni successfully serve a model using diffusion backend? Have you caompre the acc/perf compared with diffusers? how many models are supported now? is there any doc upates about these infos. Correct me if I miss something, otherwise, I suggest to revert this PR

I have a perf compare with vllm-omni above. Indeed there is no acc/perf comparison with diffusers, or the model coverage. I will work on them now and attach them later.

qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
…lm-project#2724)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Apr 23, 2026

@hsliuustc0106 Quick update of a test with Qwen Image:

Accuracy

Diffusers backend Diffusers lib
diffusers-backend-output diffusers-lib-output
>>> a = Image.open('diffusers-lib-output.png')
>>> b = Image.open('diffusers-backend-output.png')
>>> compute_image_ssim_psnr(prediction=a, reference=b)
(1.0, inf)

Perf

Load time Generation time
Diffusers backend 177.892670 s 5086.76 ms
Diffusers lib 203.10 seconds 14.63 seconds

Somehow, our backend is even faster than the bare bone library 🤔

The setup is as follow

Ours (the preprocess and postprocess are all included in the forward run)

> vllm serve /data/models/Qwen/Qwen-Image --omni --port 12345 --enable-diffusion-pipeline-profiler --diffusion-load-format diffusers --diffusers-load-kwargs '{"use_safetenors": true}'

...With logging, it uses `_native_flash` in my environment
...
INFO 04-23 10:15:27 [diffusion_model_runner.py:142] Model loading took 53.7914 GiB and 177.892670 seconds
> python examples/online_serving/text_to_image/openai_chat_client.py \
    --server http://127.0.0.1:12345 \
    --negative 'angry facial expression' \
    --steps 20 \
    --height 512 \
    --width 512  \
    --seed 40 \
    --prompt 'a cat wearing furry bee costume and enjoying a cup of honey water' \
    --output 'diffusers-output.png'

INFO 04-23 10:15:28 [diffusion_worker.py:183] Worker 0: Process-scoped GPU memory after model loading: 54.49 GiB.
INFO 04-23 10:18:58 [diffusion_pipeline_profiler.py:31] [DiffusionPipelineProfiler] DiffusersAdapterPipeline.forward took 5.080019s
INFO 04-23 10:18:58 [diffusion_model_runner.py:213] Peak GPU memory (this request): 55.25 GB reserved, 54.84 GB allocated, 0.41 GB pool overhead (0.7%)
...
(APIServer pid=875076) INFO 04-23 10:18:58 [diffusion_engine.py:176] DiffusionEngine.step breakdown: preprocess=0.00 ms, add_req_and_wait=5086.49 ms, postprocess=0.00 ms, total=5086.76 ms

Diffusers (following our defaults of bfloat16 dtype and use_safetensors)

import time

from diffusers import QwenImagePipeline
import torch

start_time = time.perf_counter()
pipe = QwenImagePipeline.from_pretrained("/data/models/Qwen/Qwen-Image", torch_dtype=torch.bfloat16, use_safetensors=True)
pipe.to("cuda")
pipe.transformer.set_attention_backend("_native_flash")
end_time = time.perf_counter()
print(f"Diffusers pipeline loading time: {end_time - start_time:.2f} seconds")

with torch.inference_mode():
    start_time = time.perf_counter()
    image = pipe(
        'a cat wearing furry bee costume and enjoying a cup of honey water',
        num_inference_steps=20,
        negative_prompt='angry facial expression',
        height=512,
        width=512,
        generator=torch.Generator("cuda").manual_seed(40),
    ).images[0]
    end_time = time.perf_counter()
image.save("diffusers-lib-output.png")
print(f"Diffusers pipeline execution time: {end_time - start_time:.2f} seconds")

I'll

@fhfuih fhfuih deleted the diffusers-backend branch April 28, 2026 02:38
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…lm-project#2724)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…lm-project#2724)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants