Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ nav:
- Online Serving:
- BAGEL-7B-MoT: user_guide/examples/online_serving/bagel.md
- vLLM-Omni Helm Chart: user_guide/examples/online_serving/chart-helm.md
- Diffusers Backend Adapter Example: user_guide/examples/online_serving/diffusers_pipeline_adapter.md
- Fish Speech S2 Pro: user_guide/examples/online_serving/fish_speech.md
- GLM-Image Online Serving: user_guide/examples/online_serving/glm_image.md
- Image-To-Image: user_guide/examples/online_serving/image_to_image.md
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Diffusers Backend Adapter Example

Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/diffusers_pipeline_adapter>.


This example demonstrates how to serve any 🤗 Diffusers pipeline through vLLM-Omni
using the `diffusers` load format.

## Supported Models

Any model loadable via `DiffusionPipeline.from_pretrained()` should be supported, including text-to-image, image-to-image, text-to-video, image-to-video, and text-to-audio.

## Limitations

The diffusers backend is a black-box adapter. The following features are NOT yet supported.
It is not guaranteed whether they will be supported in the future.

- CFG parallel execution
- Sequence parallel execution
- TeaCache / Cache-DiT acceleration
- Step-wise execution (continuous batching)

For these features, it is recommended to use natively supported pipelines instead.

## Usage

### Option 1: CLI arguments

```bash
vllm serve "stable-diffusion-v1-5/stable-diffusion-v1-5" \
--omni \
--diffusion-load-format diffusers \
--diffusers-load-kwargs '{"use_safetensors": true}' \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the kwargs from vllm serve cli args instead of introducing 3 more args? I suggest to only keep one

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--diffusion-load-format is already there. I reuse it and add a new value. --diffusers-load-kwargs and --diffusers-call-kwargs are pass-throughs so that when a specific model has any niche parameters, users have a fallback way to set them

--diffusers-call-kwargs '{"num_inference_steps": 30, "guidance_scale": 7.5}'
```

`--diffusers-load-kwargs` and `--diffusers-call-kwargs` are only valid together with `--diffusion-load-format diffusers`.

### Option 2: Stage config YAML

```bash
vllm serve stable-diffusion-v1-5/stable-diffusion-v1-5 --stage-configs-path examples/online_serving/diffusers_pipeline_adapter/stage_config.yaml --omni
```

The particular fields of interest are `model`, `diffusion_load_format`, `diffusers_load_kwargs`, and `diffusers_call_kwargs` under `engine_args`. They are the same as the CLI arguments.

## Send a Request

```bash
curl http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "stable-diffusion-v1-5/stable-diffusion-v1-5",
"prompt": "a photo of an astronaut riding a horse on mars",
"n": 1,
"size": "512x512"
}'
```

Or refer to other documentation pages on how to request a particular input/output modality, such as `examples/online_serving/text_to_image/openai_chat_client.py`.

## Configuration Reference

For the diffusers adapter, set options under **`engine_args`**:

### `diffusion_load_format: "diffusers"`

This field selects the Hugging Face diffusers adapter path (see `DiffusersPipelineLoader`).

### `diffusers_load_kwargs`

Passed to `DiffusionPipeline.from_pretrained()`.

This is suitable for model-specific configurations not available through the vLLM-Omni interface (such as `Omni.__init__()`, `vllm serve` CLI arguments, and stage config YAML fields outside `diffusers_load_kwargs`).

When a parameter is available in the vLLM-Omni interface, it will be adapted here.
But if that parameter is simultaneously set in both the vLLM-Omni interface and `diffusers_load_kwargs`, the **latter** will take precedence.

### `diffusers_call_kwargs`

Passed to `pipeline.__call__()`.

This is suitable for sampling parameters not available through the vLLM-Omni interface (such as `Omni.generate()` and online serving payloads).

When a parameter is available in the vLLM-Omni interface, it will be adapted here.
But if that parameter is simultaneously set in both the vLLM-Omni interface and `diffusers_call_kwargs`, the **former** will take precedence (because it is set at request time).

## Example materials

??? abstract "stage_config.yaml"
``````yaml
--8<-- "examples/online_serving/diffusers_pipeline_adapter/stage_config.yaml"
``````
83 changes: 83 additions & 0 deletions examples/online_serving/diffusers_pipeline_adapter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Diffusers Backend Adapter Example

This example demonstrates how to serve any 🤗 Diffusers pipeline through vLLM-Omni
using the `diffusers` load format.

## Supported Models

Any model loadable via `DiffusionPipeline.from_pretrained()` should be supported, including text-to-image, image-to-image, text-to-video, image-to-video, and text-to-audio.

## Limitations

The diffusers backend is a black-box adapter. The following features are NOT yet supported.
It is not guaranteed whether they will be supported in the future.

- CFG parallel execution
- Sequence parallel execution
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to depend on Diffusers' extensive CP support for this no?
https://huggingface.co/docs/diffusers/main/en/training/distributed_inference#context-parallelism

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info! I thought it was done only externally by xdit. But for these parallelism features, I will also need to confirm whether it plays well with our architecture

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do support CP natively :)

- TeaCache / Cache-DiT acceleration
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I also learned that it is possible to turn on these features. Apart from Cache-DIT, there seem to be also:

  • dtype & quantization
  • cpu offloading
  • Attention backend
  • VAE sliding and tiling
  • Torch compile (eagerness)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

Then there's this concept of regional compilation which provides a trade-off:
https://pytorch.org/blog/torch-compile-and-diffusers-a-hands-on-guide-to-peak-performance/

Copy link
Copy Markdown
Contributor Author

@fhfuih fhfuih Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TeaCache is incoming: huggingface/diffusers#12652

Cc: @DN6 we should probably prioritize that PR?

You can take your time on your TeaCache support :)

After a careful study of both codebases, I think the support for caching in the adapter layer is non-trivial. It can be deferred to a later PR. Put some notes here #2403 (comment)

- Step-wise execution (continuous batching)

For these features, it is recommended to use natively supported pipelines instead.

## Usage
Comment thread
fhfuih marked this conversation as resolved.

### Option 1: CLI arguments

```bash
vllm serve "stable-diffusion-v1-5/stable-diffusion-v1-5" \
--omni \
--diffusion-load-format diffusers \
--diffusers-load-kwargs '{"use_safetensors": true}' \
--diffusers-call-kwargs '{"num_inference_steps": 30, "guidance_scale": 7.5}'
```

`--diffusers-load-kwargs` and `--diffusers-call-kwargs` are only valid together with `--diffusion-load-format diffusers`.

### Option 2: Stage config YAML

```bash
vllm serve stable-diffusion-v1-5/stable-diffusion-v1-5 --stage-configs-path examples/online_serving/diffusers_pipeline_adapter/stage_config.yaml --omni
```

The particular fields of interest are `model`, `diffusion_load_format`, `diffusers_load_kwargs`, and `diffusers_call_kwargs` under `engine_args`. They are the same as the CLI arguments.

## Send a Request

```bash
curl http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "stable-diffusion-v1-5/stable-diffusion-v1-5",
"prompt": "a photo of an astronaut riding a horse on mars",
"n": 1,
"size": "512x512"
}'
```

Or refer to other documentation pages on how to request a particular input/output modality, such as `examples/online_serving/text_to_image/openai_chat_client.py`.

## Configuration Reference

For the diffusers adapter, set options under **`engine_args`**:

### `diffusion_load_format: "diffusers"`

This field selects the Hugging Face diffusers adapter path (see `DiffusersPipelineLoader`).

### `diffusers_load_kwargs`

Passed to `DiffusionPipeline.from_pretrained()`.

This is suitable for model-specific configurations not available through the vLLM-Omni interface (such as `Omni.__init__()`, `vllm serve` CLI arguments, and stage config YAML fields outside `diffusers_load_kwargs`).

When a parameter is available in the vLLM-Omni interface, it will be adapted here.
But if that parameter is simultaneously set in both the vLLM-Omni interface and `diffusers_load_kwargs`, the **latter** will take precedence.

### `diffusers_call_kwargs`

Passed to `pipeline.__call__()`.

This is suitable for sampling parameters not available through the vLLM-Omni interface (such as `Omni.generate()` and online serving payloads).

When a parameter is available in the vLLM-Omni interface, it will be adapted here.
But if that parameter is simultaneously set in both the vLLM-Omni interface and `diffusers_call_kwargs`, the **former** will take precedence (because it is set at request time).
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Example stage config for diffusers backend
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we are going to rm this yaml?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously could not successfully forward some diffusion engine_args under the new config system (from deploy yaml to OmniDiffusionConfig). I planned to wait for #2987. But saw it just closed yesterday. I can look further into this, see if I can somehow get the new config system working

# This config demonstrates serving Stable Diffusion 1.5 via the diffusers adapter.
# Users should copy and modify this for their own models.

model_type: diffusion

stage_args:
- stage_id: 0
stage_type: diffusion
engine_args:
model_stage: diffusion
model: "stable-diffusion-v1-5/stable-diffusion-v1-5"
distributed_executor_backend: "mp"
# gpu_memory_utilization: 0.9
engine_output_type: image
# Select the HF diffusers adapter
diffusion_load_format: "diffusers"
# model_class_name: "DiffusersAdapterPipeline" # default when diffusion_load_format is diffusers
diffusers_load_kwargs:
# Passed to DiffusionPipeline.from_pretrained().
# Good for model-specific loading parameters not covered by OmniDiffusionConfig.
# During model load time, parameters here override their counterparts in the vLLM-Omni interface.
use_safetensors: true
diffusers_call_kwargs:
# Passed to pipeline.__call__().
# Good for model-specific sampling parameters not covered by OmniDiffusionSamplingParams.
# During request time, parameters here are overridden by the counterparts in OmniDiffusionSamplingParams.
num_inference_steps: 30
guidance_scale: 7.5
final_output: true
final_output_type: image
Loading
Loading