-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[feat]: General diffusers adapter backend to run diffusion models #2724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d69ecd8
93098dc
792cfd2
af9a8be
ce3f90d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| # Diffusers Backend Adapter Example | ||
|
|
||
| Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/diffusers_pipeline_adapter>. | ||
|
|
||
|
|
||
| This example demonstrates how to serve any 🤗 Diffusers pipeline through vLLM-Omni | ||
| using the `diffusers` load format. | ||
|
|
||
| ## Supported Models | ||
|
|
||
| Any model loadable via `DiffusionPipeline.from_pretrained()` should be supported, including text-to-image, image-to-image, text-to-video, image-to-video, and text-to-audio. | ||
|
|
||
| ## Limitations | ||
|
|
||
| The diffusers backend is a black-box adapter. The following features are NOT yet supported. | ||
| It is not guaranteed whether they will be supported in the future. | ||
|
|
||
| - CFG parallel execution | ||
| - Sequence parallel execution | ||
| - TeaCache / Cache-DiT acceleration | ||
| - Step-wise execution (continuous batching) | ||
|
|
||
| For these features, it is recommended to use natively supported pipelines instead. | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Option 1: CLI arguments | ||
|
|
||
| ```bash | ||
| vllm serve "stable-diffusion-v1-5/stable-diffusion-v1-5" \ | ||
| --omni \ | ||
| --diffusion-load-format diffusers \ | ||
| --diffusers-load-kwargs '{"use_safetensors": true}' \ | ||
| --diffusers-call-kwargs '{"num_inference_steps": 30, "guidance_scale": 7.5}' | ||
| ``` | ||
|
|
||
| `--diffusers-load-kwargs` and `--diffusers-call-kwargs` are only valid together with `--diffusion-load-format diffusers`. | ||
|
|
||
| ### Option 2: Stage config YAML | ||
|
|
||
| ```bash | ||
| vllm serve stable-diffusion-v1-5/stable-diffusion-v1-5 --stage-configs-path examples/online_serving/diffusers_pipeline_adapter/stage_config.yaml --omni | ||
| ``` | ||
|
|
||
| The particular fields of interest are `model`, `diffusion_load_format`, `diffusers_load_kwargs`, and `diffusers_call_kwargs` under `engine_args`. They are the same as the CLI arguments. | ||
|
|
||
| ## Send a Request | ||
|
|
||
| ```bash | ||
| curl http://localhost:8000/v1/images/generations \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "model": "stable-diffusion-v1-5/stable-diffusion-v1-5", | ||
| "prompt": "a photo of an astronaut riding a horse on mars", | ||
| "n": 1, | ||
| "size": "512x512" | ||
| }' | ||
| ``` | ||
|
|
||
| Or refer to other documentation pages on how to request a particular input/output modality, such as `examples/online_serving/text_to_image/openai_chat_client.py`. | ||
|
|
||
| ## Configuration Reference | ||
|
|
||
| For the diffusers adapter, set options under **`engine_args`**: | ||
|
|
||
| ### `diffusion_load_format: "diffusers"` | ||
|
|
||
| This field selects the Hugging Face diffusers adapter path (see `DiffusersPipelineLoader`). | ||
|
|
||
| ### `diffusers_load_kwargs` | ||
|
|
||
| Passed to `DiffusionPipeline.from_pretrained()`. | ||
|
|
||
| This is suitable for model-specific configurations not available through the vLLM-Omni interface (such as `Omni.__init__()`, `vllm serve` CLI arguments, and stage config YAML fields outside `diffusers_load_kwargs`). | ||
|
|
||
| When a parameter is available in the vLLM-Omni interface, it will be adapted here. | ||
| But if that parameter is simultaneously set in both the vLLM-Omni interface and `diffusers_load_kwargs`, the **latter** will take precedence. | ||
|
|
||
| ### `diffusers_call_kwargs` | ||
|
|
||
| Passed to `pipeline.__call__()`. | ||
|
|
||
| This is suitable for sampling parameters not available through the vLLM-Omni interface (such as `Omni.generate()` and online serving payloads). | ||
|
|
||
| When a parameter is available in the vLLM-Omni interface, it will be adapted here. | ||
| But if that parameter is simultaneously set in both the vLLM-Omni interface and `diffusers_call_kwargs`, the **former** will take precedence (because it is set at request time). | ||
|
|
||
| ## Example materials | ||
|
|
||
| ??? abstract "stage_config.yaml" | ||
| ``````yaml | ||
| --8<-- "examples/online_serving/diffusers_pipeline_adapter/stage_config.yaml" | ||
| `````` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,83 @@ | ||
| # Diffusers Backend Adapter Example | ||
|
|
||
| This example demonstrates how to serve any 🤗 Diffusers pipeline through vLLM-Omni | ||
| using the `diffusers` load format. | ||
|
|
||
| ## Supported Models | ||
|
|
||
| Any model loadable via `DiffusionPipeline.from_pretrained()` should be supported, including text-to-image, image-to-image, text-to-video, image-to-video, and text-to-audio. | ||
|
|
||
| ## Limitations | ||
|
|
||
| The diffusers backend is a black-box adapter. The following features are NOT yet supported. | ||
| It is not guaranteed whether they will be supported in the future. | ||
|
|
||
| - CFG parallel execution | ||
| - Sequence parallel execution | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should be able to depend on Diffusers' extensive CP support for this no?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the info! I thought it was done only externally by xdit. But for these parallelism features, I will also need to confirm whether it plays well with our architecture There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do support CP natively :) |
||
| - TeaCache / Cache-DiT acceleration | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://huggingface.co/docs/diffusers/main/en/optimization/cache CacheDiT is supported too: https://github.com/vipshop/cache-dit?tab=readme-ov-file#quick-start-cache-parallelism-and-quantization TeaCache is incoming: huggingface/diffusers#12652 Cc: @DN6 we should probably prioritize that PR?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the clarification. I also learned that it is possible to turn on these features. Apart from Cache-DIT, there seem to be also:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup. Then there's this concept of regional compilation which provides a trade-off:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
You can take your time on your TeaCache support :) After a careful study of both codebases, I think the support for caching in the adapter layer is non-trivial. It can be deferred to a later PR. Put some notes here #2403 (comment) |
||
| - Step-wise execution (continuous batching) | ||
|
|
||
| For these features, it is recommended to use natively supported pipelines instead. | ||
|
|
||
| ## Usage | ||
|
fhfuih marked this conversation as resolved.
|
||
|
|
||
| ### Option 1: CLI arguments | ||
|
|
||
| ```bash | ||
| vllm serve "stable-diffusion-v1-5/stable-diffusion-v1-5" \ | ||
| --omni \ | ||
| --diffusion-load-format diffusers \ | ||
| --diffusers-load-kwargs '{"use_safetensors": true}' \ | ||
| --diffusers-call-kwargs '{"num_inference_steps": 30, "guidance_scale": 7.5}' | ||
| ``` | ||
|
|
||
| `--diffusers-load-kwargs` and `--diffusers-call-kwargs` are only valid together with `--diffusion-load-format diffusers`. | ||
|
|
||
| ### Option 2: Stage config YAML | ||
|
|
||
| ```bash | ||
| vllm serve stable-diffusion-v1-5/stable-diffusion-v1-5 --stage-configs-path examples/online_serving/diffusers_pipeline_adapter/stage_config.yaml --omni | ||
| ``` | ||
|
|
||
| The particular fields of interest are `model`, `diffusion_load_format`, `diffusers_load_kwargs`, and `diffusers_call_kwargs` under `engine_args`. They are the same as the CLI arguments. | ||
|
|
||
| ## Send a Request | ||
|
|
||
| ```bash | ||
| curl http://localhost:8000/v1/images/generations \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "model": "stable-diffusion-v1-5/stable-diffusion-v1-5", | ||
| "prompt": "a photo of an astronaut riding a horse on mars", | ||
| "n": 1, | ||
| "size": "512x512" | ||
| }' | ||
| ``` | ||
|
|
||
| Or refer to other documentation pages on how to request a particular input/output modality, such as `examples/online_serving/text_to_image/openai_chat_client.py`. | ||
|
|
||
| ## Configuration Reference | ||
|
|
||
| For the diffusers adapter, set options under **`engine_args`**: | ||
|
|
||
| ### `diffusion_load_format: "diffusers"` | ||
|
|
||
| This field selects the Hugging Face diffusers adapter path (see `DiffusersPipelineLoader`). | ||
|
|
||
| ### `diffusers_load_kwargs` | ||
|
|
||
| Passed to `DiffusionPipeline.from_pretrained()`. | ||
|
|
||
| This is suitable for model-specific configurations not available through the vLLM-Omni interface (such as `Omni.__init__()`, `vllm serve` CLI arguments, and stage config YAML fields outside `diffusers_load_kwargs`). | ||
|
|
||
| When a parameter is available in the vLLM-Omni interface, it will be adapted here. | ||
| But if that parameter is simultaneously set in both the vLLM-Omni interface and `diffusers_load_kwargs`, the **latter** will take precedence. | ||
|
|
||
| ### `diffusers_call_kwargs` | ||
|
|
||
| Passed to `pipeline.__call__()`. | ||
|
|
||
| This is suitable for sampling parameters not available through the vLLM-Omni interface (such as `Omni.generate()` and online serving payloads). | ||
|
|
||
| When a parameter is available in the vLLM-Omni interface, it will be adapted here. | ||
| But if that parameter is simultaneously set in both the vLLM-Omni interface and `diffusers_call_kwargs`, the **former** will take precedence (because it is set at request time). | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| # Example stage config for diffusers backend | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when we are going to rm this yaml?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I previously could not successfully forward some diffusion engine_args under the new config system (from deploy yaml to OmniDiffusionConfig). I planned to wait for #2987. But saw it just closed yesterday. I can look further into this, see if I can somehow get the new config system working |
||
| # This config demonstrates serving Stable Diffusion 1.5 via the diffusers adapter. | ||
| # Users should copy and modify this for their own models. | ||
|
|
||
| model_type: diffusion | ||
|
|
||
| stage_args: | ||
| - stage_id: 0 | ||
| stage_type: diffusion | ||
| engine_args: | ||
| model_stage: diffusion | ||
| model: "stable-diffusion-v1-5/stable-diffusion-v1-5" | ||
| distributed_executor_backend: "mp" | ||
| # gpu_memory_utilization: 0.9 | ||
| engine_output_type: image | ||
| # Select the HF diffusers adapter | ||
| diffusion_load_format: "diffusers" | ||
| # model_class_name: "DiffusersAdapterPipeline" # default when diffusion_load_format is diffusers | ||
| diffusers_load_kwargs: | ||
| # Passed to DiffusionPipeline.from_pretrained(). | ||
| # Good for model-specific loading parameters not covered by OmniDiffusionConfig. | ||
| # During model load time, parameters here override their counterparts in the vLLM-Omni interface. | ||
| use_safetensors: true | ||
| diffusers_call_kwargs: | ||
| # Passed to pipeline.__call__(). | ||
| # Good for model-specific sampling parameters not covered by OmniDiffusionSamplingParams. | ||
| # During request time, parameters here are overridden by the counterparts in OmniDiffusionSamplingParams. | ||
| num_inference_steps: 30 | ||
| guidance_scale: 7.5 | ||
| final_output: true | ||
| final_output_type: image | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse the kwargs from vllm serve cli args instead of introducing 3 more args? I suggest to only keep one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--diffusion-load-formatis already there. I reuse it and add a new value.--diffusers-load-kwargsand--diffusers-call-kwargsare pass-throughs so that when a specific model has any niche parameters, users have a fallback way to set them