Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 0 additions & 27 deletions .github/workflows/comfyui-publish.yml

This file was deleted.

40 changes: 0 additions & 40 deletions .github/workflows/comfyui-validate.yml

This file was deleted.

41 changes: 20 additions & 21 deletions apps/ComfyUI-vLLM-Omni/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,12 +47,12 @@ If no, check your shell running the ComfyUI process. There may be some error mes
This extension offers the following nodes based on the output modalities (at **ComfyUI sidebar -> Node Library**):

- **Generate Image** for text-to-image and image-to-image tasks
- **Multimodality Comprehension** for multimodality-to-text and multimodality-to-audio tasks
- **Multimodality Understanding** for multimodality-to-text and multimodality-to-audio tasks
- **TTS** and **TTS Voice Clone** for TTS tasks

This extension also offers example workflows (at **ComfyUI sidebar -> Templates -> vLLM-Omni**)

> [!INFO]
> [!NOTE]
> The node UI and feature designs are intended to match vLLM-Omni online serving interfaces. It cannot offer more than what the interfaces support.

To build a simple workflow yourself,
Expand All @@ -65,28 +65,16 @@ To build a simple workflow yourself,
- For some multi-stage models like BAGEL, [only one stage's sampling parameters are exposed and tunable via vLLM-Omni's online serving API](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving/bagel/). Thus, these models are treated as single-stage ones. Please check the vLLM-Omni documentation on how to correctly set each model's sampling parameters.
- For multi-stage models where all stages are either autoregression or diffusion, you can also connect only a single Sampling Params node, indicating that this set of sampling parameters will be used for all stages.

**The following features are tested**:

- Single-node workflows for
- Multimodal Comprehension (e.g., Qwen Omni, BAGEL)
- Text-to-Image Generation (e.g., Qwen-Image)
- Image-to-Image Generation (e.g., Qwen-Image-Edit)
- TTS (e.g., Qwen TTS, including VoiceDesign, VoiceClone, CustomVoice)

**The following features are not currently tested**. They may work or break. You are welcomed to test it out and offer comments.

- Multi-node workflow that connects multiple model services together.

## Screenshots and Examples

### Multimodal comprehension (e.g., Qwen Omni series, BAGEL)
### Multimodal understanding (e.g., Qwen Omni series, BAGEL)

(Also available at **ComfyUI sidebar->Template->vLLM-Omni->vLLM-Omni Annotated Example**)
(Also available at **ComfyUI sidebar->Template->vLLM-Omni->vLLM-Omni Multimodal Understanding**)

<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-comprehension.jpg">
<img alt="vLLM-Omni Main Architecture" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-comprehension.jpg" width=55%>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-understanding.jpg">
<img alt="vLLM-Omni multimodal understanding" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-understanding.jpg" width=55%>
</picture>
</p>

Expand All @@ -98,7 +86,7 @@ You can configure per-stage sampling parameters for multi-stage models.
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-multi-stage.jpg">
<img alt="vLLM-Omni Main Architecture" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-multi-stage.jpg" width=55%>
<img alt="vLLM-Omni multiple stages" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-multi-stage.jpg" width=55%>
</picture>
</p>

Expand All @@ -109,7 +97,7 @@ You can configure per-stage sampling parameters for multi-stage models.
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-image-generation.jpg">
<img alt="vLLM-Omni Main Architecture" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-image-generation.jpg" width=55%>
<img alt="vLLM-Omni image generation" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-image-generation.jpg" width=55%>
</picture>
</p>

Expand All @@ -123,13 +111,24 @@ You can configure per-stage sampling parameters for multi-stage models.
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-tts.jpg">
<img alt="vLLM-Omni Main Architecture" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-tts.jpg" width=55%>
<img alt="vLLM-Omni TTS" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-tts.jpg" width=55%>
</picture>
</p>

> [!TIP]
> There is a dedicated node for VoiceClone tasks with reference audio input. Other simple text-to-speech tasks should use the regular TTS node.

### Chaining multiple model services

(Also available at **ComfyUI sidebar->Template->vLLM-Omni->vLLM-Omni Chaining Services**)

<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-chaining-services.jpg">
<img alt="vLLM-Omni TTS" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-chaining-services.jpg" width=55%>
</picture>
</p>

## Develop

Follow the [development convention and rules of vLLM-Omni](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/).
Expand Down
10 changes: 5 additions & 5 deletions apps/ComfyUI-vLLM-Omni/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,26 +6,26 @@
"WEB_DIRECTORY",
]

__author__ = """Zeyu Huang"""
__email__ = "11222265+fhfuih@users.noreply.github.com"
__author__ = """vLLM-Omni Team"""
__email__ = "vllm-omni@vllm.ai"
__version__ = "0.0.1"

from .comfyui_vllm_omni.nodes import (
VLLMOmniARSampling,
VLLMOmniComprehension,
VLLMOmniDiffusionSampling,
VLLMOmniGenerateImage,
VLLMOmniQwenTTSParams,
VLLMOmniSamplingParamsList,
VLLMOmniTTS,
VLLMOmniUnderstanding,
VLLMOmniVoiceClone,
)

# A dictionary that contains all nodes you want to export with their names
NODE_CLASS_MAPPINGS = {
# === Generation ===
"VLLMOmniGenerateImage": VLLMOmniGenerateImage,
"VLLMOmniComprehension": VLLMOmniComprehension,
"VLLMOmniUnderstanding": VLLMOmniUnderstanding,
Comment thread
fhfuih marked this conversation as resolved.
"VLLMOmniTTS": VLLMOmniTTS,
"VLLMOmniVoiceClone": VLLMOmniVoiceClone,
# === Params ===
Expand All @@ -39,7 +39,7 @@
NODE_DISPLAY_NAME_MAPPINGS = {
# === Generation ===
"VLLMOmniGenerateImage": "Generate Image",
"VLLMOmniComprehension": "Multimodality Comprehension",
"VLLMOmniUnderstanding": "Multimodality Understanding",
"VLLMOmniTTS": "TTS (Text to Speech)",
"VLLMOmniVoiceClone": "TTS Voice Cloning",
# === Params ===
Expand Down
47 changes: 26 additions & 21 deletions apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/nodes.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
from typing import Literal, cast
from typing import Literal

import torch
from comfy_api.input import AudioInput, VideoInput

from .utils.api_client import VLLMOmniClient
from .utils.logger import get_logger
from .utils.models import lookup_model_spec
from .utils.types import AudioFormat
from .utils.types import (
AudioFormat,
AutoregressionSamplingParams,
DiffusionSamplingParams,
QwenTTSModelSpecificParams,
)
from .utils.validators import (
add_sampling_parameters_to_stage,
validate_model_and_sampling_params_types,
Expand Down Expand Up @@ -86,7 +91,10 @@ async def generate(

# Prefer DALL-E compatible API for simple (one-stage) diffusion models
if (spec is None or spec["stages"] == ["diffusion"]) and not is_bagel:
sampling_params = cast(dict | None, sampling_params)
# The number of sampling parameter groups should have been validated.
# Now, simply convert single-item list to dict.
if isinstance(sampling_params, list):
sampling_params = sampling_params[0]
if audio is None and image is None and video is None:
# No multimodal input --- use DALL-E image generation
logger.info("Using DALL-E image generation endpoint")
Expand Down Expand Up @@ -133,7 +141,7 @@ async def generate(
return (output,)


class VLLMOmniComprehension(_VLLMOmniGenerateBase):
class VLLMOmniUnderstanding(_VLLMOmniGenerateBase):
@classmethod
def INPUT_TYPES(cls):
return {
Expand Down Expand Up @@ -197,7 +205,7 @@ async def generate(
(
text_response,
_,
) = await client.generate_comprehension_chat_completion(
) = await client.generate_understanding_chat_completion(
model=model,
prompt=prompt,
image=image,
Expand All @@ -221,7 +229,7 @@ async def generate(
(
text_response,
audio,
) = await client.generate_comprehension_chat_completion(
) = await client.generate_understanding_chat_completion(
model=model,
prompt=prompt,
image=image,
Expand Down Expand Up @@ -287,15 +295,13 @@ async def generate(
logger.info("Got extra kwargs in TTS: %s", kwargs)

is_qwen_tts = "qwen3-tts" in model.lower()
extra_params_type = None if model_specific_params is None else model_specific_params["type"]
if not is_qwen_tts and extra_params_type == "qwen-tts":
if not is_qwen_tts and isinstance(model_specific_params, QwenTTSModelSpecificParams):
raise ValueError(
"You have provided Qwen-specific TTS params."
"However, the model appears to not be a Qwen TTS model (no 'Qwen3-TTS' in model name)."
)

combined_params = {**kwargs, **(model_specific_params or {})}
combined_params.pop("type", None) # Internal fields in model_specific_params

client = VLLMOmniClient(url)

Expand Down Expand Up @@ -352,8 +358,7 @@ async def generate(
**kwargs,
):
is_qwen_tts = "qwen3-tts" in model.lower()
extra_params_type = None if model_specific_params is None else model_specific_params["type"]
if not is_qwen_tts and extra_params_type == "qwen-tts":
if not is_qwen_tts and isinstance(model_specific_params, QwenTTSModelSpecificParams):
raise ValueError(
"You have provided Qwen-specific TTS params."
"However, the model appears to not be a Qwen TTS model (no 'Qwen3-TTS' in model name)."
Expand All @@ -366,7 +371,6 @@ async def generate(
**kwargs,
**(model_specific_params or {}),
}
combined_params.pop("type", None) # Internal fields in model_specific_params

client = VLLMOmniClient(url)

Expand Down Expand Up @@ -419,10 +423,7 @@ def INPUT_TYPES(cls):
CATEGORY = "vLLM-Omni/Sampling Params"

def get_params(self, seed, **kwargs):
params = {
"type": "autoregression", # for internal use, removed before sending the request
**kwargs,
}
params = AutoregressionSamplingParams(kwargs)
if seed >= 0:
params["seed"] = seed
return (params,)
Expand Down Expand Up @@ -479,6 +480,13 @@ def INPUT_TYPES(cls):
"tooltip": "Enable VAE slicing for reduced memory usage (slight quality trade-off)",
},
),
"vae_use_tiling": (
"BOOLEAN",
{
"default": False,
"tooltip": "Enable VAE tiling for reduced memory usage (slight quality trade-off)",
},
),
Comment thread
fhfuih marked this conversation as resolved.
# === Put seed at last. ===
# Whenever a field named "seed" is present, ComfyUI adds another field called "control after generate"
"seed": (
Expand All @@ -499,10 +507,7 @@ def INPUT_TYPES(cls):
CATEGORY = "vLLM-Omni/Sampling Params"

def get_params(self, seed, **kwargs):
params = {
"type": "diffusion", # for internal use, removed before sending the request
**kwargs,
}
params = DiffusionSamplingParams(kwargs)
if seed >= 0:
params["seed"] = seed
return (params,)
Expand Down Expand Up @@ -566,4 +571,4 @@ def INPUT_TYPES(cls):
CATEGORY = "vLLM-Omni/TTS Params"

def get_params(self, **kwargs):
return ({"type": "qwen-tts", **kwargs},)
return (QwenTTSModelSpecificParams(kwargs),)
Loading