vllm-project · hsliuustc0106 · Feb 26, 2026 · Feb 5, 2026 · Feb 23, 2026 · Feb 23, 2026
diff --git a/apps/ComfyUI-vLLM-Omni/README.md b/apps/ComfyUI-vLLM-Omni/README.md
@@ -47,12 +47,12 @@ If no, check your shell running the ComfyUI process. There may be some error mes
 This extension offers the following nodes based on the output modalities (at **ComfyUI sidebar -> Node Library**):
 
 - **Generate Image** for text-to-image and image-to-image tasks
-- **Multimodality Comprehension** for multimodality-to-text and multimodality-to-audio tasks
+- **Multimodality Understanding** for multimodality-to-text and multimodality-to-audio tasks
 - **TTS** and **TTS Voice Clone** for TTS tasks
 
 This extension also offers example workflows (at **ComfyUI sidebar -> Templates -> vLLM-Omni**)
 
-> [!INFO]
+> [!NOTE]
 > The node UI and feature designs are intended to match vLLM-Omni online serving interfaces. It cannot offer more than what the interfaces support.
 
 To build a simple workflow yourself,
@@ -65,28 +65,16 @@ To build a simple workflow yourself,
     - For some multi-stage models like BAGEL, [only one stage's sampling parameters are exposed and tunable via vLLM-Omni's online serving API](https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving/bagel/). Thus, these models are treated as single-stage ones. Please check the vLLM-Omni documentation on how to correctly set each model's sampling parameters.
     - For multi-stage models where all stages are either autoregression or diffusion, you can also connect only a single Sampling Params node, indicating that this set of sampling parameters will be used for all stages.
 
-**The following features are tested**:
-
-- Single-node workflows for
-    - Multimodal Comprehension (e.g., Qwen Omni, BAGEL)
-    - Text-to-Image Generation (e.g., Qwen-Image)
-    - Image-to-Image Generation (e.g., Qwen-Image-Edit)
-    - TTS (e.g., Qwen TTS, including VoiceDesign, VoiceClone, CustomVoice)
-
-**The following features are not currently tested**. They may work or break. You are welcomed to test it out and offer comments.
-
-- Multi-node workflow that connects multiple model services together.
-
 ## Screenshots and Examples
 
-### Multimodal comprehension (e.g., Qwen Omni series, BAGEL)
+### Multimodal understanding (e.g., Qwen Omni series, BAGEL)
 
-(Also available at **ComfyUI sidebar->Template->vLLM-Omni->vLLM-Omni Annotated Example**)
+(Also available at **ComfyUI sidebar->Template->vLLM-Omni->vLLM-Omni Multimodal Understanding**)
 
 <p align="center">
   <picture>
-    <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-comprehension.jpg">
-    <img alt="vLLM-Omni Main Architecture" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-comprehension.jpg" width=55%>
+    <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-understanding.jpg">
+    <img alt="vLLM-Omni multimodal understanding" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-understanding.jpg" width=55%>
   </picture>
 </p>
 
@@ -98,7 +86,7 @@ You can configure per-stage sampling parameters for multi-stage models.
 <p align="center">
   <picture>
     <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-multi-stage.jpg">
-    <img alt="vLLM-Omni Main Architecture" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-multi-stage.jpg" width=55%>
+    <img alt="vLLM-Omni multiple stages" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-multi-stage.jpg" width=55%>
   </picture>
 </p>
 
@@ -109,7 +97,7 @@ You can configure per-stage sampling parameters for multi-stage models.
 <p align="center">
   <picture>
     <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-image-generation.jpg">
-    <img alt="vLLM-Omni Main Architecture" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-image-generation.jpg" width=55%>
+    <img alt="vLLM-Omni image generation" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-image-generation.jpg" width=55%>
   </picture>
 </p>
 
@@ -123,13 +111,24 @@ You can configure per-stage sampling parameters for multi-stage models.
 <p align="center">
   <picture>
     <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-tts.jpg">
-    <img alt="vLLM-Omni Main Architecture" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-tts.jpg" width=55%>
+    <img alt="vLLM-Omni TTS" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-tts.jpg" width=55%>
   </picture>
 </p>
 
 > [!TIP]
 > There is a dedicated node for VoiceClone tasks with reference audio input. Other simple text-to-speech tasks should use the regular TTS node.
 
+### Chaining multiple model services
+
+(Also available at **ComfyUI sidebar->Template->vLLM-Omni->vLLM-Omni Chaining Services**)
+
+<p align="center">
+  <picture>
+    <source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-chaining-services.jpg">
+    <img alt="vLLM-Omni TTS" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/apps/ComfyUI-vLLM-Omni/docs/images/comfyui-chaining-services.jpg" width=55%>
+  </picture>
+</p>
+
 ## Develop
 
 Follow the [development convention and rules of vLLM-Omni](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/).

diff --git a/apps/ComfyUI-vLLM-Omni/__init__.py b/apps/ComfyUI-vLLM-Omni/__init__.py
@@ -6,26 +6,26 @@
     "WEB_DIRECTORY",
 ]
 
-__author__ = """Zeyu Huang"""
-__email__ = "11222265+fhfuih@users.noreply.github.com"
+__author__ = """vLLM-Omni Team"""
+__email__ = "vllm-omni@vllm.ai"
 __version__ = "0.0.1"
 
 from .comfyui_vllm_omni.nodes import (
     VLLMOmniARSampling,
-    VLLMOmniComprehension,
     VLLMOmniDiffusionSampling,
     VLLMOmniGenerateImage,
     VLLMOmniQwenTTSParams,
     VLLMOmniSamplingParamsList,
     VLLMOmniTTS,
+    VLLMOmniUnderstanding,
     VLLMOmniVoiceClone,
 )
 
 # A dictionary that contains all nodes you want to export with their names
 NODE_CLASS_MAPPINGS = {
     # === Generation ===
     "VLLMOmniGenerateImage": VLLMOmniGenerateImage,
-    "VLLMOmniComprehension": VLLMOmniComprehension,
+    "VLLMOmniUnderstanding": VLLMOmniUnderstanding,
     "VLLMOmniTTS": VLLMOmniTTS,
     "VLLMOmniVoiceClone": VLLMOmniVoiceClone,
     # === Params ===
@@ -39,7 +39,7 @@
 NODE_DISPLAY_NAME_MAPPINGS = {
     # === Generation ===
     "VLLMOmniGenerateImage": "Generate Image",
-    "VLLMOmniComprehension": "Multimodality Comprehension",
+    "VLLMOmniUnderstanding": "Multimodality Understanding",
     "VLLMOmniTTS": "TTS (Text to Speech)",
     "VLLMOmniVoiceClone": "TTS Voice Cloning",
     # === Params ===

diff --git a/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/nodes.py b/apps/ComfyUI-vLLM-Omni/comfyui_vllm_omni/nodes.py
@@ -1,12 +1,17 @@
-from typing import Literal, cast
+from typing import Literal
 
 import torch
 from comfy_api.input import AudioInput, VideoInput
 
 from .utils.api_client import VLLMOmniClient
 from .utils.logger import get_logger
 from .utils.models import lookup_model_spec
-from .utils.types import AudioFormat
+from .utils.types import (
+    AudioFormat,
+    AutoregressionSamplingParams,
+    DiffusionSamplingParams,
+    QwenTTSModelSpecificParams,
+)
 from .utils.validators import (
     add_sampling_parameters_to_stage,
     validate_model_and_sampling_params_types,
@@ -86,7 +91,10 @@ async def generate(
 
         # Prefer DALL-E compatible API for simple (one-stage) diffusion models
         if (spec is None or spec["stages"] == ["diffusion"]) and not is_bagel:
-            sampling_params = cast(dict | None, sampling_params)
+            # The number of sampling parameter groups should have been validated.
+            # Now, simply convert single-item list to dict.
+            if isinstance(sampling_params, list):
+                sampling_params = sampling_params[0]
             if audio is None and image is None and video is None:
                 # No multimodal input --- use DALL-E image generation
                 logger.info("Using DALL-E image generation endpoint")
@@ -133,7 +141,7 @@ async def generate(
         return (output,)
 
 
-class VLLMOmniComprehension(_VLLMOmniGenerateBase):
+class VLLMOmniUnderstanding(_VLLMOmniGenerateBase):
     @classmethod
     def INPUT_TYPES(cls):
         return {
@@ -197,7 +205,7 @@ async def generate(
             (
                 text_response,
                 _,
-            ) = await client.generate_comprehension_chat_completion(
+            ) = await client.generate_understanding_chat_completion(
                 model=model,
                 prompt=prompt,
                 image=image,
@@ -221,7 +229,7 @@ async def generate(
             (
                 text_response,
                 audio,
-            ) = await client.generate_comprehension_chat_completion(
+            ) = await client.generate_understanding_chat_completion(
                 model=model,
                 prompt=prompt,
                 image=image,
@@ -287,15 +295,13 @@ async def generate(
         logger.info("Got extra kwargs in TTS: %s", kwargs)
 
         is_qwen_tts = "qwen3-tts" in model.lower()
-        extra_params_type = None if model_specific_params is None else model_specific_params["type"]
-        if not is_qwen_tts and extra_params_type == "qwen-tts":
+        if not is_qwen_tts and isinstance(model_specific_params, QwenTTSModelSpecificParams):
             raise ValueError(
                 "You have provided Qwen-specific TTS params."
                 "However, the model appears to not be a Qwen TTS model (no 'Qwen3-TTS' in model name)."
             )
 
         combined_params = {**kwargs, **(model_specific_params or {})}
-        combined_params.pop("type", None)  # Internal fields in model_specific_params
 
         client = VLLMOmniClient(url)
 
@@ -352,8 +358,7 @@ async def generate(
         **kwargs,
     ):
         is_qwen_tts = "qwen3-tts" in model.lower()
-        extra_params_type = None if model_specific_params is None else model_specific_params["type"]
-        if not is_qwen_tts and extra_params_type == "qwen-tts":
+        if not is_qwen_tts and isinstance(model_specific_params, QwenTTSModelSpecificParams):
             raise ValueError(
                 "You have provided Qwen-specific TTS params."
                 "However, the model appears to not be a Qwen TTS model (no 'Qwen3-TTS' in model name)."
@@ -366,7 +371,6 @@ async def generate(
             **kwargs,
             **(model_specific_params or {}),
         }
-        combined_params.pop("type", None)  # Internal fields in model_specific_params
 
         client = VLLMOmniClient(url)
 
@@ -419,10 +423,7 @@ def INPUT_TYPES(cls):
     CATEGORY = "vLLM-Omni/Sampling Params"
 
     def get_params(self, seed, **kwargs):
-        params = {
-            "type": "autoregression",  # for internal use, removed before sending the request
-            **kwargs,
-        }
+        params = AutoregressionSamplingParams(kwargs)
         if seed >= 0:
             params["seed"] = seed
         return (params,)
@@ -479,6 +480,13 @@ def INPUT_TYPES(cls):
                         "tooltip": "Enable VAE slicing for reduced memory usage (slight quality trade-off)",
                     },
                 ),
+                "vae_use_tiling": (
+                    "BOOLEAN",
+                    {
+                        "default": False,
+                        "tooltip": "Enable VAE tiling for reduced memory usage (slight quality trade-off)",
+                    },
+                ),
                 # === Put seed at last. ===
                 # Whenever a field named "seed" is present, ComfyUI adds another field called "control after generate"
                 "seed": (
@@ -499,10 +507,7 @@ def INPUT_TYPES(cls):
     CATEGORY = "vLLM-Omni/Sampling Params"
 
     def get_params(self, seed, **kwargs):
-        params = {
-            "type": "diffusion",  # for internal use, removed before sending the request
-            **kwargs,
-        }
+        params = DiffusionSamplingParams(kwargs)
         if seed >= 0:
             params["seed"] = seed
         return (params,)
@@ -566,4 +571,4 @@ def INPUT_TYPES(cls):
     CATEGORY = "vLLM-Omni/TTS Params"
 
     def get_params(self, **kwargs):
-        return ({"type": "qwen-tts", **kwargs},)
+        return (QwenTTSModelSpecificParams(kwargs),)