diff --git a/examples/offline_inference/hunyuan_image3/README.md b/examples/offline_inference/hunyuan_image3/README.md
index da28a44d9e6..3cd8fa01b2e 100644
--- a/examples/offline_inference/hunyuan_image3/README.md
+++ b/examples/offline_inference/hunyuan_image3/README.md
@@ -1,25 +1,161 @@
-# HunyuanImage-3.0 Image-to-Text Inference
+# HunyuanImage-3.0-Instruct
-This example demonstrates how to run HunyuanImage-3.0 Image-to-Text with the vLLM-Omni.
+## Set up
-## Local CLI Usage
+Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
-Download the example image:
+## Run examples
+
+**Note**: These examples work with the default configuration on **8x NVIDIA L40S (48GB)**. For different GPU setups, modify the stage configuration to adjust device allocation and memory utilization.
+
+Get into the hunyuan_image3 folder:
+
+```bash
+cd examples/offline_inference/hunyuan_image3
+```
+
+### Modality Control
+
+HunyuanImage-3.0-Instruct supports multiple modality modes. You can control the mode using the `--modality` argument:
+
+#### Text to Image (text2img)
+
+- **Pipeline**: Text → AR (CoT + latent tokens) → DiT (denoise) → VAE Decode → Image
+- **Stages Used**: Stage 0 (AR) + Stage 1 (DiT)
+- **KV Transfer**: AR sends KV cache to DiT for conditioned generation
+- **Default Config**: `hunyuan_image3_t2i.yaml`
+
+```bash
+python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
+ --modality text2img \
+ --prompts "A cute cat sitting on a windowsill watching the sunset"
+```
+
+#### Image to Image (img2img)
+
+- **Pipeline**: Image + Text → AR (CoT + recaption + latent) → DiT → Edited Image
+- **Stages Used**: Stage 0 (AR) + Stage 1 (DiT)
+- **KV Transfer**: AR sends KV cache to DiT
+- **Default Config**: `hunyuan_image3_it2i.yaml`
+
+```bash
+python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
+ --modality img2img \
+ --image-path /path/to/image.png \
+ --prompts "Make the petals neon pink"
+```
+
+#### Image to Text (img2text)
+
+- **Pipeline**: Image + Question → AR → Text description
+- **Stages Used**: Stage 0 (AR) only
+- **Default Config**: `hunyuan_image3_i2t.yaml`
```bash
-wget https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg
+python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
+ --modality img2text \
+ --image-path /path/to/image.jpg \
+ --prompts "Describe the content of the picture."
```
-Run example:
+#### Text to Text (text2text)
+
+- **Pipeline**: Text → AR → Text
+- **Stages Used**: Stage 0 (AR) only
+- **Default Config**: `hunyuan_image3_t2t.yaml`
```bash
-python image_to_text.py \
- --image cherry_blossom.jpg \
- --prompt "<|startoftext|>You are an assistant that understands images and outputs text.
Describe the content of the picture."
+python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
+ --modality text2text \
+ --prompts "What is the capital of France?"
```
-Key arguments:
+### Inference Steps & Guidance
+
+Control generation quality for image modalities:
+
+```bash
+python end2end.py --modality text2img \
+ --steps 50 \
+ --guidance-scale 5.0 \
+ --height 1024 --width 1024 \
+ --prompts "A photo-realistic sunset over the ocean"
+```
+
+### Key Arguments
+
+#### 📌 Command Line Arguments (end2end.py)
+
+| Argument | Type | Default | Description |
+| :--------------------- | :----- | :----------------------------------- | :----------------------------------------------------------- |
+| `--model` | string | `tencent/HunyuanImage-3.0-Instruct` | Model path or name |
+| `--modality` | choice | `text2img` | Modality: `text2img`, `img2img`, `img2text`, `text2text` |
+| `--prompts` | list | `None` | Input text prompts |
+| `--image-path` | string | `None` | Input image path (for `img2img`/`img2text`) |
+| `--output` | string | `.` | Output directory for saved images |
+| `--steps` | int | `50` | Number of inference steps |
+| `--guidance-scale` | float | `5.0` | Classifier-free guidance scale |
+| `--seed` | int | `42` | Random seed |
+| `--height` | int | `1024` | Output image height |
+| `--width` | int | `1024` | Output image width |
+| `--bot-task` | string | auto | Override prompt task (e.g. `it2i_think`, `t2i_recaption`) |
+| `--sys-type` | string | auto | Override system prompt type (e.g. `en_unified`, `en_vanilla`) |
+| `--stage-configs-path` | string | auto | Custom stage config YAML path |
+| `--enforce-eager` | flag | `False` | Disable torch.compile |
+| `--init-timeout` | int | `300` | Initialization timeout (seconds) |
+
+------
+
+#### ⚙️ Stage Configurations
+
+| Config YAML | Modality | Stages | GPUs | Description |
+| :---------------------------------- | :-------- | :----- | :----- | :------------------------------------ |
+| `hunyuan_image3_t2i.yaml` | text2img | 2 | 8 | T2I with AR→DiT, 4 GPU each |
+| `hunyuan_image3_it2i.yaml` | img2img | 2 | 8 | IT2I with AR→DiT, 4 GPU each |
+| `hunyuan_image3_i2t.yaml` | img2text | 1 | 4 | I2T (AR only) |
+| `hunyuan_image3_t2t.yaml` | text2text | 1 | 4 | T2T (AR only) |
+| `hunyuan_image3_t2i_2gpu.yaml` | text2img | 2 | 2 | T2I for 2-GPU setups |
+| `hunyuan_image3_moe.yaml` | text2img | 2 | 8 | T2I with MoE AR→DiT KV reuse |
+| `hunyuan_image3_moe_dit_2gpu_fp8.yaml` | text2img | 2 | 2 | T2I with FP8 quantization |
+
+------
+
+## Using MoE Config
+
+The `hunyuan_image3_moe.yaml` config enables AR→DiT KV cache reuse with 8 GPUs (4 for AR + 4 for DiT).
+
+```bash
+python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
+ --modality text2img \
+ --stage-configs-path hunyuan_image3_moe.yaml \
+ --prompts "A cute cat"
+```
+
+------
+
+## Prompt Format
+
+HunyuanImage-3.0 uses a pretrain template format:
+
+```
+<|startoftext|>{system_prompt}{
}{trigger_tag}{user_prompt}
+```
+
+- `
`: Placeholder for each input image (auto-inserted by `prompt_utils.py`)
+- Trigger tags: `` (CoT), `` (recaptioning)
+- System prompt: Auto-selected based on task
+
+The `prompt_utils.build_prompt()` handles this formatting automatically.
+
+------
+
+## FAQ
+
+- **OOM errors**: Decrease `gpu_memory_utilization` in the YAML stage config, or use a smaller `max_num_batched_tokens`.
+- **Custom image sizes**: Use `--height` and `--width` flags (multiples of 16 recommended).
-- `--model`: Model used. Default is: tencent/HunyuanImage-3.0-Instruct (Optional).
-- `--image`: Path to input image (required).
-- `--prompt`: Text description used to guide image understanding (required).
+| Stage | VRAM (approx) |
+| :---------------- | :------------------- |
+| Stage 0 (AR) | ~15 GiB + KV Cache |
+| Stage 1 (DiT) | ~30 GiB |
+| Total (8-GPU) | ~45 GiB + KV Cache |
diff --git a/examples/offline_inference/hunyuan_image3/end2end.py b/examples/offline_inference/hunyuan_image3/end2end.py
new file mode 100644
index 00000000000..3c1ae386678
--- /dev/null
+++ b/examples/offline_inference/hunyuan_image3/end2end.py
@@ -0,0 +1,262 @@
+"""
+HunyuanImage-3.0-Instruct unified end-to-end inference script.
+
+Supports all modalities through a single entry point:
+ - text2img: Text → AR → DiT → Image
+ - img2img: Text+Image → AR → DiT → Edited Image (IT2I)
+ - img2text: Image+Text → AR → Text description (I2T)
+ - text2text: Text → AR → Text (comprehension, no image)
+
+Usage:
+ python end2end.py --modality text2img --prompts "A cute cat"
+ python end2end.py --modality img2img --image-path input.png --prompts "Make it snowy"
+ python end2end.py --modality img2text --image-path input.png --prompts "Describe this image"
+"""
+
+import argparse
+import os
+
+from vllm_omni.diffusion.models.hunyuan_image3.system_prompt import (
+ get_system_prompt,
+)
+from vllm_omni.entrypoints.omni import Omni
+from vllm_omni.inputs.data import OmniPromptType
+
+# task → (sys_type, bot_task, trigger_tag)
+_TASK_PRESETS: dict[str, tuple[str, str | None, str | None]] = {
+ "t2t": ("en_unified", None, None),
+ "i2t": ("en_unified", None, None),
+ "it2i_think": ("en_unified", "think", ""),
+ "it2i_recaption": ("en_unified", "recaption", ""),
+ "t2i_think": ("en_unified", "think", ""),
+ "t2i_recaption": ("en_unified", "recaption", ""),
+ "t2i_vanilla": ("en_vanilla", "image", None),
+}
+
+# Modality → prompt_utils task mapping
+_MODALITY_TASK_MAP = {
+ "text2img": "t2i_think",
+ "img2img": "it2i_think",
+ "img2text": "i2t",
+ "text2text": "t2t",
+}
+
+
+def build_prompt(
+ user_prompt: str,
+ task: str = "it2i_think",
+ sys_type: str | None = None,
+ custom_system_prompt: str | None = None,
+) -> str:
+ """Build a HunyuanImage-3.0 prompt using pretrain template format."""
+ if task not in _TASK_PRESETS:
+ raise ValueError(f"Unknown task {task!r}. Choose from: {sorted(_TASK_PRESETS)}")
+
+ preset_sys_type, preset_bot_task, trigger_tag = _TASK_PRESETS[task]
+ effective_sys_type = sys_type or preset_sys_type
+
+ system_prompt = get_system_prompt(effective_sys_type, preset_bot_task, custom_system_prompt)
+ sys_text = system_prompt.strip() if system_prompt else ""
+
+ has_image_input = task.startswith("i2t") or task.startswith("it2i")
+
+ parts = ["<|startoftext|>"]
+ if sys_text:
+ parts.append(sys_text)
+ if has_image_input:
+ parts.append("
")
+ if trigger_tag:
+ parts.append(trigger_tag)
+ parts.append(user_prompt)
+
+ return "".join(parts)
+
+
+# Modality → default stage config
+_MODALITY_DEFAULT_CONFIG = {
+ "text2img": "hunyuan_image3_t2i.yaml",
+ "img2img": "hunyuan_image3_it2i.yaml",
+ "img2text": "hunyuan_image3_i2t.yaml",
+ "text2text": "hunyuan_image3_t2t.yaml",
+}
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description="HunyuanImage-3.0-Instruct end-to-end inference.")
+ parser.add_argument(
+ "--model",
+ default="tencent/HunyuanImage-3.0-Instruct",
+ help="Model name or local path.",
+ )
+ parser.add_argument(
+ "--modality",
+ default="text2img",
+ choices=["text2img", "img2img", "img2text", "text2text"],
+ help="Modality mode to control stage execution.",
+ )
+ parser.add_argument("--prompts", nargs="+", default=None, help="Input text prompts.")
+ parser.add_argument(
+ "--image-path",
+ type=str,
+ default=None,
+ help="Path to input image (for img2img/img2text).",
+ )
+ parser.add_argument(
+ "--output",
+ type=str,
+ default=".",
+ help="Output directory to save results.",
+ )
+
+ # Generation parameters
+ parser.add_argument("--steps", type=int, default=50, help="Number of inference steps.")
+ parser.add_argument("--guidance-scale", type=float, default=5.0, help="Classifier-free guidance scale.")
+ parser.add_argument("--seed", type=int, default=42, help="Random seed.")
+ parser.add_argument("--height", type=int, default=1024, help="Output image height.")
+ parser.add_argument("--width", type=int, default=1024, help="Output image width.")
+
+ # Prompt configuration
+ parser.add_argument(
+ "--bot-task",
+ type=str,
+ default=None,
+ help="Override prompt task (e.g. it2i_think, t2i_recaption). Default: auto from modality.",
+ )
+ parser.add_argument(
+ "--sys-type",
+ type=str,
+ default=None,
+ help="Override system prompt type (e.g. en_unified, en_vanilla).",
+ )
+
+ # Omni init args
+ parser.add_argument("--stage-configs-path", type=str, default=None, help="Custom stage config YAML path.")
+ parser.add_argument("--log-stats", action="store_true", default=False)
+ parser.add_argument("--init-timeout", type=int, default=300, help="Initialization timeout in seconds.")
+ parser.add_argument("--enforce-eager", action="store_true", help="Disable torch.compile.")
+
+ return parser.parse_args()
+
+
+def main():
+ args = parse_args()
+ os.makedirs(args.output, exist_ok=True)
+
+ # Determine task for prompt formatting
+ task = args.bot_task or _MODALITY_TASK_MAP[args.modality]
+
+ # Determine stage config
+ stage_configs_path = args.stage_configs_path or _MODALITY_DEFAULT_CONFIG[args.modality]
+
+ # Build Omni
+ omni_kwargs = {
+ "model": args.model,
+ "stage_configs_path": stage_configs_path,
+ "log_stats": args.log_stats,
+ "init_timeout": args.init_timeout,
+ "enforce_eager": args.enforce_eager,
+ }
+ if args.modality in ("text2img", "img2img"):
+ omni_kwargs["mode"] = "text-to-image"
+
+ omni = Omni(**omni_kwargs)
+
+ # Prepare prompts
+ prompts = args.prompts or ["A cute cat"]
+ if not prompts:
+ print("[Info] No prompts provided, using default.")
+ prompts = ["A cute cat"]
+
+ # Load image if needed
+ input_image = None
+ if args.modality in ("img2img", "img2text"):
+ if not args.image_path or not os.path.exists(args.image_path):
+ raise ValueError(f"--image-path required for {args.modality}, got: {args.image_path}")
+ from PIL import Image
+
+ input_image = Image.open(args.image_path).convert("RGB")
+
+ # Format prompts
+ formatted_prompts: list[OmniPromptType] = []
+ for p in prompts:
+ formatted_text = build_prompt(p, task=task, sys_type=args.sys_type)
+
+ prompt_dict: dict = {"prompt": formatted_text}
+
+ if args.modality == "text2img":
+ prompt_dict["modalities"] = ["image"]
+ elif args.modality == "img2img":
+ prompt_dict["modalities"] = ["image"]
+ prompt_dict["multi_modal_data"] = {"image": input_image}
+ prompt_dict["height"] = input_image.height
+ prompt_dict["width"] = input_image.width
+ elif args.modality == "img2text":
+ prompt_dict["modalities"] = ["text"]
+ prompt_dict["multi_modal_data"] = {"image": input_image}
+ elif args.modality == "text2text":
+ prompt_dict["modalities"] = ["text"]
+
+ formatted_prompts.append(prompt_dict)
+
+ # Build sampling params from defaults
+ params_list = list(omni.default_sampling_params_list)
+
+ # Override diffusion params if applicable
+ from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+
+ for i, sp in enumerate(params_list):
+ if isinstance(sp, OmniDiffusionSamplingParams):
+ sp.num_inference_steps = args.steps
+ sp.guidance_scale = args.guidance_scale
+ if args.seed is not None:
+ sp.seed = args.seed
+ if args.modality in ("text2img",):
+ sp.height = args.height
+ sp.width = args.width
+
+ # Print configuration
+ print(f"\n{'=' * 60}")
+ print("HunyuanImage-3.0 Generation Configuration:")
+ print(f" Model: {args.model}")
+ print(f" Modality: {args.modality}")
+ print(f" Stage config: {stage_configs_path}")
+ print(f" Num stages: {omni.num_stages}")
+ if args.modality in ("text2img", "img2img"):
+ print(f" Inference steps: {args.steps}")
+ print(f" Guidance scale: {args.guidance_scale}")
+ print(f" Seed: {args.seed}")
+ if args.modality == "text2img":
+ print(f" Output size: {args.width}x{args.height}")
+ if args.image_path:
+ print(f" Input image: {args.image_path}")
+ print(f" Prompts: {prompts}")
+ print(f"{'=' * 60}\n")
+
+ # Generate
+ omni_outputs = list(omni.generate(prompts=formatted_prompts, sampling_params_list=params_list))
+
+ # Process outputs
+ img_idx = 0
+ for req_output in omni_outputs:
+ # Text output (AR stage or text-only)
+ ro = getattr(req_output, "request_output", None)
+ if ro and getattr(ro, "outputs", None):
+ txt = "".join(getattr(o, "text", "") or "" for o in ro.outputs)
+ if txt:
+ print(f"[Output] Text:\n{txt}")
+
+ # Image output (DiT stage)
+ images = getattr(req_output, "images", None)
+ if not images and ro and hasattr(ro, "images"):
+ images = ro.images
+
+ if images:
+ for j, img in enumerate(images):
+ save_path = os.path.join(args.output, f"output_{img_idx}_{j}.png")
+ img.save(save_path)
+ print(f"[Output] Saved image to {save_path}")
+ img_idx += 1
+
+
+if __name__ == "__main__":
+ main()
diff --git a/examples/offline_inference/hunyuan_image3/image_to_text.py b/examples/offline_inference/hunyuan_image3/image_to_text.py
deleted file mode 100644
index d40134ac0a0..00000000000
--- a/examples/offline_inference/hunyuan_image3/image_to_text.py
+++ /dev/null
@@ -1,92 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-import argparse
-import os
-
-from PIL import Image
-
-from vllm_omni.entrypoints.omni import Omni
-
-"""
-The tencent/HunyuanImage-3.0-Instruct base model uses the tencent/Hunyuan-A13B-Instruct backbone. It utilizes two tokenizer delimiter templates:
-
-1) Pretrained template (default for gen_text mode), which concatenates system, image
- tokens, and user question WITHOUT role delimiters:
-"<|startoftext|>{system_prompt}{image_tokens}{user_question}"
-
- Example (before image token expansion):
-"<|startoftext|>You are an assistant that understands images and outputs text.
Describe the content of the picture."
-
-2) Instruct template, which uses explicit role prefixes and separators.
-"""
-
-
-def parse_args() -> argparse.Namespace:
- parser = argparse.ArgumentParser(description="Generate text from image using HunyuanImage-3.0-Instruct.")
- parser.add_argument(
- "--model",
- default="tencent/HunyuanImage-3.0-Instruct",
- help="Model name or local path.",
- )
- parser.add_argument(
- "--image",
- type=str,
- required=True,
- help="Path to input image file (PNG, JPG, etc.).",
- )
- parser.add_argument(
- "--prompt",
- type=str,
- required=True,
- help="Pretrain template prompt: <|startoftext|>{system}
{question}",
- )
- parser.add_argument(
- "--enable-diffusion-pipeline-profiler",
- action="store_true",
- help="Enable diffusion pipeline profiler to display stage durations.",
- )
- return parser.parse_args()
-
-
-def load_image(image_path: str) -> Image.Image:
- """Load an image from file path."""
- if not os.path.exists(image_path):
- raise FileNotFoundError(f"Image file not found: {image_path}")
- return Image.open(image_path).convert("RGB")
-
-
-def main(args: argparse.Namespace) -> None:
- omni = Omni(
- model=args.model,
- enable_diffusion_pipeline_profiler=args.enable_diffusion_pipeline_profiler,
- mode="image-to-text",
- )
-
- prompt = "<|startoftext|>You are an assistant that understands images and outputs text.
" + args.prompt
-
- prompt_dict = {
- "prompt": prompt,
- "modalities": ["text"],
- }
-
- # Add image input if provided
- if args.image:
- if not os.path.exists(args.image):
- raise FileNotFoundError(f"Input image not found: {args.image}")
-
- input_image = load_image(args.image)
- prompt_dict["multi_modal_data"] = {"image": input_image}
-
- prompts = [prompt_dict]
- omni_outputs = omni.generate(prompts=prompts)
-
- prompt_text = omni_outputs[0].request_output.prompt
- generated_text = omni_outputs[0].request_output.outputs[0].text
- print(f"Prompt: {prompt_text}")
- print(f"Text: {generated_text}")
-
-
-if __name__ == "__main__":
- args = parse_args()
- main(args)
diff --git a/examples/offline_inference/hunyuan_image3/prompt_utils.py b/examples/offline_inference/hunyuan_image3/prompt_utils.py
deleted file mode 100644
index a5ef8e15369..00000000000
--- a/examples/offline_inference/hunyuan_image3/prompt_utils.py
+++ /dev/null
@@ -1,88 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-"""
-Prompt construction utilities for HunyuanImage-3.0-Instruct examples.
-
-Wraps system_prompt.get_system_prompt() with task-aware presets so that
-examples and tests don't need to manually concatenate system prompts,
-
, , and tags.
-
-Usage:
- from prompt_utils import build_prompt
-
- # IT2I (image editing, think+recaption mode)
- prompt = build_prompt("Make the petals neon pink", task="it2i_think")
-
- # I2T (image understanding)
- prompt = build_prompt("Describe the content of the picture.", task="i2t")
-"""
-
-from __future__ import annotations
-
-from vllm_omni.diffusion.models.hunyuan_image3.system_prompt import (
- get_system_prompt,
-)
-
-# task → (sys_type, bot_task, trigger_tag)
-# trigger_tag: "", "", or None
-_TASK_PRESETS: dict[str, tuple[str, str | None, str | None]] = {
- # Pure text generation (text → text, no image)
- "t2t": ("en_unified", None, None),
- # Image understanding (image → text)
- "i2t": ("en_unified", None, None),
- # Image editing (image+text → image), think+recaption mode
- "it2i_think": ("en_unified", "think", ""),
- # Image editing, recaption-only mode
- "it2i_recaption": ("en_unified", "recaption", ""),
- # Text-to-image, think mode
- "t2i_think": ("en_unified", "think", ""),
- # Text-to-image, recaption mode
- "t2i_recaption": ("en_unified", "recaption", ""),
- # Text-to-image, vanilla (no CoT)
- "t2i_vanilla": ("en_vanilla", "image", None),
-}
-
-
-def build_prompt(
- user_prompt: str,
- task: str = "it2i_think",
- sys_type: str | None = None,
- custom_system_prompt: str | None = None,
-) -> str:
- """Build a complete HunyuanImage-3.0 prompt with auto-selected system
- prompt and mode trigger tags.
-
- Args:
- user_prompt: The user's raw instruction or question.
- task: One of the preset task keys (see _TASK_PRESETS).
- sys_type: Override the preset's sys_type for get_system_prompt().
- custom_system_prompt: Custom system prompt text (used when
- sys_type="custom").
-
- Returns:
- Fully formatted prompt string ready for Omni.generate().
- """
- if task not in _TASK_PRESETS:
- raise ValueError(f"Unknown task {task!r}. Choose from: {sorted(_TASK_PRESETS)}")
-
- preset_sys_type, preset_bot_task, trigger_tag = _TASK_PRESETS[task]
- effective_sys_type = sys_type or preset_sys_type
-
- system_prompt = get_system_prompt(effective_sys_type, preset_bot_task, custom_system_prompt)
- sys_text = system_prompt.strip() if system_prompt else ""
-
- has_image_input = task.startswith("i2t") or task.startswith("it2i")
-
- parts = ["<|startoftext|>"]
- if sys_text:
- parts.append(sys_text)
- # Instruct conversation template: \n\nUser: ... \n\nAssistant:
- parts.append("\n\nUser: ")
- if has_image_input:
- parts.append("
")
- parts.append(user_prompt)
- parts.append("\n\nAssistant: ")
- if trigger_tag:
- parts.append(trigger_tag)
-
- return "".join(parts)
diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml
new file mode 100644
index 00000000000..f0797c63270
--- /dev/null
+++ b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml
@@ -0,0 +1,96 @@
+# Stage config for running Hunyuan-Image3.0 with AR→DiT KV reuse.
+# Stage 0: AR Model (vLLM implementation)
+# Stage 1: DiT Model (diffusion)
+#
+# text-to-image flow: AR (stage 0) → KV transfer → DiT (stage 1)
+# image-to-text flow: AR (stage 0) only
+#
+# Compared to hunyuan_image3_t2i.yaml, this config:
+# 1. Enables both stages [0, 1] for text-to-image (AR prefill + DiT denoising)
+# 2. Adds omni_kv_config to send/receive KV cache between stages
+
+# The following config has been verified on 8x L40S-48G GPU (4 for AR + 4 for DiT).
+stage_args:
+ - stage_id: 0
+ stage_type: llm # Use llm stage type for AR stages
+ runtime:
+ process: true # Run this stage in a separate process
+ devices: "0,1,2,3" # AR stage uses GPU 0-3
+ engine_args:
+ model_stage: AR
+ max_num_seqs: 1
+ model_arch: HunyuanImage3ForCausalMM
+ worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
+ scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
+ gpu_memory_utilization: 0.9
+ enforce_eager: true # Now we only support eager mode
+ trust_remote_code: true
+ engine_output_type: latent
+ enable_prefix_caching: false
+ max_num_batched_tokens: 32768
+ tensor_parallel_size: 4
+ pipeline_parallel_size: 1
+ hf_overrides:
+ rope_parameters:
+ mrope_section: [0, 32, 32]
+ rope_type: default
+ omni_kv_config:
+ need_send_cache: true
+ kv_transfer_criteria:
+ type: prefill_finished # Send KV cache after AR prefill completes
+ is_comprehension: true
+ final_output: true
+ final_output_type: text
+ default_sampling_params:
+ temperature: 0.0
+ top_p: 1.0
+ top_k: -1
+ max_tokens: 2048
+ seed: 42
+ detokenize: True
+ repetition_penalty: 1.1
+ - stage_id: 1
+ stage_type: diffusion
+ runtime:
+ process: true
+ devices: "4,5,6,7" # DiT stage uses GPU 4-7
+ max_batch_size: 1
+ engine_args:
+ model_stage: diffusion
+ enforce_eager: true
+ distributed_executor_backend: "mp"
+ vae_use_slicing: false
+ vae_use_tiling: false
+ cache_backend: null
+ cache_config: null
+ enable_cache_dit_summary: false
+ omni_kv_config:
+ need_recv_cache: true # Receive AR KV cache from stage 0
+ parallel_config:
+ pipeline_parallel_size: 1
+ data_parallel_size: 1
+ tensor_parallel_size: 4
+ enable_expert_parallel: false
+ sequence_parallel_size: 1
+ ulysses_degree: 1
+ ring_degree: 1
+ cfg_parallel_size: 1
+ vae_patch_parallel_size: 1
+ use_hsdp: false
+ hsdp_shard_size: -1
+ hsdp_replicate_size: 1
+ engine_input_source: [0] # Receive input (including KV) from stage 0
+ final_output: true
+ final_output_type: image
+
+# Top-level runtime config: windows, edges, and connectors
+runtime:
+ enabled: true
+ defaults:
+ window_size: -1 # Trigger downstream only after full upstream completion
+ max_inflight: 1 # Process serially within each stage
+
+ edges:
+ - from: 0
+ to: 1
+ window_size: -1