diff --git a/examples/offline_inference/hunyuan_image3/README.md b/examples/offline_inference/hunyuan_image3/README.md
index da28a44d9e6..3cd8fa01b2e 100644
--- a/examples/offline_inference/hunyuan_image3/README.md
+++ b/examples/offline_inference/hunyuan_image3/README.md
@@ -1,25 +1,161 @@
-# HunyuanImage-3.0 Image-to-Text Inference
+# HunyuanImage-3.0-Instruct
 
-This example demonstrates how to run HunyuanImage-3.0 Image-to-Text with the vLLM-Omni.
+## Set up
 
-## Local CLI Usage
+Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
 
-Download the example image:
+## Run examples
+
+**Note**: These examples work with the default configuration on **8x NVIDIA L40S (48GB)**. For different GPU setups, modify the stage configuration to adjust device allocation and memory utilization.
+
+Get into the hunyuan_image3 folder:
+
+```bash
+cd examples/offline_inference/hunyuan_image3
+```
+
+### Modality Control
+
+HunyuanImage-3.0-Instruct supports multiple modality modes. You can control the mode using the `--modality` argument:
+
+#### Text to Image (text2img)
+
+- **Pipeline**: Text → AR (CoT + latent tokens) → DiT (denoise) → VAE Decode → Image
+- **Stages Used**: Stage 0 (AR) + Stage 1 (DiT)
+- **KV Transfer**: AR sends KV cache to DiT for conditioned generation
+- **Default Config**: `hunyuan_image3_t2i.yaml`
+
+```bash
+python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
+                  --modality text2img \
+                  --prompts "A cute cat sitting on a windowsill watching the sunset"
+```
+
+#### Image to Image (img2img)
+
+- **Pipeline**: Image + Text → AR (CoT + recaption + latent) → DiT → Edited Image
+- **Stages Used**: Stage 0 (AR) + Stage 1 (DiT)
+- **KV Transfer**: AR sends KV cache to DiT
+- **Default Config**: `hunyuan_image3_it2i.yaml`
+
+```bash
+python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
+                  --modality img2img \
+                  --image-path /path/to/image.png \
+                  --prompts "Make the petals neon pink"
+```
+
+#### Image to Text (img2text)
+
+- **Pipeline**: Image + Question → AR → Text description
+- **Stages Used**: Stage 0 (AR) only
+- **Default Config**: `hunyuan_image3_i2t.yaml`
 
 ```bash
-wget https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg
+python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
+                  --modality img2text \
+                  --image-path /path/to/image.jpg \
+                  --prompts "Describe the content of the picture."
 ```
 
-Run example:
+#### Text to Text (text2text)
+
+- **Pipeline**: Text → AR → Text
+- **Stages Used**: Stage 0 (AR) only
+- **Default Config**: `hunyuan_image3_t2t.yaml`
 
 ```bash
-python image_to_text.py \
-  --image cherry_blossom.jpg \
-  --prompt "<|startoftext|>You are an assistant that understands images and outputs text.<img>Describe the content of the picture."
+python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
+                  --modality text2text \
+                  --prompts "What is the capital of France?"
 ```
 
-Key arguments:
+### Inference Steps & Guidance
+
+Control generation quality for image modalities:
+
+```bash
+python end2end.py --modality text2img \
+                  --steps 50 \
+                  --guidance-scale 5.0 \
+                  --height 1024 --width 1024 \
+                  --prompts "A photo-realistic sunset over the ocean"
+```
+
+### Key Arguments
+
+#### 📌 Command Line Arguments (end2end.py)
+
+| Argument               | Type   | Default                              | Description                                                  |
+| :--------------------- | :----- | :----------------------------------- | :----------------------------------------------------------- |
+| `--model`              | string | `tencent/HunyuanImage-3.0-Instruct` | Model path or name                                           |
+| `--modality`           | choice | `text2img`                           | Modality: `text2img`, `img2img`, `img2text`, `text2text`     |
+| `--prompts`            | list   | `None`                               | Input text prompts                                           |
+| `--image-path`         | string | `None`                               | Input image path (for `img2img`/`img2text`)                  |
+| `--output`             | string | `.`                                  | Output directory for saved images                            |
+| `--steps`              | int    | `50`                                 | Number of inference steps                                    |
+| `--guidance-scale`     | float  | `5.0`                                | Classifier-free guidance scale                               |
+| `--seed`               | int    | `42`                                 | Random seed                                                  |
+| `--height`             | int    | `1024`                               | Output image height                                          |
+| `--width`              | int    | `1024`                               | Output image width                                           |
+| `--bot-task`           | string | auto                                 | Override prompt task (e.g. `it2i_think`, `t2i_recaption`)    |
+| `--sys-type`           | string | auto                                 | Override system prompt type (e.g. `en_unified`, `en_vanilla`) |
+| `--stage-configs-path` | string | auto                                 | Custom stage config YAML path                                |
+| `--enforce-eager`      | flag   | `False`                              | Disable torch.compile                                        |
+| `--init-timeout`       | int    | `300`                                | Initialization timeout (seconds)                             |
+
+------
+
+#### ⚙️ Stage Configurations
+
+| Config YAML                         | Modality  | Stages | GPUs   | Description                           |
+| :---------------------------------- | :-------- | :----- | :----- | :------------------------------------ |
+| `hunyuan_image3_t2i.yaml`           | text2img  | 2      | 8      | T2I with AR→DiT, 4 GPU each          |
+| `hunyuan_image3_it2i.yaml`          | img2img   | 2      | 8      | IT2I with AR→DiT, 4 GPU each         |
+| `hunyuan_image3_i2t.yaml`           | img2text  | 1      | 4      | I2T (AR only)                         |
+| `hunyuan_image3_t2t.yaml`           | text2text | 1      | 4      | T2T (AR only)                         |
+| `hunyuan_image3_t2i_2gpu.yaml`      | text2img  | 2      | 2      | T2I for 2-GPU setups                  |
+| `hunyuan_image3_moe.yaml`           | text2img  | 2      | 8      | T2I with MoE AR→DiT KV reuse          |
+| `hunyuan_image3_moe_dit_2gpu_fp8.yaml` | text2img | 2   | 2      | T2I with FP8 quantization             |
+
+------
+
+## Using MoE Config
+
+The `hunyuan_image3_moe.yaml` config enables AR→DiT KV cache reuse with 8 GPUs (4 for AR + 4 for DiT).
+
+```bash
+python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
+                  --modality text2img \
+                  --stage-configs-path hunyuan_image3_moe.yaml \
+                  --prompts "A cute cat"
+```
+
+------
+
+## Prompt Format
+
+HunyuanImage-3.0 uses a pretrain template format:
+
+```
+<|startoftext|>{system_prompt}{<img>}{trigger_tag}{user_prompt}
+```
+
+- `<img>`: Placeholder for each input image (auto-inserted by `prompt_utils.py`)
+- Trigger tags: `<think>` (CoT), `<recaption>` (recaptioning)
+- System prompt: Auto-selected based on task
+
+The `prompt_utils.build_prompt()` handles this formatting automatically.
+
+------
+
+## FAQ
+
+- **OOM errors**: Decrease `gpu_memory_utilization` in the YAML stage config, or use a smaller `max_num_batched_tokens`.
+- **Custom image sizes**: Use `--height` and `--width` flags (multiples of 16 recommended).
 
-- `--model`: Model used. Default is: tencent/HunyuanImage-3.0-Instruct (Optional).
-- `--image`: Path to input image (required).
-- `--prompt`: Text description used to guide image understanding (required).
+| Stage             | VRAM (approx)        |
+| :---------------- | :------------------- |
+| Stage 0 (AR)      | ~15 GiB + KV Cache   |
+| Stage 1 (DiT)     | ~30 GiB              |
+| Total (8-GPU)     | ~45 GiB + KV Cache   |
diff --git a/examples/offline_inference/hunyuan_image3/end2end.py b/examples/offline_inference/hunyuan_image3/end2end.py
new file mode 100644
index 00000000000..3c1ae386678
--- /dev/null
+++ b/examples/offline_inference/hunyuan_image3/end2end.py
@@ -0,0 +1,262 @@
+"""
+HunyuanImage-3.0-Instruct unified end-to-end inference script.
+
+Supports all modalities through a single entry point:
+  - text2img:  Text → AR → DiT → Image
+  - img2img:   Text+Image → AR → DiT → Edited Image (IT2I)
+  - img2text:  Image+Text → AR → Text description (I2T)
+  - text2text: Text → AR → Text (comprehension, no image)
+
+Usage:
+    python end2end.py --modality text2img --prompts "A cute cat"
+    python end2end.py --modality img2img --image-path input.png --prompts "Make it snowy"
+    python end2end.py --modality img2text --image-path input.png --prompts "Describe this image"
+"""
+
+import argparse
+import os
+
+from vllm_omni.diffusion.models.hunyuan_image3.system_prompt import (
+    get_system_prompt,
+)
+from vllm_omni.entrypoints.omni import Omni
+from vllm_omni.inputs.data import OmniPromptType
+
+# task → (sys_type, bot_task, trigger_tag)
+_TASK_PRESETS: dict[str, tuple[str, str | None, str | None]] = {
+    "t2t": ("en_unified", None, None),
+    "i2t": ("en_unified", None, None),
+    "it2i_think": ("en_unified", "think", "<think>"),
+    "it2i_recaption": ("en_unified", "recaption", "<recaption>"),
+    "t2i_think": ("en_unified", "think", "<think>"),
+    "t2i_recaption": ("en_unified", "recaption", "<recaption>"),
+    "t2i_vanilla": ("en_vanilla", "image", None),
+}
+
+# Modality → prompt_utils task mapping
+_MODALITY_TASK_MAP = {
+    "text2img": "t2i_think",
+    "img2img": "it2i_think",
+    "img2text": "i2t",
+    "text2text": "t2t",
+}
+
+
+def build_prompt(
+    user_prompt: str,
+    task: str = "it2i_think",
+    sys_type: str | None = None,
+    custom_system_prompt: str | None = None,
+) -> str:
+    """Build a HunyuanImage-3.0 prompt using pretrain template format."""
+    if task not in _TASK_PRESETS:
+        raise ValueError(f"Unknown task {task!r}. Choose from: {sorted(_TASK_PRESETS)}")
+
+    preset_sys_type, preset_bot_task, trigger_tag = _TASK_PRESETS[task]
+    effective_sys_type = sys_type or preset_sys_type
+
+    system_prompt = get_system_prompt(effective_sys_type, preset_bot_task, custom_system_prompt)
+    sys_text = system_prompt.strip() if system_prompt else ""
+
+    has_image_input = task.startswith("i2t") or task.startswith("it2i")
+
+    parts = ["<|startoftext|>"]
+    if sys_text:
+        parts.append(sys_text)
+    if has_image_input:
+        parts.append("<img>")
+    if trigger_tag:
+        parts.append(trigger_tag)
+    parts.append(user_prompt)
+
+    return "".join(parts)
+
+
+# Modality → default stage config
+_MODALITY_DEFAULT_CONFIG = {
+    "text2img": "hunyuan_image3_t2i.yaml",
+    "img2img": "hunyuan_image3_it2i.yaml",
+    "img2text": "hunyuan_image3_i2t.yaml",
+    "text2text": "hunyuan_image3_t2t.yaml",
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="HunyuanImage-3.0-Instruct end-to-end inference.")
+    parser.add_argument(
+        "--model",
+        default="tencent/HunyuanImage-3.0-Instruct",
+        help="Model name or local path.",
+    )
+    parser.add_argument(
+        "--modality",
+        default="text2img",
+        choices=["text2img", "img2img", "img2text", "text2text"],
+        help="Modality mode to control stage execution.",
+    )
+    parser.add_argument("--prompts", nargs="+", default=None, help="Input text prompts.")
+    parser.add_argument(
+        "--image-path",
+        type=str,
+        default=None,
+        help="Path to input image (for img2img/img2text).",
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        default=".",
+        help="Output directory to save results.",
+    )
+
+    # Generation parameters
+    parser.add_argument("--steps", type=int, default=50, help="Number of inference steps.")
+    parser.add_argument("--guidance-scale", type=float, default=5.0, help="Classifier-free guidance scale.")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed.")
+    parser.add_argument("--height", type=int, default=1024, help="Output image height.")
+    parser.add_argument("--width", type=int, default=1024, help="Output image width.")
+
+    # Prompt configuration
+    parser.add_argument(
+        "--bot-task",
+        type=str,
+        default=None,
+        help="Override prompt task (e.g. it2i_think, t2i_recaption). Default: auto from modality.",
+    )
+    parser.add_argument(
+        "--sys-type",
+        type=str,
+        default=None,
+        help="Override system prompt type (e.g. en_unified, en_vanilla).",
+    )
+
+    # Omni init args
+    parser.add_argument("--stage-configs-path", type=str, default=None, help="Custom stage config YAML path.")
+    parser.add_argument("--log-stats", action="store_true", default=False)
+    parser.add_argument("--init-timeout", type=int, default=300, help="Initialization timeout in seconds.")
+    parser.add_argument("--enforce-eager", action="store_true", help="Disable torch.compile.")
+
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+    os.makedirs(args.output, exist_ok=True)
+
+    # Determine task for prompt formatting
+    task = args.bot_task or _MODALITY_TASK_MAP[args.modality]
+
+    # Determine stage config
+    stage_configs_path = args.stage_configs_path or _MODALITY_DEFAULT_CONFIG[args.modality]
+
+    # Build Omni
+    omni_kwargs = {
+        "model": args.model,
+        "stage_configs_path": stage_configs_path,
+        "log_stats": args.log_stats,
+        "init_timeout": args.init_timeout,
+        "enforce_eager": args.enforce_eager,
+    }
+    if args.modality in ("text2img", "img2img"):
+        omni_kwargs["mode"] = "text-to-image"
+
+    omni = Omni(**omni_kwargs)
+
+    # Prepare prompts
+    prompts = args.prompts or ["A cute cat"]
+    if not prompts:
+        print("[Info] No prompts provided, using default.")
+        prompts = ["A cute cat"]
+
+    # Load image if needed
+    input_image = None
+    if args.modality in ("img2img", "img2text"):
+        if not args.image_path or not os.path.exists(args.image_path):
+            raise ValueError(f"--image-path required for {args.modality}, got: {args.image_path}")
+        from PIL import Image
+
+        input_image = Image.open(args.image_path).convert("RGB")
+
+    # Format prompts
+    formatted_prompts: list[OmniPromptType] = []
+    for p in prompts:
+        formatted_text = build_prompt(p, task=task, sys_type=args.sys_type)
+
+        prompt_dict: dict = {"prompt": formatted_text}
+
+        if args.modality == "text2img":
+            prompt_dict["modalities"] = ["image"]
+        elif args.modality == "img2img":
+            prompt_dict["modalities"] = ["image"]
+            prompt_dict["multi_modal_data"] = {"image": input_image}
+            prompt_dict["height"] = input_image.height
+            prompt_dict["width"] = input_image.width
+        elif args.modality == "img2text":
+            prompt_dict["modalities"] = ["text"]
+            prompt_dict["multi_modal_data"] = {"image": input_image}
+        elif args.modality == "text2text":
+            prompt_dict["modalities"] = ["text"]
+
+        formatted_prompts.append(prompt_dict)
+
+    # Build sampling params from defaults
+    params_list = list(omni.default_sampling_params_list)
+
+    # Override diffusion params if applicable
+    from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+
+    for i, sp in enumerate(params_list):
+        if isinstance(sp, OmniDiffusionSamplingParams):
+            sp.num_inference_steps = args.steps
+            sp.guidance_scale = args.guidance_scale
+            if args.seed is not None:
+                sp.seed = args.seed
+            if args.modality in ("text2img",):
+                sp.height = args.height
+                sp.width = args.width
+
+    # Print configuration
+    print(f"\n{'=' * 60}")
+    print("HunyuanImage-3.0 Generation Configuration:")
+    print(f"  Model: {args.model}")
+    print(f"  Modality: {args.modality}")
+    print(f"  Stage config: {stage_configs_path}")
+    print(f"  Num stages: {omni.num_stages}")
+    if args.modality in ("text2img", "img2img"):
+        print(f"  Inference steps: {args.steps}")
+        print(f"  Guidance scale: {args.guidance_scale}")
+        print(f"  Seed: {args.seed}")
+    if args.modality == "text2img":
+        print(f"  Output size: {args.width}x{args.height}")
+    if args.image_path:
+        print(f"  Input image: {args.image_path}")
+    print(f"  Prompts: {prompts}")
+    print(f"{'=' * 60}\n")
+
+    # Generate
+    omni_outputs = list(omni.generate(prompts=formatted_prompts, sampling_params_list=params_list))
+
+    # Process outputs
+    img_idx = 0
+    for req_output in omni_outputs:
+        # Text output (AR stage or text-only)
+        ro = getattr(req_output, "request_output", None)
+        if ro and getattr(ro, "outputs", None):
+            txt = "".join(getattr(o, "text", "") or "" for o in ro.outputs)
+            if txt:
+                print(f"[Output] Text:\n{txt}")
+
+        # Image output (DiT stage)
+        images = getattr(req_output, "images", None)
+        if not images and ro and hasattr(ro, "images"):
+            images = ro.images
+
+        if images:
+            for j, img in enumerate(images):
+                save_path = os.path.join(args.output, f"output_{img_idx}_{j}.png")
+                img.save(save_path)
+                print(f"[Output] Saved image to {save_path}")
+            img_idx += 1
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/offline_inference/hunyuan_image3/image_to_text.py b/examples/offline_inference/hunyuan_image3/image_to_text.py
deleted file mode 100644
index d40134ac0a0..00000000000
--- a/examples/offline_inference/hunyuan_image3/image_to_text.py
+++ /dev/null
@@ -1,92 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
-import argparse
-import os
-
-from PIL import Image
-
-from vllm_omni.entrypoints.omni import Omni
-
-"""
-The tencent/HunyuanImage-3.0-Instruct base model uses the tencent/Hunyuan-A13B-Instruct backbone. It utilizes two tokenizer delimiter templates:
-
-1) Pretrained template (default for gen_text mode), which concatenates system, image
-   tokens, and user question WITHOUT role delimiters:
-"<|startoftext|>{system_prompt}{image_tokens}{user_question}"
-
-   Example (before image token expansion):
-"<|startoftext|>You are an assistant that understands images and outputs text.<img>Describe the content of the picture."
-
-2) Instruct template, which uses explicit role prefixes and separators.
-"""
-
-
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="Generate text from image using HunyuanImage-3.0-Instruct.")
-    parser.add_argument(
-        "--model",
-        default="tencent/HunyuanImage-3.0-Instruct",
-        help="Model name or local path.",
-    )
-    parser.add_argument(
-        "--image",
-        type=str,
-        required=True,
-        help="Path to input image file (PNG, JPG, etc.).",
-    )
-    parser.add_argument(
-        "--prompt",
-        type=str,
-        required=True,
-        help="Pretrain template prompt: <|startoftext|>{system}<img>{question}",
-    )
-    parser.add_argument(
-        "--enable-diffusion-pipeline-profiler",
-        action="store_true",
-        help="Enable diffusion pipeline profiler to display stage durations.",
-    )
-    return parser.parse_args()
-
-
-def load_image(image_path: str) -> Image.Image:
-    """Load an image from file path."""
-    if not os.path.exists(image_path):
-        raise FileNotFoundError(f"Image file not found: {image_path}")
-    return Image.open(image_path).convert("RGB")
-
-
-def main(args: argparse.Namespace) -> None:
-    omni = Omni(
-        model=args.model,
-        enable_diffusion_pipeline_profiler=args.enable_diffusion_pipeline_profiler,
-        mode="image-to-text",
-    )
-
-    prompt = "<|startoftext|>You are an assistant that understands images and outputs text.<img>" + args.prompt
-
-    prompt_dict = {
-        "prompt": prompt,
-        "modalities": ["text"],
-    }
-
-    # Add image input if provided
-    if args.image:
-        if not os.path.exists(args.image):
-            raise FileNotFoundError(f"Input image not found: {args.image}")
-
-        input_image = load_image(args.image)
-        prompt_dict["multi_modal_data"] = {"image": input_image}
-
-    prompts = [prompt_dict]
-    omni_outputs = omni.generate(prompts=prompts)
-
-    prompt_text = omni_outputs[0].request_output.prompt
-    generated_text = omni_outputs[0].request_output.outputs[0].text
-    print(f"Prompt: {prompt_text}")
-    print(f"Text: {generated_text}")
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    main(args)
diff --git a/examples/offline_inference/hunyuan_image3/prompt_utils.py b/examples/offline_inference/hunyuan_image3/prompt_utils.py
deleted file mode 100644
index a5ef8e15369..00000000000
--- a/examples/offline_inference/hunyuan_image3/prompt_utils.py
+++ /dev/null
@@ -1,88 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-"""
-Prompt construction utilities for HunyuanImage-3.0-Instruct examples.
-
-Wraps system_prompt.get_system_prompt() with task-aware presets so that
-examples and tests don't need to manually concatenate system prompts,
-<img>, <think>, and <recaption> tags.
-
-Usage:
-    from prompt_utils import build_prompt
-
-    # IT2I (image editing, think+recaption mode)
-    prompt = build_prompt("Make the petals neon pink", task="it2i_think")
-
-    # I2T (image understanding)
-    prompt = build_prompt("Describe the content of the picture.", task="i2t")
-"""
-
-from __future__ import annotations
-
-from vllm_omni.diffusion.models.hunyuan_image3.system_prompt import (
-    get_system_prompt,
-)
-
-# task → (sys_type, bot_task, trigger_tag)
-# trigger_tag: "<think>", "<recaption>", or None
-_TASK_PRESETS: dict[str, tuple[str, str | None, str | None]] = {
-    # Pure text generation (text → text, no image)
-    "t2t": ("en_unified", None, None),
-    # Image understanding (image → text)
-    "i2t": ("en_unified", None, None),
-    # Image editing (image+text → image), think+recaption mode
-    "it2i_think": ("en_unified", "think", "<think>"),
-    # Image editing, recaption-only mode
-    "it2i_recaption": ("en_unified", "recaption", "<recaption>"),
-    # Text-to-image, think mode
-    "t2i_think": ("en_unified", "think", "<think>"),
-    # Text-to-image, recaption mode
-    "t2i_recaption": ("en_unified", "recaption", "<recaption>"),
-    # Text-to-image, vanilla (no CoT)
-    "t2i_vanilla": ("en_vanilla", "image", None),
-}
-
-
-def build_prompt(
-    user_prompt: str,
-    task: str = "it2i_think",
-    sys_type: str | None = None,
-    custom_system_prompt: str | None = None,
-) -> str:
-    """Build a complete HunyuanImage-3.0 prompt with auto-selected system
-    prompt and mode trigger tags.
-
-    Args:
-        user_prompt: The user's raw instruction or question.
-        task: One of the preset task keys (see _TASK_PRESETS).
-        sys_type: Override the preset's sys_type for get_system_prompt().
-        custom_system_prompt: Custom system prompt text (used when
-            sys_type="custom").
-
-    Returns:
-        Fully formatted prompt string ready for Omni.generate().
-    """
-    if task not in _TASK_PRESETS:
-        raise ValueError(f"Unknown task {task!r}. Choose from: {sorted(_TASK_PRESETS)}")
-
-    preset_sys_type, preset_bot_task, trigger_tag = _TASK_PRESETS[task]
-    effective_sys_type = sys_type or preset_sys_type
-
-    system_prompt = get_system_prompt(effective_sys_type, preset_bot_task, custom_system_prompt)
-    sys_text = system_prompt.strip() if system_prompt else ""
-
-    has_image_input = task.startswith("i2t") or task.startswith("it2i")
-
-    parts = ["<|startoftext|>"]
-    if sys_text:
-        parts.append(sys_text)
-    # Instruct conversation template: \n\nUser: ... \n\nAssistant:
-    parts.append("\n\nUser: ")
-    if has_image_input:
-        parts.append("<img>")
-    parts.append(user_prompt)
-    parts.append("\n\nAssistant: ")
-    if trigger_tag:
-        parts.append(trigger_tag)
-
-    return "".join(parts)
diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml
new file mode 100644
index 00000000000..f0797c63270
--- /dev/null
+++ b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml
@@ -0,0 +1,96 @@
+# Stage config for running Hunyuan-Image3.0 with AR→DiT KV reuse.
+# Stage 0: AR Model (vLLM implementation)
+# Stage 1: DiT Model (diffusion)
+#
+# text-to-image flow: AR (stage 0) → KV transfer → DiT (stage 1)
+# image-to-text flow: AR (stage 0) only
+#
+# Compared to hunyuan_image3_t2i.yaml, this config:
+#   1. Enables both stages [0, 1] for text-to-image (AR prefill + DiT denoising)
+#   2. Adds omni_kv_config to send/receive KV cache between stages
+
+# The following config has been verified on 8x L40S-48G GPU (4 for AR + 4 for DiT).
+stage_args:
+  - stage_id: 0
+    stage_type: llm  # Use llm stage type for AR stages
+    runtime:
+      process: true  # Run this stage in a separate process
+      devices: "0,1,2,3"  # AR stage uses GPU 0-3
+    engine_args:
+      model_stage: AR
+      max_num_seqs: 1
+      model_arch: HunyuanImage3ForCausalMM
+      worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
+      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
+      gpu_memory_utilization: 0.9
+      enforce_eager: true  # Now we only support eager mode
+      trust_remote_code: true
+      engine_output_type: latent
+      enable_prefix_caching: false
+      max_num_batched_tokens: 32768
+      tensor_parallel_size: 4
+      pipeline_parallel_size: 1
+      hf_overrides:
+        rope_parameters:
+          mrope_section: [0, 32, 32]
+          rope_type: default
+      omni_kv_config:
+        need_send_cache: true
+        kv_transfer_criteria:
+          type: prefill_finished  # Send KV cache after AR prefill completes
+    is_comprehension: true
+    final_output: true
+    final_output_type: text
+    default_sampling_params:
+      temperature: 0.0
+      top_p: 1.0
+      top_k: -1
+      max_tokens: 2048
+      seed: 42
+      detokenize: True
+      repetition_penalty: 1.1
+  - stage_id: 1
+    stage_type: diffusion
+    runtime:
+      process: true
+      devices: "4,5,6,7"  # DiT stage uses GPU 4-7
+      max_batch_size: 1
+    engine_args:
+      model_stage: diffusion
+      enforce_eager: true
+      distributed_executor_backend: "mp"
+      vae_use_slicing: false
+      vae_use_tiling: false
+      cache_backend: null
+      cache_config: null
+      enable_cache_dit_summary: false
+      omni_kv_config:
+        need_recv_cache: true  # Receive AR KV cache from stage 0
+      parallel_config:
+        pipeline_parallel_size: 1
+        data_parallel_size: 1
+        tensor_parallel_size: 4
+        enable_expert_parallel: false
+        sequence_parallel_size: 1
+        ulysses_degree: 1
+        ring_degree: 1
+        cfg_parallel_size: 1
+        vae_patch_parallel_size: 1
+        use_hsdp: false
+        hsdp_shard_size: -1
+        hsdp_replicate_size: 1
+    engine_input_source: [0]  # Receive input (including KV) from stage 0
+    final_output: true
+    final_output_type: image
+
+# Top-level runtime config: windows, edges, and connectors
+runtime:
+  enabled: true
+  defaults:
+    window_size: -1  # Trigger downstream only after full upstream completion
+    max_inflight: 1  # Process serially within each stage
+
+  edges:
+    - from: 0
+      to: 1
+      window_size: -1