diff --git a/examples/offline_inference/hunyuan_image3/README.md b/examples/offline_inference/hunyuan_image3/README.md
index 3cd8fa01b2e..710ec26df9c 100644
--- a/examples/offline_inference/hunyuan_image3/README.md
+++ b/examples/offline_inference/hunyuan_image3/README.md
@@ -23,7 +23,7 @@ HunyuanImage-3.0-Instruct supports multiple modality modes. You can control the
 - **Pipeline**: Text → AR (CoT + latent tokens) → DiT (denoise) → VAE Decode → Image
 - **Stages Used**: Stage 0 (AR) + Stage 1 (DiT)
 - **KV Transfer**: AR sends KV cache to DiT for conditioned generation
-- **Default Config**: `hunyuan_image3_t2i.yaml`
+- **Default Config**: `vllm_omni/deploy/hunyuan_image3_t2i.yaml`
 
 ```bash
 python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
@@ -36,7 +36,7 @@ python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
 - **Pipeline**: Image + Text → AR (CoT + recaption + latent) → DiT → Edited Image
 - **Stages Used**: Stage 0 (AR) + Stage 1 (DiT)
 - **KV Transfer**: AR sends KV cache to DiT
-- **Default Config**: `hunyuan_image3_it2i.yaml`
+- **Default Config**: `vllm_omni/deploy/hunyuan_image3_it2i.yaml`
 
 ```bash
 python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
@@ -45,31 +45,6 @@ python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
                   --prompts "Make the petals neon pink"
 ```
 
-#### Image to Text (img2text)
-
-- **Pipeline**: Image + Question → AR → Text description
-- **Stages Used**: Stage 0 (AR) only
-- **Default Config**: `hunyuan_image3_i2t.yaml`
-
-```bash
-python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
-                  --modality img2text \
-                  --image-path /path/to/image.jpg \
-                  --prompts "Describe the content of the picture."
-```
-
-#### Text to Text (text2text)
-
-- **Pipeline**: Text → AR → Text
-- **Stages Used**: Stage 0 (AR) only
-- **Default Config**: `hunyuan_image3_t2t.yaml`
-
-```bash
-python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
-                  --modality text2text \
-                  --prompts "What is the capital of France?"
-```
-
 ### Inference Steps & Guidance
 
 Control generation quality for image modalities:
@@ -89,7 +64,7 @@ python end2end.py --modality text2img \
 | Argument               | Type   | Default                              | Description                                                  |
 | :--------------------- | :----- | :----------------------------------- | :----------------------------------------------------------- |
 | `--model`              | string | `tencent/HunyuanImage-3.0-Instruct` | Model path or name                                           |
-| `--modality`           | choice | `text2img`                           | Modality: `text2img`, `img2img`, `img2text`, `text2text`     |
+| `--modality`           | choice | `text2img`                           | Modality: `text2img`, `img2img`                              |
 | `--prompts`            | list   | `None`                               | Input text prompts                                           |
 | `--image-path`         | string | `None`                               | Input image path (for `img2img`/`img2text`)                  |
 | `--output`             | string | `.`                                  | Output directory for saved images                            |
@@ -108,28 +83,20 @@ python end2end.py --modality text2img \
 
 #### ⚙️ Stage Configurations
 
-| Config YAML                         | Modality  | Stages | GPUs   | Description                           |
-| :---------------------------------- | :-------- | :----- | :----- | :------------------------------------ |
-| `hunyuan_image3_t2i.yaml`           | text2img  | 2      | 8      | T2I with AR→DiT, 4 GPU each          |
-| `hunyuan_image3_it2i.yaml`          | img2img   | 2      | 8      | IT2I with AR→DiT, 4 GPU each         |
-| `hunyuan_image3_i2t.yaml`           | img2text  | 1      | 4      | I2T (AR only)                         |
-| `hunyuan_image3_t2t.yaml`           | text2text | 1      | 4      | T2T (AR only)                         |
-| `hunyuan_image3_t2i_2gpu.yaml`      | text2img  | 2      | 2      | T2I for 2-GPU setups                  |
-| `hunyuan_image3_moe.yaml`           | text2img  | 2      | 8      | T2I with MoE AR→DiT KV reuse          |
-| `hunyuan_image3_moe_dit_2gpu_fp8.yaml` | text2img | 2   | 2      | T2I with FP8 quantization             |
+All deploy YAMLs live under `vllm_omni/deploy/` in the new schema (PR #2383).
 
-------
+| Deploy YAML                              | Modality   | Stages | GPUs | Description                       |
+| :--------------------------------------- | :--------- | :----- | :--- | :-------------------------------- |
+| `hunyuan_image3_t2i.yaml`                | text2img   | 2      | 8    | AR + DiT with KV transfer         |
+| `hunyuan_image3_it2i.yaml`               | img2img    | 2      | 8    | AR + DiT (image-edit)             |
 
-## Using MoE Config
+The `hunyuan_image3_dit_only` pipeline is also registered (no shipped deploy yaml) for users who want to skip the AR stage with a custom deploy.
 
-The `hunyuan_image3_moe.yaml` config enables AR→DiT KV cache reuse with 8 GPUs (4 for AR + 4 for DiT).
+------
 
-```bash
-python end2end.py --model tencent/HunyuanImage-3.0-Instruct \
-                  --modality text2img \
-                  --stage-configs-path hunyuan_image3_moe.yaml \
-                  --prompts "A cute cat"
-```
+## AR→DiT KV cache reuse
+
+The default `hunyuan_image3_t2i.yaml` deploy already enables AR→DiT KV cache reuse on 8 GPUs (4 for AR + 4 for DiT) — the wiring lives on the pipeline (`omni_kv_config` for both stages).
 
 ------
 
diff --git a/examples/offline_inference/hunyuan_image3/end2end.py b/examples/offline_inference/hunyuan_image3/end2end.py
index 2cea303888e..08815465e45 100644
--- a/examples/offline_inference/hunyuan_image3/end2end.py
+++ b/examples/offline_inference/hunyuan_image3/end2end.py
@@ -72,12 +72,10 @@ def build_prompt(
     return "".join(parts)
 
 
-# Modality → default stage config
+# Modality → default deploy config (under vllm_omni/deploy/).
 _MODALITY_DEFAULT_CONFIG = {
-    "text2img": "hunyuan_image3_t2i.yaml",
-    "img2img": "hunyuan_image3_it2i.yaml",
-    "img2text": "hunyuan_image3_i2t.yaml",
-    "text2text": "hunyuan_image3_t2t.yaml",
+    "text2img": "vllm_omni/deploy/hunyuan_image3_t2i.yaml",
+    "img2img": "vllm_omni/deploy/hunyuan_image3_it2i.yaml",
 }
 
 
@@ -91,7 +89,7 @@ def parse_args():
     parser.add_argument(
         "--modality",
         default="text2img",
-        choices=["text2img", "img2img", "img2text", "text2text"],
+        choices=["text2img", "img2img"],
         help="Modality mode to control stage execution.",
     )
     parser.add_argument("--prompts", nargs="+", default=None, help="Input text prompts.")
@@ -148,21 +146,15 @@ def main():
     # Determine task for prompt formatting
     task = args.bot_task or _MODALITY_TASK_MAP[args.modality]
 
-    # Determine stage config
-    stage_configs_path = args.stage_configs_path or _MODALITY_DEFAULT_CONFIG[args.modality]
-
-    # Build Omni
-    omni_kwargs = {
-        "model": args.model,
-        "stage_configs_path": stage_configs_path,
-        "log_stats": args.log_stats,
-        "init_timeout": args.init_timeout,
-        "enforce_eager": args.enforce_eager,
+    # Resolve modality-derived overrides — these are not direct CLI flags so
+    # forward them to ``from_cli_args`` via ``**overrides``.
+    overrides: dict[str, object] = {
+        "stage_configs_path": args.stage_configs_path or _MODALITY_DEFAULT_CONFIG[args.modality],
     }
     if args.modality in ("text2img", "img2img"):
-        omni_kwargs["mode"] = "text-to-image"
+        overrides["mode"] = "text-to-image"
 
-    omni = Omni(**omni_kwargs)
+    omni = Omni.from_cli_args(args, **overrides)
 
     # Prepare prompts
     prompts = args.prompts or ["A cute cat"]
@@ -222,7 +214,7 @@ def main():
     print("HunyuanImage-3.0 Generation Configuration:")
     print(f"  Model: {args.model}")
     print(f"  Modality: {args.modality}")
-    print(f"  Stage config: {stage_configs_path}")
+    print(f"  Stage config: {overrides['stage_configs_path']}")
     print(f"  Num stages: {omni.num_stages}")
     if args.modality in ("text2img", "img2img"):
         print(f"  Inference steps: {args.steps}")
diff --git a/tests/e2e/offline_inference/deploy/hunyuan_image3_dit_only_ci.yaml b/tests/e2e/offline_inference/deploy/hunyuan_image3_dit_only_ci.yaml
new file mode 100644
index 00000000000..cda71d9bcb1
--- /dev/null
+++ b/tests/e2e/offline_inference/deploy/hunyuan_image3_dit_only_ci.yaml
@@ -0,0 +1,27 @@
+# HunyuanImage-3.0 DiT-only (no AR). CUDA verified on 4x H20.
+pipeline: hunyuan_image3_dit_only
+trust_remote_code: true
+distributed_executor_backend: mp
+
+stages:
+  - stage_id: 0
+    max_num_seqs: 1
+    enforce_eager: true
+    devices: "0,1,2,3"
+    parallel_config:
+      tensor_parallel_size: 4
+      enable_expert_parallel: true
+    default_sampling_params:
+      seed: 42
+
+platforms:
+  npu:
+    # Verified on 8x A3-64G NPUs.
+    stages:
+      - stage_id: 0
+        gpu_memory_utilization: 0.65
+        devices: "0,1,2,3,4,5,6,7"
+        max_num_batched_tokens: 32768
+        parallel_config:
+          tensor_parallel_size: 8
+          enable_expert_parallel: true
diff --git a/tests/e2e/offline_inference/test_hunyuanimage3_text2img.py b/tests/e2e/offline_inference/test_hunyuanimage3_text2img.py
index bd0d132d093..ef34feca294 100644
--- a/tests/e2e/offline_inference/test_hunyuanimage3_text2img.py
+++ b/tests/e2e/offline_inference/test_hunyuanimage3_text2img.py
@@ -17,7 +17,7 @@
 MODEL_NAME = "tencent/HunyuanImage-3.0"
 LOCAL_CLIP_PATH = "openai/clip-vit-base-patch32"
 REPO_ROOT = Path(__file__).resolve().parents[3]
-STAGE_CONFIG_PATH = REPO_ROOT / "vllm_omni" / "model_executor" / "stage_configs" / "hunyuan_image3_t2i.yaml"
+STAGE_CONFIG_PATH = REPO_ROOT / "tests" / "e2e" / "offline_inference" / "deploy" / "hunyuan_image3_dit_only_ci.yaml"
 
 pytestmark = [pytest.mark.advanced_model, pytest.mark.diffusion]
 
diff --git a/tests/test_config_factory.py b/tests/test_config_factory.py
index 16d49034fa1..6ad5f04b6bf 100644
--- a/tests/test_config_factory.py
+++ b/tests/test_config_factory.py
@@ -1334,3 +1334,61 @@ def test_constraints_win(self):
         assert stages[1].yaml_extras["default_sampling_params"]["stop_token_ids"] == [2150]
         # Deploy temperature still flows through
         assert stages[0].yaml_extras["default_sampling_params"]["temperature"] == 0.4
+
+
+class TestHunyuanImage3ShippedDeploys:
+    """Structural smoke tests for shipped Hunyuan-Image3 deploy yamls.
+
+    The GPU-gated e2e test (``test_hunyuanimage3_text2img.py``) runs against
+    the DiT-only CI fixture; these cheap tests catch schema regressions in
+    the shipped AR→DiT t2i / it2i / dit_only deploys that no GPU is needed
+    to see.
+    """
+
+    @pytest.mark.parametrize(
+        "yaml_name,expected_pipeline,expected_stage_count,expected_stages",
+        [
+            ("hunyuan_image3_t2i.yaml", "hunyuan_image3_t2i", 2, ("AR", "dit")),
+            ("hunyuan_image3_it2i.yaml", "hunyuan_image3_it2i", 2, ("AR", "dit")),
+            ("hunyuan_image3_dit_only.yaml", "hunyuan_image3_dit_only", 1, ("dit",)),
+        ],
+    )
+    def test_shipped_deploys_parse_and_resolve(
+        self, yaml_name, expected_pipeline, expected_stage_count, expected_stages
+    ):
+        import vllm_omni.model_executor.models.hunyuan_image3.pipeline  # noqa: F401
+        from vllm_omni.config.stage_config import load_deploy_config, merge_pipeline_deploy
+
+        deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / yaml_name
+        assert deploy_path.exists(), f"Shipped deploy missing: {yaml_name}"
+
+        deploy = load_deploy_config(deploy_path)
+        assert deploy.pipeline == expected_pipeline
+        assert len(deploy.stages) == expected_stage_count
+
+        pipeline = _PIPELINE_REGISTRY[expected_pipeline]
+        assert tuple(s.model_stage for s in pipeline.stages) == expected_stages
+
+        stages = merge_pipeline_deploy(pipeline, deploy)
+        assert len(stages) == expected_stage_count
+
+    def test_t2i_ar_dit_topology(self):
+        """The AR→DiT t2i default wires stage 1 to consume stage 0's KV output."""
+        import vllm_omni.model_executor.models.hunyuan_image3.pipeline  # noqa: F401
+        from vllm_omni.config.stage_config import load_deploy_config, merge_pipeline_deploy
+
+        deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "hunyuan_image3_t2i.yaml"
+        assert deploy_path.exists(), "Shipped deploy missing: hunyuan_image3_t2i.yaml"
+
+        pipeline = _PIPELINE_REGISTRY["hunyuan_image3_t2i"]
+        deploy = load_deploy_config(deploy_path)
+        stages = merge_pipeline_deploy(pipeline, deploy)
+
+        # Pipeline-level invariants for the KV-transfer path.
+        assert pipeline.stages[0].omni_kv_config is not None
+        assert pipeline.stages[1].input_sources == (0,)
+        assert pipeline.stages[1].omni_kv_config is not None
+
+        # Deploy-level placement: 4 AR + 4 DiT across 8 devices.
+        assert stages[0].yaml_runtime.get("devices") == "0,1,2,3"
+        assert stages[1].yaml_runtime.get("devices") == "4,5,6,7"
diff --git a/vllm_omni/config/pipeline_registry.py b/vllm_omni/config/pipeline_registry.py
index 6f5c072a353..e516280deed 100644
--- a/vllm_omni/config/pipeline_registry.py
+++ b/vllm_omni/config/pipeline_registry.py
@@ -33,6 +33,28 @@
 # --- Multi-stage omni pipelines (LLM-centric; audio / video I/O) ---
 _OMNI_PIPELINES: dict[str, tuple[str, str]] = {
     # model_type -> (module_path, variable_name)
+    "hunyuan_image3_t2i": (
+        "vllm_omni.model_executor.models.hunyuan_image3.pipeline",
+        "HUNYUAN_IMAGE3_T2I_PIPELINE",
+    ),
+    "hunyuan_image3_it2i": (
+        "vllm_omni.model_executor.models.hunyuan_image3.pipeline",
+        "HUNYUAN_IMAGE3_IT2I_PIPELINE",
+    ),
+    # ``dit_only`` ships a default deploy yaml (DiT-only path + NPU section);
+    # ``i2t`` / ``t2t`` are kept BYO.
+    "hunyuan_image3_dit_only": (
+        "vllm_omni.model_executor.models.hunyuan_image3.pipeline",
+        "HUNYUAN_IMAGE3_DIT_ONLY_PIPELINE",
+    ),
+    "hunyuan_image3_i2t": (
+        "vllm_omni.model_executor.models.hunyuan_image3.pipeline",
+        "HUNYUAN_IMAGE3_I2T_PIPELINE",
+    ),
+    "hunyuan_image3_t2t": (
+        "vllm_omni.model_executor.models.hunyuan_image3.pipeline",
+        "HUNYUAN_IMAGE3_T2T_PIPELINE",
+    ),
     "qwen2_5_omni": (
         "vllm_omni.model_executor.models.qwen2_5_omni.pipeline",
         "QWEN2_5_OMNI_PIPELINE",
diff --git a/vllm_omni/config/stage_config.py b/vllm_omni/config/stage_config.py
index 950685d6219..1ac12dfe749 100644
--- a/vllm_omni/config/stage_config.py
+++ b/vllm_omni/config/stage_config.py
@@ -463,6 +463,22 @@ class DeployConfig:
 _STAGE_DEPLOY_FIELDS = {f.name: f for f in fields(StageDeployConfig) if f.name not in _STAGE_NON_ENGINE_KEYS}
 
 
+_DIT_PARALLEL_FIELDS_AT_TOP_LEVEL = frozenset(
+    {
+        "enable_expert_parallel",
+        "sequence_parallel_size",
+        "ulysses_degree",
+        "ring_degree",
+        "ulysses_mode",
+        "cfg_parallel_size",
+        "vae_patch_parallel_size",
+        "use_hsdp",
+        "hsdp_shard_size",
+        "hsdp_replicate_size",
+    }
+)
+
+
 def _parse_stage_deploy(stage_data: dict[str, Any]) -> StageDeployConfig:
     """Parse a single stage entry from deploy YAML into StageDeployConfig."""
     if "engine_args" in stage_data:
@@ -477,6 +493,22 @@ def _parse_stage_deploy(stage_data: dict[str, Any]) -> StageDeployConfig:
         if name in engine_args:
             kwargs[name] = engine_args.pop(name)
 
+    # Flat schema support: hoist DiT-only parallel fields and tensor_parallel_size
+    # from top level into a parallel_config block. Lets authors write the same
+    # flat shape for AR and DiT stages — DiT engines read parallel_config.*,
+    # AR engines read top-level fields directly. Existing nested parallel_config
+    # forms keep working (we only setdefault).
+    pc = engine_args.get("parallel_config")
+    if not isinstance(pc, dict):
+        pc = {}
+    for name in _DIT_PARALLEL_FIELDS_AT_TOP_LEVEL:
+        if name in engine_args:
+            pc.setdefault(name, engine_args.pop(name))
+    if (tps := kwargs.get("tensor_parallel_size")) and tps > 1:
+        pc.setdefault("tensor_parallel_size", tps)
+    if pc:
+        engine_args["parallel_config"] = pc
+
     kwargs["output_connectors"] = stage_data.get("output_connectors")
     kwargs["input_connectors"] = stage_data.get("input_connectors")
     kwargs["default_sampling_params"] = stage_data.get("default_sampling_params")
@@ -714,6 +746,8 @@ def _build_engine_args(
         engine_args["model_subdir"] = ps.model_subdir
     if ps.tokenizer_subdir:
         engine_args["tokenizer_subdir"] = ps.tokenizer_subdir
+    if ps.omni_kv_config is not None:
+        engine_args["omni_kv_config"] = dict(ps.omni_kv_config)
 
     # Pipeline-wide top-level DeployConfig settings, applied to every stage.
     for name in _PIPELINE_WIDE_ENGINE_FIELDS:
@@ -800,6 +834,8 @@ def merge_pipeline_deploy(
         runtime: dict[str, Any] = {"process": True}
         if ds is not None:
             runtime["devices"] = ds.devices
+        if ps.requires_multimodal_data:
+            runtime["requires_multimodal_data"] = True
 
         result.append(
             StageConfig(
diff --git a/vllm_omni/deploy/hunyuan_image3_dit_only.yaml b/vllm_omni/deploy/hunyuan_image3_dit_only.yaml
new file mode 100644
index 00000000000..8ce6ee73835
--- /dev/null
+++ b/vllm_omni/deploy/hunyuan_image3_dit_only.yaml
@@ -0,0 +1,29 @@
+# HunyuanImage-3.0 DiT-only (no AR). Matches the Tencent reference
+# `generate_image(prompt, bot_task="image")` path. CUDA verified on 4x H20;
+# NPU verified on 8x A3-64G.
+pipeline: hunyuan_image3_dit_only
+trust_remote_code: true
+distributed_executor_backend: mp
+async_chunk: false
+
+stages:
+  - stage_id: 0
+    max_num_seqs: 1
+    enforce_eager: true
+    tensor_parallel_size: 4
+    enable_expert_parallel: true
+    devices: "0,1,2,3"
+    default_sampling_params:
+      num_inference_steps: 50
+      guidance_scale: 2.5
+      seed: 42
+
+platforms:
+  npu:
+    stages:
+      - stage_id: 0
+        gpu_memory_utilization: 0.65
+        tensor_parallel_size: 8
+        enable_expert_parallel: true
+        devices: "0,1,2,3,4,5,6,7"
+        max_num_batched_tokens: 32768
diff --git a/vllm_omni/deploy/hunyuan_image3_dit_only_2gpu.yaml b/vllm_omni/deploy/hunyuan_image3_dit_only_2gpu.yaml
new file mode 100644
index 00000000000..4cc5f7a37a7
--- /dev/null
+++ b/vllm_omni/deploy/hunyuan_image3_dit_only_2gpu.yaml
@@ -0,0 +1,20 @@
+# HunyuanImage-3.0 DiT-only on 2 GPUs (FP8). For 2x H200 / H100 / A100-80GB.
+# Single-stage variant of `hunyuan_image3_dit_only.yaml` retuned for 2-GPU
+# placement; matches Tencent's `bot_task="image"` (DiT-only, no AR rewriter).
+pipeline: hunyuan_image3_dit_only
+trust_remote_code: true
+distributed_executor_backend: mp
+async_chunk: false
+quantization: fp8
+
+stages:
+  - stage_id: 0
+    max_num_seqs: 1
+    enforce_eager: true
+    tensor_parallel_size: 2
+    enable_expert_parallel: true
+    devices: "0,1"
+    default_sampling_params:
+      num_inference_steps: 50
+      guidance_scale: 2.5
+      seed: 42
diff --git a/vllm_omni/deploy/hunyuan_image3_it2i.yaml b/vllm_omni/deploy/hunyuan_image3_it2i.yaml
new file mode 100644
index 00000000000..3a5f245455a
--- /dev/null
+++ b/vllm_omni/deploy/hunyuan_image3_it2i.yaml
@@ -0,0 +1,32 @@
+# HunyuanImage-3.0 image-edit (AR consumes image+text, DiT decodes).
+# Verified on 8x L40S-48G (4 AR + 4 DiT).
+pipeline: hunyuan_image3_it2i
+trust_remote_code: true
+distributed_executor_backend: mp
+async_chunk: false
+
+stages:
+  - stage_id: 0
+    max_num_seqs: 1
+    gpu_memory_utilization: 0.95
+    enforce_eager: true
+    tensor_parallel_size: 4
+    devices: "0,1,2,3"
+    hf_overrides:
+      rope_parameters:
+        mrope_section: [0, 32, 32]
+        rope_type: default
+    default_sampling_params:
+      temperature: 0.6
+      top_p: 0.95
+      top_k: 1024
+      max_tokens: 4096
+
+  - stage_id: 1
+    max_num_seqs: 1
+    tensor_parallel_size: 4
+    enable_expert_parallel: true
+    devices: "4,5,6,7"
+    default_sampling_params:
+      num_inference_steps: 50
+      guidance_scale: 2.5
diff --git a/vllm_omni/deploy/hunyuan_image3_t2i.yaml b/vllm_omni/deploy/hunyuan_image3_t2i.yaml
new file mode 100644
index 00000000000..c66eb69cbc3
--- /dev/null
+++ b/vllm_omni/deploy/hunyuan_image3_t2i.yaml
@@ -0,0 +1,59 @@
+# HunyuanImage-3.0 text-to-image (AR + DiT with KV transfer).
+# CUDA default verified on 8x L40S-48G (4 AR + 4 DiT). Stop tokens / KV
+# transfer wiring live on the pipeline; deploy carries placement + sampling.
+# NPU deployment is DiT-only — use `--pipeline hunyuan_image3_dit_only`
+# with `vllm_omni/deploy/hunyuan_image3_dit_only.yaml`.
+# FP8 on 2x H200 ships as `vllm_omni/deploy/hunyuan_image3_t2i_fp8.yaml`.
+pipeline: hunyuan_image3_t2i
+trust_remote_code: true
+distributed_executor_backend: mp
+# AR → DiT cross-stage KV transfer is sync (DiT consumes AR's full KV after
+# prefill_finished), no per-token next-stage processor.
+async_chunk: false
+
+stages:
+  - stage_id: 0
+    max_num_seqs: 1
+    gpu_memory_utilization: 0.9
+    enforce_eager: true
+    tensor_parallel_size: 4
+    devices: "0,1,2,3"
+    hf_overrides:
+      rope_parameters:
+        mrope_section: [0, 32, 32]
+        rope_type: default
+    default_sampling_params:
+      temperature: 0.0
+      top_p: 1.0
+      top_k: -1
+      max_tokens: 2048
+      seed: 42
+      repetition_penalty: 1.1
+
+  - stage_id: 1
+    max_num_seqs: 1
+    tensor_parallel_size: 4
+    enable_expert_parallel: true
+    devices: "4,5,6,7"
+    default_sampling_params:
+      num_inference_steps: 50
+      guidance_scale: 2.5
+      seed: 42
+
+platforms:
+  xpu:
+    # Verified on 8x Intel Arc Max 1550 with FP8 weights.
+    quantization: fp8
+    stages:
+      - stage_id: 0
+        gpu_memory_utilization: 0.95
+        tensor_parallel_size: 8
+        devices: "0,1,2,3,4,5,6,7"
+        max_num_batched_tokens: 32784
+        worker_cls: vllm_omni.platforms.xpu.worker.xpu_ar_worker.XPUARWorker
+        enable_expert_parallel: true
+      - stage_id: 1
+        gpu_memory_utilization: 0.9
+        tensor_parallel_size: 8
+        enable_expert_parallel: true
+        devices: "0,1,2,3,4,5,6,7"
diff --git a/vllm_omni/deploy/hunyuan_image3_t2i_fp8.yaml b/vllm_omni/deploy/hunyuan_image3_t2i_fp8.yaml
new file mode 100644
index 00000000000..625edbeb91a
--- /dev/null
+++ b/vllm_omni/deploy/hunyuan_image3_t2i_fp8.yaml
@@ -0,0 +1,42 @@
+# HunyuanImage-3.0 text-to-image with FP8 online quantization.
+# CUDA verified on 2x H200 (1 AR + 1 DiT, FP8 on the DiT side).
+pipeline: hunyuan_image3_t2i
+trust_remote_code: true
+distributed_executor_backend: mp
+async_chunk: false
+quantization: fp8
+
+stages:
+  - stage_id: 0
+    max_num_seqs: 1
+    gpu_memory_utilization: 0.9
+    enforce_eager: true
+    tensor_parallel_size: 2
+    # AR colocates with DiT on devices 0,1. DiT reserves ~32 GiB per rank
+    # before AR's profile_run, which the profiling fallback misattributes as
+    # "non_torch_increase" and zeroes out the KV budget. Set kv_cache_memory_bytes
+    # explicitly to bypass profiling — AR needs ~3 GiB per rank for max_num_seqs=1
+    # at max_model_len=22800.
+    kv_cache_memory_bytes: 4294967296   # 4 GiB per rank
+    devices: "0,1"
+    hf_overrides:
+      rope_parameters:
+        mrope_section: [0, 32, 32]
+        rope_type: default
+    default_sampling_params:
+      temperature: 0.0
+      top_p: 1.0
+      top_k: -1
+      max_tokens: 2048
+      seed: 42
+      repetition_penalty: 1.1
+
+  - stage_id: 1
+    max_num_seqs: 1
+    tensor_parallel_size: 2
+    enable_expert_parallel: true
+    devices: "0,1"
+    default_sampling_params:
+      num_inference_steps: 50
+      guidance_scale: 2.5
+      seed: 42
diff --git a/vllm_omni/model_executor/models/hunyuan_image3/pipeline.py b/vllm_omni/model_executor/models/hunyuan_image3/pipeline.py
new file mode 100644
index 00000000000..3007337c7e7
--- /dev/null
+++ b/vllm_omni/model_executor/models/hunyuan_image3/pipeline.py
@@ -0,0 +1,161 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""HunyuanImage-3.0 pipeline topologies (frozen).
+
+Five variants share one HF model_arch (HunyuanImage3ForCausalMM) but expose
+different stage graphs:
+
+    t2i      Stage 0 AR (text → latent) + Stage 1 DiT, KV-transfer
+    it2i     Stage 0 AR (image+text → latent) + Stage 1 DiT, KV-transfer
+    dit_only Stage 0 DiT only (latent → image)
+    i2t      Stage 0 AR only (image+text → text)
+    t2t      Stage 0 AR only (text → text)
+
+Variants are surfaced as separate model_types so the orchestrator picks the
+right topology from deploy YAML alone (mirrors the qwen2_5_omni /
+qwen2_5_omni_thinker_only split). ``t2i`` / ``it2i`` / ``dit_only`` ship
+with default deploy yamls (the dit_only deploy also carries the NPU
+overlay); ``i2t`` / ``t2t`` are registered for bring-your-own deploy.
+"""
+
+from vllm_omni.config.stage_config import (
+    PipelineConfig,
+    StageExecutionType,
+    StagePipelineConfig,
+)
+
+_MODEL_ARCH = "HunyuanImage3ForCausalMM"
+_AR2DIT = "vllm_omni.model_executor.stage_input_processors.hunyuan_image3.ar2diffusion"
+
+# AR-side KV transfer config: send cache once prefill is done so the DiT stage
+# can splice it in without re-running attention over the prompt.
+_AR_KV_SEND = {
+    "need_send_cache": True,
+    "kv_transfer_criteria": {"type": "prefill_finished"},
+}
+_DIT_KV_RECV = {"need_recv_cache": True}
+
+
+# Only one variant carries the hf_architectures fallback so the deploy yaml's
+# explicit ``pipeline:`` field stays the single source of truth for variant
+# selection. T2I is the default because it's the headline modality.
+HUNYUAN_IMAGE3_T2I_PIPELINE = PipelineConfig(
+    model_type="hunyuan_image3_t2i",
+    model_arch=_MODEL_ARCH,
+    hf_architectures=("HunyuanImage3ForCausalMM",),
+    stages=(
+        StagePipelineConfig(
+            stage_id=0,
+            model_stage="AR",
+            execution_type=StageExecutionType.LLM_AR,
+            input_sources=(),
+            final_output=True,
+            final_output_type="text",
+            owns_tokenizer=True,
+            engine_output_type="latent",
+            sampling_constraints={"detokenize": True},
+            omni_kv_config=_AR_KV_SEND,
+        ),
+        StagePipelineConfig(
+            stage_id=1,
+            model_stage="dit",
+            execution_type=StageExecutionType.DIFFUSION,
+            input_sources=(0,),
+            final_output=True,
+            final_output_type="image",
+            engine_output_type="image",
+            custom_process_input_func=_AR2DIT,
+            omni_kv_config=_DIT_KV_RECV,
+        ),
+    ),
+)
+
+
+HUNYUAN_IMAGE3_IT2I_PIPELINE = PipelineConfig(
+    model_type="hunyuan_image3_it2i",
+    model_arch=_MODEL_ARCH,
+    stages=(
+        StagePipelineConfig(
+            stage_id=0,
+            model_stage="AR",
+            execution_type=StageExecutionType.LLM_AR,
+            input_sources=(),
+            final_output=False,
+            owns_tokenizer=True,
+            requires_multimodal_data=True,
+            engine_output_type="latent",
+            sampling_constraints={"stop_token_ids": [127957], "detokenize": False},
+            omni_kv_config=_AR_KV_SEND,
+        ),
+        StagePipelineConfig(
+            stage_id=1,
+            model_stage="dit",
+            execution_type=StageExecutionType.DIFFUSION,
+            input_sources=(0,),
+            final_output=True,
+            final_output_type="image",
+            engine_output_type="image",
+            requires_multimodal_data=True,
+            custom_process_input_func=_AR2DIT,
+            omni_kv_config=_DIT_KV_RECV,
+        ),
+    ),
+)
+
+
+HUNYUAN_IMAGE3_DIT_ONLY_PIPELINE = PipelineConfig(
+    model_type="hunyuan_image3_dit_only",
+    model_arch=_MODEL_ARCH,
+    stages=(
+        StagePipelineConfig(
+            stage_id=0,
+            model_stage="dit",
+            execution_type=StageExecutionType.DIFFUSION,
+            input_sources=(),
+            final_output=True,
+            final_output_type="image",
+            engine_output_type="image",
+            omni_kv_config=_DIT_KV_RECV,
+        ),
+    ),
+)
+
+
+# AR-only variants (no DiT). Registered for users bringing their own deploy
+# yaml — no default deploy yaml ships because hardware sizing for I2T/T2T
+# depends on the use case.
+HUNYUAN_IMAGE3_I2T_PIPELINE = PipelineConfig(
+    model_type="hunyuan_image3_i2t",
+    model_arch=_MODEL_ARCH,
+    stages=(
+        StagePipelineConfig(
+            stage_id=0,
+            model_stage="AR",
+            execution_type=StageExecutionType.LLM_AR,
+            input_sources=(),
+            final_output=True,
+            final_output_type="text",
+            owns_tokenizer=True,
+            requires_multimodal_data=True,
+            sampling_constraints={"stop_token_ids": [127957, 128026], "detokenize": True},
+        ),
+    ),
+)
+
+
+HUNYUAN_IMAGE3_T2T_PIPELINE = PipelineConfig(
+    model_type="hunyuan_image3_t2t",
+    model_arch=_MODEL_ARCH,
+    stages=(
+        StagePipelineConfig(
+            stage_id=0,
+            model_stage="AR",
+            execution_type=StageExecutionType.LLM_AR,
+            input_sources=(),
+            final_output=True,
+            final_output_type="text",
+            owns_tokenizer=True,
+            sampling_constraints={"stop_token_ids": [127957, 128026], "detokenize": True},
+        ),
+    ),
+)
diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_i2t.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_i2t.yaml
deleted file mode 100644
index b68b184ec31..00000000000
--- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_i2t.yaml
+++ /dev/null
@@ -1,41 +0,0 @@
-# Stage config for HunyuanImage-3.0 Image-to-Text (I2T / image understanding).
-# Single LLM stage: AR model reads image + text prompt, generates text output.
-
-stage_args:
-  - stage_id: 0
-    stage_type: llm
-    runtime:
-      process: true
-      devices: "0,1,2,3"
-      max_batch_size: 1
-      requires_multimodal_data: true
-    engine_args:
-      model_stage: AR
-      max_num_seqs: 1
-      model_arch: HunyuanImage3ForCausalMM
-      worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
-      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
-      gpu_memory_utilization: 0.95
-      enforce_eager: true
-      trust_remote_code: true
-      enable_prefix_caching: false
-      max_num_batched_tokens: 32768
-      tensor_parallel_size: 4
-      pipeline_parallel_size: 1
-      hf_overrides:
-        rope_parameters:
-          mrope_section: [0, 32, 32]
-          rope_type: default
-    is_comprehension: true
-    final_output: true
-    final_output_type: text
-    default_sampling_params:
-      temperature: 0.0
-      top_p: 0.95
-      top_k: 1024
-      max_tokens: 2048
-      stop_token_ids: [127957, 128026]  # <|endoftext|>, </answer>
-      detokenize: True
-
-runtime:
-  enabled: true
diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml
deleted file mode 100644
index 413e0f09cbe..00000000000
--- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml
+++ /dev/null
@@ -1,74 +0,0 @@
-# Stage config for HunyuanImage-3.0 Image+Text-to-Image (image editing).
-# Stage 0: AR (HunyuanImage3ForConditionalGeneration) — reads (image, text), emits latent tokens
-# Stage 1: Diffusion (HunyuanImage3Pipeline / DiT + VAE) — denoise + decode latents → image
-
-stage_args:
-  # Stage 0: AR Model
-  - stage_id: 0
-    stage_type: llm
-    runtime:
-      process: true
-      devices: "0,1,2,3"
-      max_batch_size: 1
-      requires_multimodal_data: true  # AR needs the original image
-    engine_args:
-      model_stage: AR
-      model_arch: HunyuanImage3ForCausalMM
-      worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
-      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
-      gpu_memory_utilization: 0.95
-      enforce_eager: true
-      trust_remote_code: true
-      engine_output_type: latent  # AR outputs latent for DiT
-      enable_prefix_caching: false
-      max_num_batched_tokens: 32768
-      tensor_parallel_size: 4
-      pipeline_parallel_size: 1
-      hf_overrides:
-        rope_parameters:
-          mrope_section: [0, 32, 32]
-          rope_type: default
-    is_comprehension: false  # Generation task, not comprehension
-    final_output: false  # AR is not the final output
-    default_sampling_params:
-      temperature: 0.6
-      top_p: 0.95
-      top_k: 1024
-      max_tokens: 4096
-      stop_token_ids: [127957]  # <|endoftext|>
-      detokenize: false
-
-  # Stage 1: Diffusion (DiT + VAE)
-  # Receives latents from AR stage, performs denoising + VAE decode
-  - stage_id: 1
-    stage_type: diffusion
-    runtime:
-      process: true
-      devices: "4,5,6,7"
-      max_batch_size: 1
-      requires_multimodal_data: true  # May need condition images
-    engine_args:
-      model_stage: dit
-      model_arch: HunyuanImage3ForCausalMM
-      enforce_eager: true
-      trust_remote_code: true
-      distributed_executor_backend: "mp"
-      parallel_config:
-        tensor_parallel_size: 4
-        enable_expert_parallel: true
-      omni_kv_config:
-        need_recv_cache: true
-    engine_input_source: [0]  # Input from AR stage
-    custom_process_input_func: vllm_omni.model_executor.stage_input_processors.hunyuan_image3.ar2diffusion
-    final_output: true
-    final_output_type: image
-    default_sampling_params:
-      num_inference_steps: 50
-      guidance_scale: 2.5
-
-# Top-level runtime config
-runtime:
-  enabled: true
-  edges:
-    - from: 0  # AR → Diffusion
-      to: 1
diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml
deleted file mode 100644
index f0797c63270..00000000000
--- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml
+++ /dev/null
@@ -1,96 +0,0 @@
-# Stage config for running Hunyuan-Image3.0 with AR→DiT KV reuse.
-# Stage 0: AR Model (vLLM implementation)
-# Stage 1: DiT Model (diffusion)
-#
-# text-to-image flow: AR (stage 0) → KV transfer → DiT (stage 1)
-# image-to-text flow: AR (stage 0) only
-#
-# Compared to hunyuan_image3_t2i.yaml, this config:
-#   1. Enables both stages [0, 1] for text-to-image (AR prefill + DiT denoising)
-#   2. Adds omni_kv_config to send/receive KV cache between stages
-
-# The following config has been verified on 8x L40S-48G GPU (4 for AR + 4 for DiT).
-stage_args:
-  - stage_id: 0
-    stage_type: llm  # Use llm stage type for AR stages
-    runtime:
-      process: true  # Run this stage in a separate process
-      devices: "0,1,2,3"  # AR stage uses GPU 0-3
-    engine_args:
-      model_stage: AR
-      max_num_seqs: 1
-      model_arch: HunyuanImage3ForCausalMM
-      worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
-      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
-      gpu_memory_utilization: 0.9
-      enforce_eager: true  # Now we only support eager mode
-      trust_remote_code: true
-      engine_output_type: latent
-      enable_prefix_caching: false
-      max_num_batched_tokens: 32768
-      tensor_parallel_size: 4
-      pipeline_parallel_size: 1
-      hf_overrides:
-        rope_parameters:
-          mrope_section: [0, 32, 32]
-          rope_type: default
-      omni_kv_config:
-        need_send_cache: true
-        kv_transfer_criteria:
-          type: prefill_finished  # Send KV cache after AR prefill completes
-    is_comprehension: true
-    final_output: true
-    final_output_type: text
-    default_sampling_params:
-      temperature: 0.0
-      top_p: 1.0
-      top_k: -1
-      max_tokens: 2048
-      seed: 42
-      detokenize: True
-      repetition_penalty: 1.1
-  - stage_id: 1
-    stage_type: diffusion
-    runtime:
-      process: true
-      devices: "4,5,6,7"  # DiT stage uses GPU 4-7
-      max_batch_size: 1
-    engine_args:
-      model_stage: diffusion
-      enforce_eager: true
-      distributed_executor_backend: "mp"
-      vae_use_slicing: false
-      vae_use_tiling: false
-      cache_backend: null
-      cache_config: null
-      enable_cache_dit_summary: false
-      omni_kv_config:
-        need_recv_cache: true  # Receive AR KV cache from stage 0
-      parallel_config:
-        pipeline_parallel_size: 1
-        data_parallel_size: 1
-        tensor_parallel_size: 4
-        enable_expert_parallel: false
-        sequence_parallel_size: 1
-        ulysses_degree: 1
-        ring_degree: 1
-        cfg_parallel_size: 1
-        vae_patch_parallel_size: 1
-        use_hsdp: false
-        hsdp_shard_size: -1
-        hsdp_replicate_size: 1
-    engine_input_source: [0]  # Receive input (including KV) from stage 0
-    final_output: true
-    final_output_type: image
-
-# Top-level runtime config: windows, edges, and connectors
-runtime:
-  enabled: true
-  defaults:
-    window_size: -1  # Trigger downstream only after full upstream completion
-    max_inflight: 1  # Process serially within each stage
-
-  edges:
-    - from: 0
-      to: 1
-      window_size: -1
diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml
deleted file mode 100644
index 586b601bc5a..00000000000
--- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml
+++ /dev/null
@@ -1,32 +0,0 @@
-# Stage config for running Hunyuan-Image3.0 DiT with FP8 online quantization.
-# The following config is for 2x H200 GPU.
-
-# Stage 0:  Diffusion (DiT + VAE)
-# This stage receives noise and timesteps and performs denoising + VAE decode
-stage_args:
-  - stage_id: 0
-    stage_type: diffusion
-    runtime:
-      devices: "0,1"
-      max_batch_size: 1
-    engine_args:
-      model_stage: dit
-      enforce_eager: true
-      trust_remote_code: true
-      distributed_executor_backend: "mp"
-      quantization: "fp8"
-      parallel_config:
-        tensor_parallel_size: 2
-        enable_expert_parallel: true
-      omni_kv_config:
-        need_recv_cache: true
-
-    final_output: true
-    final_output_type: image
-    is_comprehension: false
-    default_sampling_params:
-      seed: 42
-
-# Runtime edges
-runtime:
-  enabled: true
diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i.yaml
deleted file mode 100644
index 1d8c7f4812d..00000000000
--- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i.yaml
+++ /dev/null
@@ -1,31 +0,0 @@
-# Stage config for running Hunyuan-Image3.0 DiT.
-# The following config has been verified on 4x H20 GPU.
-
-# Stage 0:  Diffusion (DiT + VAE)
-# This stage receives noise and timesteps and performs denoising + VAE decode
-stage_args:
-  - stage_id: 0
-    stage_type: diffusion
-    runtime:
-      devices: "0,1,2,3"
-    engine_args:
-      max_num_seqs: 1
-      model_stage: dit
-      enforce_eager: true
-      trust_remote_code: true
-      distributed_executor_backend: "mp"
-      parallel_config:
-        tensor_parallel_size: 4
-        enable_expert_parallel: true
-      omni_kv_config:
-        need_recv_cache: true
-
-    final_output: true
-    final_output_type: image
-    is_comprehension: false
-    default_sampling_params:
-      seed: 42
-
-# Runtime edges
-runtime:
-  enabled: true
diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i_2gpu.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i_2gpu.yaml
deleted file mode 100644
index 41ed74ba62a..00000000000
--- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i_2gpu.yaml
+++ /dev/null
@@ -1,41 +0,0 @@
-# Stage config for running Hunyuan-Image3.0 on 2 GPUs with FP8.
-# Stage 0: AR Model (vLLM implementation)
-
-stage_args:
-  - stage_id: 0
-    stage_type: llm
-    runtime:
-      process: true
-      devices: "0,1"
-    engine_args:
-      model_stage: AR
-      max_num_seqs: 1
-      model_arch: HunyuanImage3ForCausalMM
-      worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
-      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
-      gpu_memory_utilization: 0.9
-      enforce_eager: true
-      trust_remote_code: true
-      engine_output_type: latent
-      enable_prefix_caching: false
-      max_num_batched_tokens: 32768
-      tensor_parallel_size: 2
-      pipeline_parallel_size: 1
-      hf_overrides:
-        rope_parameters:
-          mrope_section: [0, 32, 32]
-          rope_type: default
-    is_comprehension: true
-    final_output: true
-    final_output_type: text
-    default_sampling_params:
-      temperature: 0.0
-      top_p: 1.0
-      top_k: -1
-      max_tokens: 2048
-      seed: 42
-      detokenize: True
-      repetition_penalty: 1.1
-
-runtime:
-  enabled: true
diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2t.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2t.yaml
deleted file mode 100644
index a0a1a0dc1c4..00000000000
--- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2t.yaml
+++ /dev/null
@@ -1,42 +0,0 @@
-# Stage config for HunyuanImage-3.0 Text-to-Text (T2T / pure text generation).
-# Single LLM stage: AR model reads text prompt only, generates text output.
-# Sampling params aligned with official generation_config.json.
-
-stage_args:
-  - stage_id: 0
-    stage_type: llm
-    runtime:
-      process: true
-      devices: "0,1,2,3"
-      max_batch_size: 1
-      requires_multimodal_data: false
-    engine_args:
-      model_stage: AR
-      max_num_seqs: 1
-      model_arch: HunyuanImage3ForCausalMM
-      worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker
-      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
-      gpu_memory_utilization: 0.95
-      enforce_eager: true
-      trust_remote_code: true
-      enable_prefix_caching: false
-      max_num_batched_tokens: 32768
-      tensor_parallel_size: 4
-      pipeline_parallel_size: 1
-      hf_overrides:
-        rope_parameters:
-          mrope_section: [0, 32, 32]
-          rope_type: default
-    is_comprehension: true
-    final_output: true
-    final_output_type: text
-    default_sampling_params:
-      temperature: 0.0
-      top_p: 0.95
-      top_k: 1024
-      max_tokens: 2048
-      stop_token_ids: [127957, 128026]  # <|endoftext|>, </answer>
-      detokenize: True
-
-runtime:
-  enabled: true
diff --git a/vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml b/vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml
deleted file mode 100644
index 0fd03949d11..00000000000
--- a/vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml
+++ /dev/null
@@ -1,35 +0,0 @@
-# Stage config for running Hunyuan-Image3.0 DiT on NPU.
-# The following config has been verified on 8x A3-64G NPUs.
-
-# Stage 0: Diffusion (DiT + VAE)
-# This stage receives noise and timesteps and performs denoising + VAE decode.
-stage_args:
-  - stage_id: 0
-    stage_type: diffusion
-    runtime:
-      devices: "0,1,2,3,4,5,6,7"
-    engine_args:
-      max_num_seqs: 1
-      model_stage: dit
-      gpu_memory_utilization: 0.65
-      enforce_eager: true
-      trust_remote_code: true
-      engine_output_type: image
-      distributed_executor_backend: "mp"
-      enable_prefix_caching: false
-      max_num_batched_tokens: 32768
-      parallel_config:
-        tensor_parallel_size: 8
-        enable_expert_parallel: true
-      omni_kv_config:
-        need_recv_cache: true
-
-    final_output: true
-    final_output_type: image
-    is_comprehension: false
-    default_sampling_params:
-      seed: 42
-
-# Runtime defaults
-runtime:
-  enabled: true
diff --git a/vllm_omni/platforms/xpu/stage_configs/hunyuan_image3_t2i.yaml b/vllm_omni/platforms/xpu/stage_configs/hunyuan_image3_t2i.yaml
deleted file mode 100644
index 4e0005f82a1..00000000000
--- a/vllm_omni/platforms/xpu/stage_configs/hunyuan_image3_t2i.yaml
+++ /dev/null
@@ -1,80 +0,0 @@
-# Stage config for running Hunyuan-Image3.0 with architecture of OmniLLM.
-# Stage 0: AR Model (vLLM implementation)
-
-# The following config has been verified on 8x Max 1550 GPU.
-modes:
-  - mode: text-to-image
-    stages: [1]
-  - mode: image-to-text
-    stages: [0]
-stage_args:
-  - stage_id: 0
-    stage_type: llm  # Use llm stage type to launch OmniLLM
-    runtime:
-      process: true  # Run this stage in a separate process
-      devices: "0,1,2,3,4,5,6,7"  # Visible devices for this stage
-      max_batch_size: 1
-    engine_args:
-      model_stage: AR
-      model_arch: HunyuanImage3ForCausalMM
-      worker_cls: vllm_omni.platforms.xpu.worker.xpu_ar_worker.XPUARWorker
-      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
-      gpu_memory_utilization: 0.95
-      enforce_eager: true  # Now we only support eager mode
-      trust_remote_code: true
-      engine_output_type: latent
-      enable_prefix_caching: false
-      max_num_batched_tokens: 32784
-      tensor_parallel_size: 8
-      pipeline_parallel_size: 1
-      enable_expert_parallel: true
-      quantization: "fp8"
-    is_comprehension: true
-    final_output: true
-    final_output_type: text
-    default_sampling_params:
-      temperature: 0.0
-      top_p: 1.0
-      top_k: -1
-      max_tokens: 2048
-      seed: 42
-      detokenize: True
-      repetition_penalty: 1.1
-  - stage_id: 1
-    stage_type: diffusion
-    runtime:
-      process: true
-      devices: "0,1,2,3,4,5,6,7"
-      max_batch_size: 1
-    engine_args:
-      model_stage: diffusion
-      gpu_memory_utilization: 0.9
-      enforce_eager: true
-      engine_output_type: image
-      distributed_executor_backend: "mp"
-      enable_prefix_caching: false
-      vae_use_slicing: false
-      vae_use_tiling: false
-      cache_backend: null
-      cache_config: null
-      enable_cache_dit_summary: false
-      quantization: "fp8"
-      parallel_config:
-        pipeline_parallel_size: 1
-        data_parallel_size: 1
-        tensor_parallel_size: 8
-        enable_expert_parallel: true
-        sequence_parallel_size: 1
-        ulysses_degree: 1
-        ring_degree: 1
-        cfg_parallel_size: 1
-        vae_patch_parallel_size: 1
-        use_hsdp: false
-        hsdp_shard_size: -1
-        hsdp_replicate_size: 1
-    final_output: true
-    final_output_type: image
-
-# Top-level runtime config (concise): default windows and stage edges
-runtime:
-  enabled: true