diff --git a/examples/offline_inference/hunyuan_image3/README.md b/examples/offline_inference/hunyuan_image3/README.md index 3cd8fa01b2e..710ec26df9c 100644 --- a/examples/offline_inference/hunyuan_image3/README.md +++ b/examples/offline_inference/hunyuan_image3/README.md @@ -23,7 +23,7 @@ HunyuanImage-3.0-Instruct supports multiple modality modes. You can control the - **Pipeline**: Text → AR (CoT + latent tokens) → DiT (denoise) → VAE Decode → Image - **Stages Used**: Stage 0 (AR) + Stage 1 (DiT) - **KV Transfer**: AR sends KV cache to DiT for conditioned generation -- **Default Config**: `hunyuan_image3_t2i.yaml` +- **Default Config**: `vllm_omni/deploy/hunyuan_image3_t2i.yaml` ```bash python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ @@ -36,7 +36,7 @@ python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ - **Pipeline**: Image + Text → AR (CoT + recaption + latent) → DiT → Edited Image - **Stages Used**: Stage 0 (AR) + Stage 1 (DiT) - **KV Transfer**: AR sends KV cache to DiT -- **Default Config**: `hunyuan_image3_it2i.yaml` +- **Default Config**: `vllm_omni/deploy/hunyuan_image3_it2i.yaml` ```bash python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ @@ -45,31 +45,6 @@ python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ --prompts "Make the petals neon pink" ``` -#### Image to Text (img2text) - -- **Pipeline**: Image + Question → AR → Text description -- **Stages Used**: Stage 0 (AR) only -- **Default Config**: `hunyuan_image3_i2t.yaml` - -```bash -python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ - --modality img2text \ - --image-path /path/to/image.jpg \ - --prompts "Describe the content of the picture." -``` - -#### Text to Text (text2text) - -- **Pipeline**: Text → AR → Text -- **Stages Used**: Stage 0 (AR) only -- **Default Config**: `hunyuan_image3_t2t.yaml` - -```bash -python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ - --modality text2text \ - --prompts "What is the capital of France?" -``` - ### Inference Steps & Guidance Control generation quality for image modalities: @@ -89,7 +64,7 @@ python end2end.py --modality text2img \ | Argument | Type | Default | Description | | :--------------------- | :----- | :----------------------------------- | :----------------------------------------------------------- | | `--model` | string | `tencent/HunyuanImage-3.0-Instruct` | Model path or name | -| `--modality` | choice | `text2img` | Modality: `text2img`, `img2img`, `img2text`, `text2text` | +| `--modality` | choice | `text2img` | Modality: `text2img`, `img2img` | | `--prompts` | list | `None` | Input text prompts | | `--image-path` | string | `None` | Input image path (for `img2img`/`img2text`) | | `--output` | string | `.` | Output directory for saved images | @@ -108,28 +83,20 @@ python end2end.py --modality text2img \ #### ⚙️ Stage Configurations -| Config YAML | Modality | Stages | GPUs | Description | -| :---------------------------------- | :-------- | :----- | :----- | :------------------------------------ | -| `hunyuan_image3_t2i.yaml` | text2img | 2 | 8 | T2I with AR→DiT, 4 GPU each | -| `hunyuan_image3_it2i.yaml` | img2img | 2 | 8 | IT2I with AR→DiT, 4 GPU each | -| `hunyuan_image3_i2t.yaml` | img2text | 1 | 4 | I2T (AR only) | -| `hunyuan_image3_t2t.yaml` | text2text | 1 | 4 | T2T (AR only) | -| `hunyuan_image3_t2i_2gpu.yaml` | text2img | 2 | 2 | T2I for 2-GPU setups | -| `hunyuan_image3_moe.yaml` | text2img | 2 | 8 | T2I with MoE AR→DiT KV reuse | -| `hunyuan_image3_moe_dit_2gpu_fp8.yaml` | text2img | 2 | 2 | T2I with FP8 quantization | +All deploy YAMLs live under `vllm_omni/deploy/` in the new schema (PR #2383). ------- +| Deploy YAML | Modality | Stages | GPUs | Description | +| :--------------------------------------- | :--------- | :----- | :--- | :-------------------------------- | +| `hunyuan_image3_t2i.yaml` | text2img | 2 | 8 | AR + DiT with KV transfer | +| `hunyuan_image3_it2i.yaml` | img2img | 2 | 8 | AR + DiT (image-edit) | -## Using MoE Config +The `hunyuan_image3_dit_only` pipeline is also registered (no shipped deploy yaml) for users who want to skip the AR stage with a custom deploy. -The `hunyuan_image3_moe.yaml` config enables AR→DiT KV cache reuse with 8 GPUs (4 for AR + 4 for DiT). +------ -```bash -python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ - --modality text2img \ - --stage-configs-path hunyuan_image3_moe.yaml \ - --prompts "A cute cat" -``` +## AR→DiT KV cache reuse + +The default `hunyuan_image3_t2i.yaml` deploy already enables AR→DiT KV cache reuse on 8 GPUs (4 for AR + 4 for DiT) — the wiring lives on the pipeline (`omni_kv_config` for both stages). ------ diff --git a/examples/offline_inference/hunyuan_image3/end2end.py b/examples/offline_inference/hunyuan_image3/end2end.py index 2cea303888e..08815465e45 100644 --- a/examples/offline_inference/hunyuan_image3/end2end.py +++ b/examples/offline_inference/hunyuan_image3/end2end.py @@ -72,12 +72,10 @@ def build_prompt( return "".join(parts) -# Modality → default stage config +# Modality → default deploy config (under vllm_omni/deploy/). _MODALITY_DEFAULT_CONFIG = { - "text2img": "hunyuan_image3_t2i.yaml", - "img2img": "hunyuan_image3_it2i.yaml", - "img2text": "hunyuan_image3_i2t.yaml", - "text2text": "hunyuan_image3_t2t.yaml", + "text2img": "vllm_omni/deploy/hunyuan_image3_t2i.yaml", + "img2img": "vllm_omni/deploy/hunyuan_image3_it2i.yaml", } @@ -91,7 +89,7 @@ def parse_args(): parser.add_argument( "--modality", default="text2img", - choices=["text2img", "img2img", "img2text", "text2text"], + choices=["text2img", "img2img"], help="Modality mode to control stage execution.", ) parser.add_argument("--prompts", nargs="+", default=None, help="Input text prompts.") @@ -148,21 +146,15 @@ def main(): # Determine task for prompt formatting task = args.bot_task or _MODALITY_TASK_MAP[args.modality] - # Determine stage config - stage_configs_path = args.stage_configs_path or _MODALITY_DEFAULT_CONFIG[args.modality] - - # Build Omni - omni_kwargs = { - "model": args.model, - "stage_configs_path": stage_configs_path, - "log_stats": args.log_stats, - "init_timeout": args.init_timeout, - "enforce_eager": args.enforce_eager, + # Resolve modality-derived overrides — these are not direct CLI flags so + # forward them to ``from_cli_args`` via ``**overrides``. + overrides: dict[str, object] = { + "stage_configs_path": args.stage_configs_path or _MODALITY_DEFAULT_CONFIG[args.modality], } if args.modality in ("text2img", "img2img"): - omni_kwargs["mode"] = "text-to-image" + overrides["mode"] = "text-to-image" - omni = Omni(**omni_kwargs) + omni = Omni.from_cli_args(args, **overrides) # Prepare prompts prompts = args.prompts or ["A cute cat"] @@ -222,7 +214,7 @@ def main(): print("HunyuanImage-3.0 Generation Configuration:") print(f" Model: {args.model}") print(f" Modality: {args.modality}") - print(f" Stage config: {stage_configs_path}") + print(f" Stage config: {overrides['stage_configs_path']}") print(f" Num stages: {omni.num_stages}") if args.modality in ("text2img", "img2img"): print(f" Inference steps: {args.steps}") diff --git a/tests/e2e/offline_inference/deploy/hunyuan_image3_dit_only_ci.yaml b/tests/e2e/offline_inference/deploy/hunyuan_image3_dit_only_ci.yaml new file mode 100644 index 00000000000..cda71d9bcb1 --- /dev/null +++ b/tests/e2e/offline_inference/deploy/hunyuan_image3_dit_only_ci.yaml @@ -0,0 +1,27 @@ +# HunyuanImage-3.0 DiT-only (no AR). CUDA verified on 4x H20. +pipeline: hunyuan_image3_dit_only +trust_remote_code: true +distributed_executor_backend: mp + +stages: + - stage_id: 0 + max_num_seqs: 1 + enforce_eager: true + devices: "0,1,2,3" + parallel_config: + tensor_parallel_size: 4 + enable_expert_parallel: true + default_sampling_params: + seed: 42 + +platforms: + npu: + # Verified on 8x A3-64G NPUs. + stages: + - stage_id: 0 + gpu_memory_utilization: 0.65 + devices: "0,1,2,3,4,5,6,7" + max_num_batched_tokens: 32768 + parallel_config: + tensor_parallel_size: 8 + enable_expert_parallel: true diff --git a/tests/e2e/offline_inference/test_hunyuanimage3_text2img.py b/tests/e2e/offline_inference/test_hunyuanimage3_text2img.py index bd0d132d093..ef34feca294 100644 --- a/tests/e2e/offline_inference/test_hunyuanimage3_text2img.py +++ b/tests/e2e/offline_inference/test_hunyuanimage3_text2img.py @@ -17,7 +17,7 @@ MODEL_NAME = "tencent/HunyuanImage-3.0" LOCAL_CLIP_PATH = "openai/clip-vit-base-patch32" REPO_ROOT = Path(__file__).resolve().parents[3] -STAGE_CONFIG_PATH = REPO_ROOT / "vllm_omni" / "model_executor" / "stage_configs" / "hunyuan_image3_t2i.yaml" +STAGE_CONFIG_PATH = REPO_ROOT / "tests" / "e2e" / "offline_inference" / "deploy" / "hunyuan_image3_dit_only_ci.yaml" pytestmark = [pytest.mark.advanced_model, pytest.mark.diffusion] diff --git a/tests/test_config_factory.py b/tests/test_config_factory.py index 16d49034fa1..6ad5f04b6bf 100644 --- a/tests/test_config_factory.py +++ b/tests/test_config_factory.py @@ -1334,3 +1334,61 @@ def test_constraints_win(self): assert stages[1].yaml_extras["default_sampling_params"]["stop_token_ids"] == [2150] # Deploy temperature still flows through assert stages[0].yaml_extras["default_sampling_params"]["temperature"] == 0.4 + + +class TestHunyuanImage3ShippedDeploys: + """Structural smoke tests for shipped Hunyuan-Image3 deploy yamls. + + The GPU-gated e2e test (``test_hunyuanimage3_text2img.py``) runs against + the DiT-only CI fixture; these cheap tests catch schema regressions in + the shipped AR→DiT t2i / it2i / dit_only deploys that no GPU is needed + to see. + """ + + @pytest.mark.parametrize( + "yaml_name,expected_pipeline,expected_stage_count,expected_stages", + [ + ("hunyuan_image3_t2i.yaml", "hunyuan_image3_t2i", 2, ("AR", "dit")), + ("hunyuan_image3_it2i.yaml", "hunyuan_image3_it2i", 2, ("AR", "dit")), + ("hunyuan_image3_dit_only.yaml", "hunyuan_image3_dit_only", 1, ("dit",)), + ], + ) + def test_shipped_deploys_parse_and_resolve( + self, yaml_name, expected_pipeline, expected_stage_count, expected_stages + ): + import vllm_omni.model_executor.models.hunyuan_image3.pipeline # noqa: F401 + from vllm_omni.config.stage_config import load_deploy_config, merge_pipeline_deploy + + deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / yaml_name + assert deploy_path.exists(), f"Shipped deploy missing: {yaml_name}" + + deploy = load_deploy_config(deploy_path) + assert deploy.pipeline == expected_pipeline + assert len(deploy.stages) == expected_stage_count + + pipeline = _PIPELINE_REGISTRY[expected_pipeline] + assert tuple(s.model_stage for s in pipeline.stages) == expected_stages + + stages = merge_pipeline_deploy(pipeline, deploy) + assert len(stages) == expected_stage_count + + def test_t2i_ar_dit_topology(self): + """The AR→DiT t2i default wires stage 1 to consume stage 0's KV output.""" + import vllm_omni.model_executor.models.hunyuan_image3.pipeline # noqa: F401 + from vllm_omni.config.stage_config import load_deploy_config, merge_pipeline_deploy + + deploy_path = Path(__file__).parent.parent / "vllm_omni" / "deploy" / "hunyuan_image3_t2i.yaml" + assert deploy_path.exists(), "Shipped deploy missing: hunyuan_image3_t2i.yaml" + + pipeline = _PIPELINE_REGISTRY["hunyuan_image3_t2i"] + deploy = load_deploy_config(deploy_path) + stages = merge_pipeline_deploy(pipeline, deploy) + + # Pipeline-level invariants for the KV-transfer path. + assert pipeline.stages[0].omni_kv_config is not None + assert pipeline.stages[1].input_sources == (0,) + assert pipeline.stages[1].omni_kv_config is not None + + # Deploy-level placement: 4 AR + 4 DiT across 8 devices. + assert stages[0].yaml_runtime.get("devices") == "0,1,2,3" + assert stages[1].yaml_runtime.get("devices") == "4,5,6,7" diff --git a/vllm_omni/config/pipeline_registry.py b/vllm_omni/config/pipeline_registry.py index 6f5c072a353..e516280deed 100644 --- a/vllm_omni/config/pipeline_registry.py +++ b/vllm_omni/config/pipeline_registry.py @@ -33,6 +33,28 @@ # --- Multi-stage omni pipelines (LLM-centric; audio / video I/O) --- _OMNI_PIPELINES: dict[str, tuple[str, str]] = { # model_type -> (module_path, variable_name) + "hunyuan_image3_t2i": ( + "vllm_omni.model_executor.models.hunyuan_image3.pipeline", + "HUNYUAN_IMAGE3_T2I_PIPELINE", + ), + "hunyuan_image3_it2i": ( + "vllm_omni.model_executor.models.hunyuan_image3.pipeline", + "HUNYUAN_IMAGE3_IT2I_PIPELINE", + ), + # ``dit_only`` ships a default deploy yaml (DiT-only path + NPU section); + # ``i2t`` / ``t2t`` are kept BYO. + "hunyuan_image3_dit_only": ( + "vllm_omni.model_executor.models.hunyuan_image3.pipeline", + "HUNYUAN_IMAGE3_DIT_ONLY_PIPELINE", + ), + "hunyuan_image3_i2t": ( + "vllm_omni.model_executor.models.hunyuan_image3.pipeline", + "HUNYUAN_IMAGE3_I2T_PIPELINE", + ), + "hunyuan_image3_t2t": ( + "vllm_omni.model_executor.models.hunyuan_image3.pipeline", + "HUNYUAN_IMAGE3_T2T_PIPELINE", + ), "qwen2_5_omni": ( "vllm_omni.model_executor.models.qwen2_5_omni.pipeline", "QWEN2_5_OMNI_PIPELINE", diff --git a/vllm_omni/config/stage_config.py b/vllm_omni/config/stage_config.py index 950685d6219..1ac12dfe749 100644 --- a/vllm_omni/config/stage_config.py +++ b/vllm_omni/config/stage_config.py @@ -463,6 +463,22 @@ class DeployConfig: _STAGE_DEPLOY_FIELDS = {f.name: f for f in fields(StageDeployConfig) if f.name not in _STAGE_NON_ENGINE_KEYS} +_DIT_PARALLEL_FIELDS_AT_TOP_LEVEL = frozenset( + { + "enable_expert_parallel", + "sequence_parallel_size", + "ulysses_degree", + "ring_degree", + "ulysses_mode", + "cfg_parallel_size", + "vae_patch_parallel_size", + "use_hsdp", + "hsdp_shard_size", + "hsdp_replicate_size", + } +) + + def _parse_stage_deploy(stage_data: dict[str, Any]) -> StageDeployConfig: """Parse a single stage entry from deploy YAML into StageDeployConfig.""" if "engine_args" in stage_data: @@ -477,6 +493,22 @@ def _parse_stage_deploy(stage_data: dict[str, Any]) -> StageDeployConfig: if name in engine_args: kwargs[name] = engine_args.pop(name) + # Flat schema support: hoist DiT-only parallel fields and tensor_parallel_size + # from top level into a parallel_config block. Lets authors write the same + # flat shape for AR and DiT stages — DiT engines read parallel_config.*, + # AR engines read top-level fields directly. Existing nested parallel_config + # forms keep working (we only setdefault). + pc = engine_args.get("parallel_config") + if not isinstance(pc, dict): + pc = {} + for name in _DIT_PARALLEL_FIELDS_AT_TOP_LEVEL: + if name in engine_args: + pc.setdefault(name, engine_args.pop(name)) + if (tps := kwargs.get("tensor_parallel_size")) and tps > 1: + pc.setdefault("tensor_parallel_size", tps) + if pc: + engine_args["parallel_config"] = pc + kwargs["output_connectors"] = stage_data.get("output_connectors") kwargs["input_connectors"] = stage_data.get("input_connectors") kwargs["default_sampling_params"] = stage_data.get("default_sampling_params") @@ -714,6 +746,8 @@ def _build_engine_args( engine_args["model_subdir"] = ps.model_subdir if ps.tokenizer_subdir: engine_args["tokenizer_subdir"] = ps.tokenizer_subdir + if ps.omni_kv_config is not None: + engine_args["omni_kv_config"] = dict(ps.omni_kv_config) # Pipeline-wide top-level DeployConfig settings, applied to every stage. for name in _PIPELINE_WIDE_ENGINE_FIELDS: @@ -800,6 +834,8 @@ def merge_pipeline_deploy( runtime: dict[str, Any] = {"process": True} if ds is not None: runtime["devices"] = ds.devices + if ps.requires_multimodal_data: + runtime["requires_multimodal_data"] = True result.append( StageConfig( diff --git a/vllm_omni/deploy/hunyuan_image3_dit_only.yaml b/vllm_omni/deploy/hunyuan_image3_dit_only.yaml new file mode 100644 index 00000000000..8ce6ee73835 --- /dev/null +++ b/vllm_omni/deploy/hunyuan_image3_dit_only.yaml @@ -0,0 +1,29 @@ +# HunyuanImage-3.0 DiT-only (no AR). Matches the Tencent reference +# `generate_image(prompt, bot_task="image")` path. CUDA verified on 4x H20; +# NPU verified on 8x A3-64G. +pipeline: hunyuan_image3_dit_only +trust_remote_code: true +distributed_executor_backend: mp +async_chunk: false + +stages: + - stage_id: 0 + max_num_seqs: 1 + enforce_eager: true + tensor_parallel_size: 4 + enable_expert_parallel: true + devices: "0,1,2,3" + default_sampling_params: + num_inference_steps: 50 + guidance_scale: 2.5 + seed: 42 + +platforms: + npu: + stages: + - stage_id: 0 + gpu_memory_utilization: 0.65 + tensor_parallel_size: 8 + enable_expert_parallel: true + devices: "0,1,2,3,4,5,6,7" + max_num_batched_tokens: 32768 diff --git a/vllm_omni/deploy/hunyuan_image3_dit_only_2gpu.yaml b/vllm_omni/deploy/hunyuan_image3_dit_only_2gpu.yaml new file mode 100644 index 00000000000..4cc5f7a37a7 --- /dev/null +++ b/vllm_omni/deploy/hunyuan_image3_dit_only_2gpu.yaml @@ -0,0 +1,20 @@ +# HunyuanImage-3.0 DiT-only on 2 GPUs (FP8). For 2x H200 / H100 / A100-80GB. +# Single-stage variant of `hunyuan_image3_dit_only.yaml` retuned for 2-GPU +# placement; matches Tencent's `bot_task="image"` (DiT-only, no AR rewriter). +pipeline: hunyuan_image3_dit_only +trust_remote_code: true +distributed_executor_backend: mp +async_chunk: false +quantization: fp8 + +stages: + - stage_id: 0 + max_num_seqs: 1 + enforce_eager: true + tensor_parallel_size: 2 + enable_expert_parallel: true + devices: "0,1" + default_sampling_params: + num_inference_steps: 50 + guidance_scale: 2.5 + seed: 42 diff --git a/vllm_omni/deploy/hunyuan_image3_it2i.yaml b/vllm_omni/deploy/hunyuan_image3_it2i.yaml new file mode 100644 index 00000000000..3a5f245455a --- /dev/null +++ b/vllm_omni/deploy/hunyuan_image3_it2i.yaml @@ -0,0 +1,32 @@ +# HunyuanImage-3.0 image-edit (AR consumes image+text, DiT decodes). +# Verified on 8x L40S-48G (4 AR + 4 DiT). +pipeline: hunyuan_image3_it2i +trust_remote_code: true +distributed_executor_backend: mp +async_chunk: false + +stages: + - stage_id: 0 + max_num_seqs: 1 + gpu_memory_utilization: 0.95 + enforce_eager: true + tensor_parallel_size: 4 + devices: "0,1,2,3" + hf_overrides: + rope_parameters: + mrope_section: [0, 32, 32] + rope_type: default + default_sampling_params: + temperature: 0.6 + top_p: 0.95 + top_k: 1024 + max_tokens: 4096 + + - stage_id: 1 + max_num_seqs: 1 + tensor_parallel_size: 4 + enable_expert_parallel: true + devices: "4,5,6,7" + default_sampling_params: + num_inference_steps: 50 + guidance_scale: 2.5 diff --git a/vllm_omni/deploy/hunyuan_image3_t2i.yaml b/vllm_omni/deploy/hunyuan_image3_t2i.yaml new file mode 100644 index 00000000000..c66eb69cbc3 --- /dev/null +++ b/vllm_omni/deploy/hunyuan_image3_t2i.yaml @@ -0,0 +1,59 @@ +# HunyuanImage-3.0 text-to-image (AR + DiT with KV transfer). +# CUDA default verified on 8x L40S-48G (4 AR + 4 DiT). Stop tokens / KV +# transfer wiring live on the pipeline; deploy carries placement + sampling. +# NPU deployment is DiT-only — use `--pipeline hunyuan_image3_dit_only` +# with `vllm_omni/deploy/hunyuan_image3_dit_only.yaml`. +# FP8 on 2x H200 ships as `vllm_omni/deploy/hunyuan_image3_t2i_fp8.yaml`. +pipeline: hunyuan_image3_t2i +trust_remote_code: true +distributed_executor_backend: mp +# AR → DiT cross-stage KV transfer is sync (DiT consumes AR's full KV after +# prefill_finished), no per-token next-stage processor. +async_chunk: false + +stages: + - stage_id: 0 + max_num_seqs: 1 + gpu_memory_utilization: 0.9 + enforce_eager: true + tensor_parallel_size: 4 + devices: "0,1,2,3" + hf_overrides: + rope_parameters: + mrope_section: [0, 32, 32] + rope_type: default + default_sampling_params: + temperature: 0.0 + top_p: 1.0 + top_k: -1 + max_tokens: 2048 + seed: 42 + repetition_penalty: 1.1 + + - stage_id: 1 + max_num_seqs: 1 + tensor_parallel_size: 4 + enable_expert_parallel: true + devices: "4,5,6,7" + default_sampling_params: + num_inference_steps: 50 + guidance_scale: 2.5 + seed: 42 + +platforms: + xpu: + # Verified on 8x Intel Arc Max 1550 with FP8 weights. + quantization: fp8 + stages: + - stage_id: 0 + gpu_memory_utilization: 0.95 + tensor_parallel_size: 8 + devices: "0,1,2,3,4,5,6,7" + max_num_batched_tokens: 32784 + worker_cls: vllm_omni.platforms.xpu.worker.xpu_ar_worker.XPUARWorker + enable_expert_parallel: true + - stage_id: 1 + gpu_memory_utilization: 0.9 + tensor_parallel_size: 8 + enable_expert_parallel: true + devices: "0,1,2,3,4,5,6,7" diff --git a/vllm_omni/deploy/hunyuan_image3_t2i_fp8.yaml b/vllm_omni/deploy/hunyuan_image3_t2i_fp8.yaml new file mode 100644 index 00000000000..625edbeb91a --- /dev/null +++ b/vllm_omni/deploy/hunyuan_image3_t2i_fp8.yaml @@ -0,0 +1,42 @@ +# HunyuanImage-3.0 text-to-image with FP8 online quantization. +# CUDA verified on 2x H200 (1 AR + 1 DiT, FP8 on the DiT side). +pipeline: hunyuan_image3_t2i +trust_remote_code: true +distributed_executor_backend: mp +async_chunk: false +quantization: fp8 + +stages: + - stage_id: 0 + max_num_seqs: 1 + gpu_memory_utilization: 0.9 + enforce_eager: true + tensor_parallel_size: 2 + # AR colocates with DiT on devices 0,1. DiT reserves ~32 GiB per rank + # before AR's profile_run, which the profiling fallback misattributes as + # "non_torch_increase" and zeroes out the KV budget. Set kv_cache_memory_bytes + # explicitly to bypass profiling — AR needs ~3 GiB per rank for max_num_seqs=1 + # at max_model_len=22800. + kv_cache_memory_bytes: 4294967296 # 4 GiB per rank + devices: "0,1" + hf_overrides: + rope_parameters: + mrope_section: [0, 32, 32] + rope_type: default + default_sampling_params: + temperature: 0.0 + top_p: 1.0 + top_k: -1 + max_tokens: 2048 + seed: 42 + repetition_penalty: 1.1 + + - stage_id: 1 + max_num_seqs: 1 + tensor_parallel_size: 2 + enable_expert_parallel: true + devices: "0,1" + default_sampling_params: + num_inference_steps: 50 + guidance_scale: 2.5 + seed: 42 diff --git a/vllm_omni/model_executor/models/hunyuan_image3/pipeline.py b/vllm_omni/model_executor/models/hunyuan_image3/pipeline.py new file mode 100644 index 00000000000..3007337c7e7 --- /dev/null +++ b/vllm_omni/model_executor/models/hunyuan_image3/pipeline.py @@ -0,0 +1,161 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +"""HunyuanImage-3.0 pipeline topologies (frozen). + +Five variants share one HF model_arch (HunyuanImage3ForCausalMM) but expose +different stage graphs: + + t2i Stage 0 AR (text → latent) + Stage 1 DiT, KV-transfer + it2i Stage 0 AR (image+text → latent) + Stage 1 DiT, KV-transfer + dit_only Stage 0 DiT only (latent → image) + i2t Stage 0 AR only (image+text → text) + t2t Stage 0 AR only (text → text) + +Variants are surfaced as separate model_types so the orchestrator picks the +right topology from deploy YAML alone (mirrors the qwen2_5_omni / +qwen2_5_omni_thinker_only split). ``t2i`` / ``it2i`` / ``dit_only`` ship +with default deploy yamls (the dit_only deploy also carries the NPU +overlay); ``i2t`` / ``t2t`` are registered for bring-your-own deploy. +""" + +from vllm_omni.config.stage_config import ( + PipelineConfig, + StageExecutionType, + StagePipelineConfig, +) + +_MODEL_ARCH = "HunyuanImage3ForCausalMM" +_AR2DIT = "vllm_omni.model_executor.stage_input_processors.hunyuan_image3.ar2diffusion" + +# AR-side KV transfer config: send cache once prefill is done so the DiT stage +# can splice it in without re-running attention over the prompt. +_AR_KV_SEND = { + "need_send_cache": True, + "kv_transfer_criteria": {"type": "prefill_finished"}, +} +_DIT_KV_RECV = {"need_recv_cache": True} + + +# Only one variant carries the hf_architectures fallback so the deploy yaml's +# explicit ``pipeline:`` field stays the single source of truth for variant +# selection. T2I is the default because it's the headline modality. +HUNYUAN_IMAGE3_T2I_PIPELINE = PipelineConfig( + model_type="hunyuan_image3_t2i", + model_arch=_MODEL_ARCH, + hf_architectures=("HunyuanImage3ForCausalMM",), + stages=( + StagePipelineConfig( + stage_id=0, + model_stage="AR", + execution_type=StageExecutionType.LLM_AR, + input_sources=(), + final_output=True, + final_output_type="text", + owns_tokenizer=True, + engine_output_type="latent", + sampling_constraints={"detokenize": True}, + omni_kv_config=_AR_KV_SEND, + ), + StagePipelineConfig( + stage_id=1, + model_stage="dit", + execution_type=StageExecutionType.DIFFUSION, + input_sources=(0,), + final_output=True, + final_output_type="image", + engine_output_type="image", + custom_process_input_func=_AR2DIT, + omni_kv_config=_DIT_KV_RECV, + ), + ), +) + + +HUNYUAN_IMAGE3_IT2I_PIPELINE = PipelineConfig( + model_type="hunyuan_image3_it2i", + model_arch=_MODEL_ARCH, + stages=( + StagePipelineConfig( + stage_id=0, + model_stage="AR", + execution_type=StageExecutionType.LLM_AR, + input_sources=(), + final_output=False, + owns_tokenizer=True, + requires_multimodal_data=True, + engine_output_type="latent", + sampling_constraints={"stop_token_ids": [127957], "detokenize": False}, + omni_kv_config=_AR_KV_SEND, + ), + StagePipelineConfig( + stage_id=1, + model_stage="dit", + execution_type=StageExecutionType.DIFFUSION, + input_sources=(0,), + final_output=True, + final_output_type="image", + engine_output_type="image", + requires_multimodal_data=True, + custom_process_input_func=_AR2DIT, + omni_kv_config=_DIT_KV_RECV, + ), + ), +) + + +HUNYUAN_IMAGE3_DIT_ONLY_PIPELINE = PipelineConfig( + model_type="hunyuan_image3_dit_only", + model_arch=_MODEL_ARCH, + stages=( + StagePipelineConfig( + stage_id=0, + model_stage="dit", + execution_type=StageExecutionType.DIFFUSION, + input_sources=(), + final_output=True, + final_output_type="image", + engine_output_type="image", + omni_kv_config=_DIT_KV_RECV, + ), + ), +) + + +# AR-only variants (no DiT). Registered for users bringing their own deploy +# yaml — no default deploy yaml ships because hardware sizing for I2T/T2T +# depends on the use case. +HUNYUAN_IMAGE3_I2T_PIPELINE = PipelineConfig( + model_type="hunyuan_image3_i2t", + model_arch=_MODEL_ARCH, + stages=( + StagePipelineConfig( + stage_id=0, + model_stage="AR", + execution_type=StageExecutionType.LLM_AR, + input_sources=(), + final_output=True, + final_output_type="text", + owns_tokenizer=True, + requires_multimodal_data=True, + sampling_constraints={"stop_token_ids": [127957, 128026], "detokenize": True}, + ), + ), +) + + +HUNYUAN_IMAGE3_T2T_PIPELINE = PipelineConfig( + model_type="hunyuan_image3_t2t", + model_arch=_MODEL_ARCH, + stages=( + StagePipelineConfig( + stage_id=0, + model_stage="AR", + execution_type=StageExecutionType.LLM_AR, + input_sources=(), + final_output=True, + final_output_type="text", + owns_tokenizer=True, + sampling_constraints={"stop_token_ids": [127957, 128026], "detokenize": True}, + ), + ), +) diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_i2t.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_i2t.yaml deleted file mode 100644 index b68b184ec31..00000000000 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_i2t.yaml +++ /dev/null @@ -1,41 +0,0 @@ -# Stage config for HunyuanImage-3.0 Image-to-Text (I2T / image understanding). -# Single LLM stage: AR model reads image + text prompt, generates text output. - -stage_args: - - stage_id: 0 - stage_type: llm - runtime: - process: true - devices: "0,1,2,3" - max_batch_size: 1 - requires_multimodal_data: true - engine_args: - model_stage: AR - max_num_seqs: 1 - model_arch: HunyuanImage3ForCausalMM - worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.95 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - max_num_batched_tokens: 32768 - tensor_parallel_size: 4 - pipeline_parallel_size: 1 - hf_overrides: - rope_parameters: - mrope_section: [0, 32, 32] - rope_type: default - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 0.95 - top_k: 1024 - max_tokens: 2048 - stop_token_ids: [127957, 128026] # <|endoftext|>, - detokenize: True - -runtime: - enabled: true diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml deleted file mode 100644 index 413e0f09cbe..00000000000 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_it2i.yaml +++ /dev/null @@ -1,74 +0,0 @@ -# Stage config for HunyuanImage-3.0 Image+Text-to-Image (image editing). -# Stage 0: AR (HunyuanImage3ForConditionalGeneration) — reads (image, text), emits latent tokens -# Stage 1: Diffusion (HunyuanImage3Pipeline / DiT + VAE) — denoise + decode latents → image - -stage_args: - # Stage 0: AR Model - - stage_id: 0 - stage_type: llm - runtime: - process: true - devices: "0,1,2,3" - max_batch_size: 1 - requires_multimodal_data: true # AR needs the original image - engine_args: - model_stage: AR - model_arch: HunyuanImage3ForCausalMM - worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.95 - enforce_eager: true - trust_remote_code: true - engine_output_type: latent # AR outputs latent for DiT - enable_prefix_caching: false - max_num_batched_tokens: 32768 - tensor_parallel_size: 4 - pipeline_parallel_size: 1 - hf_overrides: - rope_parameters: - mrope_section: [0, 32, 32] - rope_type: default - is_comprehension: false # Generation task, not comprehension - final_output: false # AR is not the final output - default_sampling_params: - temperature: 0.6 - top_p: 0.95 - top_k: 1024 - max_tokens: 4096 - stop_token_ids: [127957] # <|endoftext|> - detokenize: false - - # Stage 1: Diffusion (DiT + VAE) - # Receives latents from AR stage, performs denoising + VAE decode - - stage_id: 1 - stage_type: diffusion - runtime: - process: true - devices: "4,5,6,7" - max_batch_size: 1 - requires_multimodal_data: true # May need condition images - engine_args: - model_stage: dit - model_arch: HunyuanImage3ForCausalMM - enforce_eager: true - trust_remote_code: true - distributed_executor_backend: "mp" - parallel_config: - tensor_parallel_size: 4 - enable_expert_parallel: true - omni_kv_config: - need_recv_cache: true - engine_input_source: [0] # Input from AR stage - custom_process_input_func: vllm_omni.model_executor.stage_input_processors.hunyuan_image3.ar2diffusion - final_output: true - final_output_type: image - default_sampling_params: - num_inference_steps: 50 - guidance_scale: 2.5 - -# Top-level runtime config -runtime: - enabled: true - edges: - - from: 0 # AR → Diffusion - to: 1 diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml deleted file mode 100644 index f0797c63270..00000000000 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml +++ /dev/null @@ -1,96 +0,0 @@ -# Stage config for running Hunyuan-Image3.0 with AR→DiT KV reuse. -# Stage 0: AR Model (vLLM implementation) -# Stage 1: DiT Model (diffusion) -# -# text-to-image flow: AR (stage 0) → KV transfer → DiT (stage 1) -# image-to-text flow: AR (stage 0) only -# -# Compared to hunyuan_image3_t2i.yaml, this config: -# 1. Enables both stages [0, 1] for text-to-image (AR prefill + DiT denoising) -# 2. Adds omni_kv_config to send/receive KV cache between stages - -# The following config has been verified on 8x L40S-48G GPU (4 for AR + 4 for DiT). -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type for AR stages - runtime: - process: true # Run this stage in a separate process - devices: "0,1,2,3" # AR stage uses GPU 0-3 - engine_args: - model_stage: AR - max_num_seqs: 1 - model_arch: HunyuanImage3ForCausalMM - worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 - enforce_eager: true # Now we only support eager mode - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - max_num_batched_tokens: 32768 - tensor_parallel_size: 4 - pipeline_parallel_size: 1 - hf_overrides: - rope_parameters: - mrope_section: [0, 32, 32] - rope_type: default - omni_kv_config: - need_send_cache: true - kv_transfer_criteria: - type: prefill_finished # Send KV cache after AR prefill completes - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - - stage_id: 1 - stage_type: diffusion - runtime: - process: true - devices: "4,5,6,7" # DiT stage uses GPU 4-7 - max_batch_size: 1 - engine_args: - model_stage: diffusion - enforce_eager: true - distributed_executor_backend: "mp" - vae_use_slicing: false - vae_use_tiling: false - cache_backend: null - cache_config: null - enable_cache_dit_summary: false - omni_kv_config: - need_recv_cache: true # Receive AR KV cache from stage 0 - parallel_config: - pipeline_parallel_size: 1 - data_parallel_size: 1 - tensor_parallel_size: 4 - enable_expert_parallel: false - sequence_parallel_size: 1 - ulysses_degree: 1 - ring_degree: 1 - cfg_parallel_size: 1 - vae_patch_parallel_size: 1 - use_hsdp: false - hsdp_shard_size: -1 - hsdp_replicate_size: 1 - engine_input_source: [0] # Receive input (including KV) from stage 0 - final_output: true - final_output_type: image - -# Top-level runtime config: windows, edges, and connectors -runtime: - enabled: true - defaults: - window_size: -1 # Trigger downstream only after full upstream completion - max_inflight: 1 # Process serially within each stage - - edges: - - from: 0 - to: 1 - window_size: -1 diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml deleted file mode 100644 index 586b601bc5a..00000000000 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml +++ /dev/null @@ -1,32 +0,0 @@ -# Stage config for running Hunyuan-Image3.0 DiT with FP8 online quantization. -# The following config is for 2x H200 GPU. - -# Stage 0: Diffusion (DiT + VAE) -# This stage receives noise and timesteps and performs denoising + VAE decode -stage_args: - - stage_id: 0 - stage_type: diffusion - runtime: - devices: "0,1" - max_batch_size: 1 - engine_args: - model_stage: dit - enforce_eager: true - trust_remote_code: true - distributed_executor_backend: "mp" - quantization: "fp8" - parallel_config: - tensor_parallel_size: 2 - enable_expert_parallel: true - omni_kv_config: - need_recv_cache: true - - final_output: true - final_output_type: image - is_comprehension: false - default_sampling_params: - seed: 42 - -# Runtime edges -runtime: - enabled: true diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i.yaml deleted file mode 100644 index 1d8c7f4812d..00000000000 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i.yaml +++ /dev/null @@ -1,31 +0,0 @@ -# Stage config for running Hunyuan-Image3.0 DiT. -# The following config has been verified on 4x H20 GPU. - -# Stage 0: Diffusion (DiT + VAE) -# This stage receives noise and timesteps and performs denoising + VAE decode -stage_args: - - stage_id: 0 - stage_type: diffusion - runtime: - devices: "0,1,2,3" - engine_args: - max_num_seqs: 1 - model_stage: dit - enforce_eager: true - trust_remote_code: true - distributed_executor_backend: "mp" - parallel_config: - tensor_parallel_size: 4 - enable_expert_parallel: true - omni_kv_config: - need_recv_cache: true - - final_output: true - final_output_type: image - is_comprehension: false - default_sampling_params: - seed: 42 - -# Runtime edges -runtime: - enabled: true diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i_2gpu.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i_2gpu.yaml deleted file mode 100644 index 41ed74ba62a..00000000000 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2i_2gpu.yaml +++ /dev/null @@ -1,41 +0,0 @@ -# Stage config for running Hunyuan-Image3.0 on 2 GPUs with FP8. -# Stage 0: AR Model (vLLM implementation) - -stage_args: - - stage_id: 0 - stage_type: llm - runtime: - process: true - devices: "0,1" - engine_args: - model_stage: AR - max_num_seqs: 1 - model_arch: HunyuanImage3ForCausalMM - worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.9 - enforce_eager: true - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - max_num_batched_tokens: 32768 - tensor_parallel_size: 2 - pipeline_parallel_size: 1 - hf_overrides: - rope_parameters: - mrope_section: [0, 32, 32] - rope_type: default - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - -runtime: - enabled: true diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2t.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2t.yaml deleted file mode 100644 index a0a1a0dc1c4..00000000000 --- a/vllm_omni/model_executor/stage_configs/hunyuan_image3_t2t.yaml +++ /dev/null @@ -1,42 +0,0 @@ -# Stage config for HunyuanImage-3.0 Text-to-Text (T2T / pure text generation). -# Single LLM stage: AR model reads text prompt only, generates text output. -# Sampling params aligned with official generation_config.json. - -stage_args: - - stage_id: 0 - stage_type: llm - runtime: - process: true - devices: "0,1,2,3" - max_batch_size: 1 - requires_multimodal_data: false - engine_args: - model_stage: AR - max_num_seqs: 1 - model_arch: HunyuanImage3ForCausalMM - worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.95 - enforce_eager: true - trust_remote_code: true - enable_prefix_caching: false - max_num_batched_tokens: 32768 - tensor_parallel_size: 4 - pipeline_parallel_size: 1 - hf_overrides: - rope_parameters: - mrope_section: [0, 32, 32] - rope_type: default - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 0.95 - top_k: 1024 - max_tokens: 2048 - stop_token_ids: [127957, 128026] # <|endoftext|>, - detokenize: True - -runtime: - enabled: true diff --git a/vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml b/vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml deleted file mode 100644 index 0fd03949d11..00000000000 --- a/vllm_omni/platforms/npu/stage_configs/hunyuan_image3_t2i.yaml +++ /dev/null @@ -1,35 +0,0 @@ -# Stage config for running Hunyuan-Image3.0 DiT on NPU. -# The following config has been verified on 8x A3-64G NPUs. - -# Stage 0: Diffusion (DiT + VAE) -# This stage receives noise and timesteps and performs denoising + VAE decode. -stage_args: - - stage_id: 0 - stage_type: diffusion - runtime: - devices: "0,1,2,3,4,5,6,7" - engine_args: - max_num_seqs: 1 - model_stage: dit - gpu_memory_utilization: 0.65 - enforce_eager: true - trust_remote_code: true - engine_output_type: image - distributed_executor_backend: "mp" - enable_prefix_caching: false - max_num_batched_tokens: 32768 - parallel_config: - tensor_parallel_size: 8 - enable_expert_parallel: true - omni_kv_config: - need_recv_cache: true - - final_output: true - final_output_type: image - is_comprehension: false - default_sampling_params: - seed: 42 - -# Runtime defaults -runtime: - enabled: true diff --git a/vllm_omni/platforms/xpu/stage_configs/hunyuan_image3_t2i.yaml b/vllm_omni/platforms/xpu/stage_configs/hunyuan_image3_t2i.yaml deleted file mode 100644 index 4e0005f82a1..00000000000 --- a/vllm_omni/platforms/xpu/stage_configs/hunyuan_image3_t2i.yaml +++ /dev/null @@ -1,80 +0,0 @@ -# Stage config for running Hunyuan-Image3.0 with architecture of OmniLLM. -# Stage 0: AR Model (vLLM implementation) - -# The following config has been verified on 8x Max 1550 GPU. -modes: - - mode: text-to-image - stages: [1] - - mode: image-to-text - stages: [0] -stage_args: - - stage_id: 0 - stage_type: llm # Use llm stage type to launch OmniLLM - runtime: - process: true # Run this stage in a separate process - devices: "0,1,2,3,4,5,6,7" # Visible devices for this stage - max_batch_size: 1 - engine_args: - model_stage: AR - model_arch: HunyuanImage3ForCausalMM - worker_cls: vllm_omni.platforms.xpu.worker.xpu_ar_worker.XPUARWorker - scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler - gpu_memory_utilization: 0.95 - enforce_eager: true # Now we only support eager mode - trust_remote_code: true - engine_output_type: latent - enable_prefix_caching: false - max_num_batched_tokens: 32784 - tensor_parallel_size: 8 - pipeline_parallel_size: 1 - enable_expert_parallel: true - quantization: "fp8" - is_comprehension: true - final_output: true - final_output_type: text - default_sampling_params: - temperature: 0.0 - top_p: 1.0 - top_k: -1 - max_tokens: 2048 - seed: 42 - detokenize: True - repetition_penalty: 1.1 - - stage_id: 1 - stage_type: diffusion - runtime: - process: true - devices: "0,1,2,3,4,5,6,7" - max_batch_size: 1 - engine_args: - model_stage: diffusion - gpu_memory_utilization: 0.9 - enforce_eager: true - engine_output_type: image - distributed_executor_backend: "mp" - enable_prefix_caching: false - vae_use_slicing: false - vae_use_tiling: false - cache_backend: null - cache_config: null - enable_cache_dit_summary: false - quantization: "fp8" - parallel_config: - pipeline_parallel_size: 1 - data_parallel_size: 1 - tensor_parallel_size: 8 - enable_expert_parallel: true - sequence_parallel_size: 1 - ulysses_degree: 1 - ring_degree: 1 - cfg_parallel_size: 1 - vae_patch_parallel_size: 1 - use_hsdp: false - hsdp_shard_size: -1 - hsdp_replicate_size: 1 - final_output: true - final_output_type: image - -# Top-level runtime config (concise): default windows and stage edges -runtime: - enabled: true