diff --git a/examples/offline_inference/hunyuan_image3/README.md b/examples/offline_inference/hunyuan_image3/README.md index da28a44d9e6..3cd8fa01b2e 100644 --- a/examples/offline_inference/hunyuan_image3/README.md +++ b/examples/offline_inference/hunyuan_image3/README.md @@ -1,25 +1,161 @@ -# HunyuanImage-3.0 Image-to-Text Inference +# HunyuanImage-3.0-Instruct -This example demonstrates how to run HunyuanImage-3.0 Image-to-Text with the vLLM-Omni. +## Set up -## Local CLI Usage +Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup. -Download the example image: +## Run examples + +**Note**: These examples work with the default configuration on **8x NVIDIA L40S (48GB)**. For different GPU setups, modify the stage configuration to adjust device allocation and memory utilization. + +Get into the hunyuan_image3 folder: + +```bash +cd examples/offline_inference/hunyuan_image3 +``` + +### Modality Control + +HunyuanImage-3.0-Instruct supports multiple modality modes. You can control the mode using the `--modality` argument: + +#### Text to Image (text2img) + +- **Pipeline**: Text → AR (CoT + latent tokens) → DiT (denoise) → VAE Decode → Image +- **Stages Used**: Stage 0 (AR) + Stage 1 (DiT) +- **KV Transfer**: AR sends KV cache to DiT for conditioned generation +- **Default Config**: `hunyuan_image3_t2i.yaml` + +```bash +python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ + --modality text2img \ + --prompts "A cute cat sitting on a windowsill watching the sunset" +``` + +#### Image to Image (img2img) + +- **Pipeline**: Image + Text → AR (CoT + recaption + latent) → DiT → Edited Image +- **Stages Used**: Stage 0 (AR) + Stage 1 (DiT) +- **KV Transfer**: AR sends KV cache to DiT +- **Default Config**: `hunyuan_image3_it2i.yaml` + +```bash +python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ + --modality img2img \ + --image-path /path/to/image.png \ + --prompts "Make the petals neon pink" +``` + +#### Image to Text (img2text) + +- **Pipeline**: Image + Question → AR → Text description +- **Stages Used**: Stage 0 (AR) only +- **Default Config**: `hunyuan_image3_i2t.yaml` ```bash -wget https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg +python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ + --modality img2text \ + --image-path /path/to/image.jpg \ + --prompts "Describe the content of the picture." ``` -Run example: +#### Text to Text (text2text) + +- **Pipeline**: Text → AR → Text +- **Stages Used**: Stage 0 (AR) only +- **Default Config**: `hunyuan_image3_t2t.yaml` ```bash -python image_to_text.py \ - --image cherry_blossom.jpg \ - --prompt "<|startoftext|>You are an assistant that understands images and outputs text.Describe the content of the picture." +python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ + --modality text2text \ + --prompts "What is the capital of France?" ``` -Key arguments: +### Inference Steps & Guidance + +Control generation quality for image modalities: + +```bash +python end2end.py --modality text2img \ + --steps 50 \ + --guidance-scale 5.0 \ + --height 1024 --width 1024 \ + --prompts "A photo-realistic sunset over the ocean" +``` + +### Key Arguments + +#### 📌 Command Line Arguments (end2end.py) + +| Argument | Type | Default | Description | +| :--------------------- | :----- | :----------------------------------- | :----------------------------------------------------------- | +| `--model` | string | `tencent/HunyuanImage-3.0-Instruct` | Model path or name | +| `--modality` | choice | `text2img` | Modality: `text2img`, `img2img`, `img2text`, `text2text` | +| `--prompts` | list | `None` | Input text prompts | +| `--image-path` | string | `None` | Input image path (for `img2img`/`img2text`) | +| `--output` | string | `.` | Output directory for saved images | +| `--steps` | int | `50` | Number of inference steps | +| `--guidance-scale` | float | `5.0` | Classifier-free guidance scale | +| `--seed` | int | `42` | Random seed | +| `--height` | int | `1024` | Output image height | +| `--width` | int | `1024` | Output image width | +| `--bot-task` | string | auto | Override prompt task (e.g. `it2i_think`, `t2i_recaption`) | +| `--sys-type` | string | auto | Override system prompt type (e.g. `en_unified`, `en_vanilla`) | +| `--stage-configs-path` | string | auto | Custom stage config YAML path | +| `--enforce-eager` | flag | `False` | Disable torch.compile | +| `--init-timeout` | int | `300` | Initialization timeout (seconds) | + +------ + +#### ⚙️ Stage Configurations + +| Config YAML | Modality | Stages | GPUs | Description | +| :---------------------------------- | :-------- | :----- | :----- | :------------------------------------ | +| `hunyuan_image3_t2i.yaml` | text2img | 2 | 8 | T2I with AR→DiT, 4 GPU each | +| `hunyuan_image3_it2i.yaml` | img2img | 2 | 8 | IT2I with AR→DiT, 4 GPU each | +| `hunyuan_image3_i2t.yaml` | img2text | 1 | 4 | I2T (AR only) | +| `hunyuan_image3_t2t.yaml` | text2text | 1 | 4 | T2T (AR only) | +| `hunyuan_image3_t2i_2gpu.yaml` | text2img | 2 | 2 | T2I for 2-GPU setups | +| `hunyuan_image3_moe.yaml` | text2img | 2 | 8 | T2I with MoE AR→DiT KV reuse | +| `hunyuan_image3_moe_dit_2gpu_fp8.yaml` | text2img | 2 | 2 | T2I with FP8 quantization | + +------ + +## Using MoE Config + +The `hunyuan_image3_moe.yaml` config enables AR→DiT KV cache reuse with 8 GPUs (4 for AR + 4 for DiT). + +```bash +python end2end.py --model tencent/HunyuanImage-3.0-Instruct \ + --modality text2img \ + --stage-configs-path hunyuan_image3_moe.yaml \ + --prompts "A cute cat" +``` + +------ + +## Prompt Format + +HunyuanImage-3.0 uses a pretrain template format: + +``` +<|startoftext|>{system_prompt}{}{trigger_tag}{user_prompt} +``` + +- ``: Placeholder for each input image (auto-inserted by `prompt_utils.py`) +- Trigger tags: `` (CoT), `` (recaptioning) +- System prompt: Auto-selected based on task + +The `prompt_utils.build_prompt()` handles this formatting automatically. + +------ + +## FAQ + +- **OOM errors**: Decrease `gpu_memory_utilization` in the YAML stage config, or use a smaller `max_num_batched_tokens`. +- **Custom image sizes**: Use `--height` and `--width` flags (multiples of 16 recommended). -- `--model`: Model used. Default is: tencent/HunyuanImage-3.0-Instruct (Optional). -- `--image`: Path to input image (required). -- `--prompt`: Text description used to guide image understanding (required). +| Stage | VRAM (approx) | +| :---------------- | :------------------- | +| Stage 0 (AR) | ~15 GiB + KV Cache | +| Stage 1 (DiT) | ~30 GiB | +| Total (8-GPU) | ~45 GiB + KV Cache | diff --git a/examples/offline_inference/hunyuan_image3/end2end.py b/examples/offline_inference/hunyuan_image3/end2end.py new file mode 100644 index 00000000000..3c1ae386678 --- /dev/null +++ b/examples/offline_inference/hunyuan_image3/end2end.py @@ -0,0 +1,262 @@ +""" +HunyuanImage-3.0-Instruct unified end-to-end inference script. + +Supports all modalities through a single entry point: + - text2img: Text → AR → DiT → Image + - img2img: Text+Image → AR → DiT → Edited Image (IT2I) + - img2text: Image+Text → AR → Text description (I2T) + - text2text: Text → AR → Text (comprehension, no image) + +Usage: + python end2end.py --modality text2img --prompts "A cute cat" + python end2end.py --modality img2img --image-path input.png --prompts "Make it snowy" + python end2end.py --modality img2text --image-path input.png --prompts "Describe this image" +""" + +import argparse +import os + +from vllm_omni.diffusion.models.hunyuan_image3.system_prompt import ( + get_system_prompt, +) +from vllm_omni.entrypoints.omni import Omni +from vllm_omni.inputs.data import OmniPromptType + +# task → (sys_type, bot_task, trigger_tag) +_TASK_PRESETS: dict[str, tuple[str, str | None, str | None]] = { + "t2t": ("en_unified", None, None), + "i2t": ("en_unified", None, None), + "it2i_think": ("en_unified", "think", ""), + "it2i_recaption": ("en_unified", "recaption", ""), + "t2i_think": ("en_unified", "think", ""), + "t2i_recaption": ("en_unified", "recaption", ""), + "t2i_vanilla": ("en_vanilla", "image", None), +} + +# Modality → prompt_utils task mapping +_MODALITY_TASK_MAP = { + "text2img": "t2i_think", + "img2img": "it2i_think", + "img2text": "i2t", + "text2text": "t2t", +} + + +def build_prompt( + user_prompt: str, + task: str = "it2i_think", + sys_type: str | None = None, + custom_system_prompt: str | None = None, +) -> str: + """Build a HunyuanImage-3.0 prompt using pretrain template format.""" + if task not in _TASK_PRESETS: + raise ValueError(f"Unknown task {task!r}. Choose from: {sorted(_TASK_PRESETS)}") + + preset_sys_type, preset_bot_task, trigger_tag = _TASK_PRESETS[task] + effective_sys_type = sys_type or preset_sys_type + + system_prompt = get_system_prompt(effective_sys_type, preset_bot_task, custom_system_prompt) + sys_text = system_prompt.strip() if system_prompt else "" + + has_image_input = task.startswith("i2t") or task.startswith("it2i") + + parts = ["<|startoftext|>"] + if sys_text: + parts.append(sys_text) + if has_image_input: + parts.append("") + if trigger_tag: + parts.append(trigger_tag) + parts.append(user_prompt) + + return "".join(parts) + + +# Modality → default stage config +_MODALITY_DEFAULT_CONFIG = { + "text2img": "hunyuan_image3_t2i.yaml", + "img2img": "hunyuan_image3_it2i.yaml", + "img2text": "hunyuan_image3_i2t.yaml", + "text2text": "hunyuan_image3_t2t.yaml", +} + + +def parse_args(): + parser = argparse.ArgumentParser(description="HunyuanImage-3.0-Instruct end-to-end inference.") + parser.add_argument( + "--model", + default="tencent/HunyuanImage-3.0-Instruct", + help="Model name or local path.", + ) + parser.add_argument( + "--modality", + default="text2img", + choices=["text2img", "img2img", "img2text", "text2text"], + help="Modality mode to control stage execution.", + ) + parser.add_argument("--prompts", nargs="+", default=None, help="Input text prompts.") + parser.add_argument( + "--image-path", + type=str, + default=None, + help="Path to input image (for img2img/img2text).", + ) + parser.add_argument( + "--output", + type=str, + default=".", + help="Output directory to save results.", + ) + + # Generation parameters + parser.add_argument("--steps", type=int, default=50, help="Number of inference steps.") + parser.add_argument("--guidance-scale", type=float, default=5.0, help="Classifier-free guidance scale.") + parser.add_argument("--seed", type=int, default=42, help="Random seed.") + parser.add_argument("--height", type=int, default=1024, help="Output image height.") + parser.add_argument("--width", type=int, default=1024, help="Output image width.") + + # Prompt configuration + parser.add_argument( + "--bot-task", + type=str, + default=None, + help="Override prompt task (e.g. it2i_think, t2i_recaption). Default: auto from modality.", + ) + parser.add_argument( + "--sys-type", + type=str, + default=None, + help="Override system prompt type (e.g. en_unified, en_vanilla).", + ) + + # Omni init args + parser.add_argument("--stage-configs-path", type=str, default=None, help="Custom stage config YAML path.") + parser.add_argument("--log-stats", action="store_true", default=False) + parser.add_argument("--init-timeout", type=int, default=300, help="Initialization timeout in seconds.") + parser.add_argument("--enforce-eager", action="store_true", help="Disable torch.compile.") + + return parser.parse_args() + + +def main(): + args = parse_args() + os.makedirs(args.output, exist_ok=True) + + # Determine task for prompt formatting + task = args.bot_task or _MODALITY_TASK_MAP[args.modality] + + # Determine stage config + stage_configs_path = args.stage_configs_path or _MODALITY_DEFAULT_CONFIG[args.modality] + + # Build Omni + omni_kwargs = { + "model": args.model, + "stage_configs_path": stage_configs_path, + "log_stats": args.log_stats, + "init_timeout": args.init_timeout, + "enforce_eager": args.enforce_eager, + } + if args.modality in ("text2img", "img2img"): + omni_kwargs["mode"] = "text-to-image" + + omni = Omni(**omni_kwargs) + + # Prepare prompts + prompts = args.prompts or ["A cute cat"] + if not prompts: + print("[Info] No prompts provided, using default.") + prompts = ["A cute cat"] + + # Load image if needed + input_image = None + if args.modality in ("img2img", "img2text"): + if not args.image_path or not os.path.exists(args.image_path): + raise ValueError(f"--image-path required for {args.modality}, got: {args.image_path}") + from PIL import Image + + input_image = Image.open(args.image_path).convert("RGB") + + # Format prompts + formatted_prompts: list[OmniPromptType] = [] + for p in prompts: + formatted_text = build_prompt(p, task=task, sys_type=args.sys_type) + + prompt_dict: dict = {"prompt": formatted_text} + + if args.modality == "text2img": + prompt_dict["modalities"] = ["image"] + elif args.modality == "img2img": + prompt_dict["modalities"] = ["image"] + prompt_dict["multi_modal_data"] = {"image": input_image} + prompt_dict["height"] = input_image.height + prompt_dict["width"] = input_image.width + elif args.modality == "img2text": + prompt_dict["modalities"] = ["text"] + prompt_dict["multi_modal_data"] = {"image": input_image} + elif args.modality == "text2text": + prompt_dict["modalities"] = ["text"] + + formatted_prompts.append(prompt_dict) + + # Build sampling params from defaults + params_list = list(omni.default_sampling_params_list) + + # Override diffusion params if applicable + from vllm_omni.inputs.data import OmniDiffusionSamplingParams + + for i, sp in enumerate(params_list): + if isinstance(sp, OmniDiffusionSamplingParams): + sp.num_inference_steps = args.steps + sp.guidance_scale = args.guidance_scale + if args.seed is not None: + sp.seed = args.seed + if args.modality in ("text2img",): + sp.height = args.height + sp.width = args.width + + # Print configuration + print(f"\n{'=' * 60}") + print("HunyuanImage-3.0 Generation Configuration:") + print(f" Model: {args.model}") + print(f" Modality: {args.modality}") + print(f" Stage config: {stage_configs_path}") + print(f" Num stages: {omni.num_stages}") + if args.modality in ("text2img", "img2img"): + print(f" Inference steps: {args.steps}") + print(f" Guidance scale: {args.guidance_scale}") + print(f" Seed: {args.seed}") + if args.modality == "text2img": + print(f" Output size: {args.width}x{args.height}") + if args.image_path: + print(f" Input image: {args.image_path}") + print(f" Prompts: {prompts}") + print(f"{'=' * 60}\n") + + # Generate + omni_outputs = list(omni.generate(prompts=formatted_prompts, sampling_params_list=params_list)) + + # Process outputs + img_idx = 0 + for req_output in omni_outputs: + # Text output (AR stage or text-only) + ro = getattr(req_output, "request_output", None) + if ro and getattr(ro, "outputs", None): + txt = "".join(getattr(o, "text", "") or "" for o in ro.outputs) + if txt: + print(f"[Output] Text:\n{txt}") + + # Image output (DiT stage) + images = getattr(req_output, "images", None) + if not images and ro and hasattr(ro, "images"): + images = ro.images + + if images: + for j, img in enumerate(images): + save_path = os.path.join(args.output, f"output_{img_idx}_{j}.png") + img.save(save_path) + print(f"[Output] Saved image to {save_path}") + img_idx += 1 + + +if __name__ == "__main__": + main() diff --git a/examples/offline_inference/hunyuan_image3/image_to_text.py b/examples/offline_inference/hunyuan_image3/image_to_text.py deleted file mode 100644 index d40134ac0a0..00000000000 --- a/examples/offline_inference/hunyuan_image3/image_to_text.py +++ /dev/null @@ -1,92 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - -import argparse -import os - -from PIL import Image - -from vllm_omni.entrypoints.omni import Omni - -""" -The tencent/HunyuanImage-3.0-Instruct base model uses the tencent/Hunyuan-A13B-Instruct backbone. It utilizes two tokenizer delimiter templates: - -1) Pretrained template (default for gen_text mode), which concatenates system, image - tokens, and user question WITHOUT role delimiters: -"<|startoftext|>{system_prompt}{image_tokens}{user_question}" - - Example (before image token expansion): -"<|startoftext|>You are an assistant that understands images and outputs text.Describe the content of the picture." - -2) Instruct template, which uses explicit role prefixes and separators. -""" - - -def parse_args() -> argparse.Namespace: - parser = argparse.ArgumentParser(description="Generate text from image using HunyuanImage-3.0-Instruct.") - parser.add_argument( - "--model", - default="tencent/HunyuanImage-3.0-Instruct", - help="Model name or local path.", - ) - parser.add_argument( - "--image", - type=str, - required=True, - help="Path to input image file (PNG, JPG, etc.).", - ) - parser.add_argument( - "--prompt", - type=str, - required=True, - help="Pretrain template prompt: <|startoftext|>{system}{question}", - ) - parser.add_argument( - "--enable-diffusion-pipeline-profiler", - action="store_true", - help="Enable diffusion pipeline profiler to display stage durations.", - ) - return parser.parse_args() - - -def load_image(image_path: str) -> Image.Image: - """Load an image from file path.""" - if not os.path.exists(image_path): - raise FileNotFoundError(f"Image file not found: {image_path}") - return Image.open(image_path).convert("RGB") - - -def main(args: argparse.Namespace) -> None: - omni = Omni( - model=args.model, - enable_diffusion_pipeline_profiler=args.enable_diffusion_pipeline_profiler, - mode="image-to-text", - ) - - prompt = "<|startoftext|>You are an assistant that understands images and outputs text." + args.prompt - - prompt_dict = { - "prompt": prompt, - "modalities": ["text"], - } - - # Add image input if provided - if args.image: - if not os.path.exists(args.image): - raise FileNotFoundError(f"Input image not found: {args.image}") - - input_image = load_image(args.image) - prompt_dict["multi_modal_data"] = {"image": input_image} - - prompts = [prompt_dict] - omni_outputs = omni.generate(prompts=prompts) - - prompt_text = omni_outputs[0].request_output.prompt - generated_text = omni_outputs[0].request_output.outputs[0].text - print(f"Prompt: {prompt_text}") - print(f"Text: {generated_text}") - - -if __name__ == "__main__": - args = parse_args() - main(args) diff --git a/examples/offline_inference/hunyuan_image3/prompt_utils.py b/examples/offline_inference/hunyuan_image3/prompt_utils.py deleted file mode 100644 index a5ef8e15369..00000000000 --- a/examples/offline_inference/hunyuan_image3/prompt_utils.py +++ /dev/null @@ -1,88 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project -""" -Prompt construction utilities for HunyuanImage-3.0-Instruct examples. - -Wraps system_prompt.get_system_prompt() with task-aware presets so that -examples and tests don't need to manually concatenate system prompts, -, , and tags. - -Usage: - from prompt_utils import build_prompt - - # IT2I (image editing, think+recaption mode) - prompt = build_prompt("Make the petals neon pink", task="it2i_think") - - # I2T (image understanding) - prompt = build_prompt("Describe the content of the picture.", task="i2t") -""" - -from __future__ import annotations - -from vllm_omni.diffusion.models.hunyuan_image3.system_prompt import ( - get_system_prompt, -) - -# task → (sys_type, bot_task, trigger_tag) -# trigger_tag: "", "", or None -_TASK_PRESETS: dict[str, tuple[str, str | None, str | None]] = { - # Pure text generation (text → text, no image) - "t2t": ("en_unified", None, None), - # Image understanding (image → text) - "i2t": ("en_unified", None, None), - # Image editing (image+text → image), think+recaption mode - "it2i_think": ("en_unified", "think", ""), - # Image editing, recaption-only mode - "it2i_recaption": ("en_unified", "recaption", ""), - # Text-to-image, think mode - "t2i_think": ("en_unified", "think", ""), - # Text-to-image, recaption mode - "t2i_recaption": ("en_unified", "recaption", ""), - # Text-to-image, vanilla (no CoT) - "t2i_vanilla": ("en_vanilla", "image", None), -} - - -def build_prompt( - user_prompt: str, - task: str = "it2i_think", - sys_type: str | None = None, - custom_system_prompt: str | None = None, -) -> str: - """Build a complete HunyuanImage-3.0 prompt with auto-selected system - prompt and mode trigger tags. - - Args: - user_prompt: The user's raw instruction or question. - task: One of the preset task keys (see _TASK_PRESETS). - sys_type: Override the preset's sys_type for get_system_prompt(). - custom_system_prompt: Custom system prompt text (used when - sys_type="custom"). - - Returns: - Fully formatted prompt string ready for Omni.generate(). - """ - if task not in _TASK_PRESETS: - raise ValueError(f"Unknown task {task!r}. Choose from: {sorted(_TASK_PRESETS)}") - - preset_sys_type, preset_bot_task, trigger_tag = _TASK_PRESETS[task] - effective_sys_type = sys_type or preset_sys_type - - system_prompt = get_system_prompt(effective_sys_type, preset_bot_task, custom_system_prompt) - sys_text = system_prompt.strip() if system_prompt else "" - - has_image_input = task.startswith("i2t") or task.startswith("it2i") - - parts = ["<|startoftext|>"] - if sys_text: - parts.append(sys_text) - # Instruct conversation template: \n\nUser: ... \n\nAssistant: - parts.append("\n\nUser: ") - if has_image_input: - parts.append("") - parts.append(user_prompt) - parts.append("\n\nAssistant: ") - if trigger_tag: - parts.append(trigger_tag) - - return "".join(parts) diff --git a/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml new file mode 100644 index 00000000000..f0797c63270 --- /dev/null +++ b/vllm_omni/model_executor/stage_configs/hunyuan_image3_moe.yaml @@ -0,0 +1,96 @@ +# Stage config for running Hunyuan-Image3.0 with AR→DiT KV reuse. +# Stage 0: AR Model (vLLM implementation) +# Stage 1: DiT Model (diffusion) +# +# text-to-image flow: AR (stage 0) → KV transfer → DiT (stage 1) +# image-to-text flow: AR (stage 0) only +# +# Compared to hunyuan_image3_t2i.yaml, this config: +# 1. Enables both stages [0, 1] for text-to-image (AR prefill + DiT denoising) +# 2. Adds omni_kv_config to send/receive KV cache between stages + +# The following config has been verified on 8x L40S-48G GPU (4 for AR + 4 for DiT). +stage_args: + - stage_id: 0 + stage_type: llm # Use llm stage type for AR stages + runtime: + process: true # Run this stage in a separate process + devices: "0,1,2,3" # AR stage uses GPU 0-3 + engine_args: + model_stage: AR + max_num_seqs: 1 + model_arch: HunyuanImage3ForCausalMM + worker_cls: vllm_omni.worker.gpu_ar_worker.GPUARWorker + scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler + gpu_memory_utilization: 0.9 + enforce_eager: true # Now we only support eager mode + trust_remote_code: true + engine_output_type: latent + enable_prefix_caching: false + max_num_batched_tokens: 32768 + tensor_parallel_size: 4 + pipeline_parallel_size: 1 + hf_overrides: + rope_parameters: + mrope_section: [0, 32, 32] + rope_type: default + omni_kv_config: + need_send_cache: true + kv_transfer_criteria: + type: prefill_finished # Send KV cache after AR prefill completes + is_comprehension: true + final_output: true + final_output_type: text + default_sampling_params: + temperature: 0.0 + top_p: 1.0 + top_k: -1 + max_tokens: 2048 + seed: 42 + detokenize: True + repetition_penalty: 1.1 + - stage_id: 1 + stage_type: diffusion + runtime: + process: true + devices: "4,5,6,7" # DiT stage uses GPU 4-7 + max_batch_size: 1 + engine_args: + model_stage: diffusion + enforce_eager: true + distributed_executor_backend: "mp" + vae_use_slicing: false + vae_use_tiling: false + cache_backend: null + cache_config: null + enable_cache_dit_summary: false + omni_kv_config: + need_recv_cache: true # Receive AR KV cache from stage 0 + parallel_config: + pipeline_parallel_size: 1 + data_parallel_size: 1 + tensor_parallel_size: 4 + enable_expert_parallel: false + sequence_parallel_size: 1 + ulysses_degree: 1 + ring_degree: 1 + cfg_parallel_size: 1 + vae_patch_parallel_size: 1 + use_hsdp: false + hsdp_shard_size: -1 + hsdp_replicate_size: 1 + engine_input_source: [0] # Receive input (including KV) from stage 0 + final_output: true + final_output_type: image + +# Top-level runtime config: windows, edges, and connectors +runtime: + enabled: true + defaults: + window_size: -1 # Trigger downstream only after full upstream completion + max_inflight: 1 # Process serially within each stage + + edges: + - from: 0 + to: 1 + window_size: -1