[Bugfix] Fix precedence between caller runtime args and default stage configs by xiaohajiayou · Pull Request #2076 · vllm-project/vllm-omni

xiaohajiayou · 2026-03-22T15:47:15Z

Purpose

Fixes #2075.

This PR fixes config precedence for stage-config based HunyuanImage3 runs so that caller-provided parallel_config remains authoritative over stage-yaml parallel_config.

Without this change, the stage yaml can override runtime EP settings, which makes it unreliable to toggle between:

baseline: enable_expert_parallel=False
EP: enable_expert_parallel=True

This work came from investigating the HunyuanImage3 EP performance regression tracked in #2015.
This PR does not include that vllm-side kernel config change. Its scope is to make the HunyuanImage3 stage-config path reliably toggle baseline vs EP, so the regression can be reproduced and validated cleanly.

Test command:

export CUDA_VISIBLE_DEVICES=3,5,6,7
pytest tests/e2e/offline_inference/test_expert_parallel.py -v -s

Test Result

The existing expert-parallel e2e test now runs successfully on 4x A100 with an explicit HunyuanImage3 stage config.

[enable_ep: True] 4 GPUs | baseline: 27912ms, ep: 28001ms, speedup: 1.00x
[enable_ep: True] diff: [mean=3.831369e-02, max=7.254902e-01], cos_sim: [mean=9.900795e-01, max=9.900795e-01], mse: 5.543310e-03

==========================================================================================
SUMMARY
==========================================================================================
Mode            GPUs   Size       Baseline     EP           Speedup    Status
------------------------------------------------------------------------------------------
A brown an      1      1024x1024  27912ms      28001ms        1.00x      PASS
==========================================================================================
PASSED

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2e6c3e14c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

lishunyang12

Thanks for the investigation on the EP regression — the benchmarking results are useful context.

However, I think the per-field override approach here works against the stage-config design. The stage yaml is meant to be the single source of truth for stage-level topology. Punching through enable_expert_parallel as a special case creates an inconsistent precedence model — callers would reasonably expect the same to work for tensor_parallel_size or other parallel fields, but it won't.

For toggling EP in testing/benchmarking, a separate stage yaml (e.g. hunyuan_image3_moe_dit_ep.yaml) would keep the config model clean and explicit.

The env-var overrides in the test also add a lot of surface area for what's essentially a single-model validation. If A100 support is needed, a dedicated test or fixture would be more maintainable than parameterizing the existing one with 5+ env vars.

Happy to discuss if you see a reason the yaml-per-config approach doesn't work for your use case.

hsliuustc0106

Review Gates

Check	Status
DCO	❌ ACTION_REQUIRED
pre-commit	✅ SUCCESS
mergeable	✅ MERGEABLE

BLOCKER: DCO check failing. Author needs to sign commits with git commit -s.

Mandatory Blocker Triage

Category	Status	Evidence/Gap
Tests	✅ PASS	Test commands + results in PR description; existing test updated with env-based overrides
Docs	N/A	Internal test/utility changes only — no docs required
Perf	✅ PASS	Benchmark data: baseline 27912ms vs EP 28001ms (1.00x speedup)
Accuracy	✅ PASS	cos_sim=0.99, mse=5.5e-3, diff metrics provided
API	N/A	No API changes

Code Review

The fix is narrow and well-scoped:

vllm_omni/entrypoints/utils.py: Extracts enable_expert_parallel from caller config and overrides stage-yaml value
tests/e2e/offline_inference/test_expert_parallel.py: Adds env-based test configuration
tests/utils.py: Adds A100 marker support
pyproject.toml: Registers A100 marker

No inline blockers beyond DCO.

Action Required

Author: sign commits with git commit -s to fix DCO

Once DCO passes, this is ready for APPROVE.

xiaohajiayou · 2026-03-23T02:51:00Z

Thanks for the investigation on the EP regression — the benchmarking results are useful context.

However, I think the per-field override approach here works against the stage-config design. The stage yaml is meant to be the single source of truth for stage-level topology. Punching through enable_expert_parallel as a special case creates an inconsistent precedence model — callers would reasonably expect the same to work for tensor_parallel_size or other parallel fields, but it won't.

For toggling EP in testing/benchmarking, a separate stage yaml (e.g. hunyuan_image3_moe_dit_ep.yaml) would keep the config model clean and explicit.

The env-var overrides in the test also add a lot of surface area for what's essentially a single-model validation. If A100 support is needed, a dedicated test or fixture would be more maintainable than parameterizing the existing one with 5+ env vars.

Happy to discuss if you see a reason the yaml-per-config approach doesn't work for your use case.

You’re right that making enable_expert_parallel a special-case override would make the overall precedence model inconsistent.

What I’d like to clarify is the expected relationship between the two configuration paths that already exist today:

Specifying parallelism via runtime parallel_config, for example:

parallel_config = DiffusionParallelConfig(tensor_parallel_size=2)
omni = Omni(model="your-model-name", parallel_config=parallel_config)

Specifying it via stage config yaml, for example:

omni = Omni(model="your-model-name", stage-configs-path="/path/to/your/custom_bagel.yaml")

In test_expert_parallel.py, the test currently uses the first approach (passing parallel_config via Omni(...)), but in practice it gets overridden by the default stage config. This is the ambiguity we’re running into here.

So I’d like to confirm the intended design:

When the same topology-related fields (e.g. tensor_parallel_size, enable_expert_parallel, etc.) are provided through both paths—

via parallel_config at Omni(...) initialization, and
via stage config yaml (engine_args / runtime.devices)—

is there a defined precedence or merge rule that determines the final effective value?

Or is the recommended approach to treat stage config as the single source of truth for stage-level topology, and avoid configuring these fields through multiple paths at the same time?

If the latter is the intended model, I’m happy to update this test to use two separate yaml configs (EP vs non-EP), instead of overriding EP from the caller side at runtime.

yenuo26 · 2026-03-23T06:43:04Z

@congw729 can we need to add A100 mark?

congw729 · 2026-03-23T08:37:16Z

@congw729 can we need to add A100 mark?

We don't have A100 machines in our CI. Right now, the hardware mark is designed to mark which machine this test needs to run on.

lishunyang12

now that the entrypoints/utils.py override is removed, what's left is a test refactor + two new YAMLs for a test that isn't wired into any CI pipeline. The rename also drops the generic test_expert_parallel.py in favor of a HunyuanImage3-specific one.

Would it make more sense to fold the investigation findings into #2015 directly, and only open a PR when there's either a concrete fix or CI coverage for this test?

xiaohajiayou · 2026-03-25T15:32:16Z

now that the entrypoints/utils.py override is removed, what's left is a test refactor + two new YAMLs for a test that isn't wired into any CI pipeline. The rename also drops the generic test_expert_parallel.py in favor of a HunyuanImage3-specific one.

Would it make more sense to fold the investigation findings into #2015 directly, and only open a PR when there's either a concrete fix or CI coverage for this test?

I think the main issue here is not the Hunyuan-specific test itself, but the semantic conflict between bundled default stage configs and caller-provided runtime args. There should be a clear precedence model.

My reasoning is:

We already have many offline / online examples and docs where runtime behavior is controlled explicitly via caller-provided arguments, such as parallel_config, TP-related settings, gpu_memory_utilization, enforce_eager, etc. However, with the current behavior, if a model has a bundled default stage config(like: Hunyuan image model), overlapping fields in the YAML can silently override those caller-provided runtime args.
For example：

vllm-omni/examples/offline_inference/text_to_image/text_to_image.py

Lines 317 to 338 in cf66ee8

    
           omni_kwargs = { 
        
               "model": args.model, 
        
               "enable_layerwise_offload": args.enable_layerwise_offload, 
        
               "vae_use_slicing": args.vae_use_slicing, 
        
               "vae_use_tiling": args.vae_use_tiling, 
        
               "cache_backend": args.cache_backend, 
        
               "cache_config": cache_config, 
        
               "enable_cache_dit_summary": args.enable_cache_dit_summary, 
        
               "parallel_config": parallel_config, 
        
               "enforce_eager": args.enforce_eager, 
        
               "enable_cpu_offload": args.enable_cpu_offload, 
        
               "mode": "text-to-image", 
        
               "enable_diffusion_pipeline_profiler": args.enable_diffusion_pipeline_profiler, 
        
               **lora_args, 
        
               **quant_kwargs, 
        
           } 
        
           if args.stage_configs_path: 
        
               omni_kwargs["stage_configs_path"] = args.stage_configs_path 
        
           if use_nextstep: 
        
               # NextStep-1.1 requires explicit pipeline class 
        
               omni_kwargs["model_class_name"] = "NextStep11Pipeline" 
        
           omni = Omni(**omni_kwargs)

test_expert_parallel.py is simply the first place where this issue becomes visible, because it combines both conditions: explicitly passing runtime args and resolving to a model with a bundled default stage config. As more bundled stage configs are introduced, this will not remain a one-off issue.

This also seems inconsistent with the configuration model described in stage_config.py, where:

pipeline structure is defined by YAML

runtime parameters are expected to come from CLI / caller-provided args

vllm-omni/vllm_omni/config/stage_config.py

Lines 3 to 9 in 18d2d08

    
           """ 
        
           Stage Configuration System for vLLM-Omni. 
        
           Pipeline structure (stages, types, data-flow) is defined in per-model YAML 
        
           files and is set by model developers at integration time. 
        
           Runtime parameters (gpu_memory_utilization, tp_size, etc.) come from CLI. 
        
           """

To make this precedence explicit, I introduced a prefer_stage_engine_args flag in load_stage_configs_from_yaml() (default True, meaning stage_arg.engine_args overrides caller-provided runtime args).

def load_stage_configs_from_yaml(
    config_path: str,
    base_engine_args: dict | None = None,
    prefer_stage_engine_args: bool = True,
) -> list:

With that, the current logic becomes:

No caller-provided runtime args and no explicit stage_configs_path
- prefer_stage_engine_args=True
- resolve and use the bundled default stage config based on model name
Caller explicitly provides runtime args, but does not explicitly provide stage_configs_path
- prefer_stage_engine_args=False
- caller-provided runtime args override overlapping fields from the bundled default stage config
User explicitly provides stage_configs_path
- prefer_stage_engine_args=True
- the explicitly provided stage config is treated as the source of truth and overrides overlapping caller-provided runtime args

This precedence seems more consistent with the documented configuration model and makes the behavior easier to reason about. That said, I’m happy to adjust if there is a preferred precedence model.

xiaohajiayou · 2026-03-25T15:43:57Z

is there a defined precedence or merge rule that determines the final effective value?

Or is the recommended approach to treat stage config as the single source of truth for stage-level topology, and avoid configuring these fields through multiple paths at the same time?

If the latter is the intended model, I’m happy to update this test to use two separate yaml configs (EP vs non-EP), instead of overriding EP from the caller side at runtime.

test_expert_parallel.py is actually another question.
In the former version, default stage-config resolution currently always pick AR path instead of diffusion path.

#1826 might have fixed this problem.
But, there still have the precedence issue between caller-provided runtime args and the bundled default stage config.

This is also why my initial version relied on env-based switching, and the subsequent revision introduced two dedicated test YAML configs. At that stage, I was effectively working around two separate issues simultaneously: runtime argument precedence and the default stage-config resolution for Hunyuan.

What #1826 changed is that load_and_resolve_stage_configs() now goes through filter_stages(). As a result, when stage_configs_path is not explicitly provided, the Hunyuan default path first resolves to hunyuan_image_3_moe.yaml, and then, under the default mode=text-to-image, selects stage_id: 1, i.e. the stage_type: diffusion stage in that file.

vllm-omni/vllm_omni/model_executor/stage_configs/hunyuan_image_3_moe.yaml

Lines 45 to 76 in 8305477

    
           - stage_id: 1 
        
             stage_type: diffusion 
        
             runtime: 
        
               process: true 
        
               devices: "0,1,2,3,4,5,6,7" 
        
               max_batch_size: 1 
        
             engine_args: 
        
               model_stage: diffusion 
        
               gpu_memory_utilization: 0.9 
        
               enforce_eager: true 
        
               engine_output_type: image 
        
               distributed_executor_backend: "mp" 
        
               enable_prefix_caching: false 
        
               max_num_batched_tokens: 32768 
        
               vae_use_slicing: false 
        
               vae_use_tiling: false 
        
               cache_backend: null 
        
               cache_config: null 
        
               enable_cache_dit_summary: false 
        
               parallel_config: 
        
                 pipeline_parallel_size: 1 
        
                 data_parallel_size: 1 
        
                 tensor_parallel_size: 8 
        
                 enable_expert_parallel: false 
        
                 sequence_parallel_size: 1 
        
                 ulysses_degree: 1 
        
                 ring_degree: 1 
        
                 cfg_parallel_size: 1 
        
                 vae_patch_parallel_size: 1 
        
                 use_hsdp: false 
        
                 hsdp_shard_size: -1 
        
                 hsdp_replicate_size: 1

The problem is that #1826 still placed a substantial amount of runtime configuration directly into that default diffusion stage, including parallel_config. I checked other model stage configs, and they generally do not embed this kind of runtime parallel topology into the bundled default stage config. Because of that, although test_expert_parallel.py passes caller-side runtime args (such as the EP toggle / parallel config), under the current precedence model those values get overridden by the parallel_config baked into the default Hunyuan stage config. That is why this test exposes the issue and why --enable-expert-parallel stops taking effect on this path.

xiaohajiayou · 2026-03-26T07:10:47Z

@hsliuustc0106 @lishunyang12
May you help to take a look

Bounty-hunter · 2026-03-28T07:44:16Z

I also try to fix it by #2289 , we can discuss it.

xiaohajiayou · 2026-03-29T09:31:55Z

I also try to fix it by #2289 , we can discuss it.

I do not think we need to split omni and diffusion into two separate configuration flows, and I also do not think a separate diffusion_only flag is necessary.

After #1826, once a stage config is resolved, mode together with stage-id mapping is already sufficient to select the corresponding pipeline branch. Under the precedence model in this PR, both omni and diffusion behaviors can still be explained with the same four cases:

No bundled default stage config can be resolved, and the caller does not explicitly provide stage_configs_path

Fall back to the CLI/runtime-based diffusion stage-config constructor

Behaviorally, this is equivalent to introducing a separate diffusion_only flag

vllm-omni/vllm_omni/entrypoints/utils.py

Lines 418 to 421 in 9d9e6ab

    
           if not stage_configs: 
        
               if default_stage_cfg_factory is not None: 
        
                   default_stage_cfg = default_stage_cfg_factory() 
        
                   stage_configs = create_config(default_stage_cfg)

A bundled default stage config exists, no explicit stage_configs_path is provided, and no global runtime overrides are given
- Resolve and use the bundled default stage config based on model identity
- The config is used as-is
A bundled default stage config exists, no explicit stage_configs_path is provided, and runtime overrides are given
- Resolve and use the bundled default stage config
- Caller-provided runtime args act as coarse-grained global overrides for overlapping fields in engine_args (e.g., enable_prefix_caching, enforce_eager)
- This path is intended only for coarse-grained adjustments; if per-stage customization is needed, users should modify the default stage config or explicitly provide their own
stage_configs_path is explicitly provided
- The explicitly provided stage config is treated as the source of truth
- Caller runtime args only supplement fields that are not explicitly defined in the stage config

Given this, introducing diffusion_only would have several drawbacks:

Redundant semantics
- If users want to run the diffusion branch, they can already select it via mode
- Or explicitly provide a stage config via stage_configs_path
- If neither is available, the current fallback already constructs a diffusion stage config
  → Therefore, diffusion_only is functionally overlapping with existing mechanisms
Incomplete default diffusion construction
- _create_default_diffusion_stage_cfg() is not a complete configuration (e.g., it does not cover fields like gpu_memory_utilization) and is currently only intended as a fallback
- Overusing this path may lead to caller-provided runtime args not being applied as expected
Reduced configuration reuse and increased user burden
- It prevents reuse of existing model/pipeline configurations
- Users would need to manually provide more diffusion-related runtime parameters

Bounty-hunter · 2026-03-30T06:38:54Z

I also try to fix it by #2289 , we can discuss it.

I do not think we need to split omni and diffusion into two separate configuration flows, and I also do not think a separate diffusion_only flag is necessary.

After #1826, once a stage config is resolved, mode together with stage-id mapping is already sufficient to select the corresponding pipeline branch. Under the precedence model in this PR, both omni and diffusion behaviors can still be explained with the same four cases:

No bundled default stage config can be resolved, and the caller does not explicitly provide stage_configs_path

Fall back to the CLI/runtime-based diffusion stage-config constructor

Behaviorally, this is equivalent to introducing a separate diffusion_only flag

vllm-omni/vllm_omni/entrypoints/utils.py

Lines 418 to 421 in 9d9e6ab

if not stage_configs:

if default_stage_cfg_factory is not None:

default_stage_cfg = default_stage_cfg_factory()

stage_configs = create_config(default_stage_cfg)

A bundled default stage config exists, no explicit stage_configs_path is provided, and no global runtime overrides are given

Resolve and use the bundled default stage config based on model identity

The config is used as-is

A bundled default stage config exists, no explicit stage_configs_path is provided, and runtime overrides are given

Resolve and use the bundled default stage config

Caller-provided runtime args act as coarse-grained global overrides for overlapping fields in engine_args (e.g., enable_prefix_caching, enforce_eager)

This path is intended only for coarse-grained adjustments; if per-stage customization is needed, users should modify the default stage config or explicitly provide their own

stage_configs_path is explicitly provided

The explicitly provided stage config is treated as the source of truth

Caller runtime args only supplement fields that are not explicitly defined in the stage config

Given this, introducing diffusion_only would have several drawbacks:

Redundant semantics

If users want to run the diffusion branch, they can already select it via mode

Or explicitly provide a stage config via stage_configs_path

If neither is available, the current fallback already constructs a diffusion stage config
→ Therefore, diffusion_only is functionally overlapping with existing mechanisms

Incomplete default diffusion construction

_create_default_diffusion_stage_cfg() is not a complete configuration (e.g., it does not cover fields like gpu_memory_utilization) and is currently only intended as a fallback

Overusing this path may lead to caller-provided runtime args not being applied as expected

Reduced configuration reuse and increased user burden

It prevents reuse of existing model/pipeline configurations

Users would need to manually provide more diffusion-related runtime parameters

Default config + CLI args overwrite makes sense. However, some default configs cannot override correctly, e.g., device = "0,1,2,3,4,5,6,7", which leads to errors. can you test #2282 case ? Perhaps we need to remove the strictly constrained configurations that cannot be overridden from the default YAML

xiaohajiayou · 2026-03-31T16:04:31Z

The fix has been completed and verified locally.

The CI failure now does not appear to be directly related to this change. Based on the logs, it seems to come from the AMD (ROCm) Qwen3-Omni test pipeline timeout, which may be related to the known CI issues tracked in #2340.

Let me know if I should further investigate from this PR side. @princepride @Bounty-hunter @hsliuustc0106

skf-1999 · 2026-04-01T03:20:43Z

How did you get HunyuanImage 3.0 working on 4x A100? Both of my 4-GPU A100 setups hit OOM, so I'm forced to run on 8 cards.

xiaohajiayou · 2026-04-01T03:40:17Z

How did you get HunyuanImage 3.0 working on 4x A100? Both of my 4-GPU A100 setups hit OOM, so I'm forced to run on 8 cards.

could you share more details about your A100 setup? Specifically:

What is the GPU memory size (40GB or 80GB)?
Do you see full memory utilization on each card before the OOM happens?

In my case, I’m running on 4× NVIDIA A100-SXM4-80GB, and the inference proceeds normally without OOM using the following command:

python -u examples/offline_inference/text_to_image/text_to_image.py \
  --model /home/models/tencent/HunyuanImage-3.0 \
  --prompt "A brown and white dog is running on the grass" \
  --output output_image_latest.png \
  --num-inference-steps 50 \
  --tensor-parallel-size 4 \
  --cfg-scale 4.0

skf-1999 · 2026-04-01T06:32:11Z

How did you get HunyuanImage 3.0 working on 4x A100? Both of my 4-GPU A100 setups hit OOM, so I'm forced to run on 8 cards.

could you share more details about your A100 setup? Specifically:

What is the GPU memory size (40GB or 80GB)?

Do you see full memory utilization on each card before the OOM happens?

In my case, I’m running on 4× NVIDIA A100-SXM4-80GB, and the inference proceeds normally without OOM using the following command:
python -u examples/offline_inference/text_to_image/text_to_image.py \
  --model /home/models/tencent/HunyuanImage-3.0 \
  --prompt "A brown and white dog is running on the grass" \
  --output output_image_latest.png \
  --num-inference-steps 50 \
  --tensor-parallel-size 4 \
  --cfg-scale 4.0

I'm on two 8x NVIDIA A100-SXM4-80GB machines, both hit OOM. Right before crashing, memory usage was 62914MiB / 81920MiB per GPU (~77%).

xiaohajiayou · 2026-04-01T07:17:11Z

I'm on two 8x NVIDIA A100-SXM4-80GB machines, both hit OOM. Right before crashing, memory usage was 62914MiB / 81920MiB per GPU (~77%).

Just to clarify, are you seeing this OOM issue on top of my PR, or does it also happen on the current main branch?

On my side, the same setup (4× A100-SXM4-80GB, TP=4, 50 steps) runs normally without OOM. During local testing, each GPU typically starts from around ~5GB usage and ramps up to roughly ~60GB per card during inference.

Given your logs show ~62GB before OOM, this feels a bit unexpected.

One thing you might want to try is explicitly selecting 4 GPUs with more available memory, for example:

CUDA_VISIBLE_DEVICES=1,3,6,7 python -u examples/offline_inference/text_to_image/text_to_image.py \
  --model /home/models/tencent/HunyuanImage-3.0 \
  --prompt "A brown and white dog is running on the grass" \
  --output output_image_latest.png \
  --num-inference-steps 50 \
  --tensor-parallel-size 4 \
  --cfg-scale 4.0

skf-1999 · 2026-04-01T07:41:01Z

I'm on two 8x NVIDIA A100-SXM4-80GB machines, both hit OOM. Right before crashing, memory usage was 62914MiB / 81920MiB per GPU (~77%).

Just to clarify, are you seeing this OOM issue on top of my PR, or does it also happen on the current main branch?

On my side, the same setup (4× A100-SXM4-80GB, TP=4, 50 steps) runs normally without OOM. During local testing, each GPU typically starts from around ~5GB usage and ramps up to roughly ~50GB per card during inference.

Given your logs show ~62GB before OOM, this feels a bit unexpected.

One thing you might want to try is explicitly selecting 4 GPUs with more available memory, for example:
CUDA_VISIBLE_DEVICES=1,3,6,7 python -u examples/offline_inference/text_to_image/text_to_image.py \
  --model /home/models/tencent/HunyuanImage-3.0 \
  --prompt "A brown and white dog is running on the grass" \
  --output output_image_latest.png \
  --num-inference-steps 50 \
  --tensor-parallel-size 4 \
  --cfg-scale 4.0

I encountered this issue in this PR, but it's not specific to this PR—the main branch also exhibits this behavior. All GPUs have 81920MiB of VRAM and are completely idle (no other processes running) when not executing the program. In the 8-GPU setup, each card's VRAM reaches 69886MiB.

xiaohajiayou · 2026-04-01T17:03:05Z

I encountered this issue in this PR, but it's not specific to this PR—the main branch also exhibits this behavior. All GPUs have 81920MiB of VRAM and are completely idle (no other processes running) when not executing the program. In the 8-GPU setup, each card's VRAM reaches 69886MiB.

Thanks, this is helpful. If the same OOM also happens on main, then it’s likely not introduced by this PR, but rather something related to the environment or configuration.

For context, HunyuanImage-3.0 is around 83B parameters (BF16/F32). On my side, we’re able to run the standard offline inference setup on 4× A100-SXM4-80GB without hitting OOM, so in principle this setup should be sufficient for inference.

If your goal is to validate the fix in this PR, one possible approach is to explicitly set the default stage config to use 4 GPUs (TP=4), and then launch the job on an 8-GPU node with --tensor-parallel-size 8. This way, you can more directly verify whether the behavior addressed in this PR is working as expected.

xiaohajiayou · 2026-04-03T10:14:15Z

I managed to reproduce the setup on another 4× A800 (80GB) server today to isolate the variable.The inference ran successfully without triggering OOM.
It might be an issue with your environment. @skf-1999

Here is the full log and the nvidia-smi snapshot right before the process finished:

full log

(vllm-omni) root@autodl-container-2201459dc8-f8f44d5f:~/vllm-omni# cd /root/vllm-omni
source /root/vllm-omni/.venv/bin/activate
(vllm-omni) root@autodl-container-2201459dc8-f8f44d5f:~/vllm-omni# nvidia-smi
Fri Apr  3 18:04:34 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A800 80GB PCIe          On  |   00000000:4F:00.0 Off |                  Off |
| N/A   48C    P0            132W /  300W |   65277MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A800 80GB PCIe          On  |   00000000:56:00.0 Off |                  Off |
| N/A   47C    P0            199W /  300W |   65277MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A800 80GB PCIe          On  |   00000000:57:00.0 Off |                  Off |
| N/A   48C    P0            251W /  300W |   65277MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A800 80GB PCIe          On  |   00000000:D5:00.0 Off |                  Off |
| N/A   52C    P0            122W /  300W |   65277MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+


python -u examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0 \
  --prompt "A brown and white dog is running on the grass" \
  --output output_image_latest.png \
  --num-inference-steps 50 \
  --tensor-parallel-size 4 \
  --cfg-scale 4.0 \
  --enforce-eager
INFO 04-03 18:02:53 [omni_base.py:93] [Omni] Initializing with model /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0
INFO 04-03 18:02:53 [async_omni_engine.py:216] [AsyncOmniEngine] Initializing with model /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
INFO 04-03 18:02:53 [config.py:437] Replacing legacy 'type' key with 'rope_type'
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
INFO 04-03 18:02:53 [config.py:437] Replacing legacy 'type' key with 'rope_type'
INFO 04-03 18:02:53 [async_omni_engine.py:248] [AsyncOmniEngine] Launching Orchestrator thread with 1 stages
INFO 04-03 18:02:53 [stage_init_utils.py:207] [stage_init] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
INFO 04-03 18:02:53 [initialization.py:270] Loaded OmniTransferConfig with 0 connector configurations
INFO 04-03 18:02:53 [async_omni_engine.py:466] [AsyncOmniEngine] Initializing stage 0
INFO 04-03 18:02:53 [stage_init_utils.py:222] [stage_init] Stage-0 set runtime devices: 0,1,2,3,4,5,6,7
INFO 04-03 18:02:54 [multiproc_executor.py:99] Starting server...
INFO 04-03 18:03:01 [diffusion_worker.py:396] Worker 0 created result MessageQueue
INFO 04-03 18:03:01 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-03 18:03:01 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-03 18:03:01 [vllm.py:754] Asynchronous scheduling is enabled.
INFO 04-03 18:03:01 [vllm.py:754] Asynchronous scheduling is enabled.
INFO 04-03 18:03:01 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-03 18:03:01 [vllm.py:754] Asynchronous scheduling is enabled.
INFO 04-03 18:03:01 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-03 18:03:01 [vllm.py:754] Asynchronous scheduling is enabled.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 04-03 18:03:02 [diffusion_worker.py:129] Worker 0: Initialized device and distributed environment.
INFO 04-03 18:03:02 [diffusion_worker.py:129] Worker 1: Initialized device and distributed environment.
INFO 04-03 18:03:02 [diffusion_worker.py:129] Worker 2: Initialized device and distributed environment.
INFO 04-03 18:03:02 [diffusion_worker.py:129] Worker 3: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 04-03 18:03:02 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-03 18:03:02 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-03 18:03:02 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-03 18:03:02 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-03 18:03:02 [parallel_state.py:630] SP group details for rank 0: sp_group=[0], ulysses_group=[0], ring_group=[0]
INFO 04-03 18:03:02 [parallel_state.py:630] SP group details for rank 1: sp_group=[1], ulysses_group=[1], ring_group=[1]
INFO 04-03 18:03:02 [parallel_state.py:630] SP group details for rank 2: sp_group=[2], ulysses_group=[2], ring_group=[2]
INFO 04-03 18:03:02 [parallel_state.py:630] SP group details for rank 3: sp_group=[3], ulysses_group=[3], ring_group=[3]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
INFO 04-03 18:03:03 [config.py:437] Replacing legacy 'type' key with 'rope_type'
INFO 04-03 18:03:03 [config.py:437] Replacing legacy 'type' key with 'rope_type'
INFO 04-03 18:03:03 [config.py:437] Replacing legacy 'type' key with 'rope_type'
INFO 04-03 18:03:03 [config.py:437] Replacing legacy 'type' key with 'rope_type'
INFO 04-03 18:03:03 [pipeline_hunyuan_image_3.py:93] Setting attention backend to TORCH_SDPA. HunyuanImage3Pipeline only supports TORCH_SDPA to handle mixed causal and full attention.
INFO 04-03 18:03:03 [pipeline_hunyuan_image_3.py:93] Setting attention backend to TORCH_SDPA. HunyuanImage3Pipeline only supports TORCH_SDPA to handle mixed causal and full attention.
INFO 04-03 18:03:03 [pipeline_hunyuan_image_3.py:93] Setting attention backend to TORCH_SDPA. HunyuanImage3Pipeline only supports TORCH_SDPA to handle mixed causal and full attention.
INFO 04-03 18:03:03 [pipeline_hunyuan_image_3.py:93] Setting attention backend to TORCH_SDPA. HunyuanImage3Pipeline only supports TORCH_SDPA to handle mixed causal and full attention.
INFO 04-03 18:03:03 [platform.py:73] Using diffusion attention backend 'TORCH_SDPA'
INFO 04-03 18:03:03 [platform.py:73] Using diffusion attention backend 'TORCH_SDPA'
INFO 04-03 18:03:03 [platform.py:73] Using diffusion attention backend 'TORCH_SDPA'
INFO 04-03 18:03:03 [platform.py:73] Using diffusion attention backend 'TORCH_SDPA'
INFO 04-03 18:03:03 [unquantized.py:186] Using TRITON backend for Unquantized MoE
Multi-thread loading shards:   0% Completed | 0/32 [00:00<?, ?it/s]
Multi-thread loading shards:   3% Completed | 1/32 [00:00<00:12,  2.46it/s]
Multi-thread loading shards:   6% Completed | 2/32 [00:00<00:14,  2.11it/s]
Multi-thread loading shards:   9% Completed | 3/32 [00:01<00:12,  2.24it/s]
Multi-thread loading shards:  12% Completed | 4/32 [00:01<00:12,  2.27it/s]
Multi-thread loading shards:  16% Completed | 5/32 [00:02<00:11,  2.39it/s]
Multi-thread loading shards:  19% Completed | 6/32 [00:02<00:10,  2.48it/s]
Multi-thread loading shards:  22% Completed | 7/32 [00:02<00:09,  2.54it/s]
Multi-thread loading shards:  25% Completed | 8/32 [00:03<00:09,  2.55it/s]
Multi-thread loading shards:  28% Completed | 9/32 [00:03<00:08,  2.58it/s]
Multi-thread loading shards:  31% Completed | 10/32 [00:04<00:08,  2.62it/s]
Multi-thread loading shards:  34% Completed | 11/32 [00:04<00:08,  2.62it/s]
Multi-thread loading shards:  38% Completed | 12/32 [00:04<00:07,  2.66it/s]
Multi-thread loading shards:  41% Completed | 13/32 [00:05<00:06,  2.73it/s]
Multi-thread loading shards:  44% Completed | 14/32 [00:05<00:06,  2.78it/s]
Multi-thread loading shards:  47% Completed | 15/32 [00:05<00:06,  2.82it/s]
Multi-thread loading shards:  50% Completed | 16/32 [00:06<00:05,  2.87it/s]
Multi-thread loading shards:  53% Completed | 17/32 [00:06<00:05,  2.92it/s]
Multi-thread loading shards:  56% Completed | 18/32 [00:06<00:04,  2.98it/s]
Multi-thread loading shards:  59% Completed | 19/32 [00:07<00:04,  2.93it/s]
Multi-thread loading shards:  62% Completed | 20/32 [00:07<00:03,  3.01it/s]
Multi-thread loading shards:  66% Completed | 21/32 [00:07<00:03,  3.05it/s]
Multi-thread loading shards:  69% Completed | 22/32 [00:08<00:03,  3.09it/s]
Multi-thread loading shards:  72% Completed | 23/32 [00:08<00:02,  3.10it/s]
Multi-thread loading shards:  75% Completed | 24/32 [00:08<00:02,  3.00it/s]
Multi-thread loading shards:  78% Completed | 25/32 [00:09<00:02,  2.95it/s]
Multi-thread loading shards:  81% Completed | 26/32 [00:09<00:02,  2.93it/s]
Multi-thread loading shards:  84% Completed | 27/32 [00:09<00:01,  2.89it/s]
Multi-thread loading shards:  88% Completed | 28/32 [00:10<00:01,  2.87it/s]
Multi-thread loading shards:  91% Completed | 29/32 [00:10<00:01,  2.91it/s]
Multi-thread loading shards:  94% Completed | 30/32 [00:11<00:00,  2.17it/s]
Multi-thread loading shards:  97% Completed | 31/32 [00:11<00:00,  2.18it/s]
Multi-thread loading shards: 100% Completed | 32/32 [00:12<00:00,  2.04it/s]
Multi-thread loading shards: 100% Completed | 32/32 [00:12<00:00,  2.61it/s]

INFO 04-03 18:03:16 [diffusers_loader.py:321] Loading weights took 12.36 seconds
INFO 04-03 18:03:17 [diffusers_loader.py:321] Loading weights took 12.60 seconds
INFO 04-03 18:03:17 [diffusers_loader.py:321] Loading weights took 12.84 seconds
INFO 04-03 18:03:17 [diffusion_model_runner.py:141] Model loading took 42.3180 GiB and 14.999656 seconds
INFO 04-03 18:03:17 [diffusion_model_runner.py:146] Model runner: Model loaded successfully.
INFO 04-03 18:03:17 [diffusion_model_runner.py:187] Model runner: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:159] Worker 0: Process-scoped GPU memory after model loading: 42.88 GiB.
INFO 04-03 18:03:17 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-03 18:03:17 [diffusion_worker.py:98] Worker 0: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:534] Worker 0: Scheduler loop started.
INFO 04-03 18:03:17 [diffusion_worker.py:457] Worker 0 ready to receive requests via shared memory
INFO 04-03 18:03:17 [diffusion_model_runner.py:141] Model loading took 42.3180 GiB and 15.283239 seconds
INFO 04-03 18:03:17 [diffusion_model_runner.py:146] Model runner: Model loaded successfully.
INFO 04-03 18:03:17 [diffusion_model_runner.py:187] Model runner: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:159] Worker 3: Process-scoped GPU memory after model loading: 42.88 GiB.
INFO 04-03 18:03:17 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:3, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-03 18:03:17 [diffusion_worker.py:98] Worker 3: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:534] Worker 3: Scheduler loop started.
INFO 04-03 18:03:17 [diffusion_worker.py:457] Worker 3 ready to receive requests via shared memory
INFO 04-03 18:03:17 [diffusers_loader.py:321] Loading weights took 13.34 seconds
INFO 04-03 18:03:17 [diffusion_model_runner.py:141] Model loading took 42.3180 GiB and 15.439166 seconds
INFO 04-03 18:03:17 [diffusion_model_runner.py:146] Model runner: Model loaded successfully.
INFO 04-03 18:03:17 [diffusion_model_runner.py:187] Model runner: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:159] Worker 1: Process-scoped GPU memory after model loading: 42.88 GiB.
INFO 04-03 18:03:17 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:1, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-03 18:03:17 [diffusion_worker.py:98] Worker 1: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:534] Worker 1: Scheduler loop started.
INFO 04-03 18:03:17 [diffusion_worker.py:457] Worker 1 ready to receive requests via shared memory
INFO 04-03 18:03:18 [diffusion_model_runner.py:141] Model loading took 42.3180 GiB and 15.899292 seconds
INFO 04-03 18:03:18 [diffusion_model_runner.py:146] Model runner: Model loaded successfully.
INFO 04-03 18:03:18 [diffusion_model_runner.py:187] Model runner: Initialization complete.
INFO 04-03 18:03:18 [diffusion_worker.py:159] Worker 2: Process-scoped GPU memory after model loading: 42.88 GiB.
INFO 04-03 18:03:18 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:2, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-03 18:03:18 [diffusion_worker.py:98] Worker 2: Initialization complete.
INFO 04-03 18:03:18 [diffusion_worker.py:534] Worker 2: Scheduler loop started.
INFO 04-03 18:03:18 [diffusion_worker.py:457] Worker 2 ready to receive requests via shared memory
INFO 04-03 18:03:18 [diffusion_engine.py:378] dummy run to warm up the model
INFO 04-03 18:03:18 [manager.py:608] Deactivating all adapters: 0 layers
INFO 04-03 18:03:18 [manager.py:608] Deactivating all adapters: 0 layers
INFO 04-03 18:03:18 [manager.py:608] Deactivating all adapters: 0 layers
INFO 04-03 18:03:18 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 04-03 18:03:18 [kv_transfer_manager.py:381] No connector available for receiving KV cache
  0%|                                                                                    | 0/1 [00:00<?, ?it/s]WARNING 04-03 18:03:20 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /root/vllm-omni/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_A800_80GB_PCIe.json
100%|████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.10s/it]
INFO 04-03 18:03:23 [diffusion_model_runner.py:212] Peak GPU memory (this request): 63.09 GB reserved, 51.37 GB allocated, 11.72 GB pool overhead (18.6%)
INFO 04-03 18:03:23 [async_omni_diffusion.py:154] AsyncOmniDiffusion initialized with model: /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0, batch_size: 1
INFO 04-03 18:03:23 [stage_diffusion_client.py:54] [StageDiffusionClient] Stage-1 initialized (batch_size=1)
INFO 04-03 18:03:23 [async_omni_engine.py:496] [AsyncOmniEngine] Stage 0 initialized (diffusion, batch_size=1)
INFO 04-03 18:03:23 [orchestrator.py:158] [Orchestrator] Starting event loop
INFO 04-03 18:03:23 [async_omni_engine.py:290] [AsyncOmniEngine] Orchestrator ready with 1 stages
INFO 04-03 18:03:23 [omni_base.py:106] [Omni] AsyncOmniEngine initialized in 30.07 seconds
INFO 04-03 18:03:23 [omni_base.py:121] [Omni] Initialized with 1 stages for model /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0

============================================================
Generation Configuration:
  Model: /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0
  Inference steps: 50
  Cache backend: None (no acceleration)
  Quantization: None (BF16)
  Parallel configuration: tensor_parallel_size=4, ulysses_degree=1, ulysses_mode=strict, ring_degree=1, cfg_parallel_size=1, vae_patch_parallel_size=1, enable_expert_parallel=False.
  CPU offload: False
  Image size: 1024x1024
============================================================

INFO 04-03 18:03:23 [orchestrator.py:584] [Orchestrator] _handle_add_request: stage=0 req=0_deb5b338-3b60-48f7-a1f1-666e186c320f prompt_type=dict original_prompt_type=dict final_stage=0 num_sampling_params=1
Processed prompts:   0%|                                                                 | 0/1 [00:00<?, ?it/s]INFO 04-03 18:03:23 [manager.py:608] Deactivating all adapters: 0 layers
INFO 04-03 18:03:23 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 04-03 18:03:23 [kv_transfer_manager.py:381] No connector available for receiving KV cache
INFO 04-03 18:03:23 [manager.py:608] Deactivating all adapters: 0 layers
INFO 04-03 18:03:23 [manager.py:608] Deactivating all adapters: 0 layers
 80%|███████████████████████████████████████████████████████████▏              | 40/50 [01:00<00:14,  1.50s/it]INFO 04-03 18:04:24 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
100%|██████████████████████████████████████████████████████████████████████████| 50/50 [01:15<00:00,  1.50s/it]
INFO 04-03 18:04:39 [diffusion_model_runner.py:212] Peak GPU memory (this request): 63.10 GB reserved, 51.41 GB allocated, 11.69 GB pool overhead (18.5%)
INFO 04-03 18:04:39 [diffusion_engine.py:103] Generation completed successfully.
INFO 04-03 18:04:39 [diffusion_engine.py:136] Post-processing completed in 0.0000 seconds
INFO 04-03 18:04:39 [diffusion_engine.py:139] DiffusionEngine.step breakdown: preprocess=0.00 ms, add_req_and_wait=76300.51 ms, postprocess=0.00 ms, total=76300.69 ms
Processed prompts: 100%|█████████████████████████████████████████████████████████| 1/1 [01:16<00:00, 76.30s/it]INFO 04-03 18:04:39 [omni_base.py:162] [Summary] {}
Processed prompts: 100%|█████████████████████████████████████████████████████████| 1/1 [01:16<00:00, 76.30s/it]
Total generation time: 76.3049 seconds (76304.87 ms)
INFO 04-03 18:04:39 [text_to_image.py:440] Outputs: [OmniRequestOutput(request_id='0_deb5b338-3b60-48f7-a1f1-666e186c320f', finished=True, stage_id=0, final_output_type='image', request_output=OmniRequestOutput(request_id='0_deb5b338-3b60-48f7-a1f1-666e186c320f', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt={'prompt': 'A brown and white dog is running on the grass', 'negative_prompt': None}, latents=None, metrics={'preprocess_time_ms': 0.0, 'diffusion_engine_exec_time_ms': 76300.72746798396, 'diffusion_engine_total_time_ms': 76300.50529167056, 'image_num': 1, 'resolution': 640, 'postprocess_time_ms': 0.0013150274753570557}, multimodal_output={}, custom_output={}, stage_durations={}, peak_memory_mb=64610.0), images=[1 PIL Images], prompt=None, latents=None, metrics={}, multimodal_output={}, custom_output={}, stage_durations={}, peak_memory_mb=64610.0)]
Saved generated image to output_image_latest.png
INFO 04-03 18:04:40 [async_omni_engine.py:1133] [AsyncOmniEngine] Shutting down Orchestrator
INFO 04-03 18:04:40 [orchestrator.py:210] [Orchestrator] Received shutdown signal
INFO 04-03 18:04:40 [orchestrator.py:820] [Orchestrator] Shutting down all stages
INFO 04-03 18:04:40 [diffusion_worker.py:486] Worker 0: Received shutdown message
INFO 04-03 18:04:40 [diffusion_worker.py:507] event loop terminated.
INFO 04-03 18:04:40 [diffusion_worker.py:486] Worker 2: Received shutdown message
INFO 04-03 18:04:40 [diffusion_worker.py:486] Worker 3: Received shutdown message
INFO 04-03 18:04:40 [diffusion_worker.py:507] event loop terminated.
INFO 04-03 18:04:40 [diffusion_worker.py:486] Worker 1: Received shutdown message
INFO 04-03 18:04:40 [diffusion_worker.py:507] event loop terminated.
INFO 04-03 18:04:40 [diffusion_worker.py:507] event loop terminated.
INFO 04-03 18:04:40 [diffusion_worker.py:542] Worker 0: Shutdown complete.
INFO 04-03 18:04:40 [diffusion_worker.py:542] Worker 3: Shutdown complete.
INFO 04-03 18:04:40 [diffusion_worker.py:542] Worker 2: Shutdown complete.
INFO 04-03 18:04:40 [diffusion_worker.py:542] Worker 1: Shutdown complete.
INFO 04-03 18:04:42 [async_omni_diffusion.py:365] AsyncOmniDiffusion closed
INFO 04-03 18:04:42 [orchestrator.py:824] [Orchestrator] Stage 0 shut down
(vllm-omni) root@autodl-container-2201459dc8-f8f44d5f:~/vllm-omni# 
...

xiaohajiayou · 2026-04-03T10:20:09Z

This PR fixes the override priority issue between engine args and YAML config.
The runtime devices in YAML has also been refined to support logical mapping.

Could you please take a look? If everything looks good, could you merge it first? @princepride @hsliuustc0106 @lishunyang12
It seems that the work in #2264 and #2185 both depends on the merge progress of this PR.

lishunyang12

LGTM

xiaohajiayou · 2026-04-09T06:25:12Z

Hi @hsliuustc0106, hope you're having a good week. I've addressed all your comments and the PR is ready for re-review. The merge is currently blocked pending your approval.
Is there anything else you'd like me to adjust or refine? I'm happy to make further changes if needed

princepride · 2026-04-09T06:37:49Z

@hsliuustc0106 PTAL

fixed

xiaohajiayou · 2026-04-09T08:42:04Z

If everything looks good, could we merge this pr? @lishunyang12

wuhang2014 · 2026-04-09T12:35:58Z

+    stage_configs = load_stage_configs_from_yaml(
+        config_path=stage_config_path,
+        base_engine_args=base_engine_args,
+        prefer_stage_engine_args=False,


The override behavior from cli args to yaml's is failing some nightly test: https://buildkite.com/vllm/vllm-omni/builds/6216/steps/canvas?sid=019d71f7-fe1d-4c22-a5df-abb9425a9d81

Maybe we could change the False to True to align to old behavior.
@xiaohajiayou @hsliuustc0106 @lishunyang12 @princepride

For models with a built-in default stage config, we merge caller-provided engine args into the stage config.

The problem is that parser-based entrypoints were previously passing the full parsed CLI namespace into stage config resolution. That namespace contains both:

explicitly provided CLI args

parser default values
As a result, fields that were not explicitly passed by the user could still participate in config merge.

One concrete example is distributed_executor_backend:

the model's default stage config may set:

distributed_executor_backend: "mp"

but the upstream vLLM parser can provide:

distributed_executor_backend = None

if this full parsed args object is treated as override input, the YAML value "mp" can be overwritten by None

That is why the executor eventually sees:

Unknown distributed executor backend: None

I try to fix this in #2655

Thank you for your response! I believe one effective solution would be to manage the argument parser’s default values within the vllm-omni diffusion engine/worker. Alternatively, we could create our own argument parser from scratch.

… configs (vllm-project#2076) Signed-off-by: xiaohajiayou <923390377@qq.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>

xiaohajiayou requested a review from hsliuustc0106 as a code owner March 22, 2026 15:47

chatgpt-codex-connector Bot reviewed Mar 22, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/utils.py Outdated

Comment thread vllm_omni/entrypoints/utils.py Outdated

xiaohajiayou mentioned this pull request Mar 22, 2026

[Perf] Add A100 fused MoE tuned config for E=16,N=3072 vllm-project/vllm#37814

Open

lishunyang12 reviewed Mar 22, 2026

View reviewed changes

xiaohajiayou mentioned this pull request Mar 22, 2026

[Bug]: HunyuanImage3 EP regression #2075

Closed

1 task

hsliuustc0106 previously requested changes Mar 22, 2026

View reviewed changes

xiaohajiayou force-pushed the bugfix/hunyuan-ep-test-a100 branch from d732404 to 0e8a6ae Compare March 23, 2026 00:45

yenuo26 reviewed Mar 23, 2026

View reviewed changes

Comment thread tests/e2e/offline_inference/test_expert_parallel.py

Comment thread tests/e2e/offline_inference/test_expert_parallel.py Outdated

hsliuustc0106 added the ready label to trigger buildkite CI label Mar 23, 2026

xiaohajiayou force-pushed the bugfix/hunyuan-ep-test-a100 branch from e7b3064 to 811efb3 Compare March 23, 2026 15:15

xiaohajiayou requested a review from hsliuustc0106 March 23, 2026 15:17

xiaohajiayou force-pushed the bugfix/hunyuan-ep-test-a100 branch 2 times, most recently from 4decc32 to 2614afc Compare March 23, 2026 16:23

lishunyang12 reviewed Mar 25, 2026

View reviewed changes

xiaohajiayou force-pushed the bugfix/hunyuan-ep-test-a100 branch from 0509b8e to 919c35a Compare March 25, 2026 15:00

xiaohajiayou changed the title ~~[Bugfix] Fix EP test override precedence for stage-config HunyuanImage3 runs and add A100 test support~~ [Bugfix] Fix precedence between caller runtime args and default stage configs Mar 25, 2026

xiaohajiayou requested review from lishunyang12 and yenuo26 March 26, 2026 01:14

xiaohajiayou mentioned this pull request Mar 26, 2026

[CI] hunyuanimage benchmark tests for performance regression tracking #2185

Closed

5 tasks

zzhuoxin1508 mentioned this pull request Mar 27, 2026

[Fix] Bridge flat CLI parallel args into DiffusionParallelConfig before YAML stage-config merge #2264

Open

Bounty-hunter mentioned this pull request Mar 30, 2026

[BugFix]config priority fix #2289

Closed

5 tasks

xiaohajiayou force-pushed the bugfix/hunyuan-ep-test-a100 branch from 41a7a0f to d9cf2be Compare March 31, 2026 06:45

Merge branch 'main' into bugfix/hunyuan-ep-test-a100

c77d057

Merge branch 'main' into bugfix/hunyuan-ep-test-a100

c961c73

lishunyang12 approved these changes Apr 6, 2026

View reviewed changes

lishunyang12 enabled auto-merge (squash) April 6, 2026 02:19

lishunyang12 and others added 2 commits April 6, 2026 10:19

Merge branch 'main' into bugfix/hunyuan-ep-test-a100

5dd953a

Merge branch 'main' into bugfix/hunyuan-ep-test-a100

bb351c1

lishunyang12 disabled auto-merge April 6, 2026 12:46

xiaohajiayou mentioned this pull request Apr 7, 2026

[Bugfix] Fix default diffusion stage config generator drops runtime engine args #2559

Open

5 tasks

princepride merged commit 2d98013 into vllm-project:main Apr 9, 2026
8 checks passed

wuhang2014 reviewed Apr 9, 2026

View reviewed changes

xiaohajiayou mentioned this pull request Apr 9, 2026

[Bug]: Fix CLI defaults overriding stage configs #2655

Closed

1 task

This was referenced Apr 10, 2026

[Bugfix] restore legacy stage config precedence #2663

Merged

[CI Failure]: Diffusion · Other · Function Test with H100, bagel run error, ValueError: Unknown distributed executor backend: None #2662

Closed

Conversation

xiaohajiayou commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test command:

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Review Gates

Mandatory Blocker Triage

Code Review

Action Required

Uh oh!

xiaohajiayou commented Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

yenuo26 commented Mar 23, 2026

Uh oh!

congw729 commented Mar 23, 2026

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaohajiayou commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaohajiayou commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaohajiayou commented Mar 26, 2026

Uh oh!

Bounty-hunter commented Mar 28, 2026

Uh oh!

xiaohajiayou commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bounty-hunter commented Mar 30, 2026

Uh oh!

xiaohajiayou commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skf-1999 commented Apr 1, 2026

Uh oh!

xiaohajiayou commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skf-1999 commented Apr 1, 2026

Uh oh!

xiaohajiayou commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skf-1999 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaohajiayou commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaohajiayou commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaohajiayou commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

xiaohajiayou commented Apr 9, 2026

xiaohajiayou commented Mar 22, 2026 •

edited

Loading

lishunyang12 left a comment •

edited

Loading

lishunyang12 left a comment •

edited

Loading

xiaohajiayou commented Mar 25, 2026 •

edited

Loading

xiaohajiayou commented Mar 25, 2026 •

edited

Loading

xiaohajiayou commented Mar 29, 2026 •

edited

Loading

xiaohajiayou commented Mar 31, 2026 •

edited

Loading

xiaohajiayou commented Apr 1, 2026 •

edited

Loading

xiaohajiayou commented Apr 1, 2026 •

edited

Loading

skf-1999 commented Apr 1, 2026 •

edited

Loading

xiaohajiayou commented Apr 1, 2026 •

edited

Loading

xiaohajiayou commented Apr 3, 2026 •

edited

Loading

xiaohajiayou commented Apr 3, 2026 •

edited

Loading