Skip to content

[Bugfix] Fix precedence between caller runtime args and default stage configs#2076

Merged
princepride merged 11 commits intovllm-project:mainfrom
xiaohajiayou:bugfix/hunyuan-ep-test-a100
Apr 9, 2026
Merged

[Bugfix] Fix precedence between caller runtime args and default stage configs#2076
princepride merged 11 commits intovllm-project:mainfrom
xiaohajiayou:bugfix/hunyuan-ep-test-a100

Conversation

@xiaohajiayou
Copy link
Copy Markdown
Contributor

@xiaohajiayou xiaohajiayou commented Mar 22, 2026

Purpose

Fixes #2075.

This PR fixes config precedence for stage-config based HunyuanImage3 runs so that caller-provided parallel_config remains authoritative over stage-yaml parallel_config.

Without this change, the stage yaml can override runtime EP settings, which makes it unreliable to toggle between:

  • baseline: enable_expert_parallel=False
  • EP: enable_expert_parallel=True

This work came from investigating the HunyuanImage3 EP performance regression tracked in #2015.
This PR does not include that vllm-side kernel config change. Its scope is to make the HunyuanImage3 stage-config path reliably toggle baseline vs EP, so the regression can be reproduced and validated cleanly.

Test command:

export CUDA_VISIBLE_DEVICES=3,5,6,7
pytest tests/e2e/offline_inference/test_expert_parallel.py -v -s

Test Result

The existing expert-parallel e2e test now runs successfully on 4x A100 with an explicit HunyuanImage3 stage config.

[enable_ep: True] 4 GPUs | baseline: 27912ms, ep: 28001ms, speedup: 1.00x
[enable_ep: True] diff: [mean=3.831369e-02, max=7.254902e-01], cos_sim: [mean=9.900795e-01, max=9.900795e-01], mse: 5.543310e-03

==========================================================================================
SUMMARY
==========================================================================================
Mode            GPUs   Size       Baseline     EP           Speedup    Status
------------------------------------------------------------------------------------------
A brown an      1      1024x1024  27912ms      28001ms        1.00x      PASS
==========================================================================================
PASSED

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2e6c3e14c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/entrypoints/utils.py Outdated
Comment thread vllm_omni/entrypoints/utils.py Outdated
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation on the EP regression — the benchmarking results are useful context.

However, I think the per-field override approach here works against the stage-config design. The stage yaml is meant to be the single source of truth for stage-level topology. Punching through enable_expert_parallel as a special case creates an inconsistent precedence model — callers would reasonably expect the same to work for tensor_parallel_size or other parallel fields, but it won't.

For toggling EP in testing/benchmarking, a separate stage yaml (e.g. hunyuan_image3_moe_dit_ep.yaml) would keep the config model clean and explicit.

The env-var overrides in the test also add a lot of surface area for what's essentially a single-model validation. If A100 support is needed, a dedicated test or fixture would be more maintainable than parameterizing the existing one with 5+ env vars.

Happy to discuss if you see a reason the yaml-per-config approach doesn't work for your use case.

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Gates

Check Status
DCO ACTION_REQUIRED
pre-commit ✅ SUCCESS
mergeable ✅ MERGEABLE

BLOCKER: DCO check failing. Author needs to sign commits with git commit -s.


Mandatory Blocker Triage

Category Status Evidence/Gap
Tests ✅ PASS Test commands + results in PR description; existing test updated with env-based overrides
Docs N/A Internal test/utility changes only — no docs required
Perf ✅ PASS Benchmark data: baseline 27912ms vs EP 28001ms (1.00x speedup)
Accuracy ✅ PASS cos_sim=0.99, mse=5.5e-3, diff metrics provided
API N/A No API changes

Code Review

The fix is narrow and well-scoped:

  • vllm_omni/entrypoints/utils.py: Extracts enable_expert_parallel from caller config and overrides stage-yaml value
  • tests/e2e/offline_inference/test_expert_parallel.py: Adds env-based test configuration
  • tests/utils.py: Adds A100 marker support
  • pyproject.toml: Registers A100 marker

No inline blockers beyond DCO.


Action Required

  • Author: sign commits with git commit -s to fix DCO

Once DCO passes, this is ready for APPROVE.

@xiaohajiayou xiaohajiayou force-pushed the bugfix/hunyuan-ep-test-a100 branch from d732404 to 0e8a6ae Compare March 23, 2026 00:45
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

Thanks for the investigation on the EP regression — the benchmarking results are useful context.

However, I think the per-field override approach here works against the stage-config design. The stage yaml is meant to be the single source of truth for stage-level topology. Punching through enable_expert_parallel as a special case creates an inconsistent precedence model — callers would reasonably expect the same to work for tensor_parallel_size or other parallel fields, but it won't.

For toggling EP in testing/benchmarking, a separate stage yaml (e.g. hunyuan_image3_moe_dit_ep.yaml) would keep the config model clean and explicit.

The env-var overrides in the test also add a lot of surface area for what's essentially a single-model validation. If A100 support is needed, a dedicated test or fixture would be more maintainable than parameterizing the existing one with 5+ env vars.

Happy to discuss if you see a reason the yaml-per-config approach doesn't work for your use case.

You’re right that making enable_expert_parallel a special-case override would make the overall precedence model inconsistent.

What I’d like to clarify is the expected relationship between the two configuration paths that already exist today:

  1. Specifying parallelism via runtime parallel_config, for example:
parallel_config = DiffusionParallelConfig(tensor_parallel_size=2)
omni = Omni(model="your-model-name", parallel_config=parallel_config)
  1. Specifying it via stage config yaml, for example:
omni = Omni(model="your-model-name", stage-configs-path="/path/to/your/custom_bagel.yaml")

In test_expert_parallel.py, the test currently uses the first approach (passing parallel_config via Omni(...)), but in practice it gets overridden by the default stage config. This is the ambiguity we’re running into here.

So I’d like to confirm the intended design:

When the same topology-related fields (e.g. tensor_parallel_size, enable_expert_parallel, etc.) are provided through both paths—

  • via parallel_config at Omni(...) initialization, and
  • via stage config yaml (engine_args / runtime.devices)—

is there a defined precedence or merge rule that determines the final effective value?

Or is the recommended approach to treat stage config as the single source of truth for stage-level topology, and avoid configuring these fields through multiple paths at the same time?

If the latter is the intended model, I’m happy to update this test to use two separate yaml configs (EP vs non-EP), instead of overriding EP from the caller side at runtime.

Comment thread tests/e2e/offline_inference/test_expert_parallel.py
Comment thread tests/e2e/offline_inference/test_expert_parallel.py Outdated
@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Mar 23, 2026

@congw729 can we need to add A100 mark?

@congw729
Copy link
Copy Markdown
Collaborator

@congw729 can we need to add A100 mark?

We don't have A100 machines in our CI. Right now, the hardware mark is designed to mark which machine this test needs to run on.

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Mar 23, 2026
@xiaohajiayou xiaohajiayou force-pushed the bugfix/hunyuan-ep-test-a100 branch from e7b3064 to 811efb3 Compare March 23, 2026 15:15
@xiaohajiayou xiaohajiayou force-pushed the bugfix/hunyuan-ep-test-a100 branch 2 times, most recently from 4decc32 to 2614afc Compare March 23, 2026 16:23
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that the entrypoints/utils.py override is removed, what's left is a test refactor + two new YAMLs for a test that isn't wired into any CI pipeline. The rename also drops the generic test_expert_parallel.py in favor of a HunyuanImage3-specific one.

Would it make more sense to fold the investigation findings into #2015 directly, and only open a PR when there's either a concrete fix or CI coverage for this test?

@xiaohajiayou xiaohajiayou force-pushed the bugfix/hunyuan-ep-test-a100 branch from 0509b8e to 919c35a Compare March 25, 2026 15:00
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented Mar 25, 2026

now that the entrypoints/utils.py override is removed, what's left is a test refactor + two new YAMLs for a test that isn't wired into any CI pipeline. The rename also drops the generic test_expert_parallel.py in favor of a HunyuanImage3-specific one.

Would it make more sense to fold the investigation findings into #2015 directly, and only open a PR when there's either a concrete fix or CI coverage for this test?

I think the main issue here is not the Hunyuan-specific test itself, but the semantic conflict between bundled default stage configs and caller-provided runtime args. There should be a clear precedence model.

My reasoning is:

  1. We already have many offline / online examples and docs where runtime behavior is controlled explicitly via caller-provided arguments, such as parallel_config, TP-related settings, gpu_memory_utilization, enforce_eager, etc. However, with the current behavior, if a model has a bundled default stage config(like: Hunyuan image model), overlapping fields in the YAML can silently override those caller-provided runtime args.
    For example:

    omni_kwargs = {
    "model": args.model,
    "enable_layerwise_offload": args.enable_layerwise_offload,
    "vae_use_slicing": args.vae_use_slicing,
    "vae_use_tiling": args.vae_use_tiling,
    "cache_backend": args.cache_backend,
    "cache_config": cache_config,
    "enable_cache_dit_summary": args.enable_cache_dit_summary,
    "parallel_config": parallel_config,
    "enforce_eager": args.enforce_eager,
    "enable_cpu_offload": args.enable_cpu_offload,
    "mode": "text-to-image",
    "enable_diffusion_pipeline_profiler": args.enable_diffusion_pipeline_profiler,
    **lora_args,
    **quant_kwargs,
    }
    if args.stage_configs_path:
    omni_kwargs["stage_configs_path"] = args.stage_configs_path
    if use_nextstep:
    # NextStep-1.1 requires explicit pipeline class
    omni_kwargs["model_class_name"] = "NextStep11Pipeline"
    omni = Omni(**omni_kwargs)

  2. test_expert_parallel.py is simply the first place where this issue becomes visible, because it combines both conditions: explicitly passing runtime args and resolving to a model with a bundled default stage config. As more bundled stage configs are introduced, this will not remain a one-off issue.

  3. This also seems inconsistent with the configuration model described in stage_config.py, where:

    • pipeline structure is defined by YAML
    • runtime parameters are expected to come from CLI / caller-provided args
      """
      Stage Configuration System for vLLM-Omni.
      Pipeline structure (stages, types, data-flow) is defined in per-model YAML
      files and is set by model developers at integration time.
      Runtime parameters (gpu_memory_utilization, tp_size, etc.) come from CLI.
      """

To make this precedence explicit, I introduced a prefer_stage_engine_args flag in load_stage_configs_from_yaml() (default True, meaning stage_arg.engine_args overrides caller-provided runtime args).

def load_stage_configs_from_yaml(
    config_path: str,
    base_engine_args: dict | None = None,
    prefer_stage_engine_args: bool = True,
) -> list:

With that, the current logic becomes:

  1. No caller-provided runtime args and no explicit stage_configs_path

    • prefer_stage_engine_args=True
    • resolve and use the bundled default stage config based on model name
  2. Caller explicitly provides runtime args, but does not explicitly provide stage_configs_path

    • prefer_stage_engine_args=False
    • caller-provided runtime args override overlapping fields from the bundled default stage config
  3. User explicitly provides stage_configs_path

    • prefer_stage_engine_args=True
    • the explicitly provided stage config is treated as the source of truth and overrides overlapping caller-provided runtime args

This precedence seems more consistent with the documented configuration model and makes the behavior easier to reason about. That said, I’m happy to adjust if there is a preferred precedence model.

@xiaohajiayou xiaohajiayou changed the title [Bugfix] Fix EP test override precedence for stage-config HunyuanImage3 runs and add A100 test support [Bugfix] Fix precedence between caller runtime args and default stage configs Mar 25, 2026
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented Mar 25, 2026

is there a defined precedence or merge rule that determines the final effective value?

Or is the recommended approach to treat stage config as the single source of truth for stage-level topology, and avoid configuring these fields through multiple paths at the same time?

If the latter is the intended model, I’m happy to update this test to use two separate yaml configs (EP vs non-EP), instead of overriding EP from the caller side at runtime.

test_expert_parallel.py is actually another question.
In the former version, default stage-config resolution currently always pick AR path instead of diffusion path.

#1826 might have fixed this problem.
But, there still have the precedence issue between caller-provided runtime args and the bundled default stage config.

This is also why my initial version relied on env-based switching, and the subsequent revision introduced two dedicated test YAML configs. At that stage, I was effectively working around two separate issues simultaneously: runtime argument precedence and the default stage-config resolution for Hunyuan.

What #1826 changed is that load_and_resolve_stage_configs() now goes through filter_stages(). As a result, when stage_configs_path is not explicitly provided, the Hunyuan default path first resolves to hunyuan_image_3_moe.yaml, and then, under the default mode=text-to-image, selects stage_id: 1, i.e. the stage_type: diffusion stage in that file.

- stage_id: 1
stage_type: diffusion
runtime:
process: true
devices: "0,1,2,3,4,5,6,7"
max_batch_size: 1
engine_args:
model_stage: diffusion
gpu_memory_utilization: 0.9
enforce_eager: true
engine_output_type: image
distributed_executor_backend: "mp"
enable_prefix_caching: false
max_num_batched_tokens: 32768
vae_use_slicing: false
vae_use_tiling: false
cache_backend: null
cache_config: null
enable_cache_dit_summary: false
parallel_config:
pipeline_parallel_size: 1
data_parallel_size: 1
tensor_parallel_size: 8
enable_expert_parallel: false
sequence_parallel_size: 1
ulysses_degree: 1
ring_degree: 1
cfg_parallel_size: 1
vae_patch_parallel_size: 1
use_hsdp: false
hsdp_shard_size: -1
hsdp_replicate_size: 1

The problem is that #1826 still placed a substantial amount of runtime configuration directly into that default diffusion stage, including parallel_config. I checked other model stage configs, and they generally do not embed this kind of runtime parallel topology into the bundled default stage config. Because of that, although test_expert_parallel.py passes caller-side runtime args (such as the EP toggle / parallel config), under the current precedence model those values get overridden by the parallel_config baked into the default Hunyuan stage config. That is why this test exposes the issue and why --enable-expert-parallel stops taking effect on this path.

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 @lishunyang12
May you help to take a look

@Bounty-hunter
Copy link
Copy Markdown
Contributor

I also try to fix it by #2289 , we can discuss it.

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented Mar 29, 2026

I also try to fix it by #2289 , we can discuss it.

I do not think we need to split omni and diffusion into two separate configuration flows, and I also do not think a separate diffusion_only flag is necessary.

After #1826, once a stage config is resolved, mode together with stage-id mapping is already sufficient to select the corresponding pipeline branch. Under the precedence model in this PR, both omni and diffusion behaviors can still be explained with the same four cases:

  1. No bundled default stage config can be resolved, and the caller does not explicitly provide stage_configs_path

    • Fall back to the CLI/runtime-based diffusion stage-config constructor
    • Behaviorally, this is equivalent to introducing a separate diffusion_only flag
      if not stage_configs:
      if default_stage_cfg_factory is not None:
      default_stage_cfg = default_stage_cfg_factory()
      stage_configs = create_config(default_stage_cfg)
  2. A bundled default stage config exists, no explicit stage_configs_path is provided, and no global runtime overrides are given

    • Resolve and use the bundled default stage config based on model identity
    • The config is used as-is
  3. A bundled default stage config exists, no explicit stage_configs_path is provided, and runtime overrides are given

    • Resolve and use the bundled default stage config
    • Caller-provided runtime args act as coarse-grained global overrides for overlapping fields in engine_args (e.g., enable_prefix_caching, enforce_eager)
    • This path is intended only for coarse-grained adjustments; if per-stage customization is needed, users should modify the default stage config or explicitly provide their own
  4. stage_configs_path is explicitly provided

    • The explicitly provided stage config is treated as the source of truth
    • Caller runtime args only supplement fields that are not explicitly defined in the stage config

Given this, introducing diffusion_only would have several drawbacks:

  1. Redundant semantics

    • If users want to run the diffusion branch, they can already select it via mode
    • Or explicitly provide a stage config via stage_configs_path
    • If neither is available, the current fallback already constructs a diffusion stage config
      → Therefore, diffusion_only is functionally overlapping with existing mechanisms
  2. Incomplete default diffusion construction

    • _create_default_diffusion_stage_cfg() is not a complete configuration (e.g., it does not cover fields like gpu_memory_utilization) and is currently only intended as a fallback
    • Overusing this path may lead to caller-provided runtime args not being applied as expected
  3. Reduced configuration reuse and increased user burden

    • It prevents reuse of existing model/pipeline configurations
    • Users would need to manually provide more diffusion-related runtime parameters

@Bounty-hunter
Copy link
Copy Markdown
Contributor

I also try to fix it by #2289 , we can discuss it.

I do not think we need to split omni and diffusion into two separate configuration flows, and I also do not think a separate diffusion_only flag is necessary.

After #1826, once a stage config is resolved, mode together with stage-id mapping is already sufficient to select the corresponding pipeline branch. Under the precedence model in this PR, both omni and diffusion behaviors can still be explained with the same four cases:

  1. No bundled default stage config can be resolved, and the caller does not explicitly provide stage_configs_path

    • Fall back to the CLI/runtime-based diffusion stage-config constructor
    • Behaviorally, this is equivalent to introducing a separate diffusion_only flag
      if not stage_configs:
      if default_stage_cfg_factory is not None:
      default_stage_cfg = default_stage_cfg_factory()
      stage_configs = create_config(default_stage_cfg)
  2. A bundled default stage config exists, no explicit stage_configs_path is provided, and no global runtime overrides are given

    • Resolve and use the bundled default stage config based on model identity
    • The config is used as-is
  3. A bundled default stage config exists, no explicit stage_configs_path is provided, and runtime overrides are given

    • Resolve and use the bundled default stage config
    • Caller-provided runtime args act as coarse-grained global overrides for overlapping fields in engine_args (e.g., enable_prefix_caching, enforce_eager)
    • This path is intended only for coarse-grained adjustments; if per-stage customization is needed, users should modify the default stage config or explicitly provide their own
  4. stage_configs_path is explicitly provided

    • The explicitly provided stage config is treated as the source of truth
    • Caller runtime args only supplement fields that are not explicitly defined in the stage config

Given this, introducing diffusion_only would have several drawbacks:

  1. Redundant semantics

    • If users want to run the diffusion branch, they can already select it via mode
    • Or explicitly provide a stage config via stage_configs_path
    • If neither is available, the current fallback already constructs a diffusion stage config
      → Therefore, diffusion_only is functionally overlapping with existing mechanisms
  2. Incomplete default diffusion construction

    • _create_default_diffusion_stage_cfg() is not a complete configuration (e.g., it does not cover fields like gpu_memory_utilization) and is currently only intended as a fallback
    • Overusing this path may lead to caller-provided runtime args not being applied as expected
  3. Reduced configuration reuse and increased user burden

    • It prevents reuse of existing model/pipeline configurations
    • Users would need to manually provide more diffusion-related runtime parameters

Default config + CLI args overwrite makes sense. However, some default configs cannot override correctly, e.g., device = "0,1,2,3,4,5,6,7", which leads to errors. can you test #2282 case ? Perhaps we need to remove the strictly constrained configurations that cannot be overridden from the default YAML

@Bounty-hunter Bounty-hunter mentioned this pull request Mar 30, 2026
5 tasks
@xiaohajiayou xiaohajiayou force-pushed the bugfix/hunyuan-ep-test-a100 branch from 41a7a0f to d9cf2be Compare March 31, 2026 06:45
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented Mar 31, 2026

The fix has been completed and verified locally.

The CI failure now does not appear to be directly related to this change. Based on the logs, it seems to come from the AMD (ROCm) Qwen3-Omni test pipeline timeout, which may be related to the known CI issues tracked in #2340.

Let me know if I should further investigate from this PR side. @princepride @Bounty-hunter @hsliuustc0106

@skf-1999
Copy link
Copy Markdown
Contributor

skf-1999 commented Apr 1, 2026

How did you get HunyuanImage 3.0 working on 4x A100? Both of my 4-GPU A100 setups hit OOM, so I'm forced to run on 8 cards.

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented Apr 1, 2026

How did you get HunyuanImage 3.0 working on 4x A100? Both of my 4-GPU A100 setups hit OOM, so I'm forced to run on 8 cards.

could you share more details about your A100 setup? Specifically:

  • What is the GPU memory size (40GB or 80GB)?
  • Do you see full memory utilization on each card before the OOM happens?

In my case, I’m running on 4× NVIDIA A100-SXM4-80GB, and the inference proceeds normally without OOM using the following command:

python -u examples/offline_inference/text_to_image/text_to_image.py \
  --model /home/models/tencent/HunyuanImage-3.0 \
  --prompt "A brown and white dog is running on the grass" \
  --output output_image_latest.png \
  --num-inference-steps 50 \
  --tensor-parallel-size 4 \
  --cfg-scale 4.0

@skf-1999
Copy link
Copy Markdown
Contributor

skf-1999 commented Apr 1, 2026

How did you get HunyuanImage 3.0 working on 4x A100? Both of my 4-GPU A100 setups hit OOM, so I'm forced to run on 8 cards.

could you share more details about your A100 setup? Specifically:

  • What is the GPU memory size (40GB or 80GB)?
  • Do you see full memory utilization on each card before the OOM happens?

In my case, I’m running on 4× NVIDIA A100-SXM4-80GB, and the inference proceeds normally without OOM using the following command:

python -u examples/offline_inference/text_to_image/text_to_image.py \
  --model /home/models/tencent/HunyuanImage-3.0 \
  --prompt "A brown and white dog is running on the grass" \
  --output output_image_latest.png \
  --num-inference-steps 50 \
  --tensor-parallel-size 4 \
  --cfg-scale 4.0

I'm on two 8x NVIDIA A100-SXM4-80GB machines, both hit OOM. Right before crashing, memory usage was 62914MiB / 81920MiB per GPU (~77%).

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented Apr 1, 2026

I'm on two 8x NVIDIA A100-SXM4-80GB machines, both hit OOM. Right before crashing, memory usage was 62914MiB / 81920MiB per GPU (~77%).

Just to clarify, are you seeing this OOM issue on top of my PR, or does it also happen on the current main branch?

On my side, the same setup (4× A100-SXM4-80GB, TP=4, 50 steps) runs normally without OOM. During local testing, each GPU typically starts from around ~5GB usage and ramps up to roughly ~60GB per card during inference.

Given your logs show ~62GB before OOM, this feels a bit unexpected.

One thing you might want to try is explicitly selecting 4 GPUs with more available memory, for example:

CUDA_VISIBLE_DEVICES=1,3,6,7 python -u examples/offline_inference/text_to_image/text_to_image.py \
  --model /home/models/tencent/HunyuanImage-3.0 \
  --prompt "A brown and white dog is running on the grass" \
  --output output_image_latest.png \
  --num-inference-steps 50 \
  --tensor-parallel-size 4 \
  --cfg-scale 4.0

@skf-1999
Copy link
Copy Markdown
Contributor

skf-1999 commented Apr 1, 2026

I'm on two 8x NVIDIA A100-SXM4-80GB machines, both hit OOM. Right before crashing, memory usage was 62914MiB / 81920MiB per GPU (~77%).

Just to clarify, are you seeing this OOM issue on top of my PR, or does it also happen on the current main branch?

On my side, the same setup (4× A100-SXM4-80GB, TP=4, 50 steps) runs normally without OOM. During local testing, each GPU typically starts from around ~5GB usage and ramps up to roughly ~50GB per card during inference.

Given your logs show ~62GB before OOM, this feels a bit unexpected.

One thing you might want to try is explicitly selecting 4 GPUs with more available memory, for example:

CUDA_VISIBLE_DEVICES=1,3,6,7 python -u examples/offline_inference/text_to_image/text_to_image.py \
  --model /home/models/tencent/HunyuanImage-3.0 \
  --prompt "A brown and white dog is running on the grass" \
  --output output_image_latest.png \
  --num-inference-steps 50 \
  --tensor-parallel-size 4 \
  --cfg-scale 4.0

I encountered this issue in this PR, but it's not specific to this PR—the main branch also exhibits this behavior. All GPUs have 81920MiB of VRAM and are completely idle (no other processes running) when not executing the program. In the 8-GPU setup, each card's VRAM reaches 69886MiB.

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented Apr 1, 2026

I encountered this issue in this PR, but it's not specific to this PR—the main branch also exhibits this behavior. All GPUs have 81920MiB of VRAM and are completely idle (no other processes running) when not executing the program. In the 8-GPU setup, each card's VRAM reaches 69886MiB.

Thanks, this is helpful. If the same OOM also happens on main, then it’s likely not introduced by this PR, but rather something related to the environment or configuration.

For context, HunyuanImage-3.0 is around 83B parameters (BF16/F32). On my side, we’re able to run the standard offline inference setup on 4× A100-SXM4-80GB without hitting OOM, so in principle this setup should be sufficient for inference.

If your goal is to validate the fix in this PR, one possible approach is to explicitly set the default stage config to use 4 GPUs (TP=4), and then launch the job on an 8-GPU node with --tensor-parallel-size 8. This way, you can more directly verify whether the behavior addressed in this PR is working as expected.

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented Apr 3, 2026

I managed to reproduce the setup on another 4× A800 (80GB) server today to isolate the variable.The inference ran successfully without triggering OOM.
It might be an issue with your environment. @skf-1999

Here is the full log and the nvidia-smi snapshot right before the process finished:

full log
(vllm-omni) root@autodl-container-2201459dc8-f8f44d5f:~/vllm-omni# cd /root/vllm-omni
source /root/vllm-omni/.venv/bin/activate
(vllm-omni) root@autodl-container-2201459dc8-f8f44d5f:~/vllm-omni# nvidia-smi
Fri Apr  3 18:04:34 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A800 80GB PCIe          On  |   00000000:4F:00.0 Off |                  Off |
| N/A   48C    P0            132W /  300W |   65277MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A800 80GB PCIe          On  |   00000000:56:00.0 Off |                  Off |
| N/A   47C    P0            199W /  300W |   65277MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A800 80GB PCIe          On  |   00000000:57:00.0 Off |                  Off |
| N/A   48C    P0            251W /  300W |   65277MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A800 80GB PCIe          On  |   00000000:D5:00.0 Off |                  Off |
| N/A   52C    P0            122W /  300W |   65277MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+


python -u examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0 \
  --prompt "A brown and white dog is running on the grass" \
  --output output_image_latest.png \
  --num-inference-steps 50 \
  --tensor-parallel-size 4 \
  --cfg-scale 4.0 \
  --enforce-eager
INFO 04-03 18:02:53 [omni_base.py:93] [Omni] Initializing with model /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0
INFO 04-03 18:02:53 [async_omni_engine.py:216] [AsyncOmniEngine] Initializing with model /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
INFO 04-03 18:02:53 [config.py:437] Replacing legacy 'type' key with 'rope_type'
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
INFO 04-03 18:02:53 [config.py:437] Replacing legacy 'type' key with 'rope_type'
INFO 04-03 18:02:53 [async_omni_engine.py:248] [AsyncOmniEngine] Launching Orchestrator thread with 1 stages
INFO 04-03 18:02:53 [stage_init_utils.py:207] [stage_init] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
INFO 04-03 18:02:53 [initialization.py:270] Loaded OmniTransferConfig with 0 connector configurations
INFO 04-03 18:02:53 [async_omni_engine.py:466] [AsyncOmniEngine] Initializing stage 0
INFO 04-03 18:02:53 [stage_init_utils.py:222] [stage_init] Stage-0 set runtime devices: 0,1,2,3,4,5,6,7
INFO 04-03 18:02:54 [multiproc_executor.py:99] Starting server...
INFO 04-03 18:03:01 [diffusion_worker.py:396] Worker 0 created result MessageQueue
INFO 04-03 18:03:01 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-03 18:03:01 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-03 18:03:01 [vllm.py:754] Asynchronous scheduling is enabled.
INFO 04-03 18:03:01 [vllm.py:754] Asynchronous scheduling is enabled.
INFO 04-03 18:03:01 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-03 18:03:01 [vllm.py:754] Asynchronous scheduling is enabled.
INFO 04-03 18:03:01 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-03 18:03:01 [vllm.py:754] Asynchronous scheduling is enabled.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 04-03 18:03:02 [diffusion_worker.py:129] Worker 0: Initialized device and distributed environment.
INFO 04-03 18:03:02 [diffusion_worker.py:129] Worker 1: Initialized device and distributed environment.
INFO 04-03 18:03:02 [diffusion_worker.py:129] Worker 2: Initialized device and distributed environment.
INFO 04-03 18:03:02 [diffusion_worker.py:129] Worker 3: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 04-03 18:03:02 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-03 18:03:02 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-03 18:03:02 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-03 18:03:02 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-03 18:03:02 [parallel_state.py:630] SP group details for rank 0: sp_group=[0], ulysses_group=[0], ring_group=[0]
INFO 04-03 18:03:02 [parallel_state.py:630] SP group details for rank 1: sp_group=[1], ulysses_group=[1], ring_group=[1]
INFO 04-03 18:03:02 [parallel_state.py:630] SP group details for rank 2: sp_group=[2], ulysses_group=[2], ring_group=[2]
INFO 04-03 18:03:02 [parallel_state.py:630] SP group details for rank 3: sp_group=[3], ulysses_group=[3], ring_group=[3]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
You are using a model of type hunyuan_image_3_moe to instantiate a model of type Hunyuan. This is not supported for all configurations of models and can yield errors.
INFO 04-03 18:03:03 [config.py:437] Replacing legacy 'type' key with 'rope_type'
INFO 04-03 18:03:03 [config.py:437] Replacing legacy 'type' key with 'rope_type'
INFO 04-03 18:03:03 [config.py:437] Replacing legacy 'type' key with 'rope_type'
INFO 04-03 18:03:03 [config.py:437] Replacing legacy 'type' key with 'rope_type'
INFO 04-03 18:03:03 [pipeline_hunyuan_image_3.py:93] Setting attention backend to TORCH_SDPA. HunyuanImage3Pipeline only supports TORCH_SDPA to handle mixed causal and full attention.
INFO 04-03 18:03:03 [pipeline_hunyuan_image_3.py:93] Setting attention backend to TORCH_SDPA. HunyuanImage3Pipeline only supports TORCH_SDPA to handle mixed causal and full attention.
INFO 04-03 18:03:03 [pipeline_hunyuan_image_3.py:93] Setting attention backend to TORCH_SDPA. HunyuanImage3Pipeline only supports TORCH_SDPA to handle mixed causal and full attention.
INFO 04-03 18:03:03 [pipeline_hunyuan_image_3.py:93] Setting attention backend to TORCH_SDPA. HunyuanImage3Pipeline only supports TORCH_SDPA to handle mixed causal and full attention.
INFO 04-03 18:03:03 [platform.py:73] Using diffusion attention backend 'TORCH_SDPA'
INFO 04-03 18:03:03 [platform.py:73] Using diffusion attention backend 'TORCH_SDPA'
INFO 04-03 18:03:03 [platform.py:73] Using diffusion attention backend 'TORCH_SDPA'
INFO 04-03 18:03:03 [platform.py:73] Using diffusion attention backend 'TORCH_SDPA'
INFO 04-03 18:03:03 [unquantized.py:186] Using TRITON backend for Unquantized MoE
Multi-thread loading shards:   0% Completed | 0/32 [00:00<?, ?it/s]
Multi-thread loading shards:   3% Completed | 1/32 [00:00<00:12,  2.46it/s]
Multi-thread loading shards:   6% Completed | 2/32 [00:00<00:14,  2.11it/s]
Multi-thread loading shards:   9% Completed | 3/32 [00:01<00:12,  2.24it/s]
Multi-thread loading shards:  12% Completed | 4/32 [00:01<00:12,  2.27it/s]
Multi-thread loading shards:  16% Completed | 5/32 [00:02<00:11,  2.39it/s]
Multi-thread loading shards:  19% Completed | 6/32 [00:02<00:10,  2.48it/s]
Multi-thread loading shards:  22% Completed | 7/32 [00:02<00:09,  2.54it/s]
Multi-thread loading shards:  25% Completed | 8/32 [00:03<00:09,  2.55it/s]
Multi-thread loading shards:  28% Completed | 9/32 [00:03<00:08,  2.58it/s]
Multi-thread loading shards:  31% Completed | 10/32 [00:04<00:08,  2.62it/s]
Multi-thread loading shards:  34% Completed | 11/32 [00:04<00:08,  2.62it/s]
Multi-thread loading shards:  38% Completed | 12/32 [00:04<00:07,  2.66it/s]
Multi-thread loading shards:  41% Completed | 13/32 [00:05<00:06,  2.73it/s]
Multi-thread loading shards:  44% Completed | 14/32 [00:05<00:06,  2.78it/s]
Multi-thread loading shards:  47% Completed | 15/32 [00:05<00:06,  2.82it/s]
Multi-thread loading shards:  50% Completed | 16/32 [00:06<00:05,  2.87it/s]
Multi-thread loading shards:  53% Completed | 17/32 [00:06<00:05,  2.92it/s]
Multi-thread loading shards:  56% Completed | 18/32 [00:06<00:04,  2.98it/s]
Multi-thread loading shards:  59% Completed | 19/32 [00:07<00:04,  2.93it/s]
Multi-thread loading shards:  62% Completed | 20/32 [00:07<00:03,  3.01it/s]
Multi-thread loading shards:  66% Completed | 21/32 [00:07<00:03,  3.05it/s]
Multi-thread loading shards:  69% Completed | 22/32 [00:08<00:03,  3.09it/s]
Multi-thread loading shards:  72% Completed | 23/32 [00:08<00:02,  3.10it/s]
Multi-thread loading shards:  75% Completed | 24/32 [00:08<00:02,  3.00it/s]
Multi-thread loading shards:  78% Completed | 25/32 [00:09<00:02,  2.95it/s]
Multi-thread loading shards:  81% Completed | 26/32 [00:09<00:02,  2.93it/s]
Multi-thread loading shards:  84% Completed | 27/32 [00:09<00:01,  2.89it/s]
Multi-thread loading shards:  88% Completed | 28/32 [00:10<00:01,  2.87it/s]
Multi-thread loading shards:  91% Completed | 29/32 [00:10<00:01,  2.91it/s]
Multi-thread loading shards:  94% Completed | 30/32 [00:11<00:00,  2.17it/s]
Multi-thread loading shards:  97% Completed | 31/32 [00:11<00:00,  2.18it/s]
Multi-thread loading shards: 100% Completed | 32/32 [00:12<00:00,  2.04it/s]
Multi-thread loading shards: 100% Completed | 32/32 [00:12<00:00,  2.61it/s]

INFO 04-03 18:03:16 [diffusers_loader.py:321] Loading weights took 12.36 seconds
INFO 04-03 18:03:17 [diffusers_loader.py:321] Loading weights took 12.60 seconds
INFO 04-03 18:03:17 [diffusers_loader.py:321] Loading weights took 12.84 seconds
INFO 04-03 18:03:17 [diffusion_model_runner.py:141] Model loading took 42.3180 GiB and 14.999656 seconds
INFO 04-03 18:03:17 [diffusion_model_runner.py:146] Model runner: Model loaded successfully.
INFO 04-03 18:03:17 [diffusion_model_runner.py:187] Model runner: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:159] Worker 0: Process-scoped GPU memory after model loading: 42.88 GiB.
INFO 04-03 18:03:17 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-03 18:03:17 [diffusion_worker.py:98] Worker 0: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:534] Worker 0: Scheduler loop started.
INFO 04-03 18:03:17 [diffusion_worker.py:457] Worker 0 ready to receive requests via shared memory
INFO 04-03 18:03:17 [diffusion_model_runner.py:141] Model loading took 42.3180 GiB and 15.283239 seconds
INFO 04-03 18:03:17 [diffusion_model_runner.py:146] Model runner: Model loaded successfully.
INFO 04-03 18:03:17 [diffusion_model_runner.py:187] Model runner: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:159] Worker 3: Process-scoped GPU memory after model loading: 42.88 GiB.
INFO 04-03 18:03:17 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:3, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-03 18:03:17 [diffusion_worker.py:98] Worker 3: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:534] Worker 3: Scheduler loop started.
INFO 04-03 18:03:17 [diffusion_worker.py:457] Worker 3 ready to receive requests via shared memory
INFO 04-03 18:03:17 [diffusers_loader.py:321] Loading weights took 13.34 seconds
INFO 04-03 18:03:17 [diffusion_model_runner.py:141] Model loading took 42.3180 GiB and 15.439166 seconds
INFO 04-03 18:03:17 [diffusion_model_runner.py:146] Model runner: Model loaded successfully.
INFO 04-03 18:03:17 [diffusion_model_runner.py:187] Model runner: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:159] Worker 1: Process-scoped GPU memory after model loading: 42.88 GiB.
INFO 04-03 18:03:17 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:1, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-03 18:03:17 [diffusion_worker.py:98] Worker 1: Initialization complete.
INFO 04-03 18:03:17 [diffusion_worker.py:534] Worker 1: Scheduler loop started.
INFO 04-03 18:03:17 [diffusion_worker.py:457] Worker 1 ready to receive requests via shared memory
INFO 04-03 18:03:18 [diffusion_model_runner.py:141] Model loading took 42.3180 GiB and 15.899292 seconds
INFO 04-03 18:03:18 [diffusion_model_runner.py:146] Model runner: Model loaded successfully.
INFO 04-03 18:03:18 [diffusion_model_runner.py:187] Model runner: Initialization complete.
INFO 04-03 18:03:18 [diffusion_worker.py:159] Worker 2: Process-scoped GPU memory after model loading: 42.88 GiB.
INFO 04-03 18:03:18 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:2, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-03 18:03:18 [diffusion_worker.py:98] Worker 2: Initialization complete.
INFO 04-03 18:03:18 [diffusion_worker.py:534] Worker 2: Scheduler loop started.
INFO 04-03 18:03:18 [diffusion_worker.py:457] Worker 2 ready to receive requests via shared memory
INFO 04-03 18:03:18 [diffusion_engine.py:378] dummy run to warm up the model
INFO 04-03 18:03:18 [manager.py:608] Deactivating all adapters: 0 layers
INFO 04-03 18:03:18 [manager.py:608] Deactivating all adapters: 0 layers
INFO 04-03 18:03:18 [manager.py:608] Deactivating all adapters: 0 layers
INFO 04-03 18:03:18 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 04-03 18:03:18 [kv_transfer_manager.py:381] No connector available for receiving KV cache
  0%|                                                                                    | 0/1 [00:00<?, ?it/s]WARNING 04-03 18:03:20 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /root/vllm-omni/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_A800_80GB_PCIe.json
100%|████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.10s/it]
INFO 04-03 18:03:23 [diffusion_model_runner.py:212] Peak GPU memory (this request): 63.09 GB reserved, 51.37 GB allocated, 11.72 GB pool overhead (18.6%)
INFO 04-03 18:03:23 [async_omni_diffusion.py:154] AsyncOmniDiffusion initialized with model: /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0, batch_size: 1
INFO 04-03 18:03:23 [stage_diffusion_client.py:54] [StageDiffusionClient] Stage-1 initialized (batch_size=1)
INFO 04-03 18:03:23 [async_omni_engine.py:496] [AsyncOmniEngine] Stage 0 initialized (diffusion, batch_size=1)
INFO 04-03 18:03:23 [orchestrator.py:158] [Orchestrator] Starting event loop
INFO 04-03 18:03:23 [async_omni_engine.py:290] [AsyncOmniEngine] Orchestrator ready with 1 stages
INFO 04-03 18:03:23 [omni_base.py:106] [Omni] AsyncOmniEngine initialized in 30.07 seconds
INFO 04-03 18:03:23 [omni_base.py:121] [Omni] Initialized with 1 stages for model /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0

============================================================
Generation Configuration:
  Model: /root/autodl-fs/models/Tencent-Hunyuan/HunyuanImage-3.0
  Inference steps: 50
  Cache backend: None (no acceleration)
  Quantization: None (BF16)
  Parallel configuration: tensor_parallel_size=4, ulysses_degree=1, ulysses_mode=strict, ring_degree=1, cfg_parallel_size=1, vae_patch_parallel_size=1, enable_expert_parallel=False.
  CPU offload: False
  Image size: 1024x1024
============================================================

INFO 04-03 18:03:23 [orchestrator.py:584] [Orchestrator] _handle_add_request: stage=0 req=0_deb5b338-3b60-48f7-a1f1-666e186c320f prompt_type=dict original_prompt_type=dict final_stage=0 num_sampling_params=1
Processed prompts:   0%|                                                                 | 0/1 [00:00<?, ?it/s]INFO 04-03 18:03:23 [manager.py:608] Deactivating all adapters: 0 layers
INFO 04-03 18:03:23 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 04-03 18:03:23 [kv_transfer_manager.py:381] No connector available for receiving KV cache
INFO 04-03 18:03:23 [manager.py:608] Deactivating all adapters: 0 layers
INFO 04-03 18:03:23 [manager.py:608] Deactivating all adapters: 0 layers
 80%|███████████████████████████████████████████████████████████▏              | 40/50 [01:00<00:14,  1.50s/it]INFO 04-03 18:04:24 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
100%|██████████████████████████████████████████████████████████████████████████| 50/50 [01:15<00:00,  1.50s/it]
INFO 04-03 18:04:39 [diffusion_model_runner.py:212] Peak GPU memory (this request): 63.10 GB reserved, 51.41 GB allocated, 11.69 GB pool overhead (18.5%)
INFO 04-03 18:04:39 [diffusion_engine.py:103] Generation completed successfully.
INFO 04-03 18:04:39 [diffusion_engine.py:136] Post-processing completed in 0.0000 seconds
INFO 04-03 18:04:39 [diffusion_engine.py:139] DiffusionEngine.step breakdown: preprocess=0.00 ms, add_req_and_wait=76300.51 ms, postprocess=0.00 ms, total=76300.69 ms
Processed prompts: 100%|█████████████████████████████████████████████████████████| 1/1 [01:16<00:00, 76.30s/it]INFO 04-03 18:04:39 [omni_base.py:162] [Summary] {}
Processed prompts: 100%|█████████████████████████████████████████████████████████| 1/1 [01:16<00:00, 76.30s/it]
Total generation time: 76.3049 seconds (76304.87 ms)
INFO 04-03 18:04:39 [text_to_image.py:440] Outputs: [OmniRequestOutput(request_id='0_deb5b338-3b60-48f7-a1f1-666e186c320f', finished=True, stage_id=0, final_output_type='image', request_output=OmniRequestOutput(request_id='0_deb5b338-3b60-48f7-a1f1-666e186c320f', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt={'prompt': 'A brown and white dog is running on the grass', 'negative_prompt': None}, latents=None, metrics={'preprocess_time_ms': 0.0, 'diffusion_engine_exec_time_ms': 76300.72746798396, 'diffusion_engine_total_time_ms': 76300.50529167056, 'image_num': 1, 'resolution': 640, 'postprocess_time_ms': 0.0013150274753570557}, multimodal_output={}, custom_output={}, stage_durations={}, peak_memory_mb=64610.0), images=[1 PIL Images], prompt=None, latents=None, metrics={}, multimodal_output={}, custom_output={}, stage_durations={}, peak_memory_mb=64610.0)]
Saved generated image to output_image_latest.png
INFO 04-03 18:04:40 [async_omni_engine.py:1133] [AsyncOmniEngine] Shutting down Orchestrator
INFO 04-03 18:04:40 [orchestrator.py:210] [Orchestrator] Received shutdown signal
INFO 04-03 18:04:40 [orchestrator.py:820] [Orchestrator] Shutting down all stages
INFO 04-03 18:04:40 [diffusion_worker.py:486] Worker 0: Received shutdown message
INFO 04-03 18:04:40 [diffusion_worker.py:507] event loop terminated.
INFO 04-03 18:04:40 [diffusion_worker.py:486] Worker 2: Received shutdown message
INFO 04-03 18:04:40 [diffusion_worker.py:486] Worker 3: Received shutdown message
INFO 04-03 18:04:40 [diffusion_worker.py:507] event loop terminated.
INFO 04-03 18:04:40 [diffusion_worker.py:486] Worker 1: Received shutdown message
INFO 04-03 18:04:40 [diffusion_worker.py:507] event loop terminated.
INFO 04-03 18:04:40 [diffusion_worker.py:507] event loop terminated.
INFO 04-03 18:04:40 [diffusion_worker.py:542] Worker 0: Shutdown complete.
INFO 04-03 18:04:40 [diffusion_worker.py:542] Worker 3: Shutdown complete.
INFO 04-03 18:04:40 [diffusion_worker.py:542] Worker 2: Shutdown complete.
INFO 04-03 18:04:40 [diffusion_worker.py:542] Worker 1: Shutdown complete.
INFO 04-03 18:04:42 [async_omni_diffusion.py:365] AsyncOmniDiffusion closed
INFO 04-03 18:04:42 [orchestrator.py:824] [Orchestrator] Stage 0 shut down
(vllm-omni) root@autodl-container-2201459dc8-f8f44d5f:~/vllm-omni# 
...

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented Apr 3, 2026

  • This PR fixes the override priority issue between engine args and YAML config.
  • The runtime devices in YAML has also been refined to support logical mapping.

Could you please take a look? If everything looks good, could you merge it first? @princepride @hsliuustc0106 @lishunyang12
It seems that the work in #2264 and #2185 both depends on the merge progress of this PR.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lishunyang12 lishunyang12 enabled auto-merge (squash) April 6, 2026 02:19
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

Hi @hsliuustc0106, hope you're having a good week. I've addressed all your comments and the PR is ready for re-review. The merge is currently blocked pending your approval.
Is there anything else you'd like me to adjust or refine? I'm happy to make further changes if needed

@princepride
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 PTAL

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

If everything looks good, could we merge this pr? @lishunyang12

@princepride princepride merged commit 2d98013 into vllm-project:main Apr 9, 2026
8 checks passed
stage_configs = load_stage_configs_from_yaml(
config_path=stage_config_path,
base_engine_args=base_engine_args,
prefer_stage_engine_args=False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The override behavior from cli args to yaml's is failing some nightly test: https://buildkite.com/vllm/vllm-omni/builds/6216/steps/canvas?sid=019d71f7-fe1d-4c22-a5df-abb9425a9d81

Image

Maybe we could change the False to True to align to old behavior.
@xiaohajiayou @hsliuustc0106 @lishunyang12 @princepride

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For models with a built-in default stage config, we merge caller-provided engine args into the stage config.

The problem is that parser-based entrypoints were previously passing the full parsed CLI namespace into stage config resolution. That namespace contains both:

  • explicitly provided CLI args
  • parser default values
    As a result, fields that were not explicitly passed by the user could still participate in config merge.

One concrete example is distributed_executor_backend:

  • the model's default stage config may set:
    • distributed_executor_backend: "mp"
  • but the upstream vLLM parser can provide:
    • distributed_executor_backend = None
  • if this full parsed args object is treated as override input, the YAML value "mp" can be overwritten by None

That is why the executor eventually sees:

Unknown distributed executor backend: None

I try to fix this in #2655

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your response! I believe one effective solution would be to manage the argument parser’s default values within the vllm-omni diffusion engine/worker. Alternatively, we could create our own argument parser from scratch.

vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
… configs (vllm-project#2076)

Signed-off-by: xiaohajiayou <923390377@qq.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Sy0307 pushed a commit to Sy0307/vllm-omni that referenced this pull request Apr 10, 2026
… configs (vllm-project#2076)

Signed-off-by: xiaohajiayou <923390377@qq.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 13, 2026
… configs (vllm-project#2076)

Signed-off-by: xiaohajiayou <923390377@qq.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: HunyuanImage3 EP regression

9 participants