[Bugfix] Fix precedence between caller runtime args and default stage configs#2076
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2e6c3e14c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Thanks for the investigation on the EP regression — the benchmarking results are useful context.
However, I think the per-field override approach here works against the stage-config design. The stage yaml is meant to be the single source of truth for stage-level topology. Punching through enable_expert_parallel as a special case creates an inconsistent precedence model — callers would reasonably expect the same to work for tensor_parallel_size or other parallel fields, but it won't.
For toggling EP in testing/benchmarking, a separate stage yaml (e.g. hunyuan_image3_moe_dit_ep.yaml) would keep the config model clean and explicit.
The env-var overrides in the test also add a lot of surface area for what's essentially a single-model validation. If A100 support is needed, a dedicated test or fixture would be more maintainable than parameterizing the existing one with 5+ env vars.
Happy to discuss if you see a reason the yaml-per-config approach doesn't work for your use case.
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review Gates
| Check | Status |
|---|---|
| DCO | ❌ ACTION_REQUIRED |
| pre-commit | ✅ SUCCESS |
| mergeable | ✅ MERGEABLE |
BLOCKER: DCO check failing. Author needs to sign commits with git commit -s.
Mandatory Blocker Triage
| Category | Status | Evidence/Gap |
|---|---|---|
| Tests | ✅ PASS | Test commands + results in PR description; existing test updated with env-based overrides |
| Docs | N/A | Internal test/utility changes only — no docs required |
| Perf | ✅ PASS | Benchmark data: baseline 27912ms vs EP 28001ms (1.00x speedup) |
| Accuracy | ✅ PASS | cos_sim=0.99, mse=5.5e-3, diff metrics provided |
| API | N/A | No API changes |
Code Review
The fix is narrow and well-scoped:
vllm_omni/entrypoints/utils.py: Extracts enable_expert_parallel from caller config and overrides stage-yaml valuetests/e2e/offline_inference/test_expert_parallel.py: Adds env-based test configurationtests/utils.py: Adds A100 marker supportpyproject.toml: Registers A100 marker
No inline blockers beyond DCO.
Action Required
- Author: sign commits with
git commit -sto fix DCO
Once DCO passes, this is ready for APPROVE.
d732404 to
0e8a6ae
Compare
You’re right that making What I’d like to clarify is the expected relationship between the two configuration paths that already exist today:
parallel_config = DiffusionParallelConfig(tensor_parallel_size=2)
omni = Omni(model="your-model-name", parallel_config=parallel_config)
omni = Omni(model="your-model-name", stage-configs-path="/path/to/your/custom_bagel.yaml")In So I’d like to confirm the intended design: When the same topology-related fields (e.g.
is there a defined precedence or merge rule that determines the final effective value? Or is the recommended approach to treat stage config as the single source of truth for stage-level topology, and avoid configuring these fields through multiple paths at the same time? If the latter is the intended model, I’m happy to update this test to use two separate yaml configs (EP vs non-EP), instead of overriding EP from the caller side at runtime. |
|
@congw729 can we need to add A100 mark? |
We don't have A100 machines in our CI. Right now, the hardware mark is designed to mark which machine this test needs to run on. |
e7b3064 to
811efb3
Compare
4decc32 to
2614afc
Compare
There was a problem hiding this comment.
now that the entrypoints/utils.py override is removed, what's left is a test refactor + two new YAMLs for a test that isn't wired into any CI pipeline. The rename also drops the generic test_expert_parallel.py in favor of a HunyuanImage3-specific one.
Would it make more sense to fold the investigation findings into #2015 directly, and only open a PR when there's either a concrete fix or CI coverage for this test?
0509b8e to
919c35a
Compare
I think the main issue here is not the Hunyuan-specific test itself, but the semantic conflict between bundled default stage configs and caller-provided runtime args. There should be a clear precedence model. My reasoning is:
To make this precedence explicit, I introduced a With that, the current logic becomes:
This precedence seems more consistent with the documented configuration model and makes the behavior easier to reason about. That said, I’m happy to adjust if there is a preferred precedence model. |
#1826 might have fixed this problem.
What #1826 changed is that The problem is that #1826 still placed a substantial amount of runtime configuration directly into that default diffusion stage, including |
|
@hsliuustc0106 @lishunyang12 |
|
I also try to fix it by #2289 , we can discuss it. |
I do not think we need to split omni and diffusion into two separate configuration flows, and I also do not think a separate After #1826, once a stage config is resolved,
Given this, introducing
|
Default config + CLI args overwrite makes sense. However, some default configs cannot override correctly, e.g., device = "0,1,2,3,4,5,6,7", which leads to errors. can you test #2282 case ? Perhaps we need to remove the strictly constrained configurations that cannot be overridden from the default YAML |
41a7a0f to
d9cf2be
Compare
|
The fix has been completed and verified locally. The CI failure now does not appear to be directly related to this change. Based on the logs, it seems to come from the AMD (ROCm) Qwen3-Omni test pipeline timeout, which may be related to the known CI issues tracked in #2340. Let me know if I should further investigate from this PR side. @princepride @Bounty-hunter @hsliuustc0106 |
|
How did you get HunyuanImage 3.0 working on 4x A100? Both of my 4-GPU A100 setups hit OOM, so I'm forced to run on 8 cards. |
could you share more details about your A100 setup? Specifically:
In my case, I’m running on 4× NVIDIA A100-SXM4-80GB, and the inference proceeds normally without OOM using the following command: python -u examples/offline_inference/text_to_image/text_to_image.py \
--model /home/models/tencent/HunyuanImage-3.0 \
--prompt "A brown and white dog is running on the grass" \
--output output_image_latest.png \
--num-inference-steps 50 \
--tensor-parallel-size 4 \
--cfg-scale 4.0 |
I'm on two 8x NVIDIA A100-SXM4-80GB machines, both hit OOM. Right before crashing, memory usage was 62914MiB / 81920MiB per GPU (~77%). |
Just to clarify, are you seeing this OOM issue on top of my PR, or does it also happen on the current main branch? On my side, the same setup (4× A100-SXM4-80GB, TP=4, 50 steps) runs normally without OOM. During local testing, each GPU typically starts from around ~5GB usage and ramps up to roughly ~60GB per card during inference. Given your logs show ~62GB before OOM, this feels a bit unexpected. One thing you might want to try is explicitly selecting 4 GPUs with more available memory, for example: CUDA_VISIBLE_DEVICES=1,3,6,7 python -u examples/offline_inference/text_to_image/text_to_image.py \
--model /home/models/tencent/HunyuanImage-3.0 \
--prompt "A brown and white dog is running on the grass" \
--output output_image_latest.png \
--num-inference-steps 50 \
--tensor-parallel-size 4 \
--cfg-scale 4.0 |
I encountered this issue in this PR, but it's not specific to this PR—the main branch also exhibits this behavior. All GPUs have 81920MiB of VRAM and are completely idle (no other processes running) when not executing the program. In the 8-GPU setup, each card's VRAM reaches 69886MiB. |
Thanks, this is helpful. If the same OOM also happens on main, then it’s likely not introduced by this PR, but rather something related to the environment or configuration. For context, HunyuanImage-3.0 is around 83B parameters (BF16/F32). On my side, we’re able to run the standard offline inference setup on 4× A100-SXM4-80GB without hitting OOM, so in principle this setup should be sufficient for inference. If your goal is to validate the fix in this PR, one possible approach is to explicitly set the default stage config to use 4 GPUs (TP=4), and then launch the job on an 8-GPU node with |
|
I managed to reproduce the setup on another 4× A800 (80GB) server today to isolate the variable.The inference ran successfully without triggering OOM. Here is the full log and the nvidia-smi snapshot right before the process finished: full log |
Could you please take a look? If everything looks good, could you merge it first? @princepride @hsliuustc0106 @lishunyang12 |
|
Hi @hsliuustc0106, hope you're having a good week. I've addressed all your comments and the PR is ready for re-review. The merge is currently blocked pending your approval. |
|
@hsliuustc0106 PTAL |
|
If everything looks good, could we merge this pr? @lishunyang12 |
| stage_configs = load_stage_configs_from_yaml( | ||
| config_path=stage_config_path, | ||
| base_engine_args=base_engine_args, | ||
| prefer_stage_engine_args=False, |
There was a problem hiding this comment.
The override behavior from cli args to yaml's is failing some nightly test: https://buildkite.com/vllm/vllm-omni/builds/6216/steps/canvas?sid=019d71f7-fe1d-4c22-a5df-abb9425a9d81
Maybe we could change the False to True to align to old behavior.
@xiaohajiayou @hsliuustc0106 @lishunyang12 @princepride
There was a problem hiding this comment.
For models with a built-in default stage config, we merge caller-provided engine args into the stage config.
The problem is that parser-based entrypoints were previously passing the full parsed CLI namespace into stage config resolution. That namespace contains both:
- explicitly provided CLI args
- parser default values
As a result, fields that were not explicitly passed by the user could still participate in config merge.
One concrete example is distributed_executor_backend:
- the model's default stage config may set:
- distributed_executor_backend: "mp"
- but the upstream vLLM parser can provide:
- distributed_executor_backend = None
- if this full parsed args object is treated as override input, the YAML value "mp" can be overwritten by None
That is why the executor eventually sees:
Unknown distributed executor backend: None
I try to fix this in #2655
There was a problem hiding this comment.
Thank you for your response! I believe one effective solution would be to manage the argument parser’s default values within the vllm-omni diffusion engine/worker. Alternatively, we could create our own argument parser from scratch.
… configs (vllm-project#2076) Signed-off-by: xiaohajiayou <923390377@qq.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
… configs (vllm-project#2076) Signed-off-by: xiaohajiayou <923390377@qq.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
… configs (vllm-project#2076) Signed-off-by: xiaohajiayou <923390377@qq.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Purpose
Fixes #2075.
This PR fixes config precedence for stage-config based HunyuanImage3 runs so that caller-provided
parallel_configremains authoritative over stage-yamlparallel_config.Without this change, the stage yaml can override runtime EP settings, which makes it unreliable to toggle between:
enable_expert_parallel=Falseenable_expert_parallel=TrueTest command:
export CUDA_VISIBLE_DEVICES=3,5,6,7 pytest tests/e2e/offline_inference/test_expert_parallel.py -v -sTest Result
The existing expert-parallel e2e test now runs successfully on 4x A100 with an explicit HunyuanImage3 stage config.