[Test] Fix HunyuanImage3 nightly perf startup by TaffyOfficial · Pull Request #3819 · vllm-project/vllm-omni

TaffyOfficial · 2026-05-22T09:49:51Z

Purpose

Fix the HunyuanImage3 nightly perf path so the run-nightly script executes every selected pytest command, starts HunyuanImage3 perf with the intended DiT-only deploy topology, and survives an offline cache that lacks generation_config.json.

What Was Broken

tools/nightly/run_nightly_jobs.sh generated one shell job from a Buildkite step, but some Buildkite steps contain multiple pytest commands. With set -e, the first failing pytest stopped the job and later Hunyuan variants could be skipped.
The HunyuanImage3 perf JSONs launched tencent/HunyuanImage-3.0-Instruct without the DiT-only deploy config. In the single-server perf topology, async chunking is invalid because there is no next-stage input processor.
HunyuanImage3 requires Hugging Face remote code, so the perf server args need --trust-remote-code.
In offline cache, HunyuanImage3Pipeline.__init__ could fail before loading weights when the snapshot had config.json and tokenizer files but no generation_config.json.

How This PR Fixes It

Preserve all pytest commands generated from one Buildkite step and return the combined failure status after every command has run.
Use the existing hunyuan_image3_dit.yaml deploy config for all three HunyuanImage3 perf cases, then keep only the per-case CLI overrides for TP/SP/CFG/quantization/profiler/trust-remote-code.
Resolve perf deploy configs from the repository vllm_omni/deploy directory so stage_config_name can point at the shared Hunyuan deploy YAML.
Add a HunyuanImage3 generation-config loader that preserves real hub/snapshot values when available, but logs a warning and fills the bundled Hunyuan defaults when loading generation config fails in offline cache.

This PR does not change perf baselines or model execution logic. It only fixes nightly job execution and startup metadata/configuration needed for HunyuanImage3 perf startup.

Test Plan

git diff --check
python -m json.tool tests/dfx/perf/tests/test_hunyuan_image_tp4_fp8.json
python -m json.tool tests/dfx/perf/tests/test_hunyuan_image_tp2_fp8_sp2.json
python -m json.tool tests/dfx/perf/tests/test_hunyuan_image_tp2_fp8_cfgp2.json
python -m ruff check vllm_omni/diffusion/models/hunyuan_image3/pipeline_hunyuan_image3.py tests/diffusion/models/hunyuan_image3/test_generation_config.py tests/dfx/perf/scripts/run_benchmark.py
python -m ruff format --check --diff vllm_omni/diffusion/models/hunyuan_image3/pipeline_hunyuan_image3.py tests/diffusion/models/hunyuan_image3/test_generation_config.py tests/dfx/perf/scripts/run_benchmark.py
bash -n tools/nightly/run_nightly_jobs.sh
PYTHONPATH=/data/wzr/wt-nightly-pr3582-tests-codex python3 -m pytest tests/diffusion/models/hunyuan_image3/test_generation_config.py -q
bash tools/nightly/run_nightly_jobs.sh --test-type perf --model-type diffusion --label-substr HunyuanImage3

Remote validation used only /data/wzr for persistent files, with:

HF_HUB_OFFLINE=1
TRANSFORMERS_OFFLINE=1
CUDA_VISIBLE_DEVICES=2,3,4,5

Test Result

Local/static checks:

git diff --check: passed
json.tool for all three Hunyuan perf JSONs: passed
Hunyuan deploy path resolution check: passed
ruff check: All checks passed!
ruff format --check --diff: 3 files already formatted
bash -n tools/nightly/run_nightly_jobs.sh: passed

Local Windows pytest collection is blocked by missing full vllm package in this environment:

ModuleNotFoundError: No module named 'vllm'

Remote unit test before the latest local-path fallback case:

PYTHONPATH=/data/wzr/wt-nightly-pr3582-tests-codex python3 -m pytest tests/diffusion/models/hunyuan_image3/test_generation_config.py -q
..                                                                       [100%]
2 passed, 18 warnings

Remote Hunyuan nightly startup evidence:

Before this PR:
1. failed on async_chunk=True with no next-stage input processor
2. after disabling async chunk, failed because trust_remote_code=True was missing
3. after adding trust_remote_code, failed because generation_config.json was missing in offline cache

After this PR:
1. server args include trust_remote_code=True
2. all three Hunyuan perf configs use the DiT-only deploy config with async_chunk=false
3. the missing generation_config.json failure is gone
4. TP4 proceeds to diffusion weight loading

The full Hunyuan nightly still cannot finish on the tested server because the model cache itself is incomplete there. /data/wzr has no HunyuanImage3 weight files, and the available cached snapshot only contains config/tokenizer files. After this PR, the next and current blocker is:

RuntimeError: Cannot find any model weights with `tencent/HunyuanImage-3.0-Instruct`

That final error is expected for this server state: the code now reaches weight loading, but the required model weights are not present in the allowed /data/wzr path.

chatgpt-codex-connector · 2026-05-22T09:53:25Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

TaffyOfficial · 2026-05-22T09:57:00Z

@Bounty-hunter

hsliuustc0106

The nightly script change to preserve all pytest commands per Buildkite step is correct. Using bitwise OR for overall_status ensures any failure propagates. The generation_config fallback with warning is a sensible offline cache workaround. The test coverage for both paths is sufficient.

Bounty-hunter · 2026-05-25T01:28:53Z

The changes too complicated, could we just specific deploy config hunyuan_image3_dit.yaml and overwirte field with cli?

TaffyOfficial · 2026-05-26T02:31:08Z

@Bounty-hunter Thanks, updated in that direction.

The three HunyuanImage3 perf JSONs now use hunyuan_image3_dit.yaml as the deploy base, and only keep the per-case CLI overrides for TP/SP/CFG/quantization/profiler/trust-remote-code.

I kept two pieces separate because they are outside the deploy-config simplification:

run_nightly_jobs.sh still needs to execute every pytest command from one Buildkite step before returning the combined status.
The generation-config fallback still handles the offline cache case where config/tokenizer files exist but generation_config.json is missing. I also narrowed the fallback path so local snapshots can use _name_or_path from hf_config instead of relying only on the model path string.

Signed-off-by: TaffyOfficial <2324465096@qq.com>

Bounty-hunter · 2026-05-27T07:55:27Z

Paste the execution results and compare their execution times.

Bounty-hunter · 2026-05-27T07:57:05Z

        "server_type": "vllm-omni",
        "server_params": {
            "model": "tencent/HunyuanImage-3.0-Instruct",
+            "stage_overrides": {


there is an bug(3483) that stage_overrides can't overwrite correct，please check it.

Thanks, pasted the comparable execution numbers below.

For the current HunyuanImage3 perf JSON baselines, all use 1024x1024, 50 steps, 10 prompts, max_concurrency=1:

config throughput_qps latency_p99 peak_memory_mb_max

tp4_fp8 0.0800 13.1227s 46838

tp2_fp8_sp2 0.0800 12.0731s 66314

tp2_fp8_cfgp2 0.1035 9.9057s 66470

Compared with tp4_fp8:

tp2_fp8_sp2 has 8.0% lower p99 latency, same recorded throughput, and about 41.6% higher peak memory.

tp2_fp8_cfgp2 has 24.5% lower p99 latency, 29.4% higher throughput, and about 41.9% higher peak memory.

I also checked the #3483 override issue. The broken path there is flat diffusion parallel overrides being left as top-level engine_args fields while diffusion reads engine_args.parallel_config. This PR is not using that flat path for the TP2 cases: it passes a full nested parallel_config through stage_overrides, and a local materialization check gives parallel_config.tensor_parallel_size=2 / parallel_config.cfg_parallel_size=2 with no flat tensor_parallel_size left in engine_args.

If we want to use flat CLI flags instead, then this PR should wait for or rebase on #3483.

Bounty-hunter · 2026-05-28T06:52:54Z

LGTM

TaffyOfficial · 2026-05-28T07:17:11Z

@hsliuustc0106 @Gaohan123

Bounty-hunter · 2026-05-28T15:30:03Z

Running it directly results in an error. It seems like the config path needs to be changed to an absolute path.

Signed-off-by: TaffyOfficial <2324465096@qq.com>

TaffyOfficial · 2026-05-29T03:17:54Z

@Bounty-hunter @congw729 fix

congw729 · 2026-05-29T07:36:48Z

@Bounty-hunter @congw729 fix

Thanks. I will try this tonight.

Signed-off-by: TaffyOfficial <2324465096@qq.com>

Bounty-hunter · 2026-05-30T03:07:18Z

This changes not correct, try with #3996

TaffyOfficial marked this pull request as ready for review May 22, 2026 09:53

TaffyOfficial requested review from Gaohan123, Isotr0py, RuixiangMa, SamitHuang, ZJY0516, david6666666, hsliuustc0106, princepride, wtomin, yenuo26 and ywang96 as code owners May 22, 2026 09:53

hsliuustc0106 reviewed May 22, 2026

View reviewed changes

TaffyOfficial force-pushed the codex/nightly-pr3582-tests branch from 67d8407 to 48ed2d8 Compare May 26, 2026 02:29

TaffyOfficial force-pushed the codex/nightly-pr3582-tests branch from 48ed2d8 to f51761a Compare May 27, 2026 07:26

[Test] Use HunyuanImage3 DiT deploy config in perf tests

a8e4bb9

Signed-off-by: TaffyOfficial <2324465096@qq.com>

TaffyOfficial force-pushed the codex/nightly-pr3582-tests branch from f51761a to a8e4bb9 Compare May 27, 2026 07:39

Bounty-hunter reviewed May 27, 2026

View reviewed changes

xiaohajiayou mentioned this pull request May 27, 2026

[BugFix] Fix diffusion parallel_config YAML override and add deploy config field allowlist #3483

Merged

Bounty-hunter approved these changes May 28, 2026

View reviewed changes

[Test] Resolve HunyuanImage3 perf deploy config path

e387a54

Signed-off-by: TaffyOfficial <2324465096@qq.com>

[Test] Forward trust remote code to Hunyuan tokenizer

d5aac3e

Signed-off-by: TaffyOfficial <2324465096@qq.com>

TaffyOfficial closed this May 30, 2026

config	throughput_qps	latency_p99	peak_memory_mb_max
tp4_fp8	0.0800	13.1227s	46838
tp2_fp8_sp2	0.0800	12.0731s	66314
tp2_fp8_cfgp2	0.1035	9.9057s	66470

Conversation

TaffyOfficial commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

What Was Broken

How This PR Fixes It

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented May 22, 2026

Uh oh!

TaffyOfficial commented May 22, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Bounty-hunter commented May 25, 2026

Uh oh!

TaffyOfficial commented May 26, 2026

Uh oh!

Bounty-hunter commented May 27, 2026

Uh oh!

Bounty-hunter May 27, 2026

Choose a reason for hiding this comment

Uh oh!

TaffyOfficial May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Bounty-hunter commented May 28, 2026

Uh oh!

TaffyOfficial commented May 28, 2026

Uh oh!

Bounty-hunter commented May 28, 2026

Uh oh!

TaffyOfficial commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

congw729 commented May 29, 2026

Uh oh!

Bounty-hunter commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TaffyOfficial commented May 22, 2026 •

edited

Loading

TaffyOfficial commented May 29, 2026 •

edited

Loading