[Test] Fix HunyuanImage3 nightly perf startup#3819
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
hsliuustc0106
left a comment
There was a problem hiding this comment.
The nightly script change to preserve all pytest commands per Buildkite step is correct. Using bitwise OR for overall_status ensures any failure propagates. The generation_config fallback with warning is a sensible offline cache workaround. The test coverage for both paths is sufficient.
|
The changes too complicated, could we just specific deploy config |
67d8407 to
48ed2d8
Compare
|
@Bounty-hunter Thanks, updated in that direction. The three HunyuanImage3 perf JSONs now use I kept two pieces separate because they are outside the deploy-config simplification:
|
48ed2d8 to
f51761a
Compare
Signed-off-by: TaffyOfficial <2324465096@qq.com>
f51761a to
a8e4bb9
Compare
|
Paste the execution results and compare their execution times. |
| "server_type": "vllm-omni", | ||
| "server_params": { | ||
| "model": "tencent/HunyuanImage-3.0-Instruct", | ||
| "stage_overrides": { |
There was a problem hiding this comment.
there is an bug(3483) that stage_overrides can't overwrite correct,please check it.
There was a problem hiding this comment.
Thanks, pasted the comparable execution numbers below.
For the current HunyuanImage3 perf JSON baselines, all use 1024x1024, 50 steps, 10 prompts, max_concurrency=1:
| config | throughput_qps | latency_p99 | peak_memory_mb_max |
|---|---|---|---|
| tp4_fp8 | 0.0800 | 13.1227s | 46838 |
| tp2_fp8_sp2 | 0.0800 | 12.0731s | 66314 |
| tp2_fp8_cfgp2 | 0.1035 | 9.9057s | 66470 |
Compared with tp4_fp8:
- tp2_fp8_sp2 has 8.0% lower p99 latency, same recorded throughput, and about 41.6% higher peak memory.
- tp2_fp8_cfgp2 has 24.5% lower p99 latency, 29.4% higher throughput, and about 41.9% higher peak memory.
I also checked the #3483 override issue. The broken path there is flat diffusion parallel overrides being left as top-level engine_args fields while diffusion reads engine_args.parallel_config. This PR is not using that flat path for the TP2 cases: it passes a full nested parallel_config through stage_overrides, and a local materialization check gives parallel_config.tensor_parallel_size=2 / parallel_config.cfg_parallel_size=2 with no flat tensor_parallel_size left in engine_args.
If we want to use flat CLI flags instead, then this PR should wait for or rebase on #3483.
|
LGTM |
Signed-off-by: TaffyOfficial <2324465096@qq.com>
|
Thanks. I will try this tonight. |
Signed-off-by: TaffyOfficial <2324465096@qq.com>
|
This changes not correct, try with #3996 |

Purpose
Fix the HunyuanImage3 nightly perf path so the run-nightly script executes every selected pytest command, starts HunyuanImage3 perf with the intended DiT-only deploy topology, and survives an offline cache that lacks
generation_config.json.What Was Broken
tools/nightly/run_nightly_jobs.shgenerated one shell job from a Buildkite step, but some Buildkite steps contain multiple pytest commands. Withset -e, the first failing pytest stopped the job and later Hunyuan variants could be skipped.The HunyuanImage3 perf JSONs launched
tencent/HunyuanImage-3.0-Instructwithout the DiT-only deploy config. In the single-server perf topology, async chunking is invalid because there is no next-stage input processor.HunyuanImage3 requires Hugging Face remote code, so the perf server args need
--trust-remote-code.In offline cache,
HunyuanImage3Pipeline.__init__could fail before loading weights when the snapshot hadconfig.jsonand tokenizer files but nogeneration_config.json.How This PR Fixes It
hunyuan_image3_dit.yamldeploy config for all three HunyuanImage3 perf cases, then keep only the per-case CLI overrides for TP/SP/CFG/quantization/profiler/trust-remote-code.vllm_omni/deploydirectory sostage_config_namecan point at the shared Hunyuan deploy YAML.This PR does not change perf baselines or model execution logic. It only fixes nightly job execution and startup metadata/configuration needed for HunyuanImage3 perf startup.
Test Plan
Remote validation used only
/data/wzrfor persistent files, with:Test Result
Local/static checks:
Local Windows pytest collection is blocked by missing full
vllmpackage in this environment:Remote unit test before the latest local-path fallback case:
Remote Hunyuan nightly startup evidence:
The full Hunyuan nightly still cannot finish on the tested server because the model cache itself is incomplete there.
/data/wzrhas no HunyuanImage3 weight files, and the available cached snapshot only contains config/tokenizer files. After this PR, the next and current blocker is:That final error is expected for this server state: the code now reaches weight loading, but the required model weights are not present in the allowed
/data/wzrpath.