[Test][HunyuanImage3] GEBench T2I accuracy pytest harness#3055
[Test][HunyuanImage3] GEBench T2I accuracy pytest harness#3055TaffyOfficial wants to merge 12 commits into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
7f08b94 to
b406553
Compare
b406553 to
fec875f
Compare
|
do you need to load the true weights or just dummy weights |
Our nightly CI runs precision tests (assert overall_mean >= 0.45), which require real weights. Using dummy weights produces random noise, resulting in scores close to 0 and direct assertion failures. |
ad2b1c1 to
8ee36c4
Compare
5c5ee08 to
ef3adc0
Compare
the model loading will take a lot of time, we can upload the CI to main but I do not think L4 will cover it nightly. We may run it every week locally. |
@yenuo26 那还是改成 advanced_model 么 还是怎么说 |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Observations:
- Pipeline registry uses ***=_HF_ARCHS syntax - verify this works
- Multi-GPU fixture derives TP size from device count - good design
- trust_remote_code=True in tokenizer fallback is intentional
- Quantization config normalization is useful
VERDICT: COMMENT
|
Diffusion X2I(&A&T) · GEBench Accuracy Test (HunyuanImage-3.0) Waited 9h 28m |
| ) | ||
|
|
||
| def _build_t2i_scoring_prompt(self, task_prompt: str) -> str: | ||
| return ( |
There was a problem hiding this comment.
It seems the method is useless
There was a problem hiding this comment.
def evaluate(self, *, prompt: str, images: list[Image.Image], t2i_mode: bool = False) -> dict[str, Any]:
build = self._build_t2i_scoring_prompt if t2i_mode else self._build_scoring_prompt
primary_prompt = build(prompt)
So _build_t2i_scoring_prompt IS used when t2i_mode=True. And in GEBenchEvaluator._evaluate_one, for type3/type4 with
self.t2i_only:
raw_scores = self.judge.evaluate(
prompt=(...),
images=[generated],
t2i_mode=True,
)
So it is used. you have missed the t2i_mode=True parameter in the evaluate call.
hsliuustc0106
left a comment
There was a problem hiding this comment.
PR #3055 Review: HunyuanImage-3.0 GEBench accuracy smoke test + supporting infra changes
Summary: This PR adds a nightly Buildkite CI job for GEBench accuracy testing with HunyuanImage-3.0 (4-GPU, TP=4+EP MoE), plus supporting changes across the benchmark harness, test fixtures, pipeline registry, diffusion config, tokenizer, and quantization factory.
Verdict: COMMENT -- request a few small changes before merge
Positives:
- Clean
t2i_onlyflag design -- generation skips non-T2I types early (_generate_onereturnsNone), evaluation uses a dedicated_build_t2i_scoring_promptinstead of trajectory scoring. Good separation. from_kwargsparallel_config auto-construction is clever -- extracting fields that belong toDiffusionParallelConfigfrom flat kwargs prevents silent drops.- Pipeline topology docstring is thorough -- explains why DIT_ONLY is the default (AR->DiT bridge gaps) and what each synthetic suffix means.
extra_generate_argscompletely replacing["--num-gpus", "1"]when provided is the right design -- avoids conflicts between TP flags.
Issues to address:
-
quantization/factory.py:
inspect.signatureis fragile and expensiveUsing
inspect.signatureat instantiation time to filter kwargs couples factory logic to implementation details. If a config class uses**kwargsin__init__, thevalidset will exclude everything. Consider usingdataclasses.fields()if the config classes are dataclasses (which they typically are in vllm), or at minimum add a comment noting this limitation. Thebits->weight_bitsnormalization is fine and useful. -
pipeline.py: verify file is importable
The
gh pr diffoutput shows***=_HF_ARCHSwhich is invalid Python. The actual file content confirmed via API ishf_architectures=_HF_ARCHS(valid), so this is a diff rendering artifact. But worth a quick smoke test (python -c "import vllm_omni.model_executor.models.hunyuan_image3.pipeline") to ensure the file is syntactically importable. -
conftest.py:
extra_generate_argsreplaces--num-gpusentirely -- no safety netWhen
extra_generate_argsis provided,--num-gpusis no longer automatically set. If a caller forgets to include--tensor-parallel-sizeor equivalent GPU flags inextra_args_json, the server could silently start single-GPU. The current code derives TP from device count (len(devices.split(","))) which is correct, but consider adding a defensive comment or assertion that at least one GPU-related flag is present inextra_args. -
tokenizer.py:
trust_remote_code=Trueas first attemptAdding
trust_remote_code=Truebefore the fallback path is a security concern for untrusted models. For an internal CI model this is acceptable, but this file is not gated behind any check. Consider either: (a) catching a narrower exception set and only falling through for the specific case, or (b) documenting whytrust_remote_code=Trueis needed here (custom tokenizer class in the HF checkpoint?). -
Buildkite step: 4-GPU request for a smoke test
Requesting 4x H100 GPUs for a smoke test (4 samples/type) is expensive. Is there a way to run a reduced version (e.g., 2-GPU without expert parallel) as a gate, and leave the 4-GPU EP test as periodic? This is a cost concern, not a correctness one -- maintainer call.
-
gbench.py:
expectedlogic changeThe diff changes
find_first_imagefallback to only trigger whenexpecteddoes not exist (previously it also setexpected = Nonewhenframe5.pngwas missing). This is a behavior change: previously, missingframe5.pngwould skip the sample; now it falls back to another image. This is likely intentional fort2i_onlymode (where onlyframe0.pngexists), but could change behavior for non-t2i runs too. Consider adding a comment explaining the rationale.
| kwargs["weight_bits"] = kwargs.pop("bits") | ||
|
|
||
| # Filter to only params the config class accepts | ||
| valid = set(inspect.signature(config_cls.__init__).parameters) - {"self"} |
There was a problem hiding this comment.
Nit: inspect.signature reflects the runtime callable signature, which means any config class that uses **kwargs in __init__ will have an empty valid set — every kwarg gets silently dropped. If the quant config classes are dataclasses, dataclasses.fields() would be more robust. At minimum, add a comment noting this limitation so future contributors do not get tripped up.
| if isinstance(tokenizer, str): | ||
| self.tokenizer = AutoTokenizer.from_pretrained(tokenizer) | ||
| try: | ||
| self.tokenizer = AutoTokenizer.from_pretrained(tokenizer, trust_remote_code=True) |
There was a problem hiding this comment.
Security note: trust_remote_code=True is tried first for all models, even untrusted ones. Could you either:
- Narrow the exception set to only the specific failure you see with HunyuanImage-3.0, or
- Add a comment explaining why this is needed (e.g., "HunyuanImage-3.0 ships a custom tokenizer class that requires trust_remote_code")
The fallback to PreTrainedTokenizerFast is a nice safety net, but running arbitrary code from the Hub on the first attempt is a broad trust gate.
| if torch.cuda.device_count() < num_devices: | ||
| pytest.skip(f"Need at least {num_devices} CUDA GPUs for this accuracy benchmark.") | ||
|
|
||
| generate_server_args = extra_generate_args if extra_generate_args is not None else ["--num-gpus", "1"] |
There was a problem hiding this comment.
This means when extra_generate_args is provided, --num-gpus is no longer set at all. The caller (in gebench_accuracy_servers) correctly derives --tensor-parallel-size from device count, so this works today. But if a future caller provides extra_generate_args without GPU flags, the server silently starts single-GPU.
Consider a defensive assertion like:
assert any("--tensor-parallel-size" in a or "--num-gpus" in a for a in (extra_generate_args or [])), \
"extra_generate_args must include a GPU allocation flag"Or at minimum a comment warning callers.
| for sample_dir in sorted(path for path in lang_dir.iterdir() if path.is_dir()): | ||
| expected = sample_dir / "frame5.png" if data_type in {"type2", "type3", "type4"} else None | ||
| if expected is None: | ||
| if expected is None or not expected.exists(): |
There was a problem hiding this comment.
Behavior change: previously, a missing frame5.png would set expected = None and skip the sample entirely. Now it falls back to find_first_image(). This is correct for t2i_only mode (where only frame0.png exists), but also changes the behavior for non-t2i runs — samples with missing frame5 but other frames present will now be included instead of skipped.
Could you add a comment here explaining the rationale? Something like:
# t2i_only generates only frame0, so fall back to any available image
# instead of requiring frame5 (which only exists for trajectory tasks).8cab680 to
c61ccd3
Compare
HunyuanImage-3.0-Instruct is a T2I model that cannot do IT2I editing. Without --t2i-only, the test generates a full 6-frame trajectory where frames 1-5 are garbage, causing the judge to score 0.04 instead of 0.45+. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>
collect_gebench_generation_summary hardcoded frame5.png for type3/type4, but t2i-only mode only generates frame0.png. Fall back to find_first_image when the expected frame doesn't exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
6aa1284 to
2fb40bc
Compare
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
9f2d7b1 to
c8fab1f
Compare
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
|
@TaffyOfficial Could the error in CI be reproduced locally? If could, I will remove the ready tag until the error has been fixed locally. Please fix https://buildkite.com/vllm/vllm-omni/builds/9200/canvas first. |
qwen-image 故障,和我这没关系 |
|
这个pr根据会议要求,由于需要4卡,所以从ci下掉,只保留用例 |
|
This error seems to be from Hunyuan-Image. |
Revert .buildkite/test-nightly.yml to origin/main to remove: - New HunyuanImage-3.0 GEBench accuracy job (TP=4+EP, 80B MoE) - "and not resource_heavy" filter on the H100 / L4 diffusion sweeps - Threshold args on the existing Qwen GEBench step Also revert the resource_heavy marker scaffolding (pyproject.toml registration + flux2 / flux_2_dev / sd3 expansion test tags) since it only existed to keep the dropped HunyuanImage-3.0 job from competing with the broad nightly diffusion sweeps. Test cases stay: tests/e2e/accuracy/conftest.py fixture additions, tests/e2e/accuracy/test_gebench_h100_smoke.py CLI options, and benchmarks/accuracy/text_to_image/gbench.py logic. They can be invoked manually until CI is re-enabled. Signed-off-by: TaffyOfficial <2324465096@qq.com>
The previous 8-step default targeted distilled checkpoints (HunyuanImage-3.0-Instruct-Distil); on the full Instruct model 28 steps (the prior buildkite override) was already producing mode-collapse samples (e.g. near-blank frames scoring 5/5 from the judge, masked by the overall mean). HF official default for HunyuanImage-3.0 is 50 steps; align defaults with that. Distilled / fast-sampling models that want fewer steps must now opt in explicitly via --gebench-num-inference-steps / --num-inference-steps. Sites updated: - tests/e2e/accuracy/conftest.py: pytest --gebench-num-inference-steps default - benchmarks/accuracy/text_to_image/gbench.py: GEBenchRunner.__init__ + CLI default Signed-off-by: TaffyOfficial <2324465096@qq.com>
|
@gcanlin update |
Summary
Ship the GEBench T2I accuracy harness for HunyuanImage-3.0-Instruct as a
manually invocable pytest, without wiring it into the nightly buildkite
pipeline. Test cases land in this PR; a follow-up will gate on resource
budget and decide nightly inclusion.
What's in scope
tests/e2e/accuracy/test_gebench_h100_smoke.py: extend the existingGEBench smoke test with HunyuanImage-3.0 fixture params (per-type sample
count, dynamic inference steps, t2i-only gate, multi-GPU stage overrides,
extra server args)
tests/e2e/accuracy/conftest.py: add CLI options--gebench-devices,--gebench-stage-overrides,--gebench-extra-server-args,--gebench-num-inference-steps,--gebench-samples-per-type,--gebench-t2i-only; build multi-GPU OmniServer params from thembenchmarks/accuracy/text_to_image/gbench.py: add--t2i-onlyflag(skips IT2I edits in generate+evaluate; type1/2/5 remain out of scope
until the AR→DiT bridge lands), thread
--num-inference-stepsthroughgenerate / evaluate
What's NOT in scope (reverted vs earlier revisions of this PR)
.buildkite/test-nightly.yml: dropped the new HunyuanImage-3.0 GEBenchstep plus the
not resource_heavyfilter on existing diffusion sweeps.Nightly inclusion will be decided in a follow-up after we agree on
resource budget (4×H100 / 4×L20X / etc.) and quality gates.
pyproject.toml+tests/e2e/online_serving/test_flux*.py+test_sd3_expansion.py: dropped theresource_heavymarker — it onlyexisted to fence flux/sd3 sweeps off from the new HunyuanImage-3.0 step;
with that step gone, the marker has no consumer.
Default step count: 8 → 50
The previous 8-step default targeted the distilled checkpoint
(
HunyuanImage-3.0-Instruct-Distil). Earlier revisions of this PRoverrode it to 28 in the buildkite command, which produced visibly
mode-collapsed samples on the full Instruct model — a near-blank frame
(black background + small white rectangle) was scoring 5/5/5/5/5 from
the Qwen3-VL-30B-A3B-Instruct-AWQ judge because the judge fell back to
"image is well-formed" reasoning when the prompt was abstract.
HF official default for HunyuanImage-3.0 (non-distil) is 50 steps
(
hunyuan_image_3_pipeline.py:692, README L211/254). This PR aligns thefixture/CLI defaults with HF; distil users opt in via
--gebench-num-inference-steps 8/--num-inference-steps 8.Empirical validation (4×L20X 143GB, samples-per-type=2, t2i-only)
50 steps eliminates the two mode-collapse samples and the judge-fooled
near-blank, leaving residual judge variance instead of artifact-driven
floor.
Reproducing locally
4× H100 / L20X (or any 4-GPU node with ≥80GB/card):
HF_HOME=/path/to/hf-cache pytest -s -v tests/e2e/accuracy/test_gebench_h100_smoke.py \ --run-level full_model \ --gebench-model tencent/HunyuanImage-3.0-Instruct \ --accuracy-judge-model QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \ --gebench-devices 0,1,2,3 --accuracy-gpu 0 --gebench-port 8094 \ --gebench-samples-per-type 4 --accuracy-workers 1 --gebench-t2i-only \ --gebench-stage-overrides '{"0":{"devices":"0,1,2,3","enable_expert_parallel":true,"max_num_seqs":1}}' \ --gebench-extra-server-args '["--dtype","bfloat16","--gpu-memory-utilization","0.95","--enforce- eager","--trust-remote-code","--distributed-executor-backend","mp","--no-async-chunk"]' Artifacts (PNG frames + per-sample raw_scores + judge reasoning) land in tests/e2e/accuracy/artifacts/gebench_hunyuanimage-3_0-instruct/. Test plan - Local pytest on 4×L20X 143GB: 1 passed in 7m53s, overall 0.94, no mode-collapse frames - Confirmed --gebench-t2i-only matches PR's DIT_ONLY pipeline topology (no AR stage launched server-side) - Reviewer confirm step-default change (8→50) is acceptable for Qwen-Image-2512 step on main, which currently relies on default 8