Skip to content

[Test][HunyuanImage3] GEBench T2I accuracy pytest harness#3055

Open
TaffyOfficial wants to merge 12 commits into
vllm-project:mainfrom
TaffyOfficial:feat/hunyuan-image3-accuracy-ci
Open

[Test][HunyuanImage3] GEBench T2I accuracy pytest harness#3055
TaffyOfficial wants to merge 12 commits into
vllm-project:mainfrom
TaffyOfficial:feat/hunyuan-image3-accuracy-ci

Conversation

@TaffyOfficial
Copy link
Copy Markdown
Contributor

@TaffyOfficial TaffyOfficial commented Apr 23, 2026

Summary

Ship the GEBench T2I accuracy harness for HunyuanImage-3.0-Instruct as a
manually invocable pytest, without wiring it into the nightly buildkite
pipeline. Test cases land in this PR; a follow-up will gate on resource
budget and decide nightly inclusion.

What's in scope

  • tests/e2e/accuracy/test_gebench_h100_smoke.py: extend the existing
    GEBench smoke test with HunyuanImage-3.0 fixture params (per-type sample
    count, dynamic inference steps, t2i-only gate, multi-GPU stage overrides,
    extra server args)
  • tests/e2e/accuracy/conftest.py: add CLI options --gebench-devices,
    --gebench-stage-overrides, --gebench-extra-server-args,
    --gebench-num-inference-steps, --gebench-samples-per-type,
    --gebench-t2i-only; build multi-GPU OmniServer params from them
  • benchmarks/accuracy/text_to_image/gbench.py: add --t2i-only flag
    (skips IT2I edits in generate+evaluate; type1/2/5 remain out of scope
    until the AR→DiT bridge lands), thread --num-inference-steps through
    generate / evaluate

What's NOT in scope (reverted vs earlier revisions of this PR)

  • .buildkite/test-nightly.yml: dropped the new HunyuanImage-3.0 GEBench
    step plus the not resource_heavy filter on existing diffusion sweeps.
    Nightly inclusion will be decided in a follow-up after we agree on
    resource budget (4×H100 / 4×L20X / etc.) and quality gates.
  • pyproject.toml + tests/e2e/online_serving/test_flux*.py +
    test_sd3_expansion.py: dropped the resource_heavy marker — it only
    existed to fence flux/sd3 sweeps off from the new HunyuanImage-3.0 step;
    with that step gone, the marker has no consumer.

Default step count: 8 → 50

The previous 8-step default targeted the distilled checkpoint
(HunyuanImage-3.0-Instruct-Distil). Earlier revisions of this PR
overrode it to 28 in the buildkite command, which produced visibly
mode-collapsed samples on the full Instruct model — a near-blank frame
(black background + small white rectangle) was scoring 5/5/5/5/5 from
the Qwen3-VL-30B-A3B-Instruct-AWQ judge because the judge fell back to
"image is well-formed" reasoning when the prompt was abstract.

HF official default for HunyuanImage-3.0 (non-distil) is 50 steps
(hunyuan_image_3_pipeline.py:692, README L211/254). This PR aligns the
fixture/CLI defaults with HF; distil users opt in via
--gebench-num-inference-steps 8 / --num-inference-steps 8.

Empirical validation (4×L20X 143GB, samples-per-type=2, t2i-only)

28 steps 50 steps
overall_mean 0.72 0.94
type3_0001 0.44 (blurry, illegible) 1.00
type3_0002 1.00 1.00
type4_0001 0.44 (blurry) 1.00
type4_0002 1.00 (judge-fooled near-blank) 0.76 (legible UI, mild blur)

50 steps eliminates the two mode-collapse samples and the judge-fooled
near-blank, leaving residual judge variance instead of artifact-driven
floor.

Reproducing locally

4× H100 / L20X (or any 4-GPU node with ≥80GB/card):

HF_HOME=/path/to/hf-cache pytest -s -v tests/e2e/accuracy/test_gebench_h100_smoke.py \
    --run-level full_model \
    --gebench-model tencent/HunyuanImage-3.0-Instruct \
    --accuracy-judge-model QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \
    --gebench-devices 0,1,2,3 --accuracy-gpu 0 --gebench-port 8094 \
    --gebench-samples-per-type 4 --accuracy-workers 1 --gebench-t2i-only \
    --gebench-stage-overrides
'{"0":{"devices":"0,1,2,3","enable_expert_parallel":true,"max_num_seqs":1}}' \
    --gebench-extra-server-args '["--dtype","bfloat16","--gpu-memory-utilization","0.95","--enforce-
eager","--trust-remote-code","--distributed-executor-backend","mp","--no-async-chunk"]'

Artifacts (PNG frames + per-sample raw_scores + judge reasoning) land
in tests/e2e/accuracy/artifacts/gebench_hunyuanimage-3_0-instruct/.

Test plan

- Local pytest on 4×L20X 143GB: 1 passed in 7m53s, overall 0.94, no
mode-collapse frames
- Confirmed --gebench-t2i-only matches PR's DIT_ONLY pipeline
topology (no AR stage launched server-side)
- Reviewer confirm step-default change (8→50) is acceptable for
Qwen-Image-2512 step on main, which currently relies on default 8

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@TaffyOfficial TaffyOfficial changed the title ci: add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline [ci]add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline Apr 23, 2026
@TaffyOfficial TaffyOfficial changed the title [ci]add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline [ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline Apr 23, 2026
@TaffyOfficial TaffyOfficial changed the title [ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline [wip] [ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline Apr 23, 2026
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from 7f08b94 to b406553 Compare April 23, 2026 06:15
@TaffyOfficial TaffyOfficial changed the title [wip] [ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline [ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline Apr 23, 2026
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from b406553 to fec875f Compare April 23, 2026 07:06
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

do you need to load the true weights or just dummy weights

@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

do you need to load the true weights or just dummy weights

Our nightly CI runs precision tests (assert overall_mean >= 0.45), which require real weights. Using dummy weights produces random noise, resulting in scores close to 0 and direct assertion failures.

@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from ad2b1c1 to 8ee36c4 Compare April 23, 2026 07:53
Comment thread .buildkite/test-nightly.yml Outdated
@yenuo26 yenuo26 added the diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI label Apr 23, 2026
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from 5c5ee08 to ef3adc0 Compare April 23, 2026 08:12
@TaffyOfficial TaffyOfficial requested a review from yenuo26 April 23, 2026 08:15
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

do you need to load the true weights or just dummy weights

Our nightly CI runs precision tests (assert overall_mean >= 0.45), which require real weights. Using dummy weights produces random noise, resulting in scores close to 0 and direct assertion failures.

the model loading will take a lot of time, we can upload the CI to main but I do not think L4 will cover it nightly. We may run it every week locally.

@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

do you need to load the true weights or just dummy weights

Our nightly CI runs precision tests (assert overall_mean >= 0.45), which require real weights. Using dummy weights produces random noise, resulting in scores close to 0 and direct assertion failures.

the model loading will take a lot of time, we can upload the CI to main but I do not think L4 will cover it nightly. We may run it every week locally.

@yenuo26 那还是改成 advanced_model 么 还是怎么说

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observations:

  • Pipeline registry uses ***=_HF_ARCHS syntax - verify this works
  • Multi-GPU fixture derives TP size from device count - good design
  • trust_remote_code=True in tokenizer fallback is intentional
  • Quantization config normalization is useful

VERDICT: COMMENT

@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

TaffyOfficial commented Apr 24, 2026

Diffusion X2I(&A&T) · GEBench Accuracy Test (HunyuanImage-3.0)

export VLLM_TEST_CLEAN_GPU_MEMORY="1" && pytest -s -v tests/e2e/accuracy/test_gebench_h100_smoke.py --run-level full_model --gebench-model tencent/HunyuanImage-3.0-Instruct --accuracy-judge-model QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ --gebench-devices 0,1,2,3 --gebench-port 8094 --accuracy-gpu 0 --gebench-samples-per-type 4 --gebench-num-inference-steps 28 --accuracy-workers 1 --gebench-t2i-only --gebench-stage-overrides '{"0":{"devices":"0,1,2,3","enable_expert_parallel":true,"max_num_seqs":1}}' --gebench-extra-server-args '["--dtype","bfloat16","--gpu-memory-utilization","0.95","--enforce-eager","--trust-remote-code","--distributed-executor-backend","mp","--no-async-chunk"]' && buildkite-agent artifact upload "tests/e2e/accuracy/artifacts/gebench_hunyuanimage-3_0-instruct/summary*.json"

Waited 9h 28m
·
Ran in 7m 57s

@Gaohan123 Gaohan123 added this to the v0.20.0 milestone Apr 24, 2026
@yenuo26 yenuo26 added ready label to trigger buildkite CI and removed diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels Apr 24, 2026
)

def _build_t2i_scoring_prompt(self, task_prompt: str) -> str:
return (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the method is useless

Copy link
Copy Markdown
Contributor Author

@TaffyOfficial TaffyOfficial Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  def evaluate(self, *, prompt: str, images: list[Image.Image], t2i_mode: bool = False) -> dict[str, Any]:
      build = self._build_t2i_scoring_prompt if t2i_mode else self._build_scoring_prompt
      primary_prompt = build(prompt)

So _build_t2i_scoring_prompt IS used when t2i_mode=True. And in GEBenchEvaluator._evaluate_one, for type3/type4 with
self.t2i_only:

  raw_scores = self.judge.evaluate(
      prompt=(...),
      images=[generated],
      t2i_mode=True,
  )

So it is used. you have missed the t2i_mode=True parameter in the evaluate call.

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #3055 Review: HunyuanImage-3.0 GEBench accuracy smoke test + supporting infra changes

Summary: This PR adds a nightly Buildkite CI job for GEBench accuracy testing with HunyuanImage-3.0 (4-GPU, TP=4+EP MoE), plus supporting changes across the benchmark harness, test fixtures, pipeline registry, diffusion config, tokenizer, and quantization factory.

Verdict: COMMENT -- request a few small changes before merge


Positives:

  • Clean t2i_only flag design -- generation skips non-T2I types early (_generate_one returns None), evaluation uses a dedicated _build_t2i_scoring_prompt instead of trajectory scoring. Good separation.
  • from_kwargs parallel_config auto-construction is clever -- extracting fields that belong to DiffusionParallelConfig from flat kwargs prevents silent drops.
  • Pipeline topology docstring is thorough -- explains why DIT_ONLY is the default (AR->DiT bridge gaps) and what each synthetic suffix means.
  • extra_generate_args completely replacing ["--num-gpus", "1"] when provided is the right design -- avoids conflicts between TP flags.

Issues to address:

  1. quantization/factory.py: inspect.signature is fragile and expensive

    Using inspect.signature at instantiation time to filter kwargs couples factory logic to implementation details. If a config class uses **kwargs in __init__, the valid set will exclude everything. Consider using dataclasses.fields() if the config classes are dataclasses (which they typically are in vllm), or at minimum add a comment noting this limitation. The bits -> weight_bits normalization is fine and useful.

  2. pipeline.py: verify file is importable

    The gh pr diff output shows ***=_HF_ARCHS which is invalid Python. The actual file content confirmed via API is hf_architectures=_HF_ARCHS (valid), so this is a diff rendering artifact. But worth a quick smoke test (python -c "import vllm_omni.model_executor.models.hunyuan_image3.pipeline") to ensure the file is syntactically importable.

  3. conftest.py: extra_generate_args replaces --num-gpus entirely -- no safety net

    When extra_generate_args is provided, --num-gpus is no longer automatically set. If a caller forgets to include --tensor-parallel-size or equivalent GPU flags in extra_args_json, the server could silently start single-GPU. The current code derives TP from device count (len(devices.split(","))) which is correct, but consider adding a defensive comment or assertion that at least one GPU-related flag is present in extra_args.

  4. tokenizer.py: trust_remote_code=True as first attempt

    Adding trust_remote_code=True before the fallback path is a security concern for untrusted models. For an internal CI model this is acceptable, but this file is not gated behind any check. Consider either: (a) catching a narrower exception set and only falling through for the specific case, or (b) documenting why trust_remote_code=True is needed here (custom tokenizer class in the HF checkpoint?).

  5. Buildkite step: 4-GPU request for a smoke test

    Requesting 4x H100 GPUs for a smoke test (4 samples/type) is expensive. Is there a way to run a reduced version (e.g., 2-GPU without expert parallel) as a gate, and leave the 4-GPU EP test as periodic? This is a cost concern, not a correctness one -- maintainer call.

  6. gbench.py: expected logic change

    The diff changes find_first_image fallback to only trigger when expected does not exist (previously it also set expected = None when frame5.png was missing). This is a behavior change: previously, missing frame5.png would skip the sample; now it falls back to another image. This is likely intentional for t2i_only mode (where only frame0.png exists), but could change behavior for non-t2i runs too. Consider adding a comment explaining the rationale.

Comment thread vllm_omni/quantization/factory.py Outdated
kwargs["weight_bits"] = kwargs.pop("bits")

# Filter to only params the config class accepts
valid = set(inspect.signature(config_cls.__init__).parameters) - {"self"}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: inspect.signature reflects the runtime callable signature, which means any config class that uses **kwargs in __init__ will have an empty valid set — every kwarg gets silently dropped. If the quant config classes are dataclasses, dataclasses.fields() would be more robust. At minimum, add a comment noting this limitation so future contributors do not get tripped up.

if isinstance(tokenizer, str):
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer)
try:
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer, trust_remote_code=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security note: trust_remote_code=True is tried first for all models, even untrusted ones. Could you either:

  1. Narrow the exception set to only the specific failure you see with HunyuanImage-3.0, or
  2. Add a comment explaining why this is needed (e.g., "HunyuanImage-3.0 ships a custom tokenizer class that requires trust_remote_code")

The fallback to PreTrainedTokenizerFast is a nice safety net, but running arbitrary code from the Hub on the first attempt is a broad trust gate.

if torch.cuda.device_count() < num_devices:
pytest.skip(f"Need at least {num_devices} CUDA GPUs for this accuracy benchmark.")

generate_server_args = extra_generate_args if extra_generate_args is not None else ["--num-gpus", "1"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means when extra_generate_args is provided, --num-gpus is no longer set at all. The caller (in gebench_accuracy_servers) correctly derives --tensor-parallel-size from device count, so this works today. But if a future caller provides extra_generate_args without GPU flags, the server silently starts single-GPU.

Consider a defensive assertion like:

assert any("--tensor-parallel-size" in a or "--num-gpus" in a for a in (extra_generate_args or [])), \
    "extra_generate_args must include a GPU allocation flag"

Or at minimum a comment warning callers.

for sample_dir in sorted(path for path in lang_dir.iterdir() if path.is_dir()):
expected = sample_dir / "frame5.png" if data_type in {"type2", "type3", "type4"} else None
if expected is None:
if expected is None or not expected.exists():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior change: previously, a missing frame5.png would set expected = None and skip the sample entirely. Now it falls back to find_first_image(). This is correct for t2i_only mode (where only frame0.png exists), but also changes the behavior for non-t2i runs — samples with missing frame5 but other frames present will now be included instead of skipped.

Could you add a comment here explaining the rationale? Something like:

# t2i_only generates only frame0, so fall back to any available image
# instead of requiring frame5 (which only exists for trajectory tasks).

Comment thread .buildkite/test-nightly.yml Outdated
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from 8cab680 to c61ccd3 Compare April 25, 2026 11:38
TaffyOfficial and others added 3 commits May 7, 2026 14:46
HunyuanImage-3.0-Instruct is a T2I model that cannot do IT2I editing.
Without --t2i-only, the test generates a full 6-frame trajectory where
frames 1-5 are garbage, causing the judge to score 0.04 instead of 0.45+.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
collect_gebench_generation_summary hardcoded frame5.png for type3/type4,
but t2i-only mode only generates frame0.png. Fall back to find_first_image
when the expected frame doesn't exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from 6aa1284 to 2fb40bc Compare May 7, 2026 06:46
Signed-off-by: TaffyOfficial <2324465096@qq.com>
@gcanlin gcanlin self-assigned this May 7, 2026
@gcanlin gcanlin added ready label to trigger buildkite CI diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels May 7, 2026
Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from 9f2d7b1 to c8fab1f Compare May 8, 2026 02:01
TaffyOfficial added 2 commits May 8, 2026 10:05
Signed-off-by: TaffyOfficial <2324465096@qq.com>
Signed-off-by: TaffyOfficial <2324465096@qq.com>
@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented May 8, 2026

@TaffyOfficial Could the error in CI be reproduced locally? If could, I will remove the ready tag until the error has been fixed locally.

Please fix https://buildkite.com/vllm/vllm-omni/builds/9200/canvas first.

@gcanlin gcanlin removed ready label to trigger buildkite CI diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels May 8, 2026
@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

@TaffyOfficial Could the error in CI be reproduced locally? If could, I will remove the ready tag until the error has been fixed locally.

Please fix https://buildkite.com/vllm/vllm-omni/builds/9200/canvas first.

qwen-image 故障,和我这没关系

@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

这个pr根据会议要求,由于需要4卡,所以从ci下掉,只保留用例

@gcanlin gcanlin added the ready label to trigger buildkite CI label May 8, 2026
@TaffyOfficial TaffyOfficial changed the title [ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline [example] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline May 8, 2026
@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented May 8, 2026

This error seems to be from Hunyuan-Image.

=========================================================================== FAILURES ===========================================================================
___________________________________________________________________ test_gebench_h100_smoke ____________________________________________________________________
gebench_accuracy_servers = AccuracyServerConfig(generate_params=OmniServerParams(model='tencent/HunyuanImage-3.0-Instruct', port=8094, stage_conf..._omni=False, use_stage_cli=False, init_timeout=None, stage_init_timeout=None), run_level='full_model', model_prefix='')
accuracy_artifact_root = PosixPath('/workspace/build/buildkite/tests/e2e/accuracy/artifacts')
gebench_dataset_root = PosixPath('/root/.cache/huggingface/hub/datasets--stepfun-ai--GEBench/snapshots/063ea6ff286b8916fed97b45d8a43dfb4364f116')
gebench_samples_per_type = 4, gebench_num_inference_steps = 28, accuracy_workers = 1, gebench_t2i_only = True
gebench_min_scores = {'overall': 0.45, 'type3': 0.45, 'type4': 0.45}
    @pytest.mark.benchmark
    @hardware_test(res={"cuda": "H100"}, num_cards=1)
    def test_gebench_h100_smoke(
        gebench_accuracy_servers,
        accuracy_artifact_root: Path,
        gebench_dataset_root: Path,
        gebench_samples_per_type: int,
        gebench_num_inference_steps: int,
        accuracy_workers: int,
        gebench_t2i_only: bool,
        gebench_min_scores: dict[str, float],
    ) -> None:
        model_label = infer_model_label(gebench_accuracy_servers.generate_params.model).lower()
        output_root = reset_artifact_dir(accuracy_artifact_root / f"gebench_{model_label}")
        t2i_flag = ["--t2i-only"] if gebench_t2i_only else []
        with gebench_accuracy_servers.generate_server() as generate_server:
            for data_type in ("type3", "type4"):
                assert (
                    gbench_main(
                        [
                            "generate",
                            "--dataset-root",
                            str(gebench_dataset_root),
                            "--output-root",
                            str(output_root),
                            "--base-url",
                            f"http://{generate_server.host}:{generate_server.port}",
                            "--model",
                            generate_server.model,
                            "--data-type",
                            data_type,
                            "--width",
                            "768",
                            "--height",
                            "576",
                            "--output-compression",
                            "98",
                            "--num-inference-steps",
                            str(gebench_num_inference_steps),
                            "--workers",
                            str(accuracy_workers),
                            "--samples-per-type",
                            str(gebench_samples_per_type),
                            *t2i_flag,
                        ]
                    )
                    == 0
                )
        with gebench_accuracy_servers.judge_server() as judge_server:
            for data_type in ("type3", "type4"):
                assert (
                    gbench_main(
                        [
                            "evaluate",
                            "--dataset-root",
                            str(gebench_dataset_root),
                            "--output-root",
                            str(output_root),
                            "--data-type",
                            data_type,
                            "--judge-base-url",
                            f"http://{judge_server.host}:{judge_server.port}",
                            "--judge-model",
                            judge_server.model,
                            "--judge-api-key",
                            "EMPTY",
                            "--workers",
                            str(accuracy_workers),
                            *t2i_flag,
                        ]
                    )
                    == 0
                )
        assert gbench_main(["summarize", "--output-root", str(output_root)]) == 0
        summary = json.loads((output_root / "summary.json").read_text(encoding="utf-8"))
        assert "generation" in summary
        assert "evaluation" in summary
        for data_type in ("type3", "type4"):
            assert data_type in summary["generation"]["by_type"]
            assert summary["generation"]["by_type"][data_type]["count"] > 0
            assert data_type in summary["evaluation"]["by_type"]
            assert summary["evaluation"]["by_type"][data_type]["count"] > 0
        assert summary["evaluation"]["overall_mean"] >= gebench_min_scores["overall"]
>       assert summary["evaluation"]["by_type"]["type3"]["overall_mean"] >= gebench_min_scores["type3"]
E       assert 0.36 >= 0.45
tests/e2e/accuracy/test_gebench_h100_smoke.py:105: AssertionError

TaffyOfficial added 2 commits May 8, 2026 11:19
Revert .buildkite/test-nightly.yml to origin/main to remove:
- New HunyuanImage-3.0 GEBench accuracy job (TP=4+EP, 80B MoE)
- "and not resource_heavy" filter on the H100 / L4 diffusion sweeps
- Threshold args on the existing Qwen GEBench step

Also revert the resource_heavy marker scaffolding (pyproject.toml
registration + flux2 / flux_2_dev / sd3 expansion test tags) since it
only existed to keep the dropped HunyuanImage-3.0 job from competing
with the broad nightly diffusion sweeps.

Test cases stay: tests/e2e/accuracy/conftest.py fixture additions,
tests/e2e/accuracy/test_gebench_h100_smoke.py CLI options, and
benchmarks/accuracy/text_to_image/gbench.py logic. They can be
invoked manually until CI is re-enabled.

Signed-off-by: TaffyOfficial <2324465096@qq.com>
The previous 8-step default targeted distilled checkpoints
(HunyuanImage-3.0-Instruct-Distil); on the full Instruct model
28 steps (the prior buildkite override) was already producing
mode-collapse samples (e.g. near-blank frames scoring 5/5 from
the judge, masked by the overall mean). HF official default for
HunyuanImage-3.0 is 50 steps; align defaults with that.

Distilled / fast-sampling models that want fewer steps must now
opt in explicitly via --gebench-num-inference-steps / --num-inference-steps.

Sites updated:
- tests/e2e/accuracy/conftest.py: pytest --gebench-num-inference-steps default
- benchmarks/accuracy/text_to_image/gbench.py: GEBenchRunner.__init__ + CLI default

Signed-off-by: TaffyOfficial <2324465096@qq.com>
@TaffyOfficial TaffyOfficial changed the title [example] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline [Test][HunyuanImage3] GEBench T2I accuracy pytest harness May 8, 2026
@TaffyOfficial
Copy link
Copy Markdown
Contributor Author

@gcanlin update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants