[Test][HunyuanImage3] GEBench T2I accuracy pytest harness by TaffyOfficial · Pull Request #3055 · vllm-project/vllm-omni

TaffyOfficial · 2026-04-23T05:04:29Z

Summary

Ship the GEBench T2I accuracy harness for HunyuanImage-3.0-Instruct as a
manually invocable pytest, without wiring it into the nightly buildkite
pipeline. Test cases land in this PR; a follow-up will gate on resource
budget and decide nightly inclusion.

What's in scope

tests/e2e/accuracy/test_gebench_h100_smoke.py: extend the existing
GEBench smoke test with HunyuanImage-3.0 fixture params (per-type sample
count, dynamic inference steps, t2i-only gate, multi-GPU stage overrides,
extra server args)
tests/e2e/accuracy/conftest.py: add CLI options --gebench-devices,
--gebench-stage-overrides, --gebench-extra-server-args,
--gebench-num-inference-steps, --gebench-samples-per-type,
--gebench-t2i-only; build multi-GPU OmniServer params from them
benchmarks/accuracy/text_to_image/gbench.py: add --t2i-only flag
(skips IT2I edits in generate+evaluate; type1/2/5 remain out of scope
until the AR→DiT bridge lands), thread --num-inference-steps through
generate / evaluate

What's NOT in scope (reverted vs earlier revisions of this PR)

.buildkite/test-nightly.yml: dropped the new HunyuanImage-3.0 GEBench
step plus the not resource_heavy filter on existing diffusion sweeps.
Nightly inclusion will be decided in a follow-up after we agree on
resource budget (4×H100 / 4×L20X / etc.) and quality gates.
pyproject.toml + tests/e2e/online_serving/test_flux*.py +
test_sd3_expansion.py: dropped the resource_heavy marker — it only
existed to fence flux/sd3 sweeps off from the new HunyuanImage-3.0 step;
with that step gone, the marker has no consumer.

Default step count: 8 → 50

The previous 8-step default targeted the distilled checkpoint
(HunyuanImage-3.0-Instruct-Distil). Earlier revisions of this PR
overrode it to 28 in the buildkite command, which produced visibly
mode-collapsed samples on the full Instruct model — a near-blank frame
(black background + small white rectangle) was scoring 5/5/5/5/5 from
the Qwen3-VL-30B-A3B-Instruct-AWQ judge because the judge fell back to
"image is well-formed" reasoning when the prompt was abstract.

HF official default for HunyuanImage-3.0 (non-distil) is 50 steps
(hunyuan_image_3_pipeline.py:692, README L211/254). This PR aligns the
fixture/CLI defaults with HF; distil users opt in via
--gebench-num-inference-steps 8 / --num-inference-steps 8.

Empirical validation (4×L20X 143GB, samples-per-type=2, t2i-only)

	28 steps	50 steps
overall_mean	0.72	0.94
type3_0001	0.44 (blurry, illegible)	1.00
type3_0002	1.00	1.00
type4_0001	0.44 (blurry)	1.00
type4_0002	1.00 (judge-fooled near-blank)	0.76 (legible UI, mild blur)

50 steps eliminates the two mode-collapse samples and the judge-fooled
near-blank, leaving residual judge variance instead of artifact-driven
floor.

Reproducing locally

4× H100 / L20X (or any 4-GPU node with ≥80GB/card):

HF_HOME=/path/to/hf-cache pytest -s -v tests/e2e/accuracy/test_gebench_h100_smoke.py \
    --run-level full_model \
    --gebench-model tencent/HunyuanImage-3.0-Instruct \
    --accuracy-judge-model QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \
    --gebench-devices 0,1,2,3 --accuracy-gpu 0 --gebench-port 8094 \
    --gebench-samples-per-type 4 --accuracy-workers 1 --gebench-t2i-only \
    --gebench-stage-overrides
'{"0":{"devices":"0,1,2,3","enable_expert_parallel":true,"max_num_seqs":1}}' \
    --gebench-extra-server-args '["--dtype","bfloat16","--gpu-memory-utilization","0.95","--enforce-
eager","--trust-remote-code","--distributed-executor-backend","mp","--no-async-chunk"]'

Artifacts (PNG frames + per-sample raw_scores + judge reasoning) land
in tests/e2e/accuracy/artifacts/gebench_hunyuanimage-3_0-instruct/.

Test plan

- Local pytest on 4×L20X 143GB: 1 passed in 7m53s, overall 0.94, no
mode-collapse frames
- Confirmed --gebench-t2i-only matches PR's DIT_ONLY pipeline
topology (no AR stage launched server-side)
- Reviewer confirm step-default change (8→50) is acceptable for
Qwen-Image-2512 step on main, which currently relies on default 8

chatgpt-codex-connector · 2026-04-23T05:04:34Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

hsliuustc0106 · 2026-04-23T07:19:55Z

do you need to load the true weights or just dummy weights

TaffyOfficial · 2026-04-23T07:27:40Z

do you need to load the true weights or just dummy weights

Our nightly CI runs precision tests (assert overall_mean >= 0.45), which require real weights. Using dummy weights produces random noise, resulting in scores close to 0 and direct assertion failures.

hsliuustc0106 · 2026-04-23T08:26:58Z

do you need to load the true weights or just dummy weights

Our nightly CI runs precision tests (assert overall_mean >= 0.45), which require real weights. Using dummy weights produces random noise, resulting in scores close to 0 and direct assertion failures.

the model loading will take a lot of time, we can upload the CI to main but I do not think L4 will cover it nightly. We may run it every week locally.

TaffyOfficial · 2026-04-23T08:46:30Z

do you need to load the true weights or just dummy weights

Our nightly CI runs precision tests (assert overall_mean >= 0.45), which require real weights. Using dummy weights produces random noise, resulting in scores close to 0 and direct assertion failures.

the model loading will take a lot of time, we can upload the CI to main but I do not think L4 will cover it nightly. We may run it every week locally.

@yenuo26 那还是改成 advanced_model 么还是怎么说

hsliuustc0106

Observations:

Pipeline registry uses ***=_HF_ARCHS syntax - verify this works
Multi-GPU fixture derives TP size from device count - good design
trust_remote_code=True in tokenizer fallback is intentional
Quantization config normalization is useful

VERDICT: COMMENT

TaffyOfficial · 2026-04-24T01:44:04Z

Diffusion X2I(&A&T) · GEBench Accuracy Test (HunyuanImage-3.0)

export VLLM_TEST_CLEAN_GPU_MEMORY="1" && pytest -s -v tests/e2e/accuracy/test_gebench_h100_smoke.py --run-level full_model --gebench-model tencent/HunyuanImage-3.0-Instruct --accuracy-judge-model QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ --gebench-devices 0,1,2,3 --gebench-port 8094 --accuracy-gpu 0 --gebench-samples-per-type 4 --gebench-num-inference-steps 28 --accuracy-workers 1 --gebench-t2i-only --gebench-stage-overrides '{"0":{"devices":"0,1,2,3","enable_expert_parallel":true,"max_num_seqs":1}}' --gebench-extra-server-args '["--dtype","bfloat16","--gpu-memory-utilization","0.95","--enforce-eager","--trust-remote-code","--distributed-executor-backend","mp","--no-async-chunk"]' && buildkite-agent artifact upload "tests/e2e/accuracy/artifacts/gebench_hunyuanimage-3_0-instruct/summary*.json"

Waited 9h 28m
·
Ran in 7m 57s

Gaohan123 · 2026-04-24T08:14:50Z

        )

+    def _build_t2i_scoring_prompt(self, task_prompt: str) -> str:
+        return (


It seems the method is useless

def evaluate(self, *, prompt: str, images: list[Image.Image], t2i_mode: bool = False) -> dict[str, Any]: build = self._build_t2i_scoring_prompt if t2i_mode else self._build_scoring_prompt primary_prompt = build(prompt)

So _build_t2i_scoring_prompt IS used when t2i_mode=True. And in GEBenchEvaluator._evaluate_one, for type3/type4 with
self.t2i_only:

raw_scores = self.judge.evaluate( prompt=(...), images=[generated], t2i_mode=True, )

So it is used. you have missed the t2i_mode=True parameter in the evaluate call.

hsliuustc0106

PR #3055 Review: HunyuanImage-3.0 GEBench accuracy smoke test + supporting infra changes

Summary: This PR adds a nightly Buildkite CI job for GEBench accuracy testing with HunyuanImage-3.0 (4-GPU, TP=4+EP MoE), plus supporting changes across the benchmark harness, test fixtures, pipeline registry, diffusion config, tokenizer, and quantization factory.

Verdict: COMMENT -- request a few small changes before merge

Positives:

Clean t2i_only flag design -- generation skips non-T2I types early (_generate_one returns None), evaluation uses a dedicated _build_t2i_scoring_prompt instead of trajectory scoring. Good separation.
from_kwargs parallel_config auto-construction is clever -- extracting fields that belong to DiffusionParallelConfig from flat kwargs prevents silent drops.
Pipeline topology docstring is thorough -- explains why DIT_ONLY is the default (AR->DiT bridge gaps) and what each synthetic suffix means.
extra_generate_args completely replacing ["--num-gpus", "1"] when provided is the right design -- avoids conflicts between TP flags.

Issues to address:

quantization/factory.py: inspect.signature is fragile and expensive

Using inspect.signature at instantiation time to filter kwargs couples factory logic to implementation details. If a config class uses **kwargs in __init__, the valid set will exclude everything. Consider using dataclasses.fields() if the config classes are dataclasses (which they typically are in vllm), or at minimum add a comment noting this limitation. The bits -> weight_bits normalization is fine and useful.
pipeline.py: verify file is importable

The gh pr diff output shows ***=_HF_ARCHS which is invalid Python. The actual file content confirmed via API is hf_architectures=_HF_ARCHS (valid), so this is a diff rendering artifact. But worth a quick smoke test (python -c "import vllm_omni.model_executor.models.hunyuan_image3.pipeline") to ensure the file is syntactically importable.
conftest.py: extra_generate_args replaces --num-gpus entirely -- no safety net

When extra_generate_args is provided, --num-gpus is no longer automatically set. If a caller forgets to include --tensor-parallel-size or equivalent GPU flags in extra_args_json, the server could silently start single-GPU. The current code derives TP from device count (len(devices.split(","))) which is correct, but consider adding a defensive comment or assertion that at least one GPU-related flag is present in extra_args.
tokenizer.py: trust_remote_code=True as first attempt

Adding trust_remote_code=True before the fallback path is a security concern for untrusted models. For an internal CI model this is acceptable, but this file is not gated behind any check. Consider either: (a) catching a narrower exception set and only falling through for the specific case, or (b) documenting why trust_remote_code=True is needed here (custom tokenizer class in the HF checkpoint?).
Buildkite step: 4-GPU request for a smoke test

Requesting 4x H100 GPUs for a smoke test (4 samples/type) is expensive. Is there a way to run a reduced version (e.g., 2-GPU without expert parallel) as a gate, and leave the 4-GPU EP test as periodic? This is a cost concern, not a correctness one -- maintainer call.
gbench.py: expected logic change

The diff changes find_first_image fallback to only trigger when expected does not exist (previously it also set expected = None when frame5.png was missing). This is a behavior change: previously, missing frame5.png would skip the sample; now it falls back to another image. This is likely intentional for t2i_only mode (where only frame0.png exists), but could change behavior for non-t2i runs too. Consider adding a comment explaining the rationale.

hsliuustc0106 · 2026-04-24T10:48:23Z

+        kwargs["weight_bits"] = kwargs.pop("bits")
+
+    # Filter to only params the config class accepts
+    valid = set(inspect.signature(config_cls.__init__).parameters) - {"self"}


Nit: inspect.signature reflects the runtime callable signature, which means any config class that uses **kwargs in __init__ will have an empty valid set — every kwarg gets silently dropped. If the quant config classes are dataclasses, dataclasses.fields() would be more robust. At minimum, add a comment noting this limitation so future contributors do not get tripped up.

hsliuustc0106 · 2026-04-24T10:48:27Z

        if isinstance(tokenizer, str):
-            self.tokenizer = AutoTokenizer.from_pretrained(tokenizer)
+            try:
+                self.tokenizer = AutoTokenizer.from_pretrained(tokenizer, trust_remote_code=True)


Security note: trust_remote_code=True is tried first for all models, even untrusted ones. Could you either:

Narrow the exception set to only the specific failure you see with HunyuanImage-3.0, or

Add a comment explaining why this is needed (e.g., "HunyuanImage-3.0 ships a custom tokenizer class that requires trust_remote_code")

The fallback to PreTrainedTokenizerFast is a nice safety net, but running arbitrary code from the Hub on the first attempt is a broad trust gate.

hsliuustc0106 · 2026-04-24T10:48:31Z

+    if torch.cuda.device_count() < num_devices:
+        pytest.skip(f"Need at least {num_devices} CUDA GPUs for this accuracy benchmark.")
+
+    generate_server_args = extra_generate_args if extra_generate_args is not None else ["--num-gpus", "1"]


This means when extra_generate_args is provided, --num-gpus is no longer set at all. The caller (in gebench_accuracy_servers) correctly derives --tensor-parallel-size from device count, so this works today. But if a future caller provides extra_generate_args without GPU flags, the server silently starts single-GPU.

Consider a defensive assertion like:

assert any("--tensor-parallel-size" in a or "--num-gpus" in a for a in (extra_generate_args or [])), \ "extra_generate_args must include a GPU allocation flag"

Or at minimum a comment warning callers.

hsliuustc0106 · 2026-04-24T10:48:34Z

            for sample_dir in sorted(path for path in lang_dir.iterdir() if path.is_dir()):
                expected = sample_dir / "frame5.png" if data_type in {"type2", "type3", "type4"} else None
-                if expected is None:
+                if expected is None or not expected.exists():


Behavior change: previously, a missing frame5.png would set expected = None and skip the sample entirely. Now it falls back to find_first_image(). This is correct for t2i_only mode (where only frame0.png exists), but also changes the behavior for non-t2i runs — samples with missing frame5 but other frames present will now be included instead of skipped.

Could you add a comment here explaining the rationale? Something like:

# t2i_only generates only frame0, so fall back to any available image # instead of requiring frame5 (which only exists for trajectory tasks).

HunyuanImage-3.0-Instruct is a T2I model that cannot do IT2I editing. Without --t2i-only, the test generates a full 6-frame trajectory where frames 1-5 are garbage, causing the judge to score 0.04 instead of 0.45+. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>

collect_gebench_generation_summary hardcoded frame5.png for type3/type4, but t2i-only mode only generates frame0.png. Fall back to find_first_image when the expected frame doesn't exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: TaffyOfficial <2324465096@qq.com>

Signed-off-by: TaffyOfficial <2324465096@qq.com>

gcanlin · 2026-05-08T03:07:24Z

@TaffyOfficial Could the error in CI be reproduced locally? If could, I will remove the ready tag until the error has been fixed locally.

Please fix https://buildkite.com/vllm/vllm-omni/builds/9200/canvas first.

TaffyOfficial · 2026-05-08T03:11:59Z

@TaffyOfficial Could the error in CI be reproduced locally? If could, I will remove the ready tag until the error has been fixed locally.

Please fix https://buildkite.com/vllm/vllm-omni/builds/9200/canvas first.

qwen-image 故障，和我这没关系

TaffyOfficial · 2026-05-08T03:12:36Z

这个pr根据会议要求，由于需要4卡，所以从ci下掉，只保留用例

gcanlin · 2026-05-08T03:16:18Z

This error seems to be from Hunyuan-Image.

=========================================================================== FAILURES ===========================================================================
___________________________________________________________________ test_gebench_h100_smoke ____________________________________________________________________
gebench_accuracy_servers = AccuracyServerConfig(generate_params=OmniServerParams(model='tencent/HunyuanImage-3.0-Instruct', port=8094, stage_conf..._omni=False, use_stage_cli=False, init_timeout=None, stage_init_timeout=None), run_level='full_model', model_prefix='')
accuracy_artifact_root = PosixPath('/workspace/build/buildkite/tests/e2e/accuracy/artifacts')
gebench_dataset_root = PosixPath('/root/.cache/huggingface/hub/datasets--stepfun-ai--GEBench/snapshots/063ea6ff286b8916fed97b45d8a43dfb4364f116')
gebench_samples_per_type = 4, gebench_num_inference_steps = 28, accuracy_workers = 1, gebench_t2i_only = True
gebench_min_scores = {'overall': 0.45, 'type3': 0.45, 'type4': 0.45}
    @pytest.mark.benchmark
    @hardware_test(res={"cuda": "H100"}, num_cards=1)
    def test_gebench_h100_smoke(
        gebench_accuracy_servers,
        accuracy_artifact_root: Path,
        gebench_dataset_root: Path,
        gebench_samples_per_type: int,
        gebench_num_inference_steps: int,
        accuracy_workers: int,
        gebench_t2i_only: bool,
        gebench_min_scores: dict[str, float],
    ) -> None:
        model_label = infer_model_label(gebench_accuracy_servers.generate_params.model).lower()
        output_root = reset_artifact_dir(accuracy_artifact_root / f"gebench_{model_label}")
        t2i_flag = ["--t2i-only"] if gebench_t2i_only else []
        with gebench_accuracy_servers.generate_server() as generate_server:
            for data_type in ("type3", "type4"):
                assert (
                    gbench_main(
                        [
                            "generate",
                            "--dataset-root",
                            str(gebench_dataset_root),
                            "--output-root",
                            str(output_root),
                            "--base-url",
                            f"http://{generate_server.host}:{generate_server.port}",
                            "--model",
                            generate_server.model,
                            "--data-type",
                            data_type,
                            "--width",
                            "768",
                            "--height",
                            "576",
                            "--output-compression",
                            "98",
                            "--num-inference-steps",
                            str(gebench_num_inference_steps),
                            "--workers",
                            str(accuracy_workers),
                            "--samples-per-type",
                            str(gebench_samples_per_type),
                            *t2i_flag,
                        ]
                    )
                    == 0
                )
        with gebench_accuracy_servers.judge_server() as judge_server:
            for data_type in ("type3", "type4"):
                assert (
                    gbench_main(
                        [
                            "evaluate",
                            "--dataset-root",
                            str(gebench_dataset_root),
                            "--output-root",
                            str(output_root),
                            "--data-type",
                            data_type,
                            "--judge-base-url",
                            f"http://{judge_server.host}:{judge_server.port}",
                            "--judge-model",
                            judge_server.model,
                            "--judge-api-key",
                            "EMPTY",
                            "--workers",
                            str(accuracy_workers),
                            *t2i_flag,
                        ]
                    )
                    == 0
                )
        assert gbench_main(["summarize", "--output-root", str(output_root)]) == 0
        summary = json.loads((output_root / "summary.json").read_text(encoding="utf-8"))
        assert "generation" in summary
        assert "evaluation" in summary
        for data_type in ("type3", "type4"):
            assert data_type in summary["generation"]["by_type"]
            assert summary["generation"]["by_type"][data_type]["count"] > 0
            assert data_type in summary["evaluation"]["by_type"]
            assert summary["evaluation"]["by_type"][data_type]["count"] > 0
        assert summary["evaluation"]["overall_mean"] >= gebench_min_scores["overall"]
>       assert summary["evaluation"]["by_type"]["type3"]["overall_mean"] >= gebench_min_scores["type3"]
E       assert 0.36 >= 0.45
tests/e2e/accuracy/test_gebench_h100_smoke.py:105: AssertionError

Revert .buildkite/test-nightly.yml to origin/main to remove: - New HunyuanImage-3.0 GEBench accuracy job (TP=4+EP, 80B MoE) - "and not resource_heavy" filter on the H100 / L4 diffusion sweeps - Threshold args on the existing Qwen GEBench step Also revert the resource_heavy marker scaffolding (pyproject.toml registration + flux2 / flux_2_dev / sd3 expansion test tags) since it only existed to keep the dropped HunyuanImage-3.0 job from competing with the broad nightly diffusion sweeps. Test cases stay: tests/e2e/accuracy/conftest.py fixture additions, tests/e2e/accuracy/test_gebench_h100_smoke.py CLI options, and benchmarks/accuracy/text_to_image/gbench.py logic. They can be invoked manually until CI is re-enabled. Signed-off-by: TaffyOfficial <2324465096@qq.com>

The previous 8-step default targeted distilled checkpoints (HunyuanImage-3.0-Instruct-Distil); on the full Instruct model 28 steps (the prior buildkite override) was already producing mode-collapse samples (e.g. near-blank frames scoring 5/5 from the judge, masked by the overall mean). HF official default for HunyuanImage-3.0 is 50 steps; align defaults with that. Distilled / fast-sampling models that want fewer steps must now opt in explicitly via --gebench-num-inference-steps / --num-inference-steps. Sites updated: - tests/e2e/accuracy/conftest.py: pytest --gebench-num-inference-steps default - benchmarks/accuracy/text_to_image/gbench.py: GEBenchRunner.__init__ + CLI default Signed-off-by: TaffyOfficial <2324465096@qq.com>

TaffyOfficial · 2026-05-08T05:36:08Z

@gcanlin update

TaffyOfficial requested a review from hsliuustc0106 as a code owner April 23, 2026 05:04

TaffyOfficial changed the title ~~ci: add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline~~ [ci]add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline Apr 23, 2026

TaffyOfficial changed the title ~~[ci]add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline~~ [ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline Apr 23, 2026

TaffyOfficial changed the title ~~[ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline~~ [wip] [ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline Apr 23, 2026

TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from 7f08b94 to b406553 Compare April 23, 2026 06:15

TaffyOfficial changed the title ~~[wip] [ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline~~ [ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline Apr 23, 2026

TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from b406553 to fec875f Compare April 23, 2026 07:06

TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from ad2b1c1 to 8ee36c4 Compare April 23, 2026 07:53

yenuo26 reviewed Apr 23, 2026

View reviewed changes

Comment thread .buildkite/test-nightly.yml Outdated

yenuo26 added the diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI label Apr 23, 2026

TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from 5c5ee08 to ef3adc0 Compare April 23, 2026 08:12

TaffyOfficial requested a review from yenuo26 April 23, 2026 08:15

hsliuustc0106 reviewed Apr 23, 2026

View reviewed changes

Gaohan123 added this to the v0.20.0 milestone Apr 24, 2026

yenuo26 added ready label to trigger buildkite CI and removed diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels Apr 24, 2026

Gaohan123 reviewed Apr 24, 2026

View reviewed changes

hsliuustc0106 reviewed Apr 24, 2026

View reviewed changes

Comment thread .buildkite/test-nightly.yml Outdated

TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from 8cab680 to c61ccd3 Compare April 25, 2026 11:38

TaffyOfficial and others added 3 commits May 7, 2026 14:46

Address HunyuanImage GEBench review comments

2fb40bc

Signed-off-by: TaffyOfficial <2324465096@qq.com>

TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from 6aa1284 to 2fb40bc Compare May 7, 2026 06:46

chore: trigger CI re-run

a0d8f79

Signed-off-by: TaffyOfficial <2324465096@qq.com>

gcanlin self-assigned this May 7, 2026

gcanlin added ready label to trigger buildkite CI diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels May 7, 2026

Bounty-hunter mentioned this pull request May 7, 2026

[Feature]: [Hunyuanimage]Support DIT reuse kv from AR stage JiusiServe/vllm-omni#216

Open

1 task

[CI] stabilize Hunyuan Image3 accuracy nightly

c8fab1f

Signed-off-by: TaffyOfficial <2324465096@qq.com>

TaffyOfficial force-pushed the feat/hunyuan-image3-accuracy-ci branch from 9f2d7b1 to c8fab1f Compare May 8, 2026 02:01

TaffyOfficial added 2 commits May 8, 2026 10:05

Merge origin/main into Hunyuan Image3 accuracy CI

d203909

Signed-off-by: TaffyOfficial <2324465096@qq.com>

[CI] fix accuracy fixture accelerator lint

ae878a8

Signed-off-by: TaffyOfficial <2324465096@qq.com>

gcanlin removed ready label to trigger buildkite CI diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels May 8, 2026

gcanlin added the ready label to trigger buildkite CI label May 8, 2026

TaffyOfficial changed the title ~~[ci] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline~~ [example] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline May 8, 2026

TaffyOfficial added 2 commits May 8, 2026 11:19

TaffyOfficial changed the title ~~[example] add HunyuanImage-3.0 GEBench accuracy test to nightly pipeline~~ [Test][HunyuanImage3] GEBench T2I accuracy pytest harness May 8, 2026

Merge branch 'main' into feat/hunyuan-image3-accuracy-ci

2a8148a

TaffyOfficial requested a review from david6666666 as a code owner May 8, 2026 05:34

Gaohan123 modified the milestones: v0.20.0, v0.22.0 May 9, 2026

Bounty-hunter mentioned this pull request May 10, 2026

[RFC]: HunyuanImage Model deployment optimization #2015

Open

Conversation

TaffyOfficial commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in scope

What's NOT in scope (reverted vs earlier revisions of this PR)

Default step count: 8 → 50

Empirical validation (4×L20X 143GB, samples-per-type=2, t2i-only)

Reproducing locally

Uh oh!

chatgpt-codex-connector Bot commented Apr 23, 2026

Uh oh!

hsliuustc0106 commented Apr 23, 2026

Uh oh!

TaffyOfficial commented Apr 23, 2026

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 23, 2026

Uh oh!

TaffyOfficial commented Apr 23, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

TaffyOfficial commented Apr 24, 2026 • edited by hsliuustc0106 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gaohan123 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

TaffyOfficial Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gcanlin commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TaffyOfficial commented May 8, 2026

Uh oh!

TaffyOfficial commented May 8, 2026

Uh oh!

gcanlin commented May 8, 2026

Uh oh!

TaffyOfficial commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

TaffyOfficial commented Apr 23, 2026 •

edited

Loading

TaffyOfficial commented Apr 24, 2026 •

edited by hsliuustc0106

Loading

TaffyOfficial Apr 24, 2026 •

edited

Loading

gcanlin commented May 8, 2026 •

edited

Loading