[CI] Qwen image edit performance benckmark by fhfuih · Pull Request #2216 · vllm-project/vllm-omni

fhfuih · 2026-03-26T06:23:40Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Add benchmark test for Qwen Image Edit and Qwen Image Edit 2509 (multi-image input), similar to #1805 and #2111 .

In the second commit of this PR, it also incorporates the utility script from https://github.com/wtomin/vllm-omni/tree/read-benchmark , and extend the report for the recent CI kanban template

Note: ~~this PR is based on #2179 . Pending that one to merge first~~. ~~Also, pending at least one run on CI machine to finalize all the thresholds before merging~~

Test Plan

The same benchmark config as Qwen Image:

2 sampling parameter groups:

512x512_steps20_i2i
1536x1536_steps35_i2i

COMBINING

3 diffusion feature groups (same as Qwen Image):

base
Ulysses=2+CFG=2+VAE=4
2Ulysses=2+CFG=2+CacheDiT

Test Result

Passed on my side. Benchmark figures on 4*A100 is as follow:

backend	benchmark_params	test_name(server_params)	throughput_qps	latency_mean	latency_median	latency_p99	latency_p95	latency_p50	peak_memory_mb_max	peak_memory_mb_mean	peak_memory_mb_median
vllm_omni	512x512_steps20_i2i	test_qwen_image_edit_ulysses2_cfg2_vae_patch4	0.106573026	9.382967346	9.411954862	9.494944628	9.480373901	9.411954862	56660	56660	56660
vllm_omni	1536x1536_steps35_i2i	test_qwen_image_edit_ulysses2_cfg2_vae_patch4	0.026411221	37.86242981	37.87738652	38.01178825	37.98244949	37.87738652	56670	56670	56670
vllm_omni	512x512_steps20_i2i	test_qwen_image_edit_ulysses2_cfg2_cache_dit	0.139649285	7.160598632	6.961157049	7.749325271	7.733595765	6.961157049	60468	60468	60468
vllm_omni	1536x1536_steps35_i2i	test_qwen_image_edit_ulysses2_cfg2_cache_dit	0.045629346	21.91545818	21.58149048	24.18900405	23.91975841	21.58149048	67394	67394	67394
vllm_omni	512x512_steps20_i2i	test_qwen_image_edit_single_device	0.040145847	24.90893269	24.84716635	25.23590528	25.17896626	24.84716635	60356	60356	60356
vllm_omni	1536x1536_steps35_i2i	test_qwen_image_edit_single_device	0.00822772	121.5400661	121.61316	121.8717732	121.8502488	121.61316	67282	67282	67282

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

wtomin · 2026-03-30T07:51:24Z

Is this PR ready for review? @fhfuih

fhfuih · 2026-03-30T08:41:04Z

Is this PR ready for review? @fhfuih

Ready now! And it also needs a nightly CI tag

Copilot

Pull request overview

Adds nightly diffusion performance benchmark coverage for the Qwen-Image-Edit model family (including the multi-image 2509 variant) and extends the existing diffusion benchmark runner/serving scripts to better support/report these runs.

Changes:

Add new perf config JSONs for Qwen/Qwen-Image-Edit and Qwen/Qwen-Image-Edit-2509, and wire them into the nightly Buildkite diffusion perf step.
Extend the diffusion perf runner to emit richer, flattened reporting fields (resolution/parallelism/cache/etc.) and provenance metadata in the aggregated JSON output.
Update diffusion serving benchmark to support generating multiple synthetic input images via --num-input-images for random i2i-style tasks.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`tests/dfx/perf/tests/test_qwen_image_vllm_omni.json`	Minor formatting cleanup in existing Qwen-Image perf config.
`tests/dfx/perf/tests/test_qwen_image_edit_vllm_omni.json`	New Qwen-Image-Edit perf config (single device / ulysses+cfg+vae / cache_dit).
`tests/dfx/perf/tests/test_qwen_image_edit_2509_vllm_omni.json`	New Qwen-Image-Edit-2509 perf config with 2-input-image benchmarks.
`tests/dfx/perf/scripts/run_diffusion_benchmark.py`	Add flattened reporting fields + commit/build provenance; refactor server config handling.
`benchmarks/diffusion/diffusion_benchmark_serving.py`	Add `--num-input-images` and generate multiple synthetic input images for random dataset.
`tests/dfx/benchmark_results_to_excel.py`	New local utility to convert benchmark JSONs into an Excel summary.
`pyproject.toml`	Add `pandas` to dev dependencies.
`.gitignore`	Unignore perf config JSONs under `tests/dfx/perf/tests/`.
`.buildkite/test-nightly.yml`	Run the new Qwen-Image-Edit benchmarks in nightly diffusion perf step; set env vars; mark step soft-fail.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T08:46:42Z


 os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
 os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0"
+os.environ.setdefault("DIFFUSION_ATTENTION_BACKEND", "FLASH_ATTN")


DIFFUSION_ATTENTION_BACKEND is being defaulted to FLASH_ATTN at import time. This can force the FlashAttention backend even in environments where flash-attn isn’t installed (the backend raises ImportError and suggests using TORCH_SDPA), causing local runs to fail unexpectedly. Prefer leaving this env var unset by default (let platform selection decide), or set it conditionally only when FlashAttention is available / in CI where it’s guaranteed.

Suggested change

os.environ.setdefault("DIFFUSION_ATTENTION_BACKEND", "FLASH_ATTN")

Intended to use flash attention in benchmark

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b9836a790

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

fhfuih · 2026-03-30T09:02:29Z

@congw729 in this PR, apart from adding the Qwen Image Edit benchmark test, I also adjusted the test report's format and added the Excel export script from @wtomin 's https://github.com/wtomin/vllm-omni/tree/read-benchmark to match the WIP perf kanban format per our internal discussion. Plz suggest if it is OK to include this script here in this PR. Thanks Removed

congw729 · 2026-03-30T15:54:13Z

@congw729 in this PR, apart from adding the Qwen Image Edit benchmark test, I also adjusted the test report's format and added the Excel export script from @wtomin 's https://github.com/wtomin/vllm-omni/tree/read-benchmark to match the WIP perf kanban format per our internal discussion. Plz suggest if it is OK to include this script here in this PR. Thanks

Can't we reuse the original scripts?

congw729 · 2026-04-01T06:40:58Z

Do you also need to modify the _DIFFUSION_JSON_PREFIX in tools/nightly/generate_nightly_perf_excel.py?

fhfuih · 2026-04-01T08:38:47Z

Do you also need to modify the _DIFFUSION_JSON_PREFIX in tools/nightly/generate_nightly_perf_excel.py?

Yes, this is done

The latest CI result is at https://buildkite.com/vllm/vllm-omni/builds/5593/steps/canvas

Extra manual test: download all the json files in the Qwen Image Series Perf Test -> Artifacts. Run the generate_nightly_perf_excel.py script. Got the following output:

nightly_perf_20260401-082845.xlsx

empty commit_sha because it is read from buildkite environment. Major fields read successfully

congw729 · 2026-04-02T02:18:56Z

Is this ready to merge?

wtomin · 2026-04-02T03:37:31Z

+                "enable-negative-prompt": true,
+                "baseline": {
+                    "throughput_qps": 0.008,
+                    "latency_p99": 150.0,


latency_mean is a better metric than latency_p99

For the baseline metrics, maybe keep mean, median, and p99 together?

For the baseline metrics, maybe keep mean, median, and p99 together?

After we have a performance kanban, asserting too much metrics here may not be needed no more. It fluctuates and sometime blocks CI unexpectedly.

latency_mean is a better metric than latency_p99

Agree. Less strict. I will change that

fhfuih · 2026-04-02T07:15:39Z

Is this ready to merge?

Pending #2415 yesterday. Should be able to merge today. I'll run CI again

wtomin · 2026-04-02T07:23:53Z

I solved this conflicts. I think the previous CI passed nicely, Let's wait until this CI end.