[CI] Qwen image edit performance benckmark#2216
Conversation
303b0cc to
aff539a
Compare
|
Is this PR ready for review? @fhfuih |
945fc1e to
7b9836a
Compare
Ready now! And it also needs a nightly CI tag |
There was a problem hiding this comment.
Pull request overview
Adds nightly diffusion performance benchmark coverage for the Qwen-Image-Edit model family (including the multi-image 2509 variant) and extends the existing diffusion benchmark runner/serving scripts to better support/report these runs.
Changes:
- Add new perf config JSONs for Qwen/Qwen-Image-Edit and Qwen/Qwen-Image-Edit-2509, and wire them into the nightly Buildkite diffusion perf step.
- Extend the diffusion perf runner to emit richer, flattened reporting fields (resolution/parallelism/cache/etc.) and provenance metadata in the aggregated JSON output.
- Update diffusion serving benchmark to support generating multiple synthetic input images via
--num-input-imagesfor random i2i-style tasks.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/dfx/perf/tests/test_qwen_image_vllm_omni.json |
Minor formatting cleanup in existing Qwen-Image perf config. |
tests/dfx/perf/tests/test_qwen_image_edit_vllm_omni.json |
New Qwen-Image-Edit perf config (single device / ulysses+cfg+vae / cache_dit). |
tests/dfx/perf/tests/test_qwen_image_edit_2509_vllm_omni.json |
New Qwen-Image-Edit-2509 perf config with 2-input-image benchmarks. |
tests/dfx/perf/scripts/run_diffusion_benchmark.py |
Add flattened reporting fields + commit/build provenance; refactor server config handling. |
benchmarks/diffusion/diffusion_benchmark_serving.py |
Add --num-input-images and generate multiple synthetic input images for random dataset. |
tests/dfx/benchmark_results_to_excel.py |
New local utility to convert benchmark JSONs into an Excel summary. |
pyproject.toml |
Add pandas to dev dependencies. |
.gitignore |
Unignore perf config JSONs under tests/dfx/perf/tests/. |
.buildkite/test-nightly.yml |
Run the new Qwen-Image-Edit benchmarks in nightly diffusion perf step; set env vars; mark step soft-fail. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" | ||
| os.environ["VLLM_TEST_CLEAN_GPU_MEMORY"] = "0" | ||
| os.environ.setdefault("DIFFUSION_ATTENTION_BACKEND", "FLASH_ATTN") |
There was a problem hiding this comment.
DIFFUSION_ATTENTION_BACKEND is being defaulted to FLASH_ATTN at import time. This can force the FlashAttention backend even in environments where flash-attn isn’t installed (the backend raises ImportError and suggests using TORCH_SDPA), causing local runs to fail unexpectedly. Prefer leaving this env var unset by default (let platform selection decide), or set it conditionally only when FlashAttention is available / in CI where it’s guaranteed.
| os.environ.setdefault("DIFFUSION_ATTENTION_BACKEND", "FLASH_ATTN") |
There was a problem hiding this comment.
Intended to use flash attention in benchmark
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b9836a790
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
Can't we reuse the original scripts? |
8aebd40 to
c45a3bb
Compare
|
Do you also need to modify the |
Yes, this is done The latest CI result is at https://buildkite.com/vllm/vllm-omni/builds/5593/steps/canvas Extra manual test: download all the json files in the Qwen Image Series Perf Test -> Artifacts. Run the generate_nightly_perf_excel.py script. Got the following output: nightly_perf_20260401-082845.xlsx empty commit_sha because it is read from buildkite environment. Major fields read successfully |
|
Is this ready to merge? |
| "enable-negative-prompt": true, | ||
| "baseline": { | ||
| "throughput_qps": 0.008, | ||
| "latency_p99": 150.0, |
There was a problem hiding this comment.
latency_mean is a better metric than latency_p99
There was a problem hiding this comment.
For the baseline metrics, maybe keep mean, median, and p99 together?
There was a problem hiding this comment.
For the baseline metrics, maybe keep mean, median, and p99 together?
After we have a performance kanban, asserting too much metrics here may not be needed no more. It fluctuates and sometime blocks CI unexpectedly.
latency_mean is a better metric than latency_p99
Agree. Less strict. I will change that
Pending #2415 yesterday. Should be able to merge today. I'll run CI again |
|
I solved this conflicts. I think the previous CI passed nicely, Let's wait until this CI end. |
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Add benchmark_results_to_excel.py for aggregating benchmark JSON into Excel. Adapt the excel generator to the new diffusion JSON format Made-with: Cursor Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
b225d4e to
109713b
Compare
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
aabbde3 to
25b8f51
Compare
|
Rebased on main branch and here is the CI result up till the above action: Note that there is one assertion error due to my updated threshold. I will loosen the threshold once again (appearing in more commits below soon). The logics remain intact and ready to merge |
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
cdc360f to
a4647bd
Compare
|
Note: all tests have passed in https://buildkite.com/vllm/vllm-omni/builds/6482/steps/canvas except for two unrelated ones. Below I will submit a minor fix of pre-commit issue. I think we can ignore the latest CI |
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
410376a to
9edda19
Compare
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Add benchmark test for Qwen Image Edit and Qwen Image Edit 2509 (multi-image input), similar to #1805 and #2111 .
In the second commit of this PR, it also incorporates the utility script from https://github.com/wtomin/vllm-omni/tree/read-benchmark , and extend the report for the recent CI kanban template
Note:
this PR is based on #2179 . Pending that one to merge first.Also, pending at least one run on CI machine to finalize all the thresholds before mergingTest Plan
The same benchmark config as Qwen Image:
2 sampling parameter groups:
COMBINING
3 diffusion feature groups (same as Qwen Image):
Test Result
Passed on my side. Benchmark figures on 4*A100 is as follow:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)