[CI] Add on-demand performance test trigger via PR comments by inaniloquentee · Pull Request #2506 · vllm-project/vllm-omni

inaniloquentee · 2026-04-05T15:25:36Z

Purpose

This PR introduces a lightweight, on-demand GitHub Action workflow to trigger model-specific performance tests via PR comments. This prevents running heavy benchmarks on all models for every PR, saving significant CI/GPU resources and developer waiting time.

Key Features:

Flexible Triggering: Uses regex (grep -m 1 -o '/test-perf.*') to extract the command, allowing users to include natural text before the command (e.g., LGTM! /test-perf qwen3-omni).
Auto-Discovery: Automatically locates the .sh script under benchmarks/<model_name>/vllm_omni/ if no specific script is provided.
Specific Script Support: Allows passing a specific script name (e.g., /test-perf qwen3-tts bench_tts_serve.py) to handle models with .py benchmark entry points.
Visual Feedback: Acknowledges the command instantly with a 👀 reaction.
Security & Runner Match: Restricts trigger permissions to trusted roles (OWNER, MEMBER, COLLABORATOR) and targets self-hosted runners to ensure access to the necessary vLLM/GPU environment.

Test Plan

I have fully tested the workflow's trigger and routing logic in my own forked repository using a dummy PR.

Commented /test-perf <model_name> and verified successful model name extraction and auto-discovery of .sh files.
Commented with prefix text and a specific python script, verifying the fallback logic and specific execution path.
Verified the GitHub API successfully adds the 👀 reaction.
Verified that unauthorized users cannot trigger the workflow.

Test Result

The CI workflow successfully intercepts the comment, verifies the author's association, parses the arguments safely, and dynamically constructs the correct execution path according to the benchmarks/README.md architecture.
(Note: The actual successful execution of the benchmark scripts will rely on the upstream self-hosted runners having the necessary GPU environment, but the routing and trigger mechanisms work perfectly).

Signed-off-by: inaniloquentee <3051000145@qq.com>

Added latest news section with release details for version 0.18.0. Signed-off-by: inaniloquentee <3051000145@qq.com>

Signed-off-by: inaniloquentee <3051000145@qq.com>

Fix alignment of the center paragraph tag in README

Signed-off-by: inaniloquentee <3051000145@qq.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6248b969be

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-05T15:28:35Z

+      - name: Checkout PR Code
+        uses: actions/checkout@v4
+        with:
+          ref: refs/pull/${{ github.event.issue.number }}/head


Avoid executing PR head code on privileged self-hosted runners

This workflow checks out refs/pull/.../head and later executes benchmark scripts from that checkout, so a forked PR can inject arbitrary commands that run once any OWNER/MEMBER/COLLABORATOR comments /test-perf. Because issue_comment runs in the base-repo context, this creates a pwn-request path against your self-hosted runner (and any credentials/context it carries). Restrict execution to trusted PR sources or harden the job isolation/credentials before running PR-provided scripts.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-05T15:28:35Z

+            github.rest.reactions.createForIssueComment({
+              owner: context.repo.owner,
+              repo: context.repo.repo,
+              comment_id: context.payload.comment.id,
+              content: 'eyes'


Declare GITHUB_TOKEN permissions needed for reaction write

The workflow posts a comment reaction via reactions.createForIssueComment but does not define a permissions block. In repositories/orgs with read-only default workflow token permissions, this call gets a 403 and the job stops before any benchmark runs, so the trigger silently fails for valid /test-perf comments.

Useful? React with 👍 / 👎.

Signed-off-by: inaniloquentee <3051000145@qq.com>

sync main to patch-2

gcanlin · 2026-04-09T14:34:01Z

/test-perf qwen3-omni

Signed-off-by: inaniloquentee <3051000145@qq.com>

inaniloquentee · 2026-04-14T08:56:58Z

@yenuo26 and @gcanlin, It looks like the Omni · Doc Test with H100 failed, but upon inspecting the logs, this is also unrelated to the CI pipeline configuration.

The failure is happening in tests/examples/offline_inference/test_qwen3_omni.py during the validate_python_snippets step. The extracted documentation snippet is trying to download a .flac audio file from HuggingFace using urllib, but it is hitting a 401 Unauthorized error:

urllib.error.HTTPError: HTTP Error 401: Unauthorized
(URL: https://huggingface.co/datasets/fixie-ai/librispeech_asr_dummy/resolve/main/1995/1836/1995-1836-0001.flac)

This indicates that the dataset snippet in the documentation either requires an explicit HF_TOKEN passed to the downloader, or the dataset has been gated/made private recently.

Since the label-based routing is working flawlessly and catching these actual codebase issues, I will leave this documentation fix to you. Let me know if we are good to merge.

yenuo26 · 2026-04-14T09:03:43Z

@yenuo26 and @gcanlin, It looks like the Omni · Doc Test with H100 failed, but upon inspecting the logs, this is also unrelated to the CI pipeline configuration.

The failure is happening in tests/examples/offline_inference/test_qwen3_omni.py during the validate_python_snippets step. The extracted documentation snippet is trying to download a .flac audio file from HuggingFace using urllib, but it is hitting a 401 Unauthorized error:

urllib.error.HTTPError: HTTP Error 401: Unauthorized (URL: https://huggingface.co/datasets/fixie-ai/librispeech_asr_dummy/resolve/main/1995/1836/1995-1836-0001.flac)

This indicates that the dataset snippet in the documentation either requires an explicit HF_TOKEN passed to the downloader, or the dataset has been gated/made private recently.

Since the label-based routing is working flawlessly and catching these actual codebase issues, I will leave this documentation fix to you. Let me know if we are good to merge.

The previous nightly-CI issue has been fixed, so please ignore the earlier errors. Additionally, I have finished revising this PR based on your previous work, and it is now ready to be merged.

gcanlin

LGTM. Very helpful! cc @hsliuustc0106 @ywang96

yenuo26 · 2026-04-15T01:54:16Z

@inaniloquentee hello, Regarding the label configuration, we plan to adjust it to: omni-test, tts-test, diffusion-x2i-test, diffusion-x2v-test. Could you please grant me temporary push access to the patch-2 branch?

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Updated test labels and commands for nightly tests, including changes to the Omni and TTS function tests, and added new performance tests for Diffusion X2I and X2V. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Add support for custom config file via command line argument. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Updated paths for performance test commands in the test guide. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Updated the reference for L4-level performance test case addition to include specific test files. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Removed test case for 'test_qwen3_tts' from the JSON file. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Added a JSON test configuration for Qwen3 TTS performance benchmarking. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

inaniloquentee · 2026-04-15T08:20:15Z

@inaniloquentee hello, Regarding the label configuration, we plan to adjust it to: omni-test, tts-test, diffusion-x2i-test, diffusion-x2v-test. Could you please grant me temporary push access to the patch-2 branch?

Awesome, thanks for the quick fix! looking forward to the merge once the tests go green. 🎉

gcanlin · 2026-04-15T08:39:31Z

@Gaohan123 @yenuo26 If we don't wanna introduce too many tags, I think we can consider the comment way to specify one model tests in the future. When more and more models are integrated to vllm-omni, this request will be necessary.

yenuo26 · 2026-04-15T09:29:45Z

@Gaohan123 @yenuo26 If we don't wanna introduce too many tags, I think we can consider the comment way to specify one model tests in the future. When more and more models are integrated to vllm-omni, this request will be necessary.

After the morning discussion, they want to change the labels to: omni-test, tts-test, diffusion-x2v-test, diffusion-x2i-test. This involves some script changes. Since I don't have permission to push to the inaniloquentee:patch-2 branch and cannot resolve the conflicts there, I have opened a new PR #2816 to proceed with the follow-up work.

yenuo26 · 2026-04-16T03:21:07Z

@inaniloquentee Thank you very much for your contribution. It is very helpful for subsequent development. However, I'm sorry that due to changes in the follow-up plan, there is additional development work. Since I don't have permission to push to your inaniloquentee:patch-2 branch and cannot resolve the conflicts for further development, I have opened a new PR #2816 and added you as a co-author.

inaniloquentee · 2026-04-16T06:35:19Z

@inaniloquentee Thank you very much for your contribution. It is very helpful for subsequent development. However, I'm sorry that due to changes in the follow-up plan, there is additional development work. Since I don't have permission to push to your inaniloquentee:patch-2 branch and cannot resolve the conflicts for further development, I have opened a new PR #2816 and added you as a co-author.

Got it, that makes perfect sense! Thank you for handling the conflicts and getting this integrated alongside the pipeline refactoring.

Glad I could help with the trigger mechanism. Looking forward to the new release! 🎉

lishunyang12

Review: [CI] Add on-demand performance test trigger via PR comments

I reviewed the full diff (470 additions, 449 deletions, 9 files). Here are my findings:

Summary

This PR does three things:

Consolidates test-nightly-diffusion.yml into test-nightly.yml, eliminating the dynamic pipeline upload indirection.
Introduces label-based selective CI triggering (omni-test, tts-test, diffusion-x2iat-test, diffusion-x2v-test) so individual test groups can run without triggering the full nightly suite.
Splits the perf benchmark config into test_omni.json and test_tts.json, adding --config-file support to run_benchmark.py.

The direction is sound and the consolidation is a clear improvement. I have a few specific concerns below.

Issues

1. Bug: TTS Function Test runs Omni tests, not TTS tests (high severity)

In .buildkite/test-nightly.yml, the "TTS · Function Test" step uses:

pytest -s -v tests/e2e/online_serving/test_*_expansion.py -m "advanced_model and L4 and omni" --run-level "advanced_model"

The pytest marker filter is -m "advanced_model and L4 and omni" -- this selects omni tests, not TTS tests. This appears to be a copy-paste from the original "Omni · Function Test with L4" step. It should presumably use a tts marker or target TTS-specific test files.

2. PR title/description is misleading (low severity)

The title says "on-demand performance test trigger via PR comments" and the description details a GitHub Actions issue_comment workflow with regex parsing, reaction emojis, and author-association checks. However, the actual implementation uses Buildkite label-based triggering -- none of the described GitHub Actions workflow exists in the diff. The description should be updated to reflect what the PR actually implements.

3. _get_config_file_from_argv() runs at import time (minor / style)

In tests/dfx/perf/scripts/run_benchmark.py, _get_config_file_from_argv() parses sys.argv at module-level before pytest processes its own options. While pytest_addoption is also defined, the actual config loading happens via the sys.argv scan, making the pytest_addoption hook effectively decorative -- pytest never uses its return value for the config loading. This dual approach is confusing. Consider either:

Using only pytest_addoption + a fixture/hook to get the value at the right time, or
Dropping pytest_addoption and documenting that sys.argv parsing is intentional (matching the run_diffusion_benchmark.py pattern).

4. YAML anchor *nightly_or_pr_label removed, conditions duplicated

The old &nightly_or_pr_label / *nightly_or_pr_label YAML anchor pattern kept the condition in one place. The new approach duplicates the label conditions across multiple groups. If a label name changes, multiple places need updating. Consider restoring an anchor or using Buildkite's env-based approach.

Positive aspects

Consolidating the diffusion pipeline into a single file reduces operational complexity.
Label-based triggering is more secure than comment-triggered CI (labels require write access).
Splitting TTS perf config into its own file is clean and enables independent testing.
Documentation updates in CI_5levels.md, test_guide.md, and l4_performance_tests.inc.md are thorough.

Note on PR status

Per the discussion thread, this PR has been superseded by PR #2816 (opened by @yenuo26 with additional label renaming and conflict resolution). The issues identified above should be addressed in that follow-up PR.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

lishunyang12

Review: [CI] Add on-demand performance test trigger via PR comments

Summary

This PR refactors the nightly Buildkite CI pipeline to split monolithic test runs into label-gated groups (omni, tts, diffusion-x2iat, diffusion-x2v), inlines the deleted test-nightly-diffusion.yml, and separates TTS benchmark config from the omni config file. The changes are generally sensible for enabling selective nightly test execution per PR label.

However, there are several issues that should be addressed before merging.

Critical Issues

1. PR title/description does not match the actual diff

The title says "Add on-demand performance test trigger via PR comments" and the description extensively describes a GitHub Action workflow triggered by /test-perf PR comments. However, the diff contains no GitHub Actions workflow file at all. The actual changes are a Buildkite pipeline refactor with label-based gating. The description appears to have been written for an entirely different implementation. Please update the title and description to accurately reflect the Buildkite label-gating approach.

2. TTS Function Test runs the wrong pytest markers

In test-nightly.yml, the "TTS Function Test" step runs:

pytest -s -v tests/e2e/online_serving/test_*_expansion.py -m "advanced_model and L4 and omni" --run-level "advanced_model"

This uses the omni marker, not a TTS marker. This means the TTS test group actually re-runs omni L4 tests instead of TTS-specific tests. Is there a TTS-specific marker or test pattern that should be used here instead?

3. run_benchmark.py: fragile import-time sys.argv parsing

The _get_config_file_from_argv() function parses sys.argv at module import time (top-level), then pytest_addoption registers the same --config-file option later. This creates a confusing dual-path: pytest's own option parsing is never actually used for determining CONFIG_FILE_PATH since that is resolved before pytest processes options. If pytest ever changes its argv handling or another conftest registers --config-file, this will break silently. Consider either:

Using only pytest_addoption + a fixture/hook to load configs lazily, or
Using only sys.argv parsing and dropping the pytest_addoption registration.

Having both is misleading -- a reader would expect request.config.getoption("--config-file") to be the source of truth, but it is not.

Minor Issues

4. Trailing whitespace in artifact download commands

In test-nightly.yml, these lines have trailing whitespace:

- buildkite-agent artifact download "tests/dfx/perf/results/*.json" . --step nightly-tts-performance   
- buildkite-agent artifact download "tests/dfx/perf/results/*.json" . --step nightly-omni-performance

Not functional but messy -- please trim.

5. Nightly perf distribution step no longer waits for diffusion perf

The depends_on for the distribution step was updated to depend on nightly-diffusion-x2iat-performance but the old nightly-qwen-image-performance key from the deleted diffusion file is simply replaced. The new nightly-diffusion-x2iat-performance key exists, so that is correct. However, note that X2V has no perf test step, so there is no artifact gap -- just confirming this is intentional.

6. Blank line

There is a double blank line after the Omni Perf Test block (before "TTS Model Test") in test-nightly.yml. Minor style nit.

Positive Aspects

Splitting tests into independently-triggerable groups via labels is a good architectural improvement for CI resource efficiency.
Separating TTS benchmark config from omni config is clean.
Doc updates are thorough and consistent.
The --config-file CLI interface for run_benchmark.py is a good user-facing improvement.

Please address the critical issues (especially #1 and #2) before this can be approved.

re-review

inaniloquentee added 14 commits April 5, 2026 21:43

Create perf_test_trigger.yml

54e14b6

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update perf_test_trigger.yml

42f41ea

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update perf_test_trigger.yml

aa6c576

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update perf_test_trigger.yml

5079b98

Signed-off-by: inaniloquentee <3051000145@qq.com>

Fix alignment of the center paragraph tag in README

4fcd542

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update perf_test_trigger.yml

3a9fbf9

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update perf_test_trigger.yml

d305196

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update perf_test_trigger.yml

f83bbf1

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update README with latest news on version 0.18.0

93ce7fa

Added latest news section with release details for version 0.18.0. Signed-off-by: inaniloquentee <3051000145@qq.com>

Update README.md

0177193

Signed-off-by: inaniloquentee <3051000145@qq.com>

Merge branch 'main' into test-trigger

ffe2d57

Signed-off-by: inaniloquentee <3051000145@qq.com>

Merge pull request #1 from inaniloquentee/test-trigger

877a665

Fix alignment of the center paragraph tag in README

Update perf_test_trigger.yml

1c37290

Signed-off-by: inaniloquentee <3051000145@qq.com>

Security Fix & Runner Fix

6248b96

Signed-off-by: inaniloquentee <3051000145@qq.com>

chatgpt-codex-connector Bot reviewed Apr 5, 2026

View reviewed changes

inaniloquentee added 12 commits April 7, 2026 16:18

Merge branch 'main' into patch-2

88041a4

Merge branch 'main' into patch-2

eb59eeb

Create test-perf.yml

a132ac8

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update pipeline.yml

523f463

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update test-perf.yml

704d073

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update test-perf.yml

e154499

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update test-perf.yml

f2dccd9

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update pipeline.yml

752dcca

Signed-off-by: inaniloquentee <3051000145@qq.com>

Delete .github/workflows/perf_test_trigger.yml

0cd9391

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update pipeline.yml

293c4c0

Signed-off-by: inaniloquentee <3051000145@qq.com>

Update README.md

a28d374

Signed-off-by: inaniloquentee <3051000145@qq.com>

Merge pull request #3 from inaniloquentee/main

0b4b6e6

sync main to patch-2

inaniloquentee requested a review from hsliuustc0106 as a code owner April 9, 2026 14:30

Delete .github/workflows/perf_test_trigger.yml

1da6c0a

Signed-off-by: inaniloquentee <3051000145@qq.com>

Gaohan123 reviewed Apr 15, 2026

View reviewed changes

Comment thread .buildkite/pipeline.yml

gcanlin approved these changes Apr 15, 2026

View reviewed changes

Add new test labels for nightly builds

47a7e94

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

yenuo26 removed the ready label to trigger buildkite CI label Apr 15, 2026

yenuo26 added 9 commits April 15, 2026 11:40

Delete .buildkite/test-nightly-diffusion.yml

30e9326

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Implement command line config file option

3fb5a0c

Add support for custom config file via command line argument. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Update performance testing configuration in CI documentation

1de6a10

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Fix performance test command paths in test guide

0dd8209

Updated paths for performance test commands in the test guide. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Clarify test case addition format for L4 performance tests

c68c7c6

Updated the reference for L4-level performance test case addition to include specific test files. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Delete 'test_qwen3_tts' test case from test_omni.json

759d06a

Removed test case for 'test_qwen3_tts' from the JSON file. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Add test configuration for Qwen3 TTS

6507c2d

Added a JSON test configuration for Qwen3 TTS performance benchmarking. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

Rename Diffusion X2I performance test step

713f05d

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>

yenuo26 removed the tts-test label to trigger buildkite tts models test in nightly CI label Apr 16, 2026

lishunyang12 reviewed Apr 16, 2026

View reviewed changes

lishunyang12 previously requested changes Apr 16, 2026

View reviewed changes

Conversation

inaniloquentee commented Apr 5, 2026

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin commented Apr 9, 2026

Uh oh!

inaniloquentee commented Apr 14, 2026

Uh oh!

yenuo26 commented Apr 14, 2026

Uh oh!

Uh oh!

gcanlin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yenuo26 commented Apr 15, 2026

Uh oh!

inaniloquentee commented Apr 15, 2026

Uh oh!

gcanlin commented Apr 15, 2026

Uh oh!

yenuo26 commented Apr 15, 2026

Uh oh!

yenuo26 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inaniloquentee commented Apr 16, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Review: [CI] Add on-demand performance test trigger via PR comments

Summary

Issues

Positive aspects

Note on PR status

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Review: [CI] Add on-demand performance test trigger via PR comments

Summary

Critical Issues

Minor Issues

Positive Aspects

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gcanlin left a comment •

edited

Loading

yenuo26 commented Apr 16, 2026 •

edited

Loading