Skip to content

[CI] Add on-demand performance test trigger via PR comments#2506

Open
inaniloquentee wants to merge 58 commits into
vllm-project:mainfrom
inaniloquentee:patch-2
Open

[CI] Add on-demand performance test trigger via PR comments#2506
inaniloquentee wants to merge 58 commits into
vllm-project:mainfrom
inaniloquentee:patch-2

Conversation

@inaniloquentee
Copy link
Copy Markdown
Contributor

Purpose

Fixes #2410

This PR introduces a lightweight, on-demand GitHub Action workflow to trigger model-specific performance tests via PR comments. This prevents running heavy benchmarks on all models for every PR, saving significant CI/GPU resources and developer waiting time.

Key Features:

  1. Flexible Triggering: Uses regex (grep -m 1 -o '/test-perf.*') to extract the command, allowing users to include natural text before the command (e.g., LGTM! /test-perf qwen3-omni).
  2. Auto-Discovery: Automatically locates the .sh script under benchmarks/<model_name>/vllm_omni/ if no specific script is provided.
  3. Specific Script Support: Allows passing a specific script name (e.g., /test-perf qwen3-tts bench_tts_serve.py) to handle models with .py benchmark entry points.
  4. Visual Feedback: Acknowledges the command instantly with a 👀 reaction.
  5. Security & Runner Match: Restricts trigger permissions to trusted roles (OWNER, MEMBER, COLLABORATOR) and targets self-hosted runners to ensure access to the necessary vLLM/GPU environment.

Test Plan

I have fully tested the workflow's trigger and routing logic in my own forked repository using a dummy PR.

  1. Commented /test-perf <model_name> and verified successful model name extraction and auto-discovery of .sh files.
  2. Commented with prefix text and a specific python script, verifying the fallback logic and specific execution path.
  3. Verified the GitHub API successfully adds the 👀 reaction.
  4. Verified that unauthorized users cannot trigger the workflow.

Test Result

The CI workflow successfully intercepts the comment, verifies the author's association, parses the arguments safely, and dynamically constructs the correct execution path according to the benchmarks/README.md architecture.
(Note: The actual successful execution of the benchmark scripts will rely on the upstream self-hosted runners having the necessary GPU environment, but the routing and trigger mechanisms work perfectly).

Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Added latest news section with release details for version 0.18.0.

Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Fix alignment of the center paragraph tag in README
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6248b969be

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread .github/workflows/perf_test_trigger.yml Outdated
- name: Checkout PR Code
uses: actions/checkout@v4
with:
ref: refs/pull/${{ github.event.issue.number }}/head
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid executing PR head code on privileged self-hosted runners

This workflow checks out refs/pull/.../head and later executes benchmark scripts from that checkout, so a forked PR can inject arbitrary commands that run once any OWNER/MEMBER/COLLABORATOR comments /test-perf. Because issue_comment runs in the base-repo context, this creates a pwn-request path against your self-hosted runner (and any credentials/context it carries). Restrict execution to trusted PR sources or harden the job isolation/credentials before running PR-provided scripts.

Useful? React with 👍 / 👎.

Comment thread .github/workflows/perf_test_trigger.yml Outdated
Comment on lines +33 to +37
github.rest.reactions.createForIssueComment({
owner: context.repo.owner,
repo: context.repo.repo,
comment_id: context.payload.comment.id,
content: 'eyes'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Declare GITHUB_TOKEN permissions needed for reaction write

The workflow posts a comment reaction via reactions.createForIssueComment but does not define a permissions block. In repositories/orgs with read-only default workflow token permissions, this call gets a 403 and the job stops before any benchmark runs, so the trigger silently fails for valid /test-perf comments.

Useful? React with 👍 / 👎.

Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Apr 9, 2026

/test-perf qwen3-omni

Signed-off-by: inaniloquentee <3051000145@qq.com>
@inaniloquentee
Copy link
Copy Markdown
Contributor Author

@yenuo26 and @gcanlin, It looks like the Omni · Doc Test with H100 failed, but upon inspecting the logs, this is also unrelated to the CI pipeline configuration.

The failure is happening in tests/examples/offline_inference/test_qwen3_omni.py during the validate_python_snippets step. The extracted documentation snippet is trying to download a .flac audio file from HuggingFace using urllib, but it is hitting a 401 Unauthorized error:

urllib.error.HTTPError: HTTP Error 401: Unauthorized
(URL: https://huggingface.co/datasets/fixie-ai/librispeech_asr_dummy/resolve/main/1995/1836/1995-1836-0001.flac)

This indicates that the dataset snippet in the documentation either requires an explicit HF_TOKEN passed to the downloader, or the dataset has been gated/made private recently.

Since the label-based routing is working flawlessly and catching these actual codebase issues, I will leave this documentation fix to you. Let me know if we are good to merge.

@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Apr 14, 2026

@yenuo26 and @gcanlin, It looks like the Omni · Doc Test with H100 failed, but upon inspecting the logs, this is also unrelated to the CI pipeline configuration.

The failure is happening in tests/examples/offline_inference/test_qwen3_omni.py during the validate_python_snippets step. The extracted documentation snippet is trying to download a .flac audio file from HuggingFace using urllib, but it is hitting a 401 Unauthorized error:

urllib.error.HTTPError: HTTP Error 401: Unauthorized (URL: https://huggingface.co/datasets/fixie-ai/librispeech_asr_dummy/resolve/main/1995/1836/1995-1836-0001.flac)

This indicates that the dataset snippet in the documentation either requires an explicit HF_TOKEN passed to the downloader, or the dataset has been gated/made private recently.

Since the label-based routing is working flawlessly and catching these actual codebase issues, I will leave this documentation fix to you. Let me know if we are good to merge.

The previous nightly-CI issue has been fixed, so please ignore the earlier errors. Additionally, I have finished revising this PR based on your previous work, and it is now ready to be merged.

Comment thread .buildkite/pipeline.yml
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Very helpful! cc @hsliuustc0106 @ywang96

@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Apr 15, 2026

@inaniloquentee hello, Regarding the label configuration, we plan to adjust it to: omni-test, tts-test, diffusion-x2i-test, diffusion-x2v-test. Could you please grant me temporary push access to the patch-2 branch?

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
@yenuo26 yenuo26 removed the ready label to trigger buildkite CI label Apr 15, 2026
yenuo26 added 9 commits April 15, 2026 11:40
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Updated test labels and commands for nightly tests, including changes to the Omni and TTS function tests, and added new performance tests for Diffusion X2I and X2V.

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Add support for custom config file via command line argument.

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Updated paths for performance test commands in the test guide.

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Updated the reference for L4-level performance test case addition to include specific test files.

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Removed test case for 'test_qwen3_tts' from the JSON file.

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Added a JSON test configuration for Qwen3 TTS performance benchmarking.

Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
@yenuo26 yenuo26 added diffusion-x2v-test label to trigger buildkite x2video series of diffusion models test in nightly CI diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI tts-test label to trigger buildkite tts models test in nightly CI and removed diffusion-x2v-test label to trigger buildkite x2video series of diffusion models test in nightly CI diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels Apr 15, 2026
@inaniloquentee
Copy link
Copy Markdown
Contributor Author

@inaniloquentee hello, Regarding the label configuration, we plan to adjust it to: omni-test, tts-test, diffusion-x2i-test, diffusion-x2v-test. Could you please grant me temporary push access to the patch-2 branch?

Awesome, thanks for the quick fix! looking forward to the merge once the tests go green. 🎉

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Apr 15, 2026

@Gaohan123 @yenuo26 If we don't wanna introduce too many tags, I think we can consider the comment way to specify one model tests in the future. When more and more models are integrated to vllm-omni, this request will be necessary.

@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Apr 15, 2026

@Gaohan123 @yenuo26 If we don't wanna introduce too many tags, I think we can consider the comment way to specify one model tests in the future. When more and more models are integrated to vllm-omni, this request will be necessary.

After the morning discussion, they want to change the labels to: omni-test, tts-test, diffusion-x2v-test, diffusion-x2i-test. This involves some script changes. Since I don't have permission to push to the inaniloquentee:patch-2 branch and cannot resolve the conflicts there, I have opened a new PR #2816 to proceed with the follow-up work.

@yenuo26 yenuo26 removed the tts-test label to trigger buildkite tts models test in nightly CI label Apr 16, 2026
@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Apr 16, 2026

@inaniloquentee Thank you very much for your contribution. It is very helpful for subsequent development. However, I'm sorry that due to changes in the follow-up plan, there is additional development work. Since I don't have permission to push to your inaniloquentee:patch-2 branch and cannot resolve the conflicts for further development, I have opened a new PR #2816 and added you as a co-author.

@inaniloquentee
Copy link
Copy Markdown
Contributor Author

@inaniloquentee Thank you very much for your contribution. It is very helpful for subsequent development. However, I'm sorry that due to changes in the follow-up plan, there is additional development work. Since I don't have permission to push to your inaniloquentee:patch-2 branch and cannot resolve the conflicts for further development, I have opened a new PR #2816 and added you as a co-author.

Got it, that makes perfect sense! Thank you for handling the conflicts and getting this integrated alongside the pipeline refactoring.

Glad I could help with the trigger mechanism. Looking forward to the new release! 🎉

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [CI] Add on-demand performance test trigger via PR comments

I reviewed the full diff (470 additions, 449 deletions, 9 files). Here are my findings:


Summary

This PR does three things:

  1. Consolidates test-nightly-diffusion.yml into test-nightly.yml, eliminating the dynamic pipeline upload indirection.
  2. Introduces label-based selective CI triggering (omni-test, tts-test, diffusion-x2iat-test, diffusion-x2v-test) so individual test groups can run without triggering the full nightly suite.
  3. Splits the perf benchmark config into test_omni.json and test_tts.json, adding --config-file support to run_benchmark.py.

The direction is sound and the consolidation is a clear improvement. I have a few specific concerns below.


Issues

1. Bug: TTS Function Test runs Omni tests, not TTS tests (high severity)

In .buildkite/test-nightly.yml, the "TTS · Function Test" step uses:

pytest -s -v tests/e2e/online_serving/test_*_expansion.py -m "advanced_model and L4 and omni" --run-level "advanced_model"

The pytest marker filter is -m "advanced_model and L4 and omni" -- this selects omni tests, not TTS tests. This appears to be a copy-paste from the original "Omni · Function Test with L4" step. It should presumably use a tts marker or target TTS-specific test files.

2. PR title/description is misleading (low severity)

The title says "on-demand performance test trigger via PR comments" and the description details a GitHub Actions issue_comment workflow with regex parsing, reaction emojis, and author-association checks. However, the actual implementation uses Buildkite label-based triggering -- none of the described GitHub Actions workflow exists in the diff. The description should be updated to reflect what the PR actually implements.

3. _get_config_file_from_argv() runs at import time (minor / style)

In tests/dfx/perf/scripts/run_benchmark.py, _get_config_file_from_argv() parses sys.argv at module-level before pytest processes its own options. While pytest_addoption is also defined, the actual config loading happens via the sys.argv scan, making the pytest_addoption hook effectively decorative -- pytest never uses its return value for the config loading. This dual approach is confusing. Consider either:

  • Using only pytest_addoption + a fixture/hook to get the value at the right time, or
  • Dropping pytest_addoption and documenting that sys.argv parsing is intentional (matching the run_diffusion_benchmark.py pattern).

4. YAML anchor *nightly_or_pr_label removed, conditions duplicated

The old &nightly_or_pr_label / *nightly_or_pr_label YAML anchor pattern kept the condition in one place. The new approach duplicates the label conditions across multiple groups. If a label name changes, multiple places need updating. Consider restoring an anchor or using Buildkite's env-based approach.


Positive aspects

  • Consolidating the diffusion pipeline into a single file reduces operational complexity.
  • Label-based triggering is more secure than comment-triggered CI (labels require write access).
  • Splitting TTS perf config into its own file is clean and enables independent testing.
  • Documentation updates in CI_5levels.md, test_guide.md, and l4_performance_tests.inc.md are thorough.

Note on PR status

Per the discussion thread, this PR has been superseded by PR #2816 (opened by @yenuo26 with additional label renaming and conflict resolution). The issues identified above should be addressed in that follow-up PR.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [CI] Add on-demand performance test trigger via PR comments

Summary

This PR refactors the nightly Buildkite CI pipeline to split monolithic test runs into label-gated groups (omni, tts, diffusion-x2iat, diffusion-x2v), inlines the deleted test-nightly-diffusion.yml, and separates TTS benchmark config from the omni config file. The changes are generally sensible for enabling selective nightly test execution per PR label.

However, there are several issues that should be addressed before merging.


Critical Issues

1. PR title/description does not match the actual diff

The title says "Add on-demand performance test trigger via PR comments" and the description extensively describes a GitHub Action workflow triggered by /test-perf PR comments. However, the diff contains no GitHub Actions workflow file at all. The actual changes are a Buildkite pipeline refactor with label-based gating. The description appears to have been written for an entirely different implementation. Please update the title and description to accurately reflect the Buildkite label-gating approach.

2. TTS Function Test runs the wrong pytest markers

In test-nightly.yml, the "TTS Function Test" step runs:

pytest -s -v tests/e2e/online_serving/test_*_expansion.py -m "advanced_model and L4 and omni" --run-level "advanced_model"

This uses the omni marker, not a TTS marker. This means the TTS test group actually re-runs omni L4 tests instead of TTS-specific tests. Is there a TTS-specific marker or test pattern that should be used here instead?

3. run_benchmark.py: fragile import-time sys.argv parsing

The _get_config_file_from_argv() function parses sys.argv at module import time (top-level), then pytest_addoption registers the same --config-file option later. This creates a confusing dual-path: pytest's own option parsing is never actually used for determining CONFIG_FILE_PATH since that is resolved before pytest processes options. If pytest ever changes its argv handling or another conftest registers --config-file, this will break silently. Consider either:

  • Using only pytest_addoption + a fixture/hook to load configs lazily, or
  • Using only sys.argv parsing and dropping the pytest_addoption registration.

Having both is misleading -- a reader would expect request.config.getoption("--config-file") to be the source of truth, but it is not.


Minor Issues

4. Trailing whitespace in artifact download commands

In test-nightly.yml, these lines have trailing whitespace:

- buildkite-agent artifact download "tests/dfx/perf/results/*.json" . --step nightly-tts-performance   
- buildkite-agent artifact download "tests/dfx/perf/results/*.json" . --step nightly-omni-performance                     

Not functional but messy -- please trim.

5. Nightly perf distribution step no longer waits for diffusion perf

The depends_on for the distribution step was updated to depend on nightly-diffusion-x2iat-performance but the old nightly-qwen-image-performance key from the deleted diffusion file is simply replaced. The new nightly-diffusion-x2iat-performance key exists, so that is correct. However, note that X2V has no perf test step, so there is no artifact gap -- just confirming this is intentional.

6. Blank line

There is a double blank line after the Omni Perf Test block (before "TTS Model Test") in test-nightly.yml. Minor style nit.


Positive Aspects

  • Splitting tests into independently-triggerable groups via labels is a good architectural improvement for CI resource efficiency.
  • Separating TTS benchmark config from omni config is clean.
  • Doc updates are thorough and consistent.
  • The --config-file CLI interface for run_benchmark.py is a good user-facing improvement.

Please address the critical issues (especially #1 and #2) before this can be approved.

@lishunyang12 lishunyang12 dismissed their stale review April 16, 2026 14:56

re-review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Trigger Model-Specific Performance Tests via Tags in vLLM-Omni

5 participants