[CI] Add on-demand performance test trigger via PR comments#2506
[CI] Add on-demand performance test trigger via PR comments#2506inaniloquentee wants to merge 58 commits into
Conversation
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Added latest news section with release details for version 0.18.0. Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Fix alignment of the center paragraph tag in README
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6248b969be
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| - name: Checkout PR Code | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| ref: refs/pull/${{ github.event.issue.number }}/head |
There was a problem hiding this comment.
Avoid executing PR head code on privileged self-hosted runners
This workflow checks out refs/pull/.../head and later executes benchmark scripts from that checkout, so a forked PR can inject arbitrary commands that run once any OWNER/MEMBER/COLLABORATOR comments /test-perf. Because issue_comment runs in the base-repo context, this creates a pwn-request path against your self-hosted runner (and any credentials/context it carries). Restrict execution to trusted PR sources or harden the job isolation/credentials before running PR-provided scripts.
Useful? React with 👍 / 👎.
| github.rest.reactions.createForIssueComment({ | ||
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| comment_id: context.payload.comment.id, | ||
| content: 'eyes' |
There was a problem hiding this comment.
Declare GITHUB_TOKEN permissions needed for reaction write
The workflow posts a comment reaction via reactions.createForIssueComment but does not define a permissions block. In repositories/orgs with read-only default workflow token permissions, this call gets a 403 and the job stops before any benchmark runs, so the trigger silently fails for valid /test-perf comments.
Useful? React with 👍 / 👎.
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
Signed-off-by: inaniloquentee <3051000145@qq.com>
sync main to patch-2
|
/test-perf qwen3-omni |
Signed-off-by: inaniloquentee <3051000145@qq.com>
|
@yenuo26 and @gcanlin, It looks like the The failure is happening in
This indicates that the dataset snippet in the documentation either requires an explicit Since the label-based routing is working flawlessly and catching these actual codebase issues, I will leave this documentation fix to you. Let me know if we are good to merge. |
The previous nightly-CI issue has been fixed, so please ignore the earlier errors. Additionally, I have finished revising this PR based on your previous work, and it is now ready to be merged. |
There was a problem hiding this comment.
LGTM. Very helpful! cc @hsliuustc0106 @ywang96
|
@inaniloquentee hello, Regarding the label configuration, we plan to adjust it to: omni-test, tts-test, diffusion-x2i-test, diffusion-x2v-test. Could you please grant me temporary push access to the patch-2 branch? |
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Updated test labels and commands for nightly tests, including changes to the Omni and TTS function tests, and added new performance tests for Diffusion X2I and X2V. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Add support for custom config file via command line argument. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Updated paths for performance test commands in the test guide. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Updated the reference for L4-level performance test case addition to include specific test files. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Removed test case for 'test_qwen3_tts' from the JSON file. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Added a JSON test configuration for Qwen3 TTS performance benchmarking. Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Awesome, thanks for the quick fix! looking forward to the merge once the tests go green. 🎉 |
|
@Gaohan123 @yenuo26 If we don't wanna introduce too many tags, I think we can consider the comment way to specify one model tests in the future. When more and more models are integrated to vllm-omni, this request will be necessary. |
After the morning discussion, they want to change the labels to: omni-test, tts-test, diffusion-x2v-test, diffusion-x2i-test. This involves some script changes. Since I don't have permission to push to the inaniloquentee:patch-2 branch and cannot resolve the conflicts there, I have opened a new PR #2816 to proceed with the follow-up work. |
|
@inaniloquentee Thank you very much for your contribution. It is very helpful for subsequent development. However, I'm sorry that due to changes in the follow-up plan, there is additional development work. Since I don't have permission to push to your inaniloquentee:patch-2 branch and cannot resolve the conflicts for further development, I have opened a new PR #2816 and added you as a co-author. |
Got it, that makes perfect sense! Thank you for handling the conflicts and getting this integrated alongside the pipeline refactoring. Glad I could help with the trigger mechanism. Looking forward to the new release! 🎉 |
lishunyang12
left a comment
There was a problem hiding this comment.
Review: [CI] Add on-demand performance test trigger via PR comments
I reviewed the full diff (470 additions, 449 deletions, 9 files). Here are my findings:
Summary
This PR does three things:
- Consolidates
test-nightly-diffusion.ymlintotest-nightly.yml, eliminating the dynamic pipeline upload indirection. - Introduces label-based selective CI triggering (
omni-test,tts-test,diffusion-x2iat-test,diffusion-x2v-test) so individual test groups can run without triggering the full nightly suite. - Splits the perf benchmark config into
test_omni.jsonandtest_tts.json, adding--config-filesupport torun_benchmark.py.
The direction is sound and the consolidation is a clear improvement. I have a few specific concerns below.
Issues
1. Bug: TTS Function Test runs Omni tests, not TTS tests (high severity)
In .buildkite/test-nightly.yml, the "TTS · Function Test" step uses:
pytest -s -v tests/e2e/online_serving/test_*_expansion.py -m "advanced_model and L4 and omni" --run-level "advanced_model"
The pytest marker filter is -m "advanced_model and L4 and omni" -- this selects omni tests, not TTS tests. This appears to be a copy-paste from the original "Omni · Function Test with L4" step. It should presumably use a tts marker or target TTS-specific test files.
2. PR title/description is misleading (low severity)
The title says "on-demand performance test trigger via PR comments" and the description details a GitHub Actions issue_comment workflow with regex parsing, reaction emojis, and author-association checks. However, the actual implementation uses Buildkite label-based triggering -- none of the described GitHub Actions workflow exists in the diff. The description should be updated to reflect what the PR actually implements.
3. _get_config_file_from_argv() runs at import time (minor / style)
In tests/dfx/perf/scripts/run_benchmark.py, _get_config_file_from_argv() parses sys.argv at module-level before pytest processes its own options. While pytest_addoption is also defined, the actual config loading happens via the sys.argv scan, making the pytest_addoption hook effectively decorative -- pytest never uses its return value for the config loading. This dual approach is confusing. Consider either:
- Using only
pytest_addoption+ a fixture/hook to get the value at the right time, or - Dropping
pytest_addoptionand documenting thatsys.argvparsing is intentional (matching therun_diffusion_benchmark.pypattern).
4. YAML anchor *nightly_or_pr_label removed, conditions duplicated
The old &nightly_or_pr_label / *nightly_or_pr_label YAML anchor pattern kept the condition in one place. The new approach duplicates the label conditions across multiple groups. If a label name changes, multiple places need updating. Consider restoring an anchor or using Buildkite's env-based approach.
Positive aspects
- Consolidating the diffusion pipeline into a single file reduces operational complexity.
- Label-based triggering is more secure than comment-triggered CI (labels require write access).
- Splitting TTS perf config into its own file is clean and enables independent testing.
- Documentation updates in
CI_5levels.md,test_guide.md, andl4_performance_tests.inc.mdare thorough.
Note on PR status
Per the discussion thread, this PR has been superseded by PR #2816 (opened by @yenuo26 with additional label renaming and conflict resolution). The issues identified above should be addressed in that follow-up PR.
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
lishunyang12
left a comment
There was a problem hiding this comment.
Review: [CI] Add on-demand performance test trigger via PR comments
Summary
This PR refactors the nightly Buildkite CI pipeline to split monolithic test runs into label-gated groups (omni, tts, diffusion-x2iat, diffusion-x2v), inlines the deleted test-nightly-diffusion.yml, and separates TTS benchmark config from the omni config file. The changes are generally sensible for enabling selective nightly test execution per PR label.
However, there are several issues that should be addressed before merging.
Critical Issues
1. PR title/description does not match the actual diff
The title says "Add on-demand performance test trigger via PR comments" and the description extensively describes a GitHub Action workflow triggered by /test-perf PR comments. However, the diff contains no GitHub Actions workflow file at all. The actual changes are a Buildkite pipeline refactor with label-based gating. The description appears to have been written for an entirely different implementation. Please update the title and description to accurately reflect the Buildkite label-gating approach.
2. TTS Function Test runs the wrong pytest markers
In test-nightly.yml, the "TTS Function Test" step runs:
pytest -s -v tests/e2e/online_serving/test_*_expansion.py -m "advanced_model and L4 and omni" --run-level "advanced_model"
This uses the omni marker, not a TTS marker. This means the TTS test group actually re-runs omni L4 tests instead of TTS-specific tests. Is there a TTS-specific marker or test pattern that should be used here instead?
3. run_benchmark.py: fragile import-time sys.argv parsing
The _get_config_file_from_argv() function parses sys.argv at module import time (top-level), then pytest_addoption registers the same --config-file option later. This creates a confusing dual-path: pytest's own option parsing is never actually used for determining CONFIG_FILE_PATH since that is resolved before pytest processes options. If pytest ever changes its argv handling or another conftest registers --config-file, this will break silently. Consider either:
- Using only
pytest_addoption+ a fixture/hook to load configs lazily, or - Using only
sys.argvparsing and dropping thepytest_addoptionregistration.
Having both is misleading -- a reader would expect request.config.getoption("--config-file") to be the source of truth, but it is not.
Minor Issues
4. Trailing whitespace in artifact download commands
In test-nightly.yml, these lines have trailing whitespace:
- buildkite-agent artifact download "tests/dfx/perf/results/*.json" . --step nightly-tts-performance
- buildkite-agent artifact download "tests/dfx/perf/results/*.json" . --step nightly-omni-performance Not functional but messy -- please trim.
5. Nightly perf distribution step no longer waits for diffusion perf
The depends_on for the distribution step was updated to depend on nightly-diffusion-x2iat-performance but the old nightly-qwen-image-performance key from the deleted diffusion file is simply replaced. The new nightly-diffusion-x2iat-performance key exists, so that is correct. However, note that X2V has no perf test step, so there is no artifact gap -- just confirming this is intentional.
6. Blank line
There is a double blank line after the Omni Perf Test block (before "TTS Model Test") in test-nightly.yml. Minor style nit.
Positive Aspects
- Splitting tests into independently-triggerable groups via labels is a good architectural improvement for CI resource efficiency.
- Separating TTS benchmark config from omni config is clean.
- Doc updates are thorough and consistent.
- The
--config-fileCLI interface forrun_benchmark.pyis a good user-facing improvement.
Please address the critical issues (especially #1 and #2) before this can be approved.
Purpose
Fixes #2410
This PR introduces a lightweight, on-demand GitHub Action workflow to trigger model-specific performance tests via PR comments. This prevents running heavy benchmarks on all models for every PR, saving significant CI/GPU resources and developer waiting time.
Key Features:
grep -m 1 -o '/test-perf.*') to extract the command, allowing users to include natural text before the command (e.g.,LGTM! /test-perf qwen3-omni)..shscript underbenchmarks/<model_name>/vllm_omni/if no specific script is provided./test-perf qwen3-tts bench_tts_serve.py) to handle models with.pybenchmark entry points.OWNER,MEMBER,COLLABORATOR) and targetsself-hostedrunners to ensure access to the necessary vLLM/GPU environment.Test Plan
I have fully tested the workflow's trigger and routing logic in my own forked repository using a dummy PR.
/test-perf <model_name>and verified successful model name extraction and auto-discovery of.shfiles.Test Result
The CI workflow successfully intercepts the comment, verifies the author's association, parses the arguments safely, and dynamically constructs the correct execution path according to the
benchmarks/README.mdarchitecture.(Note: The actual successful execution of the benchmark scripts will rely on the upstream self-hosted runners having the necessary GPU environment, but the routing and trigger mechanisms work perfectly).