ci: refactor PR tests to hide failed spot jobs from PR status#2500
ci: refactor PR tests to hide failed spot jobs from PR status#2500
Conversation
|
Note Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported. |
📝 WalkthroughWalkthroughReplaces a prior matrixed in-repo test flow with a two-workflow design: an orchestrator ( Changes
Sequence Diagram(s)sequenceDiagram
participant GH as GitHub (PR / Manual)
participant Orch as Orchestrator\n.github/workflows/pr-test.yml
participant App as GitHub App Service
participant Exec as Executor\n.github/workflows/pr-test-runner.yml
participant Checks as GitHub Checks API
GH->>Orch: PR event or manual dispatch
Orch->>Orch: Validate PR authorization, Get HEAD SHA
Orch->>App: Generate token (flashinfer)
App-->>Orch: token A
Orch->>App: Generate token (ci-infra)
App-->>Orch: token B
Orch->>Checks: Create check runs (AOT, A10G, T4, Summary)
Checks-->>Orch: check IDs
Orch->>Exec: Dispatch executor (head_sha, docker_tag, check IDs, skip flags)
Exec->>Exec: Run AOT matrix & GPU shards (spot monitor → analyze → possible rerun)
Exec->>Checks: Update check runs & post aggregated summary
Checks-->>GH: Reflect final statuses on PR
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
02322b8 to
b517b7a
Compare
1dfb89f to
194f10e
Compare
|
@flashinfer-bot rerun |
194f10e to
e085d9d
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.github/workflows/pr-test-runner.yml:
- Around line 56-67: The workflow uses fail-fast: true for the aot-build-import
matrix job which causes sibling spot jobs to be cancelled and produce "context
canceled" logs that analyze-aot-failure misclassifies as spot termination;
change the strategy of the aot-build-import job to fail-fast: false so each
matrix permutation runs to completion and yields independent logs (or
alternatively update the analyze-aot-failure logic to distinguish
GitHub-initiated cancellations from spot preemptions), referencing the
aot-build-import job and the analyze-aot-failure matcher in your change.
- Line 658: The use of printf with SUMMARY_LINES as the unquoted format string
is fragile and can misinterpret percent signs; change the call to pass a literal
format and the variable as an argument (e.g., use printf '%s\n' "$SUMMARY_LINES"
or printf '%s' "$SUMMARY_LINES") so printf treats the content as data, not a
format string; update the line that sets SUMMARY (the printf invocation that
references SUMMARY_LINES) accordingly.
In @.github/workflows/pr-test.yml:
- Around line 198-200: The RUNNER_URL is hardcoded to the ci-infra repo causing
"view test runs" links to point to the wrong repository; update the RUNNER_URL
assignment to build the URL from the current repository variable (use
github.repository) so it points to /${{ github.repository
}}/actions/workflows/pr-test-runner.yml; locate the RUNNER_URL definition in the
same block that sets SHA and REPO and replace the hardcoded path so
workflow_dispatch and pr-test-runner.yml references remain consistent across
repositories.
- Around line 192-240: The Create Check Runs step (id: create-checks) only runs
when github.event_name == 'pull_request', leaving AOT_CHECK, A10G_CHECK,
T4_CHECK and SUMMARY_CHECK empty for push and workflow_dispatch and causing
downstream gates to skip tests; remove or broaden that if so the step always
runs (or runs for push and workflow_dispatch as well), ensure it uses the
correct SHA variable (SHA="${{ needs.setup.outputs.head_sha }}" or fallback to
github.sha) so the gh api calls that populate AOT_CHECK, A10G_CHECK, T4_CHECK
and SUMMARY_CHECK always create check-runs and write their ids to GITHUB_OUTPUT
for the runner jobs to consume.
🧹 Nitpick comments (5)
.github/workflows/pr-test-runner.yml (3)
150-161: Rerun matrix always includes all combinations, even if only a subset failed.The rerun matrix is hardcoded to all 8 arch × cuda combos regardless of which specific jobs were spot-terminated. If only one combination was preempted, all 8 are re-executed on the more expensive on-demand instances.
Consider building the matrix from only the actually-failed/cancelled job names to avoid unnecessary reruns.
108-148: Spot-termination analysis logic is duplicated three times.The log-analysis block (download logs → check for spot markers) is near-identical across
analyze-aot-failure(Lines 117-148),analyze-gpu-a10g-failure(Lines 270-301), andanalyze-gpu-t4-failure(Lines 407-438), differing only in the jq job-name filter. Consider extracting this into a shared shell script (e.g.,scripts/analyze_spot_failure.sh) that takes the filter pattern as an argument.
74-81: Cleanup steps are duplicated across six jobs.The docker/workspace cleanup block is copy-pasted across
aot-build-import,aot-build-import-rerun,gpu-tests-a10g,gpu-tests-a10g-rerun,gpu-tests-t4, andgpu-tests-t4-rerun. Consider extracting this into a composite action (e.g.,.github/actions/cleanup/action.yml) to centralize maintenance..github/workflows/pr-test.yml (2)
150-151: TODO: Restore^\.github/to skip patterns before merging.Line 150 has a TODO from the author to add back the
.github/skip pattern. This is currently removed (likely to allow testing CI changes), but should be restored before merging to main to avoid running the full test suite on CI-only changes.
293-300: Step name "Post Comment" is misleading — it writes toGITHUB_STEP_SUMMARY, not a PR comment.The unauthorized user won't see this unless they navigate to the workflow run. If the intent is to notify on the PR itself, use
gh pr comment. Otherwise, rename the step to match its behavior (e.g., "Post Step Summary").
| SHA="${{ needs.setup.outputs.head_sha }}" | ||
| REPO="${{ github.repository }}" | ||
| RUNNER_URL="https://github.com/flashinfer-ai/ci-infra/actions/workflows/pr-test-runner.yml" |
There was a problem hiding this comment.
RUNNER_URL points to ci-infra but the primary dispatch targets the flashinfer repo.
Line 200 hardcodes the URL to flashinfer-ai/ci-infra/actions/workflows/pr-test-runner.yml, but the primary workflow_dispatch (Line 258-259) triggers pr-test-runner.yml in ${{ github.repository }} (flashinfer). Once the workflow is established on main, the "view test runs" link in check-run summaries will point to the wrong repo.
🤖 Prompt for AI Agents
In @.github/workflows/pr-test.yml around lines 198 - 200, The RUNNER_URL is
hardcoded to the ci-infra repo causing "view test runs" links to point to the
wrong repository; update the RUNNER_URL assignment to build the URL from the
current repository variable (use github.repository) so it points to /${{
github.repository }}/actions/workflows/pr-test-runner.yml; locate the RUNNER_URL
definition in the same block that sets SHA and REPO and replace the hardcoded
path so workflow_dispatch and pr-test-runner.yml references remain consistent
across repositories.
There was a problem hiding this comment.
The link is only visible briefly during "in progress" and gets overwritten when tests complete.
There was a problem hiding this comment.
@yongwww, understood! Since the check runs get updated with the correct information once the tests complete, having the temporary link point to ci-infra during the brief "in progress" phase is acceptable. Thanks for clarifying the design intent.
✏️ Learnings added
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2500
File: .github/workflows/pr-test.yml:198-200
Timestamp: 2026-02-06T06:07:10.364Z
Learning: In `.github/workflows/pr-test.yml`, the RUNNER_URL in the "Create Check Runs" step intentionally points to `flashinfer-ai/ci-infra/actions/workflows/pr-test-runner.yml` even though the primary dispatch may target the flashinfer repo. This is acceptable because the link is only visible during the "in progress" phase and gets overwritten when the pr-test-runner.yml workflow updates the check runs with final results.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2500
File: .github/workflows/pr-test-runner.yml:56-67
Timestamp: 2026-02-06T06:06:03.934Z
Learning: In `.github/workflows/pr-test-runner.yml`, the analyze-aot-failure and analyze-gpu-*-failure steps look for specific network error patterns that indicate actual spot termination, not generic cancellation messages. fail-fast cancellation does not produce these patterns, so there is no risk of false-positive spot-termination reruns.
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2500
File: .github/workflows/pr-test-runner.yml:56-67
Timestamp: 2026-02-06T05:51:00.326Z
Learning: In `.github/workflows/pr-test-runner.yml`, the team prefers to keep `fail-fast: true` for matrix jobs to reduce overall test time, even with the potential for false-positive spot-termination reruns.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.github/workflows/pr-test-runner.yml:
- Around line 530-534: The gh api command is vulnerable to script injection
because inputs like ${{ inputs.aot_check_id }} are expanded directly into the
run: shell; change the invocation that builds the gh api PATCH (the block using
gh api -X PATCH repos/${{ github.repository }}/check-runs/${{
inputs.aot_check_id }} -f status="completed" -f conclusion="$CONCLUSION" -F
output[title]="$TITLE" -F output[summary]="$SUMMARY") to instead pass the inputs
through environment variables (e.g., set AOT_CHECK_ID from inputs in env and
reference $AOT_CHECK_ID inside the shell) so untrusted characters are not
interpreted by the shell; apply the same fix to the other similar commands that
reference inputs.gpu_a10g_check_id, inputs.gpu_t4_check_id, and
inputs.summary_check_id.
- Around line 48-50: The orchestrator job in pr-test-runner.yml must proactively
clean up stale in_progress check runs before creating/dispatching new ones: add
a step in the orchestrator job (before dispatching the runner) that queries the
GitHub Checks API for check runs belonging to the same PR/commit/ref (the ones
created by pr-test.yml), and for any with status "in_progress" call the Checks
API to update their conclusion/status (e.g., "cancelled" or "neutral") so they
are not left orphaned; reference the orchestrator job and the update-check-runs
logic (and the check-run creation in pr-test.yml) when implementing this API
call and ensure the step runs under a token with write access to checks
(GITHUB_TOKEN).
🧹 Nitpick comments (2)
.github/workflows/pr-test-runner.yml (2)
150-161: Hardcoded rerun matrix duplicates the original job matrix.The arch/CUDA values on Lines 155-156 are manually duplicated from Lines 69-70. If the original matrix is updated (e.g., adding
cu131), this rerun matrix must be updated in lockstep or the rerun will miss combinations silently. Consider generating the rerun matrix programmatically from the run's job names, or extracting the combinations into a reusable definition.
108-148: Near-identical spot-termination analysis logic repeated three times.The analyze-aot-failure, analyze-gpu-a10g-failure, and analyze-gpu-t4-failure jobs share the same log-download-and-grep logic, differing only in the jq job-name filter. Consider extracting this into a composite action that accepts the name filter as an input, reducing ~90 lines of duplication.
Also applies to: 261-301, 399-438
| concurrency: | ||
| group: ${{ inputs.concurrency_key || github.run_id }} | ||
| cancel-in-progress: true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd 'pr-test.yml' .github/workflows --exec cat {}Repository: flashinfer-ai/flashinfer
Length of output: 11588
Orphaned check runs when concurrency cancels the runner workflow.
When a new push arrives while a previous pr-test-runner.yml is in flight, concurrency (group pr-test-${{ github.ref }}) cancels both the old pr-test.yml and old pr-test-runner.yml before the update-check-runs job can complete. The old check runs (created by pr-test.yml) remain stuck in "in_progress" indefinitely, while new check runs are created for the incoming push.
Clean up stale in-progress check runs before creating new ones in the orchestrator job, or mark them as cancelled/neutral before dispatching the new runner.
🤖 Prompt for AI Agents
In @.github/workflows/pr-test-runner.yml around lines 48 - 50, The orchestrator
job in pr-test-runner.yml must proactively clean up stale in_progress check runs
before creating/dispatching new ones: add a step in the orchestrator job (before
dispatching the runner) that queries the GitHub Checks API for check runs
belonging to the same PR/commit/ref (the ones created by pr-test.yml), and for
any with status "in_progress" call the Checks API to update their
conclusion/status (e.g., "cancelled" or "neutral") so they are not left
orphaned; reference the orchestrator job and the update-check-runs logic (and
the check-run creation in pr-test.yml) when implementing this API call and
ensure the step runs under a token with write access to checks (GITHUB_TOKEN).
There was a problem hiding this comment.
Check runs are per-SHA. When a new commit is pushed, old check runs are no longer visible on the PR status panel, so orphaned checks don't affect the workflow.
There was a problem hiding this comment.
@yongwww, you're absolutely right! I appreciate the clarification. Since check runs are tied to specific commit SHAs, GitHub automatically hides the old check runs from the PR status panel when new commits are pushed. The orphaned in-progress check runs from cancelled workflows won't clutter the UI or affect the workflow, as they're only associated with superseded commits.
My concern about needing cleanup logic doesn't apply here. Thank you for explaining this!
✏️ Learnings added
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2500
File: .github/workflows/pr-test-runner.yml:48-50
Timestamp: 2026-02-06T06:23:16.701Z
Learning: In GitHub Actions workflows with check runs, check runs are per-SHA. When a new commit is pushed to a PR, old check runs associated with previous commit SHAs are automatically hidden from the PR status panel and only the latest commit's check runs are displayed. Therefore, orphaned in-progress check runs from cancelled workflows do not clutter the PR status or affect the workflow.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2500
File: .github/workflows/pr-test.yml:198-200
Timestamp: 2026-02-06T06:07:10.364Z
Learning: In `.github/workflows/pr-test.yml`, the RUNNER_URL in the "Create Check Runs" step intentionally points to `flashinfer-ai/ci-infra/actions/workflows/pr-test-runner.yml` even though the primary dispatch may target the flashinfer repo. This is acceptable because the link is only visible during the "in progress" phase and gets overwritten when the pr-test-runner.yml workflow updates the check runs with final results.
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2500
File: .github/workflows/pr-test-runner.yml:56-67
Timestamp: 2026-02-06T05:51:00.326Z
Learning: In `.github/workflows/pr-test-runner.yml`, the team prefers to keep `fail-fast: true` for matrix jobs to reduce overall test time, even with the potential for false-positive spot-termination reruns.
Learnt from: yongwww
Repo: flashinfer-ai/flashinfer PR: 2500
File: .github/workflows/pr-test-runner.yml:56-67
Timestamp: 2026-02-06T06:06:03.934Z
Learning: In `.github/workflows/pr-test-runner.yml`, the analyze-aot-failure and analyze-gpu-*-failure steps look for specific network error patterns that indicate actual spot termination, not generic cancellation messages. fail-fast cancellation does not produce these patterns, so there is no risk of false-positive spot-termination reruns.
dierksen
left a comment
There was a problem hiding this comment.
LGTM.
I'd love to see about cleaning up some of the duplication at some point, as it made the review a little hard to read through with so much duplicated code moving to a new file (GitHub also doesn't handle this particularly well 😞 )
Thanks, Jonathan! I will find time to improve in follow-up pr (reuse code as much as possible) |
#2524) … (#2500)" This reverts commit d5eaa42. Revert PR #2500. The workflow_dispatch architecture requires actions:write on GITHUB_TOKEN, which is read-only for fork PRs. This breaks CI for all fork PR contributors even after run-ci approval. <!-- .github/pull_request_template.md --> ## 📌 Description <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Restructured the PR validation pipeline with enhanced permission gating and staging. * Improved automated failure handling for infrastructure interruptions with dynamic rerun mechanisms. * Streamlined test result aggregation and reporting across multiple test environments. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
📌 Description
Refactor PR tests to hide failed spot job checks from the PR status panel. Instead of running test jobs inline (where cancelled/failed spot attempts are visible), this PR introduces a two-workflow architecture:
workflow_dispatchrequires the target workflow to exist on the default branch (GitHub docs). Sincepr-test-runner.ymlis new and not yet onmain, this PR usesrepository_dispatchto a temporary copy inci-infrafor testing. An auto-switch mechanism triesworkflow_dispatchfirst and falls back toci-infra:workflow_dispatchfails (file not on main) -> tests run via ci-infraworkflow_dispatchsucceeds -> tests run in flashinfer directly🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit