QVAC-18612 infra: gate every secret-bearing workflow with label-gate (re-land)#2023
Conversation
Preview deployments for qvac-docs-staging ⚡️
Commit: Deployment ID: Static site name: |
Tier-based Approval Status |
Re-land of the label-gate fan-out after PR #1997 was reverted on 2026-05-13 (commit 919850c). Re-architected to fix the caller-cap permissions violation that broke 30+ on-pr-* workflows the moment a verified label was applied. Architecture: caller-gates-callee - Reusable workflows (workflow_call invokees) are NOT modified. PR #1997 embedded a label-gate job inside each reusable callee with `pull-requests: write`, which violates the caller-cap rule for any caller that scopes the call to `pull-requests: read|none`. GitHub enforces this at parse time; the affected workflow files won't even load. - Callers get a label-gate job at the top of `jobs:` with `pull-requests: write` (which never crosses a caller-cap boundary). Each `uses:` invocation that targets a secret-bearing reusable, plus every standalone secret-bearing job in the same workflow, gains `needs: [..., label-gate]` and an `if:` prepended with `needs.label-gate.outputs.authorised == 'true'`. - When the gate denies on a `uses:` job, the entire reusable invocation is skipped — the callee runner never starts, no secrets are exposed, and no caller-cap validation can fire because the workflow_call payload is never sent. The label-gate action checks out from the default branch via sparse checkout, which is the same Tanstack-class supply-chain mitigation landed in the canary fix on PR #1971 / #1973. Workflow-by-workflow stats: - 59 caller workflows migrated (label-gate + needs/if updates) - 56 reusable callees, exempt workflows, and no-secret workflows intentionally left UNCHANGED on disk - Pre-existing `authorize-pr` peer jobs preserved (belt-and-suspenders; removal is a follow-up after a soak period) - approval-worker.yml and approval-check-worker.yml exempt (gating them creates a deadlock; we explicitly do not touch them) Pre-flight verification before push: - `python3 .github/scripts/audit-workflow-permissions.py` -> 0 hard violations across 162 caller-callee edges (vs. 21 hard violations after the naive PR #1997-style migration; the audit was added in the previous commit precisely to catch this regression class) - `actionlint .github/workflows/*.{yml,yaml}` reports identical issue counts before and after the migration: 1832 shellcheck (pre-existing), 9 expression (pre-existing), 5 action (down from 7 pre-existing) End-to-end validated in the qvac-internal sandbox with real org teams: - tetherto/qvac-internal#12 (caller-gates-callee + standalone gating against the actual qvac-internal-{dev,merge,release} teams) - Olutest/qvac-tests (public mirror; same harness, single-user allowlist) - Validation matrix: 9/9 scenarios pass, including the strip-on- non-trusted-apply case Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Proletter <40578159+Proletter@users.noreply.github.com>
Closes a release-env exposure surfaced when auditing #2023: public-delete-npm-versions.yml (environment: release, packages: write) is invoked by 12 on-pr-close-* workflows, but only embed-llamacpp had label-gate. The other 10 fire on `pull_request: types: [closed]` and reach the release env without authorisation. This is currently held back only by the manual approval on the release environment. Once that approval is dropped (the goal of QVAC-18612), the label-gate becomes the sole control. This commit makes label-gate that control everywhere. Pattern is identical to on-pr-close-embed-llamacpp.yml (already on this branch): inline label-gate job (caller side) + needs/if on the delete-npm-versions-trigger reusable call. Reusable callee (public-delete-npm-versions.yml) is unchanged. on-pr-close-translation-nmtcpp.yml deliberately not modified - it has only workflow_dispatch (no pull_request trigger) and is intrinsically gated by repo-write access. Co-authored-by: Cursor <cursoragent@cursor.com>
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
What problem does this PR solve?
The QVAC monorepo has ~60 workflows that consume secrets from PR-triggered events. Without an explicit label gate, anyone able to open a PR can trigger them. This re-lands the
label-gatefan-out from QVAC-18612 to require averifiedlabel applied by a trusted team member before any secret-bearing job runs.The first attempt (PR #1997, reverted in #2019 on 2026-05-13) embedded the
label-gatejob inside reusable callees withpull-requests: write. GitHub's caller-cap rule caps a reusable's permissions at what the caller granted; almost everyon-pr-*.ymlcaller scopes the call topull-requests: read|none, so the workflow files failed validation before any secrets could be touched.This PR re-architects the gating so the same validation never happens again.
How does it solve it?
workflow_callinvokees —integration-test-*.yml,cpp-lint.yaml,create-github-release-*.yml, etc.) are NOT modified. Thelabel-gatejob lives in the caller, where it has thepull-requests: writescope it needs and where the caller-cap rule never applies (in-workflow jobs aren't caller-capped). The caller adds anif:to theuses:invocation of every reusable that consumes secrets: when the gate denies, the entire reusable call is skipped, the callee runner never starts, and no secrets are exposed.needs: [..., label-gate]+if: needs.label-gate.outputs.authorised == 'true' && (existing condition)treatment in the same workflow.label-gatejob pinsactions/checkoutto the default branch and sparse-checks-out only.github/actions/label-gate/, so a malicious PR cannot rewritegate.mjsin its own branch to bypass the gate.approval-worker.ymlandapproval-check-worker.ymlare unchanged — gating them would create a deadlock (they ARE the approval mechanism). Composite actions (.github/actions/) andpackages/are also untouched. Pre-existingauthorize-prjobs are preserved alongsidelabel-gateas belt-and-suspenders; their removal is a follow-up after a soak period.Release-environment audit (covers full release-env surface)
77 workflows reference
environment: release. Verified breakdown:label-gatejobworkflow_call-only callees, called by gated callersif:workflow_call+workflow_dispatch/push/etc., no PR trigger)workflow_dispatchrequires repo-write;workflow_callgated by callerlabel-gateon-pr-close-*→public-delete-npm-versions.yml96031000(this commit)Commit
96031000addslabel-gateto the 10 remainingon-pr-close-*.ymlworkflows that fire onpull_request: types: [closed]and callpublic-delete-npm-versions.yml. (on-pr-close-translation-nmtcpp.ymlis intentionally untouched — it has onlyworkflow_dispatch, nopull_requesttrigger.) Without this gating, dropping the release-env approval would have left an unauthenticated path from any internal PR-close to npm version deletion in the release env.How was it tested?
Local invariants verified post-rewrite:
End-to-end against real org teams in
tetherto/qvac-internal#12(qvac-internal-{dev,merge,release}teams) and the public mirror atOlutest/qvac-tests. The validation harness mirrors every pattern in this PR — most importantly the row that broke#1997:pull_requestopened, noverifiedlabel (pattern A — self-contained)authorised=false— gated job skippedpull_requestopened, noverifiedlabel (pattern B — caller-gates-callee)authorised=false, both caller AND callee skipped, no caller-cap errorverifiedadded by trusted team member (pattern A)authorised=true (member of 'tetherto/qvac-internal-dev'), marker emittedverifiedadded by trusted team member (pattern B — the#1997failure mode)if:ran with nopull-requests: write; NO caller-cap validation error; marker emitted from inside the reusableverifiedadded by untrusted user — strip behaviourauthorised=false (non-trusted 'Proletter' applied 'verified' — label stripped); PR labels =[]post-runworkflow_dispatch(trusted event) — pattern Aauthorised=true (trusted event source)workflow_dispatch— pattern Bpushto branchauthorised=true (push)What's in the diff
ec975f81QVAC-18612 infra: gate every secret-bearing workflow with label-gate (59 files)96031000QVAC-18612 infra: gate on-pr-close-* workflows with label-gate (10 files).github/workflows/.github/actions/packages/approval-worker.yml+approval-check-worker.ymlunchangedKnown CI red
sanity-checks(invoked by ONNX/whispercpp on-pr workflows) runsyamlfmt v0.17.0across the whole repo and fails the build if anything is dirty.mainitself currently has yamlfmt drift across 21 composite-action files and 4packages/transcription-whispercpp/config files (never tripped on main becausemain'ssanity-checksonly runs when a PR touches an ONNX path). That drift is not introduced by this PR but the check sees it on every PR that triggers ONNX/whispercppsanity-checks. Cleaning it up requires either a separate "yamlfmt main" PR or a.yamlfmtconfig to scope the formatter — both deliberately out of scope for this PR per review feedback.Refs: QVAC-18612.