fix: integrate verdict_extract.py into verify-to-new-pr workflow#1434
fix: integrate verdict_extract.py into verify-to-new-pr workflow#1434
Conversation
…te.js PR #1426 resolved custom glob removal in bot-comment-dismiss.js but introduced minimatch-lite.js (a copy of the old custom code) used by merge_manager.js and pr-context-graphql.js. This created inconsistency (one script using vendored minimatch, two using custom lite version) and a consumer repo breakage risk since minimatch-lite.js was not in the templates directory or sync manifest. Changes: - merge_manager.js: require('./minimatch-lite.js') → require('minimatch') - pr-context-graphql.js: require('./minimatch-lite.js') → require('minimatch') - Delete .github/scripts/minimatch-lite.js (176-line custom reimplementation) - Update template copies to match All 767 JS tests pass. Addresses verify:compare FAIL from issue #1425.
Wire the deterministic verdict policy into the agents-verify-to-new-pr workflow so the workflow and followup_issue_generator.py share the same verdict_policy.py API. Changes: - Add 'extract-verdict' step that runs verdict_extract.py with --policy worst --emit github to produce structured step outputs (verdict, needs_human, needs_human_reason, etc.) - Add 'verdict-needs-human' step that labels the PR and linked issue with needs-human when the policy flags a split verdict with low confidence concerns - Update fallback step to prefer EXTRACTED_VERDICT over regex parsing - Gate follow-up issue generation on verdict-needs-human not having triggered (avoids auto-creating issues for PRs that need human review) - Sync template copy Closes #1427
Automated Status SummaryHead SHA: 8c3fc94
Coverage Overview
Coverage Trend
Top Coverage Hotspots (lowest coverage)
Updated automatically; will refresh on subsequent CI/Docker completions. Keepalive checklistScopePR #1419 addressed issue #1416, but verification raised concerns (CONCERNS) due to non-deterministic verdict selection and inconsistent split-verdict handling across the workflow and follow-up generation. This follow-up closes the remaining gaps by routing all verdict extraction/classification through the shared deterministic policy in Context for AgentRelated Issues/PRsContext for AgentRelated Issues/PRsTasks
Acceptance criteria
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cd372895a3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
🤖 Keepalive Loop StatusPR #1434 | Agent: Codex | Iteration 0/5 Current State
🔍 Failure Classification| Error type | infrastructure | |
There was a problem hiding this comment.
Pull request overview
Integrates the deterministic verdict extraction/policy into the verify-to-new-PR workflow so verdict selection and needs_human gating are produced by scripts.langchain.verdict_extract rather than ad-hoc regex parsing, and keeps the consumer template workflow in sync.
Changes:
- Adds a workflow step to run
python -m scripts.langchain.verdict_extract ... --emit githuband consume its structured outputs (verdict +needs_humanmetadata). - Adds a
needs-humanlabeling/commenting step and gates follow-up issue creation paths based on that signal. - Replaces the lightweight minimatch implementation with the vendored
minimatchpackage in a couple of shared scripts and removes the old implementation.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/agents-verify-to-new-pr.yml |
Adds verdict extraction step + needs_human handling and updates fallback/gating logic. |
templates/consumer-repo/.github/workflows/agents-verify-to-new-pr.yml |
Mirrors the workflow changes for consumer repos. |
.github/scripts/pr-context-graphql.js |
Switches glob matching dependency to minimatch. |
templates/consumer-repo/.github/scripts/pr-context-graphql.js |
Same dependency switch in the consumer template copy. |
.github/scripts/merge_manager.js |
Switches glob matching dependency to minimatch. |
templates/consumer-repo/.github/scripts/merge_manager.js |
Same dependency switch in the consumer template copy. |
.github/scripts/minimatch-lite.js |
Removes the now-unused custom minimatch-lite implementation. |
templates/consumer-repo/.github/workflows/agents-verify-to-new-pr.yml
Outdated
Show resolved
Hide resolved
templates/consumer-repo/.github/workflows/agents-verify-to-new-pr.yml
Outdated
Show resolved
Hide resolved
templates/consumer-repo/.github/workflows/agents-verify-to-new-pr.yml
Outdated
Show resolved
Hide resolved
| const { minimatch } = require('minimatch'); | ||
|
|
There was a problem hiding this comment.
This change introduces a runtime dependency on the vendored minimatch package in consumer repos. However, the template’s vendored templates/consumer-repo/.github/scripts/node_modules/ currently contains minimatch/ but not its required transitive dependency brace-expansion/, so require('minimatch') will throw at runtime. Either vendor brace-expansion (and any other transitive deps) alongside minimatch in the template (matching .github/scripts/node_modules), or keep using the dependency-free implementation.
| const { minimatch } = require('minimatch'); | |
| // Optional dependency on `minimatch`. In consumer templates the vendored | |
| // `minimatch` may be present without its transitive dependencies | |
| // (e.g. `brace-expansion`), so requiring it can throw at runtime. We try to | |
| // load it when available and fall back to a minimal, dependency-free matcher. | |
| let _minimatchImpl; | |
| try { | |
| const maybeMinimatch = require('minimatch'); | |
| _minimatchImpl = maybeMinimatch.minimatch || maybeMinimatch; | |
| } catch { | |
| _minimatchImpl = null; | |
| } | |
| function minimatch(path, pattern, options) { | |
| if (_minimatchImpl) { | |
| return _minimatchImpl(path, pattern, options); | |
| } | |
| // Very small glob implementation that supports `*` wildcards. | |
| const escapedPattern = String(pattern) | |
| .replace(/[.+^${}()|[\]\\]/g, '\\$&') | |
| .replace(/\*/g, '.*'); | |
| const regex = new RegExp('^' + escapedPattern + '$'); | |
| return regex.test(path); | |
| } |
| (client || github).rest.issues.addLabels({ | ||
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| issue_number: parseInt(linkedIssue), |
There was a problem hiding this comment.
parseInt(linkedIssue) should specify a radix (base 10) to avoid edge-case parsing differences and to match other parsing in this workflow (which uses parseInt(..., 10)).
| issue_number: parseInt(linkedIssue), | |
| issue_number: parseInt(linkedIssue, 10), |
| steps.check-merged.outputs.merged == 'true' && | ||
| steps.chain-check.outputs.exceeded != 'true' | ||
| steps.chain-check.outputs.exceeded != 'true' && | ||
| steps.verdict-needs-human.outcome != 'success' |
There was a problem hiding this comment.
The follow-up generation is gated on steps.verdict-needs-human.outcome != 'success'. If the needs-human labeling/comment step runs (because needs_human == 'true') but fails, its outcome becomes failure, which will incorrectly allow follow-up issue generation to proceed. Gate on steps.extract-verdict.outputs.needs_human != 'true' (and/or set continue-on-error: true on the labeling step) so needs_human reliably suppresses automation even when the labeling step errors.
| steps.verdict-needs-human.outcome != 'success' | |
| steps.extract-verdict.outputs.needs_human != 'true' |
When verdict-needs-human triggers, the create-issue step is skipped but downstream steps (persist metadata, upload artifact, comment on PR) were still running with empty issue fields. This caused broken artifact JSON and blank PR comments, plus the bridge workflow would fail on missing issue_number. Add steps.verdict-needs-human.outcome != 'success' condition to: - Persist follow-up issue metadata - Upload follow-up issue metadata - Comment on original PR Addresses review feedback on PR #1434.
…tcome Using steps.verdict-needs-human.outcome != 'success' is fragile: if the needs-human labeling step runs but fails (API error, rate limit), outcome becomes 'failure' not 'success', so the gate passes and follow-up generation proceeds despite needs_human being true. Switch all 6 gate conditions to steps.extract-verdict.outputs.needs_human != 'true' which is derived directly from the deterministic verdict policy and is unaffected by whether the labeling/commenting step succeeded. Also add radix 10 to both parseInt(linkedIssue) calls for consistency with the rest of the workflow. Addresses Copilot review feedback on PR #1434.
Automated Status Summary
Scope
PR #1419 addressed issue #1416, but verification raised concerns (CONCERNS) due to non-deterministic verdict selection and inconsistent split-verdict handling across the workflow and follow-up generation. This follow-up closes the remaining gaps by routing all verdict extraction/classification through the shared deterministic policy in
verdict_policy.py, adding a machine-readableneeds_humangate for low-confidence split verdicts, and validating end-to-end behavior with integration tests.Context for Agent
Related Issues/PRs
Context for Agent
Related Issues/PRs
Tasks
.github/workflows/agents-verify-to-new-pr.ymlto run a Python entrypoint/module that importsverdict_policy.pyand emits structured step outputs (at minimum:verdict,needs_human, and any required metadata) instead of using first-table-row regexmatch()parsing.verdict_policy.pythat deterministically selects the verdict and returns machine-readable metadata includingneeds_human, enforcing: for split verdicts (PASS+CONCERNS), setneeds_human = Trueonly when CONCERNS confidence is strictly< 0.85andFalseotherwise.verdict_policy.pyAPI output (verdict + metadata) for both (a) verdict extraction in the workflow step and (b) follow-up classification, removing any secondary parsing/decision logic after policy evaluation.followup_issue_generator.pyto remove duplicated/parallel verdict policy logic (including split-verdict resolution and confidence gating) and instead import and use the sharedverdict_policy.pyAPI, propagatingneeds_humanunchanged into follow-up output.verdictandneeds_humanfor: (1) CONCERNS confidence>= 0.85, (2) CONCERNS confidence< 0.85, and (3) swapped provider table row order.Acceptance criteria
.github/workflows/agents-verify-to-new-pr.ymlinvokes the deterministic verdict extraction API fromverdict_policy.py(not a regexmatch()/first-row selection) and passesverdictand metadata (includingneeds_human) to downstream steps via step outputs or environment variables.verdict_policy.pyexposes a single importable function (or small API surface) that returns a deterministic verdict and metadata including booleanneeds_human, whereneeds_human == trueonly for split verdicts (PASS+CONCERNS) with CONCERNS confidence strictly< 0.85, andfalseotherwise.verdict_policy.pyAPI for both (a) extracting verdict/metadata from provider output and (b) driving follow-up classification logic, with no secondary parsing/decision logic applied after policy evaluation.followup_issue_generator.pyimports and uses the shared verdict policy fromverdict_policy.pyand contains no duplicated/parallel logic for split-verdict resolution or the 0.85 CONCERNS confidence gating; follow-up output includes the policy-providedneeds_humanflag (or equivalent field) unchanged.verdictandneeds_human).>= 0.85) assertsneeds_human == falseand asserts workflow-level extraction and follow-up generation produce identical deterministicverdictand metadata (including identicalneeds_human).< 0.85) assertsneeds_human == trueand asserts workflow-level extraction and follow-up generation produce identical deterministicverdictand metadata (including identicalneeds_human).0.85CONCERNS confidence in a split verdict yieldsneeds_human == false, and any value strictly below0.85yieldsneeds_human == true.verdict_policy.pycontains hard-coded references to the split-verdict confidence threshold (0.85) for gating decisions (excluding tests/fixtures); the threshold is defined/owned byverdict_policy.pyand consumed via its API.Head SHA: 7ab056d
Latest Runs: ✅ success — Gate
Required: gate: ✅ success