fix: integrate verdict_extract.py into verify-to-new-pr workflow by stranske · Pull Request #1434 · stranske/Workflows

stranske · 2026-02-10T07:11:54Z

Source: Issue #1429

Automated Status Summary

Scope

PR #1419 addressed issue #1416, but verification raised concerns (CONCERNS) due to non-deterministic verdict selection and inconsistent split-verdict handling across the workflow and follow-up generation. This follow-up closes the remaining gaps by routing all verdict extraction/classification through the shared deterministic policy in verdict_policy.py, adding a machine-readable needs_human gate for low-confidence split verdicts, and validating end-to-end behavior with integration tests.

Context for Agent

Related Issues/PRs

#1419
#1416

Context for Agent

Related Issues/PRs

Tasks

Acceptance criteria

.github/workflows/agents-verify-to-new-pr.yml invokes the deterministic verdict extraction API from verdict_policy.py (not a regex match()/first-row selection) and passes verdict and metadata (including needs_human) to downstream steps via step outputs or environment variables.
verdict_policy.py exposes a single importable function (or small API surface) that returns a deterministic verdict and metadata including boolean needs_human, where needs_human == true only for split verdicts (PASS+CONCERNS) with CONCERNS confidence strictly < 0.85, and false otherwise.
The verification-to-PR pipeline uses the exact same verdict_policy.py API for both (a) extracting verdict/metadata from provider output and (b) driving follow-up classification logic, with no secondary parsing/decision logic applied after policy evaluation.
followup_issue_generator.py imports and uses the shared verdict policy from verdict_policy.py and contains no duplicated/parallel logic for split-verdict resolution or the 0.85 CONCERNS confidence gating; follow-up output includes the policy-provided needs_human flag (or equivalent field) unchanged.
Integration tests simulate provider markdown/table inputs that yield split verdicts (PASS + CONCERNS) and verify deterministic verdict extraction is invariant to provider table row order (swapping row order produces identical verdict and needs_human).
Integration test case A (PASS+CONCERNS with CONCERNS confidence >= 0.85) asserts needs_human == false and asserts workflow-level extraction and follow-up generation produce identical deterministic verdict and metadata (including identical needs_human).
Integration test case B (PASS+CONCERNS with CONCERNS confidence < 0.85) asserts needs_human == true and asserts workflow-level extraction and follow-up generation produce identical deterministic verdict and metadata (including identical needs_human).
Integration tests enforce the strict boundary: exactly 0.85 CONCERNS confidence in a split verdict yields needs_human == false, and any value strictly below 0.85 yields needs_human == true.
No file outside verdict_policy.py contains hard-coded references to the split-verdict confidence threshold (0.85) for gating decisions (excluding tests/fixtures); the threshold is defined/owned by verdict_policy.py and consumed via its API.

Head SHA: 7ab056d
Latest Runs: ✅ success — Gate
Required: gate: ✅ success

Workflow / Job	Result	Logs
Agents PR meta manager	❔ in progress	View run
CI Autofix Loop	✅ success	View run
Gate	✅ success	View run
Health 40 Sweep	✅ success	View run
Health 44 Gate Branch Protection	❔ in progress	View run
Health 45 Agents Guard	✅ success	View run
Health 50 Security Scan	✅ success	View run
Health 72 Template Sync	✅ success	View run
Health 73 Template Completeness	✅ success	View run
Health 74 Template Drift	✅ success	View run
Maint 52 Validate Workflows	✅ success	View run
PR 11 - Minimal invariant CI	✅ success	View run
Selftest CI	✅ success	View run
Validate Sync Manifest	✅ success	View run

…te.js PR #1426 resolved custom glob removal in bot-comment-dismiss.js but introduced minimatch-lite.js (a copy of the old custom code) used by merge_manager.js and pr-context-graphql.js. This created inconsistency (one script using vendored minimatch, two using custom lite version) and a consumer repo breakage risk since minimatch-lite.js was not in the templates directory or sync manifest. Changes: - merge_manager.js: require('./minimatch-lite.js') → require('minimatch') - pr-context-graphql.js: require('./minimatch-lite.js') → require('minimatch') - Delete .github/scripts/minimatch-lite.js (176-line custom reimplementation) - Update template copies to match All 767 JS tests pass. Addresses verify:compare FAIL from issue #1425.

Wire the deterministic verdict policy into the agents-verify-to-new-pr workflow so the workflow and followup_issue_generator.py share the same verdict_policy.py API. Changes: - Add 'extract-verdict' step that runs verdict_extract.py with --policy worst --emit github to produce structured step outputs (verdict, needs_human, needs_human_reason, etc.) - Add 'verdict-needs-human' step that labels the PR and linked issue with needs-human when the policy flags a split verdict with low confidence concerns - Update fallback step to prefer EXTRACTED_VERDICT over regex parsing - Gate follow-up issue generation on verdict-needs-human not having triggered (avoids auto-creating issues for PRs that need human review) - Sync template copy Closes #1427

stranske-keepalive · 2026-02-10T07:14:51Z

Automated Status Summary

Head SHA: 8c3fc94
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / guard
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job	Result	Logs
(no jobs reported)	⏳ pending	—

Coverage Overview

Coverage history entries: 1

Coverage Trend

Metric	Value
Current	93.12%
Baseline	85.00%
Delta	+8.12%
Minimum	70.00%
Status	✅ Pass

Top Coverage Hotspots (lowest coverage)

File	Coverage	Missing
`src/cli_parser.py`	81.8%	4
`src/percentile_calculator.py`	95.0%	1
`src/aggregator.py`	95.0%	2
`src/__init__.py`	100.0%	0
`src/ndjson_parser.py`	100.0%	0

Updated automatically; will refresh on subsequent CI/Docker completions.

Keepalive checklist

Scope

PR #1419 addressed issue #1416, but verification raised concerns (CONCERNS) due to non-deterministic verdict selection and inconsistent split-verdict handling across the workflow and follow-up generation. This follow-up closes the remaining gaps by routing all verdict extraction/classification through the shared deterministic policy in verdict_policy.py, adding a machine-readable needs_human gate for low-confidence split verdicts, and validating end-to-end behavior with integration tests.

Context for Agent

Related Issues/PRs

#1419
#1416

Context for Agent

Related Issues/PRs

Tasks

Acceptance criteria

.github/workflows/agents-verify-to-new-pr.yml invokes the deterministic verdict extraction API from verdict_policy.py (not a regex match()/first-row selection) and passes verdict and metadata (including needs_human) to downstream steps via step outputs or environment variables.
verdict_policy.py exposes a single importable function (or small API surface) that returns a deterministic verdict and metadata including boolean needs_human, where needs_human == true only for split verdicts (PASS+CONCERNS) with CONCERNS confidence strictly < 0.85, and false otherwise.
The verification-to-PR pipeline uses the exact same verdict_policy.py API for both (a) extracting verdict/metadata from provider output and (b) driving follow-up classification logic, with no secondary parsing/decision logic applied after policy evaluation.
followup_issue_generator.py imports and uses the shared verdict policy from verdict_policy.py and contains no duplicated/parallel logic for split-verdict resolution or the 0.85 CONCERNS confidence gating; follow-up output includes the policy-provided needs_human flag (or equivalent field) unchanged.
Integration tests simulate provider markdown/table inputs that yield split verdicts (PASS + CONCERNS) and verify deterministic verdict extraction is invariant to provider table row order (swapping row order produces identical verdict and needs_human).
Integration test case A (PASS+CONCERNS with CONCERNS confidence >= 0.85) asserts needs_human == false and asserts workflow-level extraction and follow-up generation produce identical deterministic verdict and metadata (including identical needs_human).
Integration test case B (PASS+CONCERNS with CONCERNS confidence < 0.85) asserts needs_human == true and asserts workflow-level extraction and follow-up generation produce identical deterministic verdict and metadata (including identical needs_human).
Integration tests enforce the strict boundary: exactly 0.85 CONCERNS confidence in a split verdict yields needs_human == false, and any value strictly below 0.85 yields needs_human == true.
No file outside verdict_policy.py contains hard-coded references to the split-verdict confidence threshold (0.85) for gating decisions (excluding tests/fixtures); the threshold is defined/owned by verdict_policy.py and consumed via its API.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd372895a3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

.github/workflows/agents-verify-to-new-pr.yml

.github/scripts/pr-context-graphql.js

stranske-keepalive · 2026-02-10T07:15:30Z

🤖 Keepalive Loop Status

PR #1434 | Agent: Codex | Iteration 0/5

Current State

Metric	Value
Iteration progress	[----------] 0/5
Action	wait (missing-agent-label)
Disposition	skipped (transient)
Gate	success
Tasks	14/57 complete
Timeout	45 min (default)
Timeout usage	3m elapsed (8%, 42m remaining)
Keepalive	❌ disabled
Autofix	❌ disabled

🔍 Failure Classification

Copilot

Pull request overview

Integrates the deterministic verdict extraction/policy into the verify-to-new-PR workflow so verdict selection and needs_human gating are produced by scripts.langchain.verdict_extract rather than ad-hoc regex parsing, and keeps the consumer template workflow in sync.

Changes:

Adds a workflow step to run python -m scripts.langchain.verdict_extract ... --emit github and consume its structured outputs (verdict + needs_human metadata).
Adds a needs-human labeling/commenting step and gates follow-up issue creation paths based on that signal.
Replaces the lightweight minimatch implementation with the vendored minimatch package in a couple of shared scripts and removes the old implementation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`.github/workflows/agents-verify-to-new-pr.yml`	Adds verdict extraction step + `needs_human` handling and updates fallback/gating logic.
`templates/consumer-repo/.github/workflows/agents-verify-to-new-pr.yml`	Mirrors the workflow changes for consumer repos.
`.github/scripts/pr-context-graphql.js`	Switches glob matching dependency to `minimatch`.
`templates/consumer-repo/.github/scripts/pr-context-graphql.js`	Same dependency switch in the consumer template copy.
`.github/scripts/merge_manager.js`	Switches glob matching dependency to `minimatch`.
`templates/consumer-repo/.github/scripts/merge_manager.js`	Same dependency switch in the consumer template copy.
`.github/scripts/minimatch-lite.js`	Removes the now-unused custom minimatch-lite implementation.

.github/workflows/agents-verify-to-new-pr.yml

templates/consumer-repo/.github/workflows/agents-verify-to-new-pr.yml

Copilot · 2026-02-10T07:21:38Z

templates/consumer-repo/.github/scripts/pr-context-graphql.js

+const { minimatch } = require('minimatch');



This change introduces a runtime dependency on the vendored minimatch package in consumer repos. However, the template’s vendored templates/consumer-repo/.github/scripts/node_modules/ currently contains minimatch/ but not its required transitive dependency brace-expansion/, so require('minimatch') will throw at runtime. Either vendor brace-expansion (and any other transitive deps) alongside minimatch in the template (matching .github/scripts/node_modules), or keep using the dependency-free implementation.

Suggested change

const { minimatch } = require('minimatch');

// Optional dependency on `minimatch`. In consumer templates the vendored

// `minimatch` may be present without its transitive dependencies

// (e.g. `brace-expansion`), so requiring it can throw at runtime. We try to

// load it when available and fall back to a minimal, dependency-free matcher.

let _minimatchImpl;

try {

const maybeMinimatch = require('minimatch');

_minimatchImpl = maybeMinimatch.minimatch || maybeMinimatch;

} catch {

_minimatchImpl = null;

}

function minimatch(path, pattern, options) {

if (_minimatchImpl) {

return _minimatchImpl(path, pattern, options);

}

// Very small glob implementation that supports `*` wildcards.

const escapedPattern = String(pattern)

.replace(/[.+^${}()|[\]\\]/g, '\\$&')

.replace(/\*/g, '.*');

const regex = new RegExp('^' + escapedPattern + '$');

return regex.test(path);

}

Copilot · 2026-02-10T07:21:38Z

.github/workflows/agents-verify-to-new-pr.yml

+                    (client || github).rest.issues.addLabels({
+                      owner: context.repo.owner,
+                      repo: context.repo.repo,
+                      issue_number: parseInt(linkedIssue),


parseInt(linkedIssue) should specify a radix (base 10) to avoid edge-case parsing differences and to match other parsing in this workflow (which uses parseInt(..., 10)).

Suggested change

issue_number: parseInt(linkedIssue),

issue_number: parseInt(linkedIssue, 10),

Copilot · 2026-02-10T07:21:39Z

templates/consumer-repo/.github/workflows/agents-verify-to-new-pr.yml

          steps.check-merged.outputs.merged == 'true' &&
-          steps.chain-check.outputs.exceeded != 'true'
+          steps.chain-check.outputs.exceeded != 'true' &&
+          steps.verdict-needs-human.outcome != 'success'


The follow-up generation is gated on steps.verdict-needs-human.outcome != 'success'. If the needs-human labeling/comment step runs (because needs_human == 'true') but fails, its outcome becomes failure, which will incorrectly allow follow-up issue generation to proceed. Gate on steps.extract-verdict.outputs.needs_human != 'true' (and/or set continue-on-error: true on the labeling step) so needs_human reliably suppresses automation even when the labeling step errors.

Suggested change

steps.verdict-needs-human.outcome != 'success'

steps.extract-verdict.outputs.needs_human != 'true'

templates/consumer-repo/.github/scripts/merge_manager.js

When verdict-needs-human triggers, the create-issue step is skipped but downstream steps (persist metadata, upload artifact, comment on PR) were still running with empty issue fields. This caused broken artifact JSON and blank PR comments, plus the bridge workflow would fail on missing issue_number. Add steps.verdict-needs-human.outcome != 'success' condition to: - Persist follow-up issue metadata - Upload follow-up issue metadata - Comment on original PR Addresses review feedback on PR #1434.

…tcome Using steps.verdict-needs-human.outcome != 'success' is fragile: if the needs-human labeling step runs but fails (API error, rate limit), outcome becomes 'failure' not 'success', so the gate passes and follow-up generation proceeds despite needs_human being true. Switch all 6 gate conditions to steps.extract-verdict.outputs.needs_human != 'true' which is derived directly from the deterministic verdict policy and is unaffected by whether the labeling/commenting step succeeded. Also add radix 10 to both parseInt(linkedIssue) calls for consistency with the rest of the workflow. Addresses Copilot review feedback on PR #1434.

stranske added 2 commits February 10, 2026 06:53

Copilot AI review requested due to automatic review settings February 10, 2026 07:11

Copilot started reviewing on behalf of stranske February 10, 2026 07:12 View session

stranske temporarily deployed to agent-high-privilege February 10, 2026 07:12 — with GitHub Actions Inactive

chatgpt-codex-connector bot reviewed Feb 10, 2026

View reviewed changes

.github/workflows/agents-verify-to-new-pr.yml Outdated Show resolved Hide resolved

.github/scripts/pr-context-graphql.js Show resolved Hide resolved

Copilot AI reviewed Feb 10, 2026

View reviewed changes

stranske temporarily deployed to agent-high-privilege February 10, 2026 07:27 — with GitHub Actions Inactive

stranske temporarily deployed to agent-high-privilege February 10, 2026 07:33 — with GitHub Actions Inactive

stranske merged commit 3582350 into main Feb 10, 2026
137 checks passed

stranske deleted the fix/verdict-policy-workflow-integration branch February 10, 2026 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: integrate verdict_extract.py into verify-to-new-pr workflow#1434

fix: integrate verdict_extract.py into verify-to-new-pr workflow#1434
stranske merged 4 commits intomainfrom
fix/verdict-policy-workflow-integration

stranske commented Feb 10, 2026 •

edited by agents-workflows-bot bot

Loading

Uh oh!

stranske-keepalive bot commented Feb 10, 2026 •

edited by agents-workflows-bot bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

stranske-keepalive bot commented Feb 10, 2026 •

edited by agents-workflows-bot bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-const { minimatch } = require('minimatch');
+// Optional dependency on `minimatch`. In consumer templates the vendored
+// `minimatch` may be present without its transitive dependencies
+// (e.g. `brace-expansion`), so requiring it can throw at runtime. We try to
+// load it when available and fall back to a minimal, dependency-free matcher.
+let _minimatchImpl;
+try {
+  const maybeMinimatch = require('minimatch');
+  _minimatchImpl = maybeMinimatch.minimatch || maybeMinimatch;
+} catch {
+  _minimatchImpl = null;
+}
+function minimatch(path, pattern, options) {
+  if (_minimatchImpl) {
+    return _minimatchImpl(path, pattern, options);
+  }
+  // Very small glob implementation that supports `*` wildcards.
+  const escapedPattern = String(pattern)
+    .replace(/[.+^${}()|[\]\\]/g, '\\$&')
+    .replace(/\*/g, '.*');
+  const regex = new RegExp('^' + escapedPattern + '$');
+  return regex.test(path);
+}

	issue_number: parseInt(linkedIssue),
	issue_number: parseInt(linkedIssue, 10),

	steps.verdict-needs-human.outcome != 'success'
	steps.extract-verdict.outputs.needs_human != 'true'

Conversation

stranske commented Feb 10, 2026 • edited by agents-workflows-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Status Summary

Scope

Context for Agent

Related Issues/PRs

Context for Agent

Related Issues/PRs

Tasks

Acceptance criteria

Uh oh!

stranske-keepalive bot commented Feb 10, 2026 • edited by agents-workflows-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Status Summary

Coverage Overview

Coverage Trend

Top Coverage Hotspots (lowest coverage)

Keepalive checklist

Scope

Context for Agent

Related Issues/PRs

Context for Agent

Related Issues/PRs

Tasks

Acceptance criteria

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

stranske-keepalive bot commented Feb 10, 2026 • edited by agents-workflows-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Keepalive Loop Status

Current State

🔍 Failure Classification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stranske commented Feb 10, 2026 •

edited by agents-workflows-bot bot

Loading

stranske-keepalive bot commented Feb 10, 2026 •

edited by agents-workflows-bot bot

Loading

stranske-keepalive bot commented Feb 10, 2026 •

edited by agents-workflows-bot bot

Loading