feat: Enable comparing two different models in verify:compare mode by stranske · Pull Request #629 · stranske/Workflows

stranske · 2026-01-07T03:44:22Z

Source: Issue #123

Automated Status Summary

Scope

After merging PR #103 (multi-agent routing infrastructure), we need to:

Validate the CLI agent pipeline works end-to-end with the new task-focused prompts
Add GITHUB_STEP_SUMMARY output so iteration results are visible in the Actions UI
Streamline the Automated Status Summary to reduce clutter when using CLI agents
Clean up comment patterns to avoid a mix of old UI-agent and new CLI-agent comments

Context for Agent

Design Decisions & Constraints

1. Clean up comment patterns to avoid a mix of old UI-agent and new CLI-agent comments
The keepalive loop now:
|  | github-actions[bot] | NEW: CLI agent iteration tracking | ✅ Keep for CLI agents |
|  | agents-workflows-bot[bot] | State tracking | ⚠️ Multiple copies accumulate |
|  | stranske | OLD: Instruction comment | ❌ CLI agents dont need this |
The goal: For CLI agents (agent:* label), we should have exactly one updating comment () instead of accumulating 10+ comments per PR.
Requires PR #103 to be merged first
This round you MUST:
Review the Scope/Tasks/Acceptance below, identify the next incomplete task that requires code, implement it, then post a reply comment with the completed items using their exact original text.

Related Issues/PRs

References

https://github.com/stranske/Workflows/compare/main...codex/issue-123?expand=1

Blockers & Dependencies

After merging PR #103 (multi-agent routing infrastructure), we need to:
1. Mark a task checkbox complete ONLY after verifying the implementation works.

Tasks

Pipeline Validation

After PR chore(codex): bootstrap PR for issue #101 #103 merges, create a test PR with agent:codex label
Verify task appendix appears in Codex prompt (check workflow logs)
Verify Codex works on actual tasks (not random infrastructure work)
Verify keepalive comment updates with iteration progress

GITHUB_STEP_SUMMARY

Add step summary output to agents-keepalive-loop.yml after agent run
Include: iteration number, tasks completed, files changed, outcome
Ensure summary is visible in workflow run UI

Conditional Status Summary

Modify buildStatusBlock() in agents_pr_meta_update_body.js to accept agentType parameter
When agentType is set (CLI agent): hide workflow table, hide head SHA/required checks
Keep Scope/Tasks/Acceptance checkboxes for all cases
Pass agent type from workflow to the update_body job

Comment Pattern Cleanup

Acceptance criteria

CLI agent receives explicit tasks in prompt and works on them
Iteration results visible in Actions workflow run summary
PR body shows checkboxes but not workflow clutter when using CLI agents
UI Codex path (no agent label) continues to show full status summary
CLI agent PRs have ≤3 bot comments total (summary, one per iteration update) instead of 10+
State tracking is consolidated in the summary comment, not scattered

Dependencies

- Requires PR chore(codex): bootstrap PR for issue #101 #103 to be merged first

Head SHA: 6588db1
Latest Runs: ❔ in progress — Gate
Required: gate: ❔ in progress

Workflow / Job	Result	Logs
Agents PR meta manager	❔ in progress	View run
CI Autofix Loop	✅ success	View run
Gate	❔ in progress	View run
Health 40 Sweep	✅ success	View run
Health 44 Gate Branch Protection	✅ success	View run
Health 45 Agents Guard	✅ success	View run
Health 50 Security Scan	✅ success	View run
Maint 52 Validate Workflows	✅ success	View run
PR 11 - Minimal invariant CI	✅ success	View run
Selftest CI	✅ success	View run
Validate Sync Manifest	✅ success	View run

- Add model2 input to consumer workflow template and reusable workflow - Update pr_verifier.py to accept --model2 CLI argument - Modify _get_llm_clients() to use model1 for GitHub Models, model2 for OpenAI - Update ComparisonRunner and evaluate_pr_multiple() to pass models through - Enable side-by-side comparison of different models (e.g., gpt-5 vs gpt-4o) This enhancement allows users to compare evaluations from two different models instead of just comparing the same model across providers.

github-actions · 2026-01-07T03:45:23Z

github-actions · 2026-01-07T03:45:50Z

Automated Status Summary

Head SHA: c646f61
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / Enforce agents workflow protections
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job	Result	Logs
(no jobs reported)	⏳ pending	—

Coverage Overview

Coverage history entries: 1

Coverage Trend

Metric	Value
Current	92.21%
Baseline	85.00%
Delta	+7.21%
Minimum	70.00%
Status	✅ Pass

Top Coverage Hotspots (lowest coverage)

File	Coverage	Missing
`scripts/workflow_health_check.py`	62.6%	28
`scripts/classify_test_failures.py`	62.9%	37
`scripts/ledger_validate.py`	65.3%	63
`scripts/mypy_return_autofix.py`	82.6%	11
`scripts/ledger_migrate_base.py`	85.5%	13
`scripts/fix_cosmetic_aggregate.py`	92.3%	1
`scripts/coverage_history_append.py`	92.8%	2
`scripts/workflow_validator.py`	93.3%	4
`scripts/update_autofix_expectations.py`	93.9%	1
`scripts/pr_metrics_tracker.py`	95.7%	3
`scripts/generate_residual_trend.py`	96.6%	1
`scripts/build_autofix_pr_comment.py`	97.0%	2
`scripts/aggregate_agent_metrics.py`	97.2%	0
`scripts/fix_numpy_asserts.py`	98.1%	0
`scripts/sync_test_dependencies.py`	98.3%	1

Updated automatically; will refresh on subsequent CI/Docker completions.

Keepalive checklist

Scope

After merging PR #103 (multi-agent routing infrastructure), we need to:

Validate the CLI agent pipeline works end-to-end with the new task-focused prompts
Add GITHUB_STEP_SUMMARY output so iteration results are visible in the Actions UI
Streamline the Automated Status Summary to reduce clutter when using CLI agents
Clean up comment patterns to avoid a mix of old UI-agent and new CLI-agent comments

Context for Agent

Design Decisions & Constraints

1. Clean up comment patterns to avoid a mix of old UI-agent and new CLI-agent comments
The keepalive loop now:
|  | github-actions[bot] | NEW: CLI agent iteration tracking | ✅ Keep for CLI agents |
|  | agents-workflows-bot[bot] | State tracking | ⚠️ Multiple copies accumulate |
|  | stranske | OLD: Instruction comment | ❌ CLI agents dont need this |
The goal: For CLI agents (agent:* label), we should have exactly one updating comment () instead of accumulating 10+ comments per PR.
Requires PR #103 to be merged first
This round you MUST:
Review the Scope/Tasks/Acceptance below, identify the next incomplete task that requires code, implement it, then post a reply comment with the completed items using their exact original text.

Related Issues/PRs

References

https://github.com/stranske/Workflows/compare/main...codex/issue-123?expand=1

Blockers & Dependencies

After merging PR #103 (multi-agent routing infrastructure), we need to:
1. Mark a task checkbox complete ONLY after verifying the implementation works.

Tasks

Pipeline Validation

After PR chore(codex): bootstrap PR for issue #101 #103 merges, create a test PR with agent:codex label
Verify task appendix appears in Codex prompt (check workflow logs)
Verify Codex works on actual tasks (not random infrastructure work)
Verify keepalive comment updates with iteration progress

GITHUB_STEP_SUMMARY

Add step summary output to agents-keepalive-loop.yml after agent run
Include: iteration number, tasks completed, files changed, outcome
Ensure summary is visible in workflow run UI

Conditional Status Summary

Modify buildStatusBlock() in agents_pr_meta_update_body.js to accept agentType parameter
When agentType is set (CLI agent): hide workflow table, hide head SHA/required checks
Keep Scope/Tasks/Acceptance checkboxes for all cases
Pass agent type from workflow to the update_body job

Comment Pattern Cleanup

Acceptance criteria

CLI agent receives explicit tasks in prompt and works on them
Iteration results visible in Actions workflow run summary
PR body shows checkboxes but not workflow clutter when using CLI agents
UI Codex path (no agent label) continues to show full status summary
CLI agent PRs have ≤3 bot comments total (summary, one per iteration update) instead of 10+
State tracking is consolidated in the summary comment, not scattered

Dependencies

- Requires PR chore(codex): bootstrap PR for issue #101 #103 to be merged first

github-actions · 2026-01-07T03:46:13Z

🤖 Keepalive Loop Status

PR #629 | Agent: Codex | Iteration 0/5

Current State

Metric	Value
Iteration progress	[----------] 0/5
Action	wait (missing-agent-label)
Disposition	skipped (transient)
Gate	success
Tasks	0/28 complete
Keepalive	❌ disabled
Autofix	❌ disabled

🔍 Failure Classification

Copilot

Pull request overview

This PR enhances the verify:compare mode to support comparing two different LLM models side-by-side instead of only comparing the same model across different providers.

Key Changes:

Added model2 input parameter throughout the workflow chain (consumer template → reusable workflow → Python script)
Modified _get_llm_clients() to accept model1/model2 parameters and assign them to GitHub Models and OpenAI providers respectively
Updated comparison logic to pass model parameters through the entire evaluation pipeline

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
templates/consumer-repo/.github/workflows/agents-verifier.yml	Added model2 workflow input, output declaration, and parameter passing through the workflow chain
.github/workflows/reusable-agents-verifier.yml	Added model2 input parameter and command-line argument passing to pr_verifier.py
scripts/langchain/pr_verifier.py	Added --model2 CLI argument and updated evaluation functions to accept and use both model parameters in comparison mode

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-07T03:48:00Z

scripts/langchain/pr_verifier.py

    clients: list[tuple[object, str]] = []
    if github_token:
        clients.append(
            (
                ChatOpenAI(
-                    model=DEFAULT_MODEL,
+                    model=first_model,
                    base_url=GITHUB_MODELS_BASE_URL,
                    api_key=github_token,
                    temperature=0.1,


The provider label should include the model name for clarity in comparison reports. Currently it only shows "github-models" but should be formatted as "github-models/{first_model}" to match the pattern used in _get_llm_client() (lines 149, 162, 174, 186). This will help users understand which specific models are being compared in the comparison report.

Copilot · 2026-01-07T03:48:00Z

scripts/langchain/pr_verifier.py

        clients.append(
            (
                ChatOpenAI(
-                    model=DEFAULT_MODEL,
+                    model=second_model,
                    api_key=openai_token,
                    temperature=0.1,
                ),


The provider label should include the model name for clarity in comparison reports. Currently it only shows "openai" but should be formatted as "openai/{second_model}" to match the pattern used in _get_llm_client() (lines 149, 162, 174, 186). This will help users understand which specific models are being compared in the comparison report.

Copilot · 2026-01-07T03:48:01Z

templates/consumer-repo/.github/workflows/agents-verifier.yml

        type: string
        default: 'gpt-4o-mini'
+      model2:
+        description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)'


The example model "gpt-4.1" in the description is not a valid OpenAI model name. Consider using a valid example like "gpt-4-turbo" or "gpt-4o" instead.

Suggested change

description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)'

description: 'Second model for compare mode (e.g., gpt-5, gpt-4o)'

Copilot · 2026-01-07T03:48:01Z

templates/consumer-repo/.github/workflows/agents-verifier.yml

        type: string
        default: 'gpt-4o-mini'
+      model2:
+        description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)'


The description should clarify that in compare mode, the first model uses GitHub Models provider and the second model uses OpenAI provider. This is important information for users to understand how the comparison will work. Consider updating the description to something like: "Second model for compare mode (uses OpenAI provider). Leave empty to compare same model across providers."

Suggested change

description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)'

description: 'Second model for compare mode (uses OpenAI provider). Leave empty to compare same model across providers.'

Copilot · 2026-01-07T03:48:01Z

.github/workflows/reusable-agents-verifier.yml

        type: string
        default: 'github-models'
+      model2:
+        description: 'Second model for compare mode (defaults to model if not specified)'


The description should clarify that in compare mode, the first model uses GitHub Models provider and the second model uses OpenAI provider. Consider updating to: "Second model for compare mode (uses OpenAI provider). Defaults to model if not specified for same-model cross-provider comparison."

Suggested change

description: 'Second model for compare mode (defaults to model if not specified)'

description: 'Second model for compare mode (uses OpenAI provider). Defaults to model if not specified for same-model cross-provider comparison.'

Copilot · 2026-01-07T03:48:02Z

templates/consumer-repo/.github/workflows/agents-verifier.yml

      should_run: ${{ steps.check.outputs.should_run }}
      mode: ${{ steps.check.outputs.mode }}
      model: ${{ steps.check.outputs.model }}
+      model2: ${{ steps.check.outputs.model2 }}


The model2 output is being declared here, but it's not being set in the label trigger path (lines 146-151 of the check step script). When the workflow is triggered via label (verify:checkbox, verify:evaluate, or verify:compare), the model2 output will be undefined. You need to add "core.setOutput('model2', '');" in the label trigger path alongside the other setOutput calls for model and provider.

The from_environment classmethod now accepts model1 and model2 parameters, so the test mocks need to accept them too (with defaults).

Copilot AI review requested due to automatic review settings January 7, 2026 03:44

stranske temporarily deployed to agent-high-privilege January 7, 2026 03:44 — with GitHub Actions Inactive

github-actions bot added the autofix Opt-in automated formatting & lint remediation label Jan 7, 2026

Copilot started reviewing on behalf of stranske January 7, 2026 03:44 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

test: Update test mocks to match new from_environment signature

497a953

The from_environment classmethod now accepts model1 and model2 parameters, so the test mocks need to accept them too (with defaults).

stranske temporarily deployed to agent-high-privilege January 7, 2026 03:52 — with GitHub Actions Inactive

stranske enabled auto-merge (squash) January 7, 2026 03:54

Merge remote-tracking branch 'origin/main' into feat/compare-two-models

6588db1

stranske temporarily deployed to agent-high-privilege January 7, 2026 04:08 — with GitHub Actions Inactive

stranske merged commit 6b1ffc0 into main Jan 7, 2026
93 checks passed

stranske deleted the feat/compare-two-models branch January 7, 2026 04:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Enable comparing two different models in verify:compare mode#629

feat: Enable comparing two different models in verify:compare mode#629
stranske merged 3 commits intomainfrom
feat/compare-two-models

stranske commented Jan 7, 2026 •

edited by agents-workflows-bot bot

Loading

Uh oh!

github-actions bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)'
	description: 'Second model for compare mode (e.g., gpt-5, gpt-4o)'

Conversation

stranske commented Jan 7, 2026 • edited by agents-workflows-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Status Summary

Scope

Context for Agent

Design Decisions & Constraints

Related Issues/PRs

References

Blockers & Dependencies

Tasks

Pipeline Validation

GITHUB_STEP_SUMMARY

Conditional Status Summary

Comment Pattern Cleanup

Acceptance criteria

Dependencies

Uh oh!

github-actions bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Status Summary

Coverage Overview

Coverage Trend

Top Coverage Hotspots (lowest coverage)

Keepalive checklist

Scope

Context for Agent

Design Decisions & Constraints

Related Issues/PRs

References

Blockers & Dependencies

Tasks

Pipeline Validation

GITHUB_STEP_SUMMARY

Conditional Status Summary

Comment Pattern Cleanup

Acceptance criteria

Dependencies

Uh oh!

github-actions bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Keepalive Loop Status

Current State

🔍 Failure Classification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stranske commented Jan 7, 2026 •

edited by agents-workflows-bot bot

Loading

github-actions bot commented Jan 7, 2026 •

edited

Loading

github-actions bot commented Jan 7, 2026 •

edited

Loading

github-actions bot commented Jan 7, 2026 •

edited

Loading