Skip to content

feat: Enable comparing two different models in verify:compare mode#629

Merged
stranske merged 3 commits intomainfrom
feat/compare-two-models
Jan 7, 2026
Merged

feat: Enable comparing two different models in verify:compare mode#629
stranske merged 3 commits intomainfrom
feat/compare-two-models

Conversation

@stranske
Copy link
Copy Markdown
Owner

@stranske stranske commented Jan 7, 2026

Source: Issue #123

Automated Status Summary

Scope

After merging PR #103 (multi-agent routing infrastructure), we need to:

  1. Validate the CLI agent pipeline works end-to-end with the new task-focused prompts
  2. Add GITHUB_STEP_SUMMARY output so iteration results are visible in the Actions UI
  3. Streamline the Automated Status Summary to reduce clutter when using CLI agents
  4. Clean up comment patterns to avoid a mix of old UI-agent and new CLI-agent comments

Context for Agent

Design Decisions & Constraints

    1. Clean up comment patterns to avoid a mix of old UI-agent and new CLI-agent comments
  • The keepalive loop now:
  • | <!-- keepalive-loop-summary --> | github-actions[bot] | NEW: CLI agent iteration tracking | ✅ Keep for CLI agents |
  • | <!-- keepalive-state:v1 --> | agents-workflows-bot[bot] | State tracking | ⚠️ Multiple copies accumulate |
  • | <!-- keepalive-round: N --> | stranske | OLD: Instruction comment | ❌ CLI agents dont need this |
  • The goal: For CLI agents (agent:* label), we should have exactly one updating comment (<!-- keepalive-loop-summary -->) instead of accumulating 10+ comments per PR.
  • Requires PR #103 to be merged first
  • This round you MUST:
  • Review the Scope/Tasks/Acceptance below, identify the next incomplete task that requires code, implement it, then post a reply comment with the completed items using their exact original text.

Related Issues/PRs

References

Blockers & Dependencies

  • After merging PR #103 (multi-agent routing infrastructure), we need to:
    1. Mark a task checkbox complete ONLY after verifying the implementation works.

Tasks

Pipeline Validation

  • After PR chore(codex): bootstrap PR for issue #101 #103 merges, create a test PR with agent:codex label
  • Verify task appendix appears in Codex prompt (check workflow logs)
  • Verify Codex works on actual tasks (not random infrastructure work)
  • Verify keepalive comment updates with iteration progress

GITHUB_STEP_SUMMARY

  • Add step summary output to agents-keepalive-loop.yml after agent run
  • Include: iteration number, tasks completed, files changed, outcome
  • Ensure summary is visible in workflow run UI

Conditional Status Summary

  • Modify buildStatusBlock() in agents_pr_meta_update_body.js to accept agentType parameter
  • When agentType is set (CLI agent): hide workflow table, hide head SHA/required checks
  • Keep Scope/Tasks/Acceptance checkboxes for all cases
  • Pass agent type from workflow to the update_body job

Comment Pattern Cleanup

  • For CLI agents (agent:* label):
  • Suppress <!-- gate-summary: --> comment posting (use step summary instead)
  • Suppress <!-- keepalive-round: N --> instruction comments (task appendix replaces this)
  • Update <!-- keepalive-loop-summary --> to be the single source of truth
  • Ensure state marker is embedded in the summary comment (not separate)
  • For UI Codex (no agent:* label):
  • Keep existing comment patterns (instruction comments, connector bot reports)
  • Keep <!-- gate-summary: --> comment
  • Add agent_type output to detect job so downstream workflows know the mode
  • Update agents-pr-meta.yml to conditionally skip gate summary for CLI agent PRs

Acceptance criteria

  • CLI agent receives explicit tasks in prompt and works on them
  • Iteration results visible in Actions workflow run summary
  • PR body shows checkboxes but not workflow clutter when using CLI agents
  • UI Codex path (no agent label) continues to show full status summary
  • CLI agent PRs have ≤3 bot comments total (summary, one per iteration update) instead of 10+
  • State tracking is consolidated in the summary comment, not scattered

Dependencies

Head SHA: 6588db1
Latest Runs: ❔ in progress — Gate
Required: gate: ❔ in progress

Workflow / Job Result Logs
Agents PR meta manager ❔ in progress View run
CI Autofix Loop ✅ success View run
Gate ❔ in progress View run
Health 40 Sweep ✅ success View run
Health 44 Gate Branch Protection ✅ success View run
Health 45 Agents Guard ✅ success View run
Health 50 Security Scan ✅ success View run
Maint 52 Validate Workflows ✅ success View run
PR 11 - Minimal invariant CI ✅ success View run
Selftest CI ✅ success View run
Validate Sync Manifest ✅ success View run

- Add model2 input to consumer workflow template and reusable workflow
- Update pr_verifier.py to accept --model2 CLI argument
- Modify _get_llm_clients() to use model1 for GitHub Models, model2 for OpenAI
- Update ComparisonRunner and evaluate_pr_multiple() to pass models through
- Enable side-by-side comparison of different models (e.g., gpt-5 vs gpt-4o)

This enhancement allows users to compare evaluations from two different
models instead of just comparing the same model across providers.
Copilot AI review requested due to automatic review settings January 7, 2026 03:44
@stranske stranske temporarily deployed to agent-high-privilege January 7, 2026 03:44 — with GitHub Actions Inactive
@github-actions github-actions bot added the autofix Opt-in automated formatting & lint remediation label Jan 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

Status | ✅ no new diagnostics
History points | 1
Timestamp | 2026-01-07 04:09:13 UTC
Report artifact | autofix-report-pr-629
Remaining | 0
New | 0
No additional artifacts

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

Automated Status Summary

Head SHA: c646f61
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / Enforce agents workflow protections
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job Result Logs
(no jobs reported) ⏳ pending

Coverage Overview

  • Coverage history entries: 1

Coverage Trend

Metric Value
Current 92.21%
Baseline 85.00%
Delta +7.21%
Minimum 70.00%
Status ✅ Pass

Top Coverage Hotspots (lowest coverage)

File Coverage Missing
scripts/workflow_health_check.py 62.6% 28
scripts/classify_test_failures.py 62.9% 37
scripts/ledger_validate.py 65.3% 63
scripts/mypy_return_autofix.py 82.6% 11
scripts/ledger_migrate_base.py 85.5% 13
scripts/fix_cosmetic_aggregate.py 92.3% 1
scripts/coverage_history_append.py 92.8% 2
scripts/workflow_validator.py 93.3% 4
scripts/update_autofix_expectations.py 93.9% 1
scripts/pr_metrics_tracker.py 95.7% 3
scripts/generate_residual_trend.py 96.6% 1
scripts/build_autofix_pr_comment.py 97.0% 2
scripts/aggregate_agent_metrics.py 97.2% 0
scripts/fix_numpy_asserts.py 98.1% 0
scripts/sync_test_dependencies.py 98.3% 1

Updated automatically; will refresh on subsequent CI/Docker completions.


Keepalive checklist

Scope

After merging PR #103 (multi-agent routing infrastructure), we need to:

  1. Validate the CLI agent pipeline works end-to-end with the new task-focused prompts
  2. Add GITHUB_STEP_SUMMARY output so iteration results are visible in the Actions UI
  3. Streamline the Automated Status Summary to reduce clutter when using CLI agents
  4. Clean up comment patterns to avoid a mix of old UI-agent and new CLI-agent comments

Context for Agent

Design Decisions & Constraints

    1. Clean up comment patterns to avoid a mix of old UI-agent and new CLI-agent comments
  • The keepalive loop now:
  • | <!-- keepalive-loop-summary --> | github-actions[bot] | NEW: CLI agent iteration tracking | ✅ Keep for CLI agents |
  • | <!-- keepalive-state:v1 --> | agents-workflows-bot[bot] | State tracking | ⚠️ Multiple copies accumulate |
  • | <!-- keepalive-round: N --> | stranske | OLD: Instruction comment | ❌ CLI agents dont need this |
  • The goal: For CLI agents (agent:* label), we should have exactly one updating comment (<!-- keepalive-loop-summary -->) instead of accumulating 10+ comments per PR.
  • Requires PR #103 to be merged first
  • This round you MUST:
  • Review the Scope/Tasks/Acceptance below, identify the next incomplete task that requires code, implement it, then post a reply comment with the completed items using their exact original text.

Related Issues/PRs

References

Blockers & Dependencies

  • After merging PR #103 (multi-agent routing infrastructure), we need to:
    1. Mark a task checkbox complete ONLY after verifying the implementation works.

Tasks

Pipeline Validation

  • After PR chore(codex): bootstrap PR for issue #101 #103 merges, create a test PR with agent:codex label
  • Verify task appendix appears in Codex prompt (check workflow logs)
  • Verify Codex works on actual tasks (not random infrastructure work)
  • Verify keepalive comment updates with iteration progress

GITHUB_STEP_SUMMARY

  • Add step summary output to agents-keepalive-loop.yml after agent run
  • Include: iteration number, tasks completed, files changed, outcome
  • Ensure summary is visible in workflow run UI

Conditional Status Summary

  • Modify buildStatusBlock() in agents_pr_meta_update_body.js to accept agentType parameter
  • When agentType is set (CLI agent): hide workflow table, hide head SHA/required checks
  • Keep Scope/Tasks/Acceptance checkboxes for all cases
  • Pass agent type from workflow to the update_body job

Comment Pattern Cleanup

  • For CLI agents (agent:* label):
  • Suppress <!-- gate-summary: --> comment posting (use step summary instead)
  • Suppress <!-- keepalive-round: N --> instruction comments (task appendix replaces this)
  • Update <!-- keepalive-loop-summary --> to be the single source of truth
  • Ensure state marker is embedded in the summary comment (not separate)
  • For UI Codex (no agent:* label):
  • Keep existing comment patterns (instruction comments, connector bot reports)
  • Keep <!-- gate-summary: --> comment
  • Add agent_type output to detect job so downstream workflows know the mode
  • Update agents-pr-meta.yml to conditionally skip gate summary for CLI agent PRs

Acceptance criteria

  • CLI agent receives explicit tasks in prompt and works on them
  • Iteration results visible in Actions workflow run summary
  • PR body shows checkboxes but not workflow clutter when using CLI agents
  • UI Codex path (no agent label) continues to show full status summary
  • CLI agent PRs have ≤3 bot comments total (summary, one per iteration update) instead of 10+
  • State tracking is consolidated in the summary comment, not scattered

Dependencies

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

🤖 Keepalive Loop Status

PR #629 | Agent: Codex | Iteration 0/5

Current State

Metric Value
Iteration progress [----------] 0/5
Action wait (missing-agent-label)
Disposition skipped (transient)
Gate success
Tasks 0/28 complete
Keepalive ❌ disabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | resource |
| Suggested recovery | Confirm the referenced resource exists (repo, PR, branch, workflow, or file). |

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the verify:compare mode to support comparing two different LLM models side-by-side instead of only comparing the same model across different providers.

Key Changes:

  • Added model2 input parameter throughout the workflow chain (consumer template → reusable workflow → Python script)
  • Modified _get_llm_clients() to accept model1/model2 parameters and assign them to GitHub Models and OpenAI providers respectively
  • Updated comparison logic to pass model parameters through the entire evaluation pipeline

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
templates/consumer-repo/.github/workflows/agents-verifier.yml Added model2 workflow input, output declaration, and parameter passing through the workflow chain
.github/workflows/reusable-agents-verifier.yml Added model2 input parameter and command-line argument passing to pr_verifier.py
scripts/langchain/pr_verifier.py Added --model2 CLI argument and updated evaluation functions to accept and use both model parameters in comparison mode

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 209 to 217
clients: list[tuple[object, str]] = []
if github_token:
clients.append(
(
ChatOpenAI(
model=DEFAULT_MODEL,
model=first_model,
base_url=GITHUB_MODELS_BASE_URL,
api_key=github_token,
temperature=0.1,
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provider label should include the model name for clarity in comparison reports. Currently it only shows "github-models" but should be formatted as "github-models/{first_model}" to match the pattern used in _get_llm_client() (lines 149, 162, 174, 186). This will help users understand which specific models are being compared in the comparison report.

Copilot uses AI. Check for mistakes.
Comment on lines 223 to 229
clients.append(
(
ChatOpenAI(
model=DEFAULT_MODEL,
model=second_model,
api_key=openai_token,
temperature=0.1,
),
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provider label should include the model name for clarity in comparison reports. Currently it only shows "openai" but should be formatted as "openai/{second_model}" to match the pattern used in _get_llm_client() (lines 149, 162, 174, 186). This will help users understand which specific models are being compared in the comparison report.

Copilot uses AI. Check for mistakes.
type: string
default: 'gpt-4o-mini'
model2:
description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)'
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example model "gpt-4.1" in the description is not a valid OpenAI model name. Consider using a valid example like "gpt-4-turbo" or "gpt-4o" instead.

Suggested change
description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)'
description: 'Second model for compare mode (e.g., gpt-5, gpt-4o)'

Copilot uses AI. Check for mistakes.
type: string
default: 'gpt-4o-mini'
model2:
description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)'
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description should clarify that in compare mode, the first model uses GitHub Models provider and the second model uses OpenAI provider. This is important information for users to understand how the comparison will work. Consider updating the description to something like: "Second model for compare mode (uses OpenAI provider). Leave empty to compare same model across providers."

Suggested change
description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)'
description: 'Second model for compare mode (uses OpenAI provider). Leave empty to compare same model across providers.'

Copilot uses AI. Check for mistakes.
type: string
default: 'github-models'
model2:
description: 'Second model for compare mode (defaults to model if not specified)'
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description should clarify that in compare mode, the first model uses GitHub Models provider and the second model uses OpenAI provider. Consider updating to: "Second model for compare mode (uses OpenAI provider). Defaults to model if not specified for same-model cross-provider comparison."

Suggested change
description: 'Second model for compare mode (defaults to model if not specified)'
description: 'Second model for compare mode (uses OpenAI provider). Defaults to model if not specified for same-model cross-provider comparison.'

Copilot uses AI. Check for mistakes.
should_run: ${{ steps.check.outputs.should_run }}
mode: ${{ steps.check.outputs.mode }}
model: ${{ steps.check.outputs.model }}
model2: ${{ steps.check.outputs.model2 }}
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model2 output is being declared here, but it's not being set in the label trigger path (lines 146-151 of the check step script). When the workflow is triggered via label (verify:checkbox, verify:evaluate, or verify:compare), the model2 output will be undefined. You need to add "core.setOutput('model2', '');" in the label trigger path alongside the other setOutput calls for model and provider.

Copilot uses AI. Check for mistakes.
The from_environment classmethod now accepts model1 and model2 parameters,
so the test mocks need to accept them too (with defaults).
@stranske stranske temporarily deployed to agent-high-privilege January 7, 2026 03:52 — with GitHub Actions Inactive
@stranske stranske enabled auto-merge (squash) January 7, 2026 03:54
@stranske stranske temporarily deployed to agent-high-privilege January 7, 2026 04:08 — with GitHub Actions Inactive
@stranske stranske merged commit 6b1ffc0 into main Jan 7, 2026
93 checks passed
@stranske stranske deleted the feat/compare-two-models branch January 7, 2026 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autofix Opt-in automated formatting & lint remediation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants