feat: Enable comparing two different models in verify:compare mode#629
feat: Enable comparing two different models in verify:compare mode#629
Conversation
- Add model2 input to consumer workflow template and reusable workflow - Update pr_verifier.py to accept --model2 CLI argument - Modify _get_llm_clients() to use model1 for GitHub Models, model2 for OpenAI - Update ComparisonRunner and evaluate_pr_multiple() to pass models through - Enable side-by-side comparison of different models (e.g., gpt-5 vs gpt-4o) This enhancement allows users to compare evaluations from two different models instead of just comparing the same model across providers.
|
Status | ✅ no new diagnostics |
Automated Status SummaryHead SHA: c646f61
Coverage Overview
Coverage Trend
Top Coverage Hotspots (lowest coverage)
Updated automatically; will refresh on subsequent CI/Docker completions. Keepalive checklistScopeAfter merging PR #103 (multi-agent routing infrastructure), we need to:
Context for AgentDesign Decisions & Constraints
Related Issues/PRsReferencesBlockers & Dependencies
TasksPipeline Validation
GITHUB_STEP_SUMMARY
Conditional Status Summary
Comment Pattern Cleanup
Acceptance criteria
Dependencies
|
🤖 Keepalive Loop StatusPR #629 | Agent: Codex | Iteration 0/5 Current State
🔍 Failure Classification| Error type | infrastructure | |
There was a problem hiding this comment.
Pull request overview
This PR enhances the verify:compare mode to support comparing two different LLM models side-by-side instead of only comparing the same model across different providers.
Key Changes:
- Added
model2input parameter throughout the workflow chain (consumer template → reusable workflow → Python script) - Modified
_get_llm_clients()to accept model1/model2 parameters and assign them to GitHub Models and OpenAI providers respectively - Updated comparison logic to pass model parameters through the entire evaluation pipeline
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| templates/consumer-repo/.github/workflows/agents-verifier.yml | Added model2 workflow input, output declaration, and parameter passing through the workflow chain |
| .github/workflows/reusable-agents-verifier.yml | Added model2 input parameter and command-line argument passing to pr_verifier.py |
| scripts/langchain/pr_verifier.py | Added --model2 CLI argument and updated evaluation functions to accept and use both model parameters in comparison mode |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| clients: list[tuple[object, str]] = [] | ||
| if github_token: | ||
| clients.append( | ||
| ( | ||
| ChatOpenAI( | ||
| model=DEFAULT_MODEL, | ||
| model=first_model, | ||
| base_url=GITHUB_MODELS_BASE_URL, | ||
| api_key=github_token, | ||
| temperature=0.1, |
There was a problem hiding this comment.
The provider label should include the model name for clarity in comparison reports. Currently it only shows "github-models" but should be formatted as "github-models/{first_model}" to match the pattern used in _get_llm_client() (lines 149, 162, 174, 186). This will help users understand which specific models are being compared in the comparison report.
| clients.append( | ||
| ( | ||
| ChatOpenAI( | ||
| model=DEFAULT_MODEL, | ||
| model=second_model, | ||
| api_key=openai_token, | ||
| temperature=0.1, | ||
| ), |
There was a problem hiding this comment.
The provider label should include the model name for clarity in comparison reports. Currently it only shows "openai" but should be formatted as "openai/{second_model}" to match the pattern used in _get_llm_client() (lines 149, 162, 174, 186). This will help users understand which specific models are being compared in the comparison report.
| type: string | ||
| default: 'gpt-4o-mini' | ||
| model2: | ||
| description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)' |
There was a problem hiding this comment.
The example model "gpt-4.1" in the description is not a valid OpenAI model name. Consider using a valid example like "gpt-4-turbo" or "gpt-4o" instead.
| description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)' | |
| description: 'Second model for compare mode (e.g., gpt-5, gpt-4o)' |
| type: string | ||
| default: 'gpt-4o-mini' | ||
| model2: | ||
| description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)' |
There was a problem hiding this comment.
The description should clarify that in compare mode, the first model uses GitHub Models provider and the second model uses OpenAI provider. This is important information for users to understand how the comparison will work. Consider updating the description to something like: "Second model for compare mode (uses OpenAI provider). Leave empty to compare same model across providers."
| description: 'Second model for compare mode (e.g., gpt-5, gpt-4.1)' | |
| description: 'Second model for compare mode (uses OpenAI provider). Leave empty to compare same model across providers.' |
| type: string | ||
| default: 'github-models' | ||
| model2: | ||
| description: 'Second model for compare mode (defaults to model if not specified)' |
There was a problem hiding this comment.
The description should clarify that in compare mode, the first model uses GitHub Models provider and the second model uses OpenAI provider. Consider updating to: "Second model for compare mode (uses OpenAI provider). Defaults to model if not specified for same-model cross-provider comparison."
| description: 'Second model for compare mode (defaults to model if not specified)' | |
| description: 'Second model for compare mode (uses OpenAI provider). Defaults to model if not specified for same-model cross-provider comparison.' |
| should_run: ${{ steps.check.outputs.should_run }} | ||
| mode: ${{ steps.check.outputs.mode }} | ||
| model: ${{ steps.check.outputs.model }} | ||
| model2: ${{ steps.check.outputs.model2 }} |
There was a problem hiding this comment.
The model2 output is being declared here, but it's not being set in the label trigger path (lines 146-151 of the check step script). When the workflow is triggered via label (verify:checkbox, verify:evaluate, or verify:compare), the model2 output will be undefined. You need to add "core.setOutput('model2', '');" in the label trigger path alongside the other setOutput calls for model and provider.
The from_environment classmethod now accepts model1 and model2 parameters, so the test mocks need to accept them too (with defaults).
Automated Status Summary
Scope
After merging PR #103 (multi-agent routing infrastructure), we need to:
GITHUB_STEP_SUMMARYoutput so iteration results are visible in the Actions UIContext for Agent
Design Decisions & Constraints
<!-- keepalive-loop-summary -->| github-actions[bot] | NEW: CLI agent iteration tracking | ✅ Keep for CLI agents |<!-- keepalive-state:v1 -->| agents-workflows-bot[bot] | State tracking |<!-- keepalive-round: N -->| stranske | OLD: Instruction comment | ❌ CLI agents dont need this |agent:*label), we should have exactly one updating comment (<!-- keepalive-loop-summary -->) instead of accumulating 10+ comments per PR.Related Issues/PRs
References
Blockers & Dependencies
Tasks
Pipeline Validation
agent:codexlabelGITHUB_STEP_SUMMARY
agents-keepalive-loop.ymlafter agent runConditional Status Summary
buildStatusBlock()inagents_pr_meta_update_body.jsto acceptagentTypeparameteragentTypeis set (CLI agent): hide workflow table, hide head SHA/required checksComment Pattern Cleanup
agent:*label):<!-- gate-summary: -->comment posting (use step summary instead)<!-- keepalive-round: N -->instruction comments (task appendix replaces this)<!-- keepalive-loop-summary -->to be the single source of truthagent:*label):<!-- gate-summary: -->commentagent_typeoutput to detect job so downstream workflows know the modeagents-pr-meta.ymlto conditionally skip gate summary for CLI agent PRsAcceptance criteria
Dependencies
Head SHA: 6588db1
Latest Runs: ❔ in progress — Gate
Required: gate: ❔ in progress