Skip to content

feat: add model name to comparison report output#643

Merged
stranske merged 2 commits intomainfrom
add-model-name-to-comparison
Jan 7, 2026
Merged

feat: add model name to comparison report output#643
stranske merged 2 commits intomainfrom
add-model-name-to-comparison

Conversation

@stranske
Copy link
Copy Markdown
Owner

@stranske stranske commented Jan 7, 2026

Enhancement

Display the model name used by each provider in verify:compare reports. This makes it clear which models were used for evaluation, especially when using different models via the model1/model2 parameters.

Changes

1. Data Model

  • Added model: str | None field to EvaluationResult to track the model used
  • Updated _get_llm_clients to return list[tuple[object, str, str]] (client, provider, model)
  • Updated ComparisonRunner.clients type annotation

2. Report Display

Provider Summary Table:

| Provider | Model | Verdict | Confidence | Summary |
| --- | --- | --- | --- | --- |
| github-models | gpt-4o | PASS | 85% | ... |
| openai | gpt-5.2 | CONCERNS | 72% | ... |

Expandable Details:

#### github-models
- **Model:** gpt-4o
- **Verdict:** PASS
- **Confidence:** 85%
...

Testing

  • ✅ All 8 existing tests pass
  • ✅ Pre-commit hooks pass (syntax, formatting, type check, lint)

Use Cases

  • Comparing different models: gpt-4o (GitHub) vs gpt-5.2 (OpenAI)
  • Understanding which model version produced each evaluation
  • Debugging model-specific differences in verdicts
  • Tracking model usage for cost/performance analysis

Display the model name used by each provider in verify:compare reports.

Changes:
- Add 'model' field to EvaluationResult to track which model was used
- Update _get_llm_clients to return tuples with (client, provider, model)
- Add 'Model' column to Provider Summary table
- Display model name in expandable Full Provider Details section
- Update _fallback_evaluation to accept and store model parameter

Example output:
| Provider | Model | Verdict | Confidence | Summary |
| github-models | gpt-4o | PASS | 85% | ... |
| openai | gpt-5.2 | CONCERNS | 72% | ... |

This helps users understand which models were used for evaluation,
especially when using model1/model2 parameters in compare mode.
Copilot AI review requested due to automatic review settings January 7, 2026 13:22
@stranske stranske enabled auto-merge (squash) January 7, 2026 13:22
@agents-workflows-bot
Copy link
Copy Markdown
Contributor

⚠️ Action Required: Unable to determine source issue for PR #643. The PR title, branch name, or body must contain the issue number (e.g. #123, branch: issue-123, or the hidden marker ).

@github-actions github-actions bot added the autofix Opt-in automated formatting & lint remediation label Jan 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

Status | ✅ no new diagnostics
History points | 1
Timestamp | 2026-01-07 15:56:24 UTC
Report artifact | autofix-report-pr-643
Remaining | 0
New | 0
No additional artifacts

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

Automated Status Summary

Head SHA: 11b5c01
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / Enforce agents workflow protections
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job Result Logs
(no jobs reported) ⏳ pending

Coverage Overview

  • Coverage history entries: 1

Coverage Trend

Metric Value
Current 92.21%
Baseline 85.00%
Delta +7.21%
Minimum 70.00%
Status ✅ Pass

Top Coverage Hotspots (lowest coverage)

File Coverage Missing
scripts/workflow_health_check.py 62.6% 28
scripts/classify_test_failures.py 62.9% 37
scripts/ledger_validate.py 65.3% 63
scripts/mypy_return_autofix.py 82.6% 11
scripts/ledger_migrate_base.py 85.5% 13
scripts/fix_cosmetic_aggregate.py 92.3% 1
scripts/coverage_history_append.py 92.8% 2
scripts/workflow_validator.py 93.3% 4
scripts/update_autofix_expectations.py 93.9% 1
scripts/pr_metrics_tracker.py 95.7% 3
scripts/generate_residual_trend.py 96.6% 1
scripts/build_autofix_pr_comment.py 97.0% 2
scripts/aggregate_agent_metrics.py 97.2% 0
scripts/fix_numpy_asserts.py 98.1% 0
scripts/sync_test_dependencies.py 98.3% 1

Updated automatically; will refresh on subsequent CI/Docker completions.


Keepalive checklist

Scope

No scope information available

Tasks

  • No tasks defined

Acceptance criteria

  • No acceptance criteria defined

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

🤖 Keepalive Loop Status

PR #643 | Agent: Codex | Iteration 0/5

Current State

Metric Value
Iteration progress [----------] 0/5
Action wait (missing-agent-label)
Disposition skipped (transient)
Gate success
Tasks 0/6 complete
Keepalive ❌ disabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | resource |
| Suggested recovery | Confirm the referenced resource exists (repo, PR, branch, workflow, or file). |

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the comparison report feature by displaying the model name used by each provider, making it easier to identify which specific models (e.g., gpt-4o vs gpt-4-turbo) were used during evaluation. This is particularly useful when comparing results from different models via the model1/model2 parameters.

Key Changes:

  • Added model field to EvaluationResult data model to track the model used
  • Updated _get_llm_clients to return model names alongside clients and providers
  • Enhanced comparison report tables to include a "Model" column

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add model parameter to test client tuples in test_pr_verifier_compare.py
- Update table header assertion to include Model column in test_pr_verifier_comparison_report.py
- Addresses failing checks and bot review comments in PR #643
@stranske stranske merged commit 6121bea into main Jan 7, 2026
36 checks passed
@stranske stranske deleted the add-model-name-to-comparison branch January 7, 2026 15:57
stranske added a commit that referenced this pull request Jan 7, 2026
* feat: add model name to comparison report output

Display the model name used by each provider in verify:compare reports.

Changes:
- Add 'model' field to EvaluationResult to track which model was used
- Update _get_llm_clients to return tuples with (client, provider, model)
- Add 'Model' column to Provider Summary table
- Display model name in expandable Full Provider Details section
- Update _fallback_evaluation to accept and store model parameter

Example output:
| Provider | Model | Verdict | Confidence | Summary |
| github-models | gpt-4o | PASS | 85% | ... |
| openai | gpt-5.2 | CONCERNS | 72% | ... |

This helps users understand which models were used for evaluation,
especially when using model1/model2 parameters in compare mode.

* fix: update tests to match new 3-tuple format with model name

- Add model parameter to test client tuples in test_pr_verifier_compare.py
- Update table header assertion to include Model column in test_pr_verifier_comparison_report.py
- Addresses failing checks and bot review comments in PR #643

* feat: disable automatic follow-up issue creation by agent verifier

The agent verifier will no longer automatically create issues after PR evaluations,
regardless of CONCERNS or FAIL verdicts. This addresses user feedback that
automatic issue creation was creating unwanted noise.

Changes:
- Modified _should_create_issue() to always return False
- Updated test to verify that issue creation is disabled
- Workflow will still have --create-issue flag but it will have no effect

* fix: add models permission to verifier workflows for GitHub Models API access

- Add 'models: read' permission to reusable-agents-verifier.yml
- Add 'models: read' permission to agents-verifier.yml
- Fixes 401 authentication errors when using GitHub Models provider
- Templates already have this permission configured

Resolves GitHub Models authentication issue identified in Travel-Plan-Permission PR #318 test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autofix Opt-in automated formatting & lint remediation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants