Skip to content

Add model name to comparison#644

Merged
stranske merged 5 commits intomainfrom
add-model-name-to-comparison
Jan 7, 2026
Merged

Add model name to comparison#644
stranske merged 5 commits intomainfrom
add-model-name-to-comparison

Conversation

@stranske
Copy link
Copy Markdown
Owner

@stranske stranske commented Jan 7, 2026

No description provided.

Display the model name used by each provider in verify:compare reports.

Changes:
- Add 'model' field to EvaluationResult to track which model was used
- Update _get_llm_clients to return tuples with (client, provider, model)
- Add 'Model' column to Provider Summary table
- Display model name in expandable Full Provider Details section
- Update _fallback_evaluation to accept and store model parameter

Example output:
| Provider | Model | Verdict | Confidence | Summary |
| github-models | gpt-4o | PASS | 85% | ... |
| openai | gpt-5.2 | CONCERNS | 72% | ... |

This helps users understand which models were used for evaluation,
especially when using model1/model2 parameters in compare mode.
- Add model parameter to test client tuples in test_pr_verifier_compare.py
- Update table header assertion to include Model column in test_pr_verifier_comparison_report.py
- Addresses failing checks and bot review comments in PR #643
The agent verifier will no longer automatically create issues after PR evaluations,
regardless of CONCERNS or FAIL verdicts. This addresses user feedback that
automatic issue creation was creating unwanted noise.

Changes:
- Modified _should_create_issue() to always return False
- Updated test to verify that issue creation is disabled
- Workflow will still have --create-issue flag but it will have no effect
…I access

- Add 'models: read' permission to reusable-agents-verifier.yml
- Add 'models: read' permission to agents-verifier.yml
- Fixes 401 authentication errors when using GitHub Models provider
- Templates already have this permission configured

Resolves GitHub Models authentication issue identified in Travel-Plan-Permission PR #318 test
Copilot AI review requested due to automatic review settings January 7, 2026 16:57
@agents-workflows-bot
Copy link
Copy Markdown
Contributor

⚠️ Action Required: Unable to determine source issue for PR #644. The PR title, branch name, or body must contain the issue number (e.g. #123, branch: issue-123, or the hidden marker ).

@github-actions github-actions bot added the autofix Opt-in automated formatting & lint remediation label Jan 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

Automated Status Summary

Head SHA: 0bc7819
Latest Runs: ⏳ pending — Gate
Required contexts: Gate / gate, Health 45 Agents Guard / Enforce agents workflow protections
Required: core tests (3.11): ⏳ pending, core tests (3.12): ⏳ pending, docker smoke: ⏳ pending, gate: ⏳ pending

Workflow / Job Result Logs
(no jobs reported) ⏳ pending

Coverage Overview

  • Coverage history entries: 1

Coverage Trend

Metric Value
Current 92.21%
Baseline 85.00%
Delta +7.21%
Minimum 70.00%
Status ✅ Pass

Top Coverage Hotspots (lowest coverage)

File Coverage Missing
scripts/workflow_health_check.py 62.6% 28
scripts/classify_test_failures.py 62.9% 37
scripts/ledger_validate.py 65.3% 63
scripts/mypy_return_autofix.py 82.6% 11
scripts/ledger_migrate_base.py 85.5% 13
scripts/fix_cosmetic_aggregate.py 92.3% 1
scripts/coverage_history_append.py 92.8% 2
scripts/workflow_validator.py 93.3% 4
scripts/update_autofix_expectations.py 93.9% 1
scripts/pr_metrics_tracker.py 95.7% 3
scripts/generate_residual_trend.py 96.6% 1
scripts/build_autofix_pr_comment.py 97.0% 2
scripts/aggregate_agent_metrics.py 97.2% 0
scripts/fix_numpy_asserts.py 98.1% 0
scripts/sync_test_dependencies.py 98.3% 1

Updated automatically; will refresh on subsequent CI/Docker completions.


Keepalive checklist

Scope

No scope information available

Tasks

  • No tasks defined

Acceptance criteria

  • No acceptance criteria defined

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

🤖 Keepalive Loop Status

PR #644 | Agent: Codex | Iteration 0/5

Current State

Metric Value
Iteration progress [----------] 0/5
Action wait (missing-agent-label)
Disposition skipped (transient)
Gate success
Tasks 0/0 complete
Keepalive ❌ disabled
Autofix ❌ disabled

🔍 Failure Classification

| Error type | infrastructure |
| Error category | resource |
| Suggested recovery | Confirm the referenced resource exists (repo, PR, branch, workflow, or file). |

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

Status | ✅ no new diagnostics
History points | 1
Timestamp | 2026-01-07 16:59:56 UTC
Report artifact | autofix-report-pr-644
Remaining | 0
New | 0
No additional artifacts

@stranske stranske merged commit a1254d1 into main Jan 7, 2026
37 checks passed
@stranske stranske deleted the add-model-name-to-comparison branch January 7, 2026 17:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR disables automatic follow-up issue creation for LLM evaluations and adds support for displaying model names in comparison reports, along with adding the necessary GitHub Models API permissions to workflows.

Key Changes

  • Disabled automatic issue creation for PR evaluation failures/concerns by modifying _should_create_issue to always return False
  • Added models: read permission to workflow files to enable GitHub Models API access
  • Updated documentation to reflect completed sync operations and the disabled issue creation feature

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/scripts/test_pr_verifier_issue_creation.py Updated test to verify that issue creation is now disabled, removing mock HTTP requests and asserting None return value
scripts/langchain/pr_verifier.py Modified _should_create_issue function to always return False with explanatory comment
docs/plans/langchain-post-code-rollout.md Documented PR #643 merge, updated sync status for consumer repos, added test results from Travel-Plan-Permission #318, and marked issue creation as disabled
.github/workflows/reusable-agents-verifier.yml Added models: read permission for GitHub Models API access
.github/workflows/agents-verifier.yml Added models: read permission and cleaned up trailing whitespace in JavaScript code

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 51 to +66
def test_create_followup_issue_posts(monkeypatch) -> None:
# Since automatic issue creation is disabled, this test verifies
# that _create_followup_issue returns None without creating an issue
result = pr_verifier.EvaluationResult(verdict="FAIL", concerns=["Issue found."])
monkeypatch.setenv("GITHUB_TOKEN", "token")
monkeypatch.setenv("GITHUB_REPOSITORY", "org/repo")

captured = {}

class FakeResponse:
def __init__(self):
self._data = json.dumps({"number": 99}).encode("utf-8")

def read(self):
return self._data

def __enter__(self):
return self

def __exit__(self, exc_type, exc, tb):
return False

def fake_urlopen(request):
captured["url"] = request.full_url
captured["body"] = request.data
return FakeResponse()

monkeypatch.setattr(pr_verifier.urllib.request, "urlopen", fake_urlopen)

issue_number = pr_verifier._create_followup_issue(
result,
"- Pull request: [#99](https://example.com/pr/99)",
labels=["agent:codex"],
run_url="https://example.com/run/99",
)

assert issue_number == 99
assert captured["url"] == "https://api.github.com/repos/org/repo/issues"
payload = json.loads(captured["body"].decode("utf-8"))
assert payload["title"] == "LLM evaluation concerns for PR #99"
assert payload["labels"] == ["agent:codex"]
# Automatic issue creation is disabled, so this should return None
assert issue_number is None
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test coverage for the _should_create_issue function itself. While the existing tests verify that _create_followup_issue returns None when issue creation is disabled, there should be a direct test for _should_create_issue to ensure it returns False for all verdict types. This would make the disabled behavior more explicit and prevent accidental re-enabling without proper testing.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autofix Opt-in automated formatting & lint remediation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants