feat: add comparison scoring framework to compare-models-helper (t168.3) by alex-solovyev · Pull Request #783 · marcusquinn/aidevops

alex-solovyev · 2026-02-09T16:45:54Z

Summary

Add score command to record model comparison results with 5 scoring criteria
Add results command to view past comparisons and aggregate model rankings
SQLite storage in ~/.aidevops/.agent-workspace/memory/model-comparisons.db

Scoring Criteria (1-10 scale)

Criterion	Description
correctness	Factual accuracy
completeness	Coverage of requirements
quality	Code quality (for code tasks)
clarity	Response readability
adherence	Following instructions

Usage

# Record a comparison
compare-models-helper.sh score --task "fix React bug" --type code \
  --model claude-sonnet-4 --correctness 9 --completeness 8 --quality 8 --clarity 9 --adherence 9 \
  --model gpt-4.1 --correctness 8 --completeness 7 --quality 7 --clarity 8 --adherence 8 \
  --winner claude-sonnet-4

# View results
compare-models-helper.sh results
compare-models-helper.sh results --model sonnet --limit 5

Testing

bash -n syntax check passes
Score recording with 2 models produces correct table output
Results aggregation shows win rates and average scores
Multiple comparisons aggregate correctly

Fixes: t168.3 ref:GH#723

Summary by CodeRabbit

New Features
- Added ability to record and persist model comparison scoring sessions with aggregated results
- Added command to view historical comparisons with filtering options (limit, model, type) and aggregate model performance rankings

Add cmd_score() and cmd_results() for recording and viewing model comparison results. SQLite storage tracks per-model scores across 5 criteria (correctness, completeness, quality, clarity, adherence) plus latency and token usage. Aggregate rankings show win rates and average scores across comparisons. Fixes: t168.3 ref:GH#723

gemini-code-assist · 2026-02-09T16:46:05Z

Summary of Changes

Hello @alex-solovyev, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the compare-models-helper.sh script by integrating a robust system for evaluating and tracking the performance of different AI models. It provides a structured way to capture qualitative and quantitative scores for model responses, storing this data persistently in an SQLite database. This allows users to systematically compare models over time, analyze their strengths and weaknesses across various tasks, and derive aggregate rankings, moving beyond ad-hoc comparisons to a more data-driven evaluation process.

Highlights

New Scoring Framework: Introduced a new framework for scoring and comparing AI models, allowing users to record evaluation results based on five distinct criteria: correctness, completeness, quality, clarity, and adherence.
Persistent Storage: Implemented SQLite database storage for model comparison results, enabling cross-session insights and historical tracking of model performance.
New Commands: Added two new commands to compare-models-helper.sh: score to record detailed comparison data, and results to view past comparisons, aggregate model rankings, and win rates.

Changelog

.agents/scripts/compare-models-helper.sh
- Added score and results commands to the script's help output.
- Implemented cmd_score function to handle recording model comparison data, including task details, model-specific scores (correctness, completeness, quality, clarity, adherence, overall), latency, tokens, strengths, weaknesses, and response file paths.
- Implemented cmd_results function to display recent comparison entries and aggregate model rankings, with filtering options for model and task type.
- Created init_results_db function to initialize and manage the SQLite database schema for comparisons and comparison_scores tables.
- Integrated the new score and results commands into the main command parsing logic of the script.
- Defined RESULTS_DB variable to specify the path for the SQLite database, defaulting to ~/.aidevops/.agent-workspace/memory/model-comparisons.db.

Activity

The pull request introduces a new feature to add a comparison scoring framework.
The author alex-solovyev has provided a comprehensive summary of the changes, including detailed scoring criteria and usage examples for the new commands.
Testing notes indicate that the new functionalities for score recording and results aggregation have been verified.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-09T16:47:39Z

Walkthrough

A self-contained comparison scoring framework was introduced using SQLite persistence. The changes add database infrastructure with two tables (comparisons and comparison_scores), a cmd_score handler to record multi-model scoring sessions with aggregation, a cmd_results handler to query and filter past comparisons, and integration into the main command dispatcher.

Changes

Cohort / File(s)	Summary
Database Infrastructure `.agents/scripts/compare-models-helper.sh`	Introduced RESULTS_DB path, init_results_db function creating comparisons and comparison_scores tables with indexes for efficient querying.
Scoring Command Handler `.agents/scripts/compare-models-helper.sh`	Added cmd_score to record multi-model scoring sessions, performing score aggregation, required metadata validation, database insertion, and formatted results summary output.
Results Query Command `.agents/scripts/compare-models-helper.sh`	Added cmd_results to retrieve and display past comparisons with filtering (limit, model, type), per-comparison score display, and aggregate model rankings.
Command Dispatcher `.agents/scripts/compare-models-helper.sh`	Extended main function to route new `score` and `results` commands alongside existing functionality.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

feat: add /compare-models and /compare-models-free slash commands (t168) #660: Extends compare-models-helper.sh with persistent SQLite-backed scoring and cmd_score/cmd_results commands—directly related as the foundation for this framework.

Poem

🗄️ SQLite steps in, a database of dreams,
Scoring sessions flow in persistent streams,
Models compared, their rankings take shape,
Results remembered—no data escape! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a comparison scoring framework to the compare-models-helper script, with a concise reference to the task identifier.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t168.3-comparison-scoring

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a valuable framework for scoring and comparing model performance. The implementation is well-structured with new score and results commands and SQLite for persistence. My review focuses on improving security and adherence to the repository's style guide. I've identified several critical SQL injection vulnerabilities where user input is not properly sanitized before being used in database queries. Addressing these is essential for security. I've also included suggestions to fix violations of the shell style guide regarding error suppression and variable declaration, and a recommendation to improve the precision of the overall score calculation. The suggested fixes align with repository-specific rules for shell scripting practices.

gemini-code-assist · 2026-02-09T16:48:37Z

.agents/scripts/compare-models-helper.sh

+
+    # Insert comparison record
+    local comp_id
+    comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();")


This SQL query is vulnerable to SQL injection. The variables $task_type, $evaluator, and $winner are inserted directly into the query string without escaping. An attacker could provide malicious input to these parameters, compromising the database. To fix this, you should escape single quotes in these variables, similar to how you've handled the $task variable.

Suggested change

comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();")

comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$(echo "$task_type" | sed "s/'/''/g")', '$(echo "$evaluator" | sed "s/'/''/g")', '$(echo "$winner" | sed "s/'/''/g")'); SELECT last_insert_rowid();")

References

The style guide recommends using parameterized queries where possible to prevent SQL injection. When not possible, all user-supplied data must be properly escaped. ^(link)

For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions (e.g., a cross-platform sed wrapper) instead of introducing source dependencies. This maintains script independence and avoids risks like path resolution issues, which is particularly important in focused bugfix pull requests.

gemini-code-assist · 2026-02-09T16:48:37Z

.agents/scripts/compare-models-helper.sh

+    # Insert scores for each model
+    for entry in "${model_entries[@]}"; do
+        IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry"
+        sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"


This query is also vulnerable to SQL injection. The variable $m_id is not escaped before being inserted into the SQL string. While other text variables in this query are escaped, this one was missed. All user-controllable string literals in an SQL query must be properly escaped to prevent security vulnerabilities.

Suggested change

sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"

sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$(echo "$m_id" | sed "s/'/''/g")', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"

References

The style guide recommends using parameterized queries where possible to prevent SQL injection. When not possible, all user-supplied data must be properly escaped. ^(link)

For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions (e.g., a cross-platform sed wrapper) instead of introducing source dependencies. This maintains script independence and avoids risks like path resolution issues, which is particularly important in focused bugfix pull requests.

gemini-code-assist · 2026-02-09T16:48:37Z

.agents/scripts/compare-models-helper.sh

+    if [[ -n "$model_filter" ]]; then
+        where_clause="WHERE cs.model_id LIKE '%${model_filter}%'"
+    fi
+    if [[ -n "$type_filter" ]]; then
+        if [[ -n "$where_clause" ]]; then
+            where_clause="$where_clause AND c.task_type = '$type_filter'"
+        else
+            where_clause="WHERE c.task_type = '$type_filter'"
+        fi
+    fi


The construction of where_clause is vulnerable to SQL injection. The variables $model_filter and $type_filter are taken from user input and inserted directly into the SQL query without escaping. This could allow an attacker to manipulate the query. You should sanitize these variables before using them in the query.

Suggested change

if [[ -n "$model_filter" ]]; then

where_clause="WHERE cs.model_id LIKE '%${model_filter}%'"

fi

if [[ -n "$type_filter" ]]; then

if [[ -n "$where_clause" ]]; then

where_clause="$where_clause AND c.task_type = '$type_filter'"

else

where_clause="WHERE c.task_type = '$type_filter'"

fi

fi

if [[ -n "$model_filter" ]]; then

local sanitized_model_filter

sanitized_model_filter=$(echo "$model_filter" | sed "s/'/''/g")

where_clause="WHERE cs.model_id LIKE '%${sanitized_model_filter}%'"

fi

if [[ -n "$type_filter" ]]; then

local sanitized_type_filter

sanitized_type_filter=$(echo "$type_filter" | sed "s/'/''/g")

if [[ -n "$where_clause" ]]; then

where_clause="$where_clause AND c.task_type = '$sanitized_type_filter'"

else

where_clause="WHERE c.task_type = '$sanitized_type_filter'"

fi

fi

References

The style guide recommends using parameterized queries where possible to prevent SQL injection. When not possible, all user-supplied data must be properly escaped. ^(link)

For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions (e.g., a cross-platform sed wrapper) instead of introducing source dependencies. This maintains script independence and avoids risks like path resolution issues, which is particularly important in focused bugfix pull requests.

In shell scripts, capture a command's exit code in a variable instead of using $? directly in conditionals. This aligns with ShellCheck SC2181 and improves clarity for multi-way branches.

gemini-code-assist · 2026-02-09T16:48:37Z

.agents/scripts/compare-models-helper.sh

+
+    flush_model() {
+        if [[ -n "$current_model" ]]; then
+            local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))


This line has two issues:

Style Guide Violation: It combines local with an assignment from a command substitution (arithmetic expansion). The repository style guide (line 11) requires separating declaration and assignment for exit code safety.

Precision Loss: It uses integer division, which truncates the average score (e.g., 8.8 becomes 8), leading to inaccurate overall scores.

To fix both, you can separate the declaration and use bc for floating-point math. This would also require changing the overall column type to REAL in the init_results_db function.

Suggested change

local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))

local overall

overall=$(echo "scale=2; ($current_correct + $current_complete + $current_quality + $current_clarity + $current_adherence) / 5" | bc)

References

The style guide requires declaring local variables and assigning them from command substitutions in separate steps to ensure exit code safety. ^(link)

In shell scripts, capture a command's exit code in a variable instead of using $? directly in conditionals. This aligns with ShellCheck SC2181 and improves clarity for multi-way branches.

gemini-code-assist · 2026-02-09T16:48:37Z

.agents/scripts/compare-models-helper.sh

+    echo ""
+
+    local count
+    count=$(sqlite3 "$RESULTS_DB" "SELECT COUNT(DISTINCT c.id) FROM comparisons c LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id $where_clause;" 2>/dev/null || echo "0")


This sqlite3 command suppresses errors using 2>/dev/null. This pattern is repeated on lines 1084, 1096, and 1116, and it violates the repository style guide (line 50), which disallows blanket error suppression. Please remove 2>/dev/null from these calls to allow database errors to be visible for debugging. The existing || guards and pipes will still handle command failures gracefully.

Suggested change

count=$(sqlite3 "$RESULTS_DB" "SELECT COUNT(DISTINCT c.id) FROM comparisons c LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id $where_clause;" 2>/dev/null || echo "0")

count=$(sqlite3 "$RESULTS_DB" "SELECT COUNT(DISTINCT c.id) FROM comparisons c LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id $where_clause;" || echo "0")

References

The style guide prohibits blanket error suppression with 2>/dev/null. Errors should be visible for debugging or redirected to a log file. ^(link)

coderabbitai

Actionable comments posted: 8

🤖 Fix all issues with AI agents

In @.agents/scripts/compare-models-helper.sh:
- Around line 937-945: The flush_model function should declare and assign the
local variable separately and end with an explicit return; change "local
overall=$((...))" to "local overall" followed by "overall=$(( (current_correct +
current_complete + current_quality + current_clarity + current_adherence) / 5
))", and break the combined post-if assignments (current_model="",
current_correct=0 ...) into separate statements for clarity, then append an
explicit "return 0" at the end of flush_model; keep references to model_entries
and current_* variables intact.
- Around line 1051-1061: The WHERE construction interpolates untrusted vars
(model_filter, type_filter) and limit directly into SQL, enabling injection; fix
by sanitizing/validating inputs before building where_clause and limit usage:
escape single quotes and percent/wildcard characters in model_filter (used in
LIKE) and escape single quotes in type_filter, or better use parameterized
queries if the DB client supports them, and validate limit is an integer (reject
or default if not); update the code paths that set where_clause (variables
model_filter, type_filter, where_clause) and wherever limit is used to perform
these checks/escapes before concatenation.
- Around line 947-966: Add validation for the numeric score flags
(current_correct, current_complete, current_quality, current_clarity,
current_adherence, current_latency, current_tokens) right after argument
parsing: implement a helper function (e.g., validate_score) that checks each
variable is an integer (or numeric as appropriate) and within expected ranges,
logs a clear error and exits on failure; call this helper before any arithmetic
or SQL insertion (used later around the arithmetic and DB insert logic) and
ensure flush_model/current_model parsing still occurs unaffected.
- Around line 999-1006: The sqlite3 INSERT that sets comp_id can fail and leave
comp_id empty causing invalid downstream inserts into comparison_scores; update
the block around comp_id and the for-loop to check the sqlite3 command exit
status and validate comp_id is a non-empty numeric value before proceeding:
after running the sqlite3 INSERT into comparisons capture both stdout and stderr
(or check $?), verify comp_id is not empty and consists only of digits, log a
clear error to stderr including $RESULTS_DB and the sqlite3 error output if it
failed, and exit or skip the subsequent loop if validation fails so
comparison_scores inserts only run when comp_id is valid; reference symbols:
comp_id, RESULTS_DB, model_entries, comparison_scores, and the sqlite3 INSERT
commands.
- Around line 882-920: The init_results_db function currently assumes sqlite3 is
present; add a pre-check using command availability (e.g., verify sqlite3 with
command -v or type) before attempting to create the DB so you can print a clear
actionable error and exit if missing; reference init_results_db, RESULTS_DB, and
sqlite3 when implementing the check and ensure the script emits a readable error
message (including that sqlite3 is required) and returns a non-zero status
instead of failing with a cryptic "command not found".
- Around line 1079-1101: The recent comparisons query is missing the filter
stored in where_clause so --model/--type aren't applied; modify the sqlite3
SELECT that reads from the comparisons table (the block that SELECTs c.id,
c.created_at, c.task_type, c.task_description, c.winner_model) to include the
same where_clause used elsewhere by adding the variable (e.g. append
"$where_clause" or ${where_clause}) after "FROM comparisons c" and before "ORDER
BY", ensuring proper spacing/quoting so the shell expands it into the SQL
statement.
- Line 964: The wildcard case in the argument parsing (the pattern '*) shift
;;') silently swallows unknown flags; replace it with a handler that prints a
clear error to stderr including the offending token (e.g., "Unknown option: $1")
and exits non-zero so typos fail fast; update the case block that contains '*)
shift ;;' to use a warning/error message directed to stderr (using >&2) and call
exit 1 (or, if intended to continue, at minimum print a warning and shift) so
unrecognized arguments are not silently discarded.
- Around line 998-1006: The SQL insertion uses unescaped, user-controlled
variables (comp_id creation and the comparison_scores INSERT inside the
model_entries loop); fix by consistently escaping all string interpolations (not
just task) before embedding into SQL—e.g., sanitize task_type, evaluator,
winner, and m_id and any string columns (m_str, m_wea, m_res) with the same sed
"s/'/''/g" pattern or, better, switch the INSERT logic in the comp_id and
comparison_scores blocks to a parameterized/ prepared statement approach; update
the code around comp_id assignment and the for-loop that reads model_entries to
use the sanitized variables (or bound parameters) when calling sqlite3 so
apostrophes and injection vectors are handled safely.

🧹 Nitpick comments (1)

.agents/scripts/compare-models-helper.sh (1)
880-880: Consider declaring RESULTS_DB as readonly for consistency.

All other global constants in this file (MODEL_DATA, TIER_MAP, TASK_RECOMMENDATIONS, PROVIDER_ENV_KEYS) are declared readonly. RESULTS_DB is assigned once and never modified — making it readonly would be consistent and prevent accidental reassignment.
-RESULTS_DB="${AIDEVOPS_WORKSPACE_DIR:-$HOME/.aidevops/.agent-workspace}/memory/model-comparisons.db"
+readonly RESULTS_DB="${AIDEVOPS_WORKSPACE_DIR:-$HOME/.aidevops/.agent-workspace}/memory/model-comparisons.db"

coderabbitai · 2026-02-09T16:51:23Z

.agents/scripts/compare-models-helper.sh

+init_results_db() {
+    local db_dir
+    db_dir="$(dirname "$RESULTS_DB")"
+    mkdir -p "$db_dir"
+
+    sqlite3 "$RESULTS_DB" <<'SQL'
+CREATE TABLE IF NOT EXISTS comparisons (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    task_description TEXT NOT NULL,
+    task_type TEXT DEFAULT 'general',
+    created_at TEXT DEFAULT (datetime('now')),
+    evaluator_model TEXT,
+    winner_model TEXT
+);
+
+CREATE TABLE IF NOT EXISTS comparison_scores (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    comparison_id INTEGER NOT NULL,
+    model_id TEXT NOT NULL,
+    correctness INTEGER DEFAULT 0,
+    completeness INTEGER DEFAULT 0,
+    code_quality INTEGER DEFAULT 0,
+    clarity INTEGER DEFAULT 0,
+    adherence INTEGER DEFAULT 0,
+    overall INTEGER DEFAULT 0,
+    latency_ms INTEGER DEFAULT 0,
+    tokens_used INTEGER DEFAULT 0,
+    strengths TEXT DEFAULT '',
+    weaknesses TEXT DEFAULT '',
+    response_file TEXT DEFAULT '',
+    FOREIGN KEY (comparison_id) REFERENCES comparisons(id)
+);
+
+CREATE INDEX IF NOT EXISTS idx_comparisons_task ON comparisons(task_type);
+CREATE INDEX IF NOT EXISTS idx_comparisons_winner ON comparisons(winner_model);
+CREATE INDEX IF NOT EXISTS idx_scores_model ON comparison_scores(model_id);
+SQL
+    return 0
+}


🛠️ Refactor suggestion | 🟠 Major

Add a sqlite3 availability check for clear error feedback.

If sqlite3 isn't installed, the script will fail with a cryptic "command not found" error under set -e. Per the coding guidelines on reliability and clear logging, a pre-check with an actionable error message would be much friendlier.

🛡️ Proposed fix

init_results_db() { + if ! command -v sqlite3 &>/dev/null; then + print_error "sqlite3 is required for comparison scoring but not found in PATH" + return 1 + fi local db_dir db_dir="$(dirname "$RESULTS_DB")" mkdir -p "$db_dir"

As per coding guidelines, "Automation scripts - focus on: Clear logging and feedback, Error recovery mechanisms".

🤖 Prompt for AI Agents

In @.agents/scripts/compare-models-helper.sh around lines 882 - 920, The init_results_db function currently assumes sqlite3 is present; add a pre-check using command availability (e.g., verify sqlite3 with command -v or type) before attempting to create the DB so you can print a clear actionable error and exit if missing; reference init_results_db, RESULTS_DB, and sqlite3 when implementing the check and ensure the script emits a readable error message (including that sqlite3 is required) and returns a non-zero status instead of failing with a cryptic "command not found".

coderabbitai · 2026-02-09T16:51:23Z

.agents/scripts/compare-models-helper.sh

+    flush_model() {
+        if [[ -n "$current_model" ]]; then
+            local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))
+            model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}")
+        fi
+        current_model="" current_correct=0 current_complete=0 current_quality=0
+        current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0
+        current_strengths="" current_weaknesses="" current_response=""
+    }


🛠️ Refactor suggestion | 🟠 Major

flush_model missing explicit return and has SC2155.

Per coding guidelines, all functions should include explicit returns. Also, line 939 triggers ShellCheck SC2155 (declare and assign separately to avoid masking return values).

♻️ Proposed fix

flush_model() { if [[ -n "$current_model" ]]; then - local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 )) + local overall + overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 )) model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}") fi current_model="" current_correct=0 current_complete=0 current_quality=0 current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0 current_strengths="" current_weaknesses="" current_response="" + return 0 }

As per coding guidelines, "Run ShellCheck with zero violations on all scripts" and "Include explicit returns in shell scripts".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

flush_model() {

if [[ -n "$current_model" ]]; then

local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))

model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}")

fi

current_model="" current_correct=0 current_complete=0 current_quality=0

current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0

current_strengths="" current_weaknesses="" current_response=""

}

flush_model() {

if [[ -n "$current_model" ]]; then

local overall

overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))

model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}")

fi

current_model="" current_correct=0 current_complete=0 current_quality=0

current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0

current_strengths="" current_weaknesses="" current_response=""

return 0

}

🤖 Prompt for AI Agents

In @.agents/scripts/compare-models-helper.sh around lines 937 - 945, The flush_model function should declare and assign the local variable separately and end with an explicit return; change "local overall=$((...))" to "local overall" followed by "overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))", and break the combined post-if assignments (current_model="", current_correct=0 ...) into separate statements for clarity, then append an explicit "return 0" at the end of flush_model; keep references to model_entries and current_* variables intact.

coderabbitai · 2026-02-09T16:51:23Z

.agents/scripts/compare-models-helper.sh

+    while [[ $# -gt 0 ]]; do
+        case "$1" in
+            --task) task="$2"; shift 2 ;;
+            --type) task_type="$2"; shift 2 ;;
+            --evaluator) evaluator="$2"; shift 2 ;;
+            --winner) winner="$2"; shift 2 ;;
+            --model) flush_model; current_model="$2"; shift 2 ;;
+            --correctness) current_correct="$2"; shift 2 ;;
+            --completeness) current_complete="$2"; shift 2 ;;
+            --quality) current_quality="$2"; shift 2 ;;
+            --clarity) current_clarity="$2"; shift 2 ;;
+            --adherence) current_adherence="$2"; shift 2 ;;
+            --latency) current_latency="$2"; shift 2 ;;
+            --tokens) current_tokens="$2"; shift 2 ;;
+            --strengths) current_strengths="$2"; shift 2 ;;
+            --weaknesses) current_weaknesses="$2"; shift 2 ;;
+            --response) current_response="$2"; shift 2 ;;
+            *) shift ;;
+        esac
+    done


⚠️ Potential issue | 🟠 Major

Validate numeric inputs to prevent arithmetic failures and corrupt data.

Score values (--correctness, --completeness, etc.) are accepted raw and used directly in integer arithmetic (line 939) and SQL inserts. Non-numeric or out-of-range input will cause a silent arithmetic error or insert garbage into the database. A simple validation helper would catch this early with a clear message.

🛡️ Proposed validation helper

+ # Validate integer in range + validate_score() { + local name="$1" val="$2" + if ! [[ "$val" =~ ^[0-9]+$ ]] || [[ "$val" -lt 1 || "$val" -gt 10 ]]; then + print_error "--$name must be an integer between 1 and 10 (got: $val)" + return 1 + fi + return 0 + } + while [[ $# -gt 0 ]]; do case "$1" in --task) task="$2"; shift 2 ;; --type) task_type="$2"; shift 2 ;; --evaluator) evaluator="$2"; shift 2 ;; --winner) winner="$2"; shift 2 ;; --model) flush_model; current_model="$2"; shift 2 ;; - --correctness) current_correct="$2"; shift 2 ;; - --completeness) current_complete="$2"; shift 2 ;; - --quality) current_quality="$2"; shift 2 ;; - --clarity) current_clarity="$2"; shift 2 ;; - --adherence) current_adherence="$2"; shift 2 ;; + --correctness) validate_score correctness "$2" || return 1; current_correct="$2"; shift 2 ;; + --completeness) validate_score completeness "$2" || return 1; current_complete="$2"; shift 2 ;; + --quality) validate_score quality "$2" || return 1; current_quality="$2"; shift 2 ;; + --clarity) validate_score clarity "$2" || return 1; current_clarity="$2"; shift 2 ;; + --adherence) validate_score adherence "$2" || return 1; current_adherence="$2"; shift 2 ;;

As per coding guidelines, "Automation scripts - focus on: Reliability and robustness, Clear logging and feedback".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

while [[ $# -gt 0 ]]; do

case "$1" in

--task) task="$2"; shift 2 ;;

--type) task_type="$2"; shift 2 ;;

--evaluator) evaluator="$2"; shift 2 ;;

--winner) winner="$2"; shift 2 ;;

--model) flush_model; current_model="$2"; shift 2 ;;

--correctness) current_correct="$2"; shift 2 ;;

--completeness) current_complete="$2"; shift 2 ;;

--quality) current_quality="$2"; shift 2 ;;

--clarity) current_clarity="$2"; shift 2 ;;

--adherence) current_adherence="$2"; shift 2 ;;

--latency) current_latency="$2"; shift 2 ;;

--tokens) current_tokens="$2"; shift 2 ;;

--strengths) current_strengths="$2"; shift 2 ;;

--weaknesses) current_weaknesses="$2"; shift 2 ;;

--response) current_response="$2"; shift 2 ;;

*) shift ;;

esac

done

# Validate integer in range

validate_score() {

local name="$1" val="$2"

if ! [[ "$val" =~ ^[0-9]+$ ]] || [[ "$val" -lt 1 || "$val" -gt 10 ]]; then

print_error "--$name must be an integer between 1 and 10 (got: $val)"

return 1

fi

return 0

}

while [[ $# -gt 0 ]]; do

case "$1" in

--task) task="$2"; shift 2 ;;

--type) task_type="$2"; shift 2 ;;

--evaluator) evaluator="$2"; shift 2 ;;

--winner) winner="$2"; shift 2 ;;

--model) flush_model; current_model="$2"; shift 2 ;;

--correctness) validate_score correctness "$2" || return 1; current_correct="$2"; shift 2 ;;

--completeness) validate_score completeness "$2" || return 1; current_complete="$2"; shift 2 ;;

--quality) validate_score quality "$2" || return 1; current_quality="$2"; shift 2 ;;

--clarity) validate_score clarity "$2" || return 1; current_clarity="$2"; shift 2 ;;

--adherence) validate_score adherence "$2" || return 1; current_adherence="$2"; shift 2 ;;

--latency) current_latency="$2"; shift 2 ;;

--tokens) current_tokens="$2"; shift 2 ;;

--strengths) current_strengths="$2"; shift 2 ;;

--weaknesses) current_weaknesses="$2"; shift 2 ;;

--response) current_response="$2"; shift 2 ;;

*) shift ;;

esac

done

🤖 Prompt for AI Agents

In @.agents/scripts/compare-models-helper.sh around lines 947 - 966, Add validation for the numeric score flags (current_correct, current_complete, current_quality, current_clarity, current_adherence, current_latency, current_tokens) right after argument parsing: implement a helper function (e.g., validate_score) that checks each variable is an integer (or numeric as appropriate) and within expected ranges, logs a clear error and exits on failure; call this helper before any arithmetic or SQL insertion (used later around the arithmetic and DB insert logic) and ensure flush_model/current_model parsing still occurs unaffected.

coderabbitai · 2026-02-09T16:51:23Z

.agents/scripts/compare-models-helper.sh

+            --strengths) current_strengths="$2"; shift 2 ;;
+            --weaknesses) current_weaknesses="$2"; shift 2 ;;
+            --response) current_response="$2"; shift 2 ;;
+            *) shift ;;


⚠️ Potential issue | 🟠 Major

Silent discard of unrecognized flags can mask typos.

*) shift ;; silently ignores unknown arguments. A typo like --corectness 9 would be eaten without warning, resulting in a default score of 0 for that criterion — a data-integrity trap that's hard to debug.

🛡️ Proposed fix — warn on unknown flags

- *) shift ;; + *) print_warning "Unknown option ignored: $1"; shift ;;

As per coding guidelines, "Automation scripts - focus on: Clear logging and feedback".

🤖 Prompt for AI Agents

In @.agents/scripts/compare-models-helper.sh at line 964, The wildcard case in the argument parsing (the pattern '*) shift ;;') silently swallows unknown flags; replace it with a handler that prints a clear error to stderr including the offending token (e.g., "Unknown option: $1") and exits non-zero so typos fail fast; update the case block that contains '*) shift ;;' to use a warning/error message directed to stderr (using >&2) and call exit 1 (or, if intended to continue, at minimum print a warning and shift) so unrecognized arguments are not silently discarded.

coderabbitai · 2026-02-09T16:51:23Z

.agents/scripts/compare-models-helper.sh

+    # Insert comparison record
+    local comp_id
+    comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();")
+
+    # Insert scores for each model
+    for entry in "${model_entries[@]}"; do
+        IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry"
+        sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"
+    done


⚠️ Potential issue | 🔴 Critical

SQL injection — user-controlled values interpolated without escaping.

On line 1000, only $task is escaped via sed "s/'/''/g", but $task_type, $evaluator, and $winner are interpolated raw into the SQL string. On line 1005, $m_id is also unescaped. A malicious or simply apostrophe-containing value (e.g., --winner "it's a tie") breaks the query or allows injection.

Apply the same sed "s/'/''/g" escaping consistently to all string-interpolated values, or better yet, use a parameterized approach.

🔒 Proposed fix — consistent escaping

# Insert comparison record local comp_id - comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();") + local esc_task esc_type esc_eval esc_winner + esc_task="${task//\'/\'\'}" + esc_type="${task_type//\'/\'\'}" + esc_eval="${evaluator//\'/\'\'}" + esc_winner="${winner//\'/\'\'}" + comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$esc_task', '$esc_type', '$esc_eval', '$esc_winner'); SELECT last_insert_rowid();") # Insert scores for each model for entry in "${model_entries[@]}"; do IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry" - sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');" + local esc_mid="${m_id//\'/\'\'}" + local esc_str="${m_str//\'/\'\'}" + local esc_wea="${m_wea//\'/\'\'}" + local esc_res="${m_res//\'/\'\'}" + sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$esc_mid', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$esc_str', '$esc_wea', '$esc_res');" done

🤖 Prompt for AI Agents

In @.agents/scripts/compare-models-helper.sh around lines 998 - 1006, The SQL insertion uses unescaped, user-controlled variables (comp_id creation and the comparison_scores INSERT inside the model_entries loop); fix by consistently escaping all string interpolations (not just task) before embedding into SQL—e.g., sanitize task_type, evaluator, winner, and m_id and any string columns (m_str, m_wea, m_res) with the same sed "s/'/''/g" pattern or, better, switch the INSERT logic in the comp_id and comparison_scores blocks to a parameterized/ prepared statement approach; update the code around comp_id assignment and the for-loop that reads model_entries to use the sanitized variables (or bound parameters) when calling sqlite3 so apostrophes and injection vectors are handled safely.

coderabbitai · 2026-02-09T16:51:23Z

.agents/scripts/compare-models-helper.sh

+    local comp_id
+    comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();")
+
+    # Insert scores for each model
+    for entry in "${model_entries[@]}"; do
+        IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry"
+        sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"
+    done


⚠️ Potential issue | 🟠 Major

No error handling on SQLite insert operations.

If the sqlite3 call on line 1000 fails (permissions, disk full, corrupted DB), comp_id will be empty and the subsequent inserts on line 1005 will either fail or insert with an invalid comparison_id. Under set -e the script exits abruptly with no user-facing diagnostic.

🛡️ Proposed fix

local comp_id - comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons ...") + comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons ...") || { + print_error "Failed to write comparison to database: $RESULTS_DB" + return 1 + } + if [[ -z "$comp_id" ]]; then + print_error "Failed to obtain comparison ID from database" + return 1 + fi

As per coding guidelines, "Automation scripts - focus on: Error recovery mechanisms, Clear logging and feedback".

🤖 Prompt for AI Agents

In @.agents/scripts/compare-models-helper.sh around lines 999 - 1006, The sqlite3 INSERT that sets comp_id can fail and leave comp_id empty causing invalid downstream inserts into comparison_scores; update the block around comp_id and the for-loop to check the sqlite3 command exit status and validate comp_id is a non-empty numeric value before proceeding: after running the sqlite3 INSERT into comparisons capture both stdout and stderr (or check $?), verify comp_id is not empty and consists only of digits, log a clear error to stderr including $RESULTS_DB and the sqlite3 error output if it failed, and exit or skip the subsequent loop if validation fails so comparison_scores inserts only run when comp_id is valid; reference symbols: comp_id, RESULTS_DB, model_entries, comparison_scores, and the sqlite3 INSERT commands.

coderabbitai · 2026-02-09T16:51:23Z

.agents/scripts/compare-models-helper.sh

+    local where_clause=""
+    if [[ -n "$model_filter" ]]; then
+        where_clause="WHERE cs.model_id LIKE '%${model_filter}%'"
+    fi
+    if [[ -n "$type_filter" ]]; then
+        if [[ -n "$where_clause" ]]; then
+            where_clause="$where_clause AND c.task_type = '$type_filter'"
+        else
+            where_clause="WHERE c.task_type = '$type_filter'"
+        fi
+    fi


⚠️ Potential issue | 🔴 Critical

SQL injection in filter construction — model_filter, type_filter, and limit interpolated raw.

model_filter and type_filter are placed directly into the SQL WHERE clause without escaping. limit on line 1083 is also unvalidated. All three accept arbitrary user input via --model, --type, and --limit.

🔒 Proposed fix

+ # Validate limit is a positive integer + if ! [[ "$limit" =~ ^[0-9]+$ ]]; then + print_error "--limit must be a positive integer" + return 1 + fi + local where_clause="" if [[ -n "$model_filter" ]]; then - where_clause="WHERE cs.model_id LIKE '%${model_filter}%'" + local esc_model="${model_filter//\'/\'\'}" + where_clause="WHERE cs.model_id LIKE '%${esc_model}%'" fi if [[ -n "$type_filter" ]]; then + local esc_type="${type_filter//\'/\'\'}" if [[ -n "$where_clause" ]]; then - where_clause="$where_clause AND c.task_type = '$type_filter'" + where_clause="$where_clause AND c.task_type = '$esc_type'" else - where_clause="WHERE c.task_type = '$type_filter'" + where_clause="WHERE c.task_type = '$esc_type'" fi fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

local where_clause=""

if [[ -n "$model_filter" ]]; then

where_clause="WHERE cs.model_id LIKE '%${model_filter}%'"

fi

if [[ -n "$type_filter" ]]; then

if [[ -n "$where_clause" ]]; then

where_clause="$where_clause AND c.task_type = '$type_filter'"

else

where_clause="WHERE c.task_type = '$type_filter'"

fi

fi

# Validate limit is a positive integer

if ! [[ "$limit" =~ ^[0-9]+$ ]]; then

print_error "--limit must be a positive integer"

return 1

fi

local where_clause=""

if [[ -n "$model_filter" ]]; then

local esc_model="${model_filter//\'/\'\'}"

where_clause="WHERE cs.model_id LIKE '%${esc_model}%'"

fi

if [[ -n "$type_filter" ]]; then

local esc_type="${type_filter//\'/\'\'}"

if [[ -n "$where_clause" ]]; then

where_clause="$where_clause AND c.task_type = '$esc_type'"

else

where_clause="WHERE c.task_type = '$esc_type'"

fi

fi

🤖 Prompt for AI Agents

In @.agents/scripts/compare-models-helper.sh around lines 1051 - 1061, The WHERE construction interpolates untrusted vars (model_filter, type_filter) and limit directly into SQL, enabling injection; fix by sanitizing/validating inputs before building where_clause and limit usage: escape single quotes and percent/wildcard characters in model_filter (used in LIKE) and escape single quotes in type_filter, or better use parameterized queries if the DB client supports them, and validate limit is an integer (reject or default if not); update the code paths that set where_clause (variables model_filter, type_filter, where_clause) and wherever limit is used to perform these checks/escapes before concatenation.

coderabbitai · 2026-02-09T16:51:23Z

.agents/scripts/compare-models-helper.sh

+    sqlite3 -separator '|' "$RESULTS_DB" "
+        SELECT c.id, c.created_at, c.task_type, c.task_description, c.winner_model
+        FROM comparisons c
+        ORDER BY c.created_at DESC
+        LIMIT $limit;
+    " 2>/dev/null | while IFS='|' read -r cid cdate ctype cdesc cwinner; do
+        echo "  #$cid [$ctype] $(echo "$cdesc" | head -c 60) ($cdate)"
+        if [[ -n "$cwinner" ]]; then
+            echo "    Winner: $cwinner"
+        fi
+
+        # Show scores for this comparison
+        sqlite3 -separator '|' "$RESULTS_DB" "
+            SELECT model_id, overall, correctness, completeness, code_quality, clarity, adherence
+            FROM comparison_scores
+            WHERE comparison_id = $cid
+            ORDER BY overall DESC;
+        " 2>/dev/null | while IFS='|' read -r mid ov co cm cq cl ca; do
+            printf "    %-20s overall:%d (corr:%d comp:%d qual:%d clar:%d adhr:%d)\n" \
+                "$mid" "$ov" "$co" "$cm" "$cq" "$cl" "$ca"
+        done
+        echo ""
+    done


⚠️ Potential issue | 🟠 Major

Bug: --model / --type filters not applied to the recent comparisons listing.

The where_clause is applied to the count query (line 1069) and the aggregate rankings query (line 1106), but the recent comparisons listing here queries comparisons without any filter. Running results --model sonnet will show all comparisons in this section while only showing filtered aggregates below — inconsistent and confusing.

🐛 Proposed fix — apply filter to recent comparisons query

# Show recent comparisons sqlite3 -separator '|' "$RESULTS_DB" " - SELECT c.id, c.created_at, c.task_type, c.task_description, c.winner_model - FROM comparisons c - ORDER BY c.created_at DESC + SELECT DISTINCT c.id, c.created_at, c.task_type, c.task_description, c.winner_model + FROM comparisons c + LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id + $where_clause + ORDER BY c.created_at DESC LIMIT $limit;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sqlite3 -separator '|' "$RESULTS_DB" "

SELECT c.id, c.created_at, c.task_type, c.task_description, c.winner_model

FROM comparisons c

ORDER BY c.created_at DESC

LIMIT $limit;

" 2>/dev/null | while IFS='|' read -r cid cdate ctype cdesc cwinner; do

echo " #$cid [$ctype] $(echo "$cdesc" | head -c 60) ($cdate)"

if [[ -n "$cwinner" ]]; then

echo " Winner: $cwinner"

fi

# Show scores for this comparison

sqlite3 -separator '|' "$RESULTS_DB" "

SELECT model_id, overall, correctness, completeness, code_quality, clarity, adherence

FROM comparison_scores

WHERE comparison_id = $cid

ORDER BY overall DESC;

" 2>/dev/null | while IFS='|' read -r mid ov co cm cq cl ca; do

printf " %-20s overall:%d (corr:%d comp:%d qual:%d clar:%d adhr:%d)\n" \

"$mid" "$ov" "$co" "$cm" "$cq" "$cl" "$ca"

done

echo ""

done

sqlite3 -separator '|' "$RESULTS_DB" "

SELECT DISTINCT c.id, c.created_at, c.task_type, c.task_description, c.winner_model

FROM comparisons c

LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id

$where_clause

ORDER BY c.created_at DESC

LIMIT $limit;

" 2>/dev/null | while IFS='|' read -r cid cdate ctype cdesc cwinner; do

echo " #$cid [$ctype] $(echo "$cdesc" | head -c 60) ($cdate)"

if [[ -n "$cwinner" ]]; then

echo " Winner: $cwinner"

fi

# Show scores for this comparison

sqlite3 -separator '|' "$RESULTS_DB" "

SELECT model_id, overall, correctness, completeness, code_quality, clarity, adherence

FROM comparison_scores

WHERE comparison_id = $cid

ORDER BY overall DESC;

" 2>/dev/null | while IFS='|' read -r mid ov co cm cq cl ca; do

printf " %-20s overall:%d (corr:%d comp:%d qual:%d clar:%d adhr:%d)\n" \

"$mid" "$ov" "$co" "$cm" "$cq" "$cl" "$ca"

done

echo ""

done

🤖 Prompt for AI Agents

In @.agents/scripts/compare-models-helper.sh around lines 1079 - 1101, The recent comparisons query is missing the filter stored in where_clause so --model/--type aren't applied; modify the sqlite3 SELECT that reads from the comparisons table (the block that SELECTs c.id, c.created_at, c.task_type, c.task_description, c.winner_model) to include the same where_clause used elsewhere by adding the variable (e.g. append "$where_clause" or ${where_clause}) after "FROM comparisons c" and before "ORDER BY", ensuring proper spacing/quoting so the shell expands it into the SQL statement.

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

coderabbitai bot requested changes Feb 9, 2026

View reviewed changes

alex-solovyev merged commit d219f66 into main Feb 9, 2026
19 of 25 checks passed

marcusquinn mentioned this pull request Feb 10, 2026

t168.3: Build comparison and scoring framework #723

Closed

	comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" \| sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();")
	comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" \| sed "s/'/''/g")', '$(echo "$task_type" \| sed "s/'/''/g")', '$(echo "$evaluator" \| sed "s/'/''/g")', '$(echo "$winner" \| sed "s/'/''/g")'); SELECT last_insert_rowid();")

	local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))
	local overall
	overall=$(echo "scale=2; ($current_correct + $current_complete + $current_quality + $current_clarity + $current_adherence) / 5" \| bc)

	count=$(sqlite3 "$RESULTS_DB" "SELECT COUNT(DISTINCT c.id) FROM comparisons c LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id $where_clause;" 2>/dev/null \|\| echo "0")
	count=$(sqlite3 "$RESULTS_DB" "SELECT COUNT(DISTINCT c.id) FROM comparisons c LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id $where_clause;" \|\| echo "0")

Conversation

alex-solovyev commented Feb 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scoring Criteria (1-10 scale)

Usage

Testing

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Review ran into problems

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alex-solovyev commented Feb 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 9, 2026 •

edited

Loading