Skip to content

feat: add comparison scoring framework to compare-models-helper (t168.3)#783

Merged
alex-solovyev merged 1 commit intomainfrom
feature/t168.3-comparison-scoring
Feb 9, 2026
Merged

feat: add comparison scoring framework to compare-models-helper (t168.3)#783
alex-solovyev merged 1 commit intomainfrom
feature/t168.3-comparison-scoring

Conversation

@alex-solovyev
Copy link
Collaborator

@alex-solovyev alex-solovyev commented Feb 9, 2026

Summary

  • Add score command to record model comparison results with 5 scoring criteria
  • Add results command to view past comparisons and aggregate model rankings
  • SQLite storage in ~/.aidevops/.agent-workspace/memory/model-comparisons.db

Scoring Criteria (1-10 scale)

Criterion Description
correctness Factual accuracy
completeness Coverage of requirements
quality Code quality (for code tasks)
clarity Response readability
adherence Following instructions

Usage

# Record a comparison
compare-models-helper.sh score --task "fix React bug" --type code \
  --model claude-sonnet-4 --correctness 9 --completeness 8 --quality 8 --clarity 9 --adherence 9 \
  --model gpt-4.1 --correctness 8 --completeness 7 --quality 7 --clarity 8 --adherence 8 \
  --winner claude-sonnet-4

# View results
compare-models-helper.sh results
compare-models-helper.sh results --model sonnet --limit 5

Testing

  • bash -n syntax check passes
  • Score recording with 2 models produces correct table output
  • Results aggregation shows win rates and average scores
  • Multiple comparisons aggregate correctly

Fixes: t168.3 ref:GH#723

Summary by CodeRabbit

  • New Features
    • Added ability to record and persist model comparison scoring sessions with aggregated results
    • Added command to view historical comparisons with filtering options (limit, model, type) and aggregate model performance rankings

Add cmd_score() and cmd_results() for recording and viewing model
comparison results. SQLite storage tracks per-model scores across
5 criteria (correctness, completeness, quality, clarity, adherence)
plus latency and token usage. Aggregate rankings show win rates
and average scores across comparisons.

Fixes: t168.3 ref:GH#723
@gemini-code-assist
Copy link

Summary of Changes

Hello @alex-solovyev, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the compare-models-helper.sh script by integrating a robust system for evaluating and tracking the performance of different AI models. It provides a structured way to capture qualitative and quantitative scores for model responses, storing this data persistently in an SQLite database. This allows users to systematically compare models over time, analyze their strengths and weaknesses across various tasks, and derive aggregate rankings, moving beyond ad-hoc comparisons to a more data-driven evaluation process.

Highlights

  • New Scoring Framework: Introduced a new framework for scoring and comparing AI models, allowing users to record evaluation results based on five distinct criteria: correctness, completeness, quality, clarity, and adherence.
  • Persistent Storage: Implemented SQLite database storage for model comparison results, enabling cross-session insights and historical tracking of model performance.
  • New Commands: Added two new commands to compare-models-helper.sh: score to record detailed comparison data, and results to view past comparisons, aggregate model rankings, and win rates.
Changelog
  • .agents/scripts/compare-models-helper.sh
    • Added score and results commands to the script's help output.
    • Implemented cmd_score function to handle recording model comparison data, including task details, model-specific scores (correctness, completeness, quality, clarity, adherence, overall), latency, tokens, strengths, weaknesses, and response file paths.
    • Implemented cmd_results function to display recent comparison entries and aggregate model rankings, with filtering options for model and task type.
    • Created init_results_db function to initialize and manage the SQLite database schema for comparisons and comparison_scores tables.
    • Integrated the new score and results commands into the main command parsing logic of the script.
    • Defined RESULTS_DB variable to specify the path for the SQLite database, defaulting to ~/.aidevops/.agent-workspace/memory/model-comparisons.db.
Activity
  • The pull request introduces a new feature to add a comparison scoring framework.
  • The author alex-solovyev has provided a comprehensive summary of the changes, including detailed scoring criteria and usage examples for the new commands.
  • Testing notes indicate that the new functionalities for score recording and results aggregation have been verified.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

Walkthrough

A self-contained comparison scoring framework was introduced using SQLite persistence. The changes add database infrastructure with two tables (comparisons and comparison_scores), a cmd_score handler to record multi-model scoring sessions with aggregation, a cmd_results handler to query and filter past comparisons, and integration into the main command dispatcher.

Changes

Cohort / File(s) Summary
Database Infrastructure
.agents/scripts/compare-models-helper.sh
Introduced RESULTS_DB path, init_results_db function creating comparisons and comparison_scores tables with indexes for efficient querying.
Scoring Command Handler
.agents/scripts/compare-models-helper.sh
Added cmd_score to record multi-model scoring sessions, performing score aggregation, required metadata validation, database insertion, and formatted results summary output.
Results Query Command
.agents/scripts/compare-models-helper.sh
Added cmd_results to retrieve and display past comparisons with filtering (limit, model, type), per-comparison score display, and aggregate model rankings.
Command Dispatcher
.agents/scripts/compare-models-helper.sh
Extended main function to route new score and results commands alongside existing functionality.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

🗄️ SQLite steps in, a database of dreams,
Scoring sessions flow in persistent streams,
Models compared, their rankings take shape,
Results remembered—no data escape! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a comparison scoring framework to the compare-models-helper script, with a concise reference to the task identifier.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t168.3-comparison-scoring

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable framework for scoring and comparing model performance. The implementation is well-structured with new score and results commands and SQLite for persistence. My review focuses on improving security and adherence to the repository's style guide. I've identified several critical SQL injection vulnerabilities where user input is not properly sanitized before being used in database queries. Addressing these is essential for security. I've also included suggestions to fix violations of the shell style guide regarding error suppression and variable declaration, and a recommendation to improve the precision of the overall score calculation. The suggested fixes align with repository-specific rules for shell scripting practices.


# Insert comparison record
local comp_id
comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

This SQL query is vulnerable to SQL injection. The variables $task_type, $evaluator, and $winner are inserted directly into the query string without escaping. An attacker could provide malicious input to these parameters, compromising the database. To fix this, you should escape single quotes in these variables, similar to how you've handled the $task variable.

Suggested change
comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();")
comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$(echo "$task_type" | sed "s/'/''/g")', '$(echo "$evaluator" | sed "s/'/''/g")', '$(echo "$winner" | sed "s/'/''/g")'); SELECT last_insert_rowid();")
References
  1. The style guide recommends using parameterized queries where possible to prevent SQL injection. When not possible, all user-supplied data must be properly escaped. (link)
  2. For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions (e.g., a cross-platform sed wrapper) instead of introducing source dependencies. This maintains script independence and avoids risks like path resolution issues, which is particularly important in focused bugfix pull requests.

# Insert scores for each model
for entry in "${model_entries[@]}"; do
IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry"
sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

This query is also vulnerable to SQL injection. The variable $m_id is not escaped before being inserted into the SQL string. While other text variables in this query are escaped, this one was missed. All user-controllable string literals in an SQL query must be properly escaped to prevent security vulnerabilities.

Suggested change
sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"
sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$(echo "$m_id" | sed "s/'/''/g")', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"
References
  1. The style guide recommends using parameterized queries where possible to prevent SQL injection. When not possible, all user-supplied data must be properly escaped. (link)
  2. For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions (e.g., a cross-platform sed wrapper) instead of introducing source dependencies. This maintains script independence and avoids risks like path resolution issues, which is particularly important in focused bugfix pull requests.

Comment on lines +1052 to +1061
if [[ -n "$model_filter" ]]; then
where_clause="WHERE cs.model_id LIKE '%${model_filter}%'"
fi
if [[ -n "$type_filter" ]]; then
if [[ -n "$where_clause" ]]; then
where_clause="$where_clause AND c.task_type = '$type_filter'"
else
where_clause="WHERE c.task_type = '$type_filter'"
fi
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The construction of where_clause is vulnerable to SQL injection. The variables $model_filter and $type_filter are taken from user input and inserted directly into the SQL query without escaping. This could allow an attacker to manipulate the query. You should sanitize these variables before using them in the query.

Suggested change
if [[ -n "$model_filter" ]]; then
where_clause="WHERE cs.model_id LIKE '%${model_filter}%'"
fi
if [[ -n "$type_filter" ]]; then
if [[ -n "$where_clause" ]]; then
where_clause="$where_clause AND c.task_type = '$type_filter'"
else
where_clause="WHERE c.task_type = '$type_filter'"
fi
fi
if [[ -n "$model_filter" ]]; then
local sanitized_model_filter
sanitized_model_filter=$(echo "$model_filter" | sed "s/'/''/g")
where_clause="WHERE cs.model_id LIKE '%${sanitized_model_filter}%'"
fi
if [[ -n "$type_filter" ]]; then
local sanitized_type_filter
sanitized_type_filter=$(echo "$type_filter" | sed "s/'/''/g")
if [[ -n "$where_clause" ]]; then
where_clause="$where_clause AND c.task_type = '$sanitized_type_filter'"
else
where_clause="WHERE c.task_type = '$sanitized_type_filter'"
fi
fi
References
  1. The style guide recommends using parameterized queries where possible to prevent SQL injection. When not possible, all user-supplied data must be properly escaped. (link)
  2. For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions (e.g., a cross-platform sed wrapper) instead of introducing source dependencies. This maintains script independence and avoids risks like path resolution issues, which is particularly important in focused bugfix pull requests.
  3. In shell scripts, capture a command's exit code in a variable instead of using $? directly in conditionals. This aligns with ShellCheck SC2181 and improves clarity for multi-way branches.


flush_model() {
if [[ -n "$current_model" ]]; then
local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line has two issues:

  1. Style Guide Violation: It combines local with an assignment from a command substitution (arithmetic expansion). The repository style guide (line 11) requires separating declaration and assignment for exit code safety.
  2. Precision Loss: It uses integer division, which truncates the average score (e.g., 8.8 becomes 8), leading to inaccurate overall scores.

To fix both, you can separate the declaration and use bc for floating-point math. This would also require changing the overall column type to REAL in the init_results_db function.

Suggested change
local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))
local overall
overall=$(echo "scale=2; ($current_correct + $current_complete + $current_quality + $current_clarity + $current_adherence) / 5" | bc)
References
  1. The style guide requires declaring local variables and assigning them from command substitutions in separate steps to ensure exit code safety. (link)
  2. In shell scripts, capture a command's exit code in a variable instead of using $? directly in conditionals. This aligns with ShellCheck SC2181 and improves clarity for multi-way branches.

echo ""

local count
count=$(sqlite3 "$RESULTS_DB" "SELECT COUNT(DISTINCT c.id) FROM comparisons c LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id $where_clause;" 2>/dev/null || echo "0")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sqlite3 command suppresses errors using 2>/dev/null. This pattern is repeated on lines 1084, 1096, and 1116, and it violates the repository style guide (line 50), which disallows blanket error suppression. Please remove 2>/dev/null from these calls to allow database errors to be visible for debugging. The existing || guards and pipes will still handle command failures gracefully.

Suggested change
count=$(sqlite3 "$RESULTS_DB" "SELECT COUNT(DISTINCT c.id) FROM comparisons c LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id $where_clause;" 2>/dev/null || echo "0")
count=$(sqlite3 "$RESULTS_DB" "SELECT COUNT(DISTINCT c.id) FROM comparisons c LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id $where_clause;" || echo "0")
References
  1. The style guide prohibits blanket error suppression with 2>/dev/null. Errors should be visible for debugging or redirected to a log file. (link)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Fix all issues with AI agents
In @.agents/scripts/compare-models-helper.sh:
- Around line 937-945: The flush_model function should declare and assign the
local variable separately and end with an explicit return; change "local
overall=$((...))" to "local overall" followed by "overall=$(( (current_correct +
current_complete + current_quality + current_clarity + current_adherence) / 5
))", and break the combined post-if assignments (current_model="",
current_correct=0 ...) into separate statements for clarity, then append an
explicit "return 0" at the end of flush_model; keep references to model_entries
and current_* variables intact.
- Around line 1051-1061: The WHERE construction interpolates untrusted vars
(model_filter, type_filter) and limit directly into SQL, enabling injection; fix
by sanitizing/validating inputs before building where_clause and limit usage:
escape single quotes and percent/wildcard characters in model_filter (used in
LIKE) and escape single quotes in type_filter, or better use parameterized
queries if the DB client supports them, and validate limit is an integer (reject
or default if not); update the code paths that set where_clause (variables
model_filter, type_filter, where_clause) and wherever limit is used to perform
these checks/escapes before concatenation.
- Around line 947-966: Add validation for the numeric score flags
(current_correct, current_complete, current_quality, current_clarity,
current_adherence, current_latency, current_tokens) right after argument
parsing: implement a helper function (e.g., validate_score) that checks each
variable is an integer (or numeric as appropriate) and within expected ranges,
logs a clear error and exits on failure; call this helper before any arithmetic
or SQL insertion (used later around the arithmetic and DB insert logic) and
ensure flush_model/current_model parsing still occurs unaffected.
- Around line 999-1006: The sqlite3 INSERT that sets comp_id can fail and leave
comp_id empty causing invalid downstream inserts into comparison_scores; update
the block around comp_id and the for-loop to check the sqlite3 command exit
status and validate comp_id is a non-empty numeric value before proceeding:
after running the sqlite3 INSERT into comparisons capture both stdout and stderr
(or check $?), verify comp_id is not empty and consists only of digits, log a
clear error to stderr including $RESULTS_DB and the sqlite3 error output if it
failed, and exit or skip the subsequent loop if validation fails so
comparison_scores inserts only run when comp_id is valid; reference symbols:
comp_id, RESULTS_DB, model_entries, comparison_scores, and the sqlite3 INSERT
commands.
- Around line 882-920: The init_results_db function currently assumes sqlite3 is
present; add a pre-check using command availability (e.g., verify sqlite3 with
command -v or type) before attempting to create the DB so you can print a clear
actionable error and exit if missing; reference init_results_db, RESULTS_DB, and
sqlite3 when implementing the check and ensure the script emits a readable error
message (including that sqlite3 is required) and returns a non-zero status
instead of failing with a cryptic "command not found".
- Around line 1079-1101: The recent comparisons query is missing the filter
stored in where_clause so --model/--type aren't applied; modify the sqlite3
SELECT that reads from the comparisons table (the block that SELECTs c.id,
c.created_at, c.task_type, c.task_description, c.winner_model) to include the
same where_clause used elsewhere by adding the variable (e.g. append
"$where_clause" or ${where_clause}) after "FROM comparisons c" and before "ORDER
BY", ensuring proper spacing/quoting so the shell expands it into the SQL
statement.
- Line 964: The wildcard case in the argument parsing (the pattern '*) shift
;;') silently swallows unknown flags; replace it with a handler that prints a
clear error to stderr including the offending token (e.g., "Unknown option: $1")
and exits non-zero so typos fail fast; update the case block that contains '*)
shift ;;' to use a warning/error message directed to stderr (using >&2) and call
exit 1 (or, if intended to continue, at minimum print a warning and shift) so
unrecognized arguments are not silently discarded.
- Around line 998-1006: The SQL insertion uses unescaped, user-controlled
variables (comp_id creation and the comparison_scores INSERT inside the
model_entries loop); fix by consistently escaping all string interpolations (not
just task) before embedding into SQL—e.g., sanitize task_type, evaluator,
winner, and m_id and any string columns (m_str, m_wea, m_res) with the same sed
"s/'/''/g" pattern or, better, switch the INSERT logic in the comp_id and
comparison_scores blocks to a parameterized/ prepared statement approach; update
the code around comp_id assignment and the for-loop that reads model_entries to
use the sanitized variables (or bound parameters) when calling sqlite3 so
apostrophes and injection vectors are handled safely.
🧹 Nitpick comments (1)
.agents/scripts/compare-models-helper.sh (1)

880-880: Consider declaring RESULTS_DB as readonly for consistency.

All other global constants in this file (MODEL_DATA, TIER_MAP, TASK_RECOMMENDATIONS, PROVIDER_ENV_KEYS) are declared readonly. RESULTS_DB is assigned once and never modified — making it readonly would be consistent and prevent accidental reassignment.

-RESULTS_DB="${AIDEVOPS_WORKSPACE_DIR:-$HOME/.aidevops/.agent-workspace}/memory/model-comparisons.db"
+readonly RESULTS_DB="${AIDEVOPS_WORKSPACE_DIR:-$HOME/.aidevops/.agent-workspace}/memory/model-comparisons.db"

Comment on lines +882 to +920
init_results_db() {
local db_dir
db_dir="$(dirname "$RESULTS_DB")"
mkdir -p "$db_dir"

sqlite3 "$RESULTS_DB" <<'SQL'
CREATE TABLE IF NOT EXISTS comparisons (
id INTEGER PRIMARY KEY AUTOINCREMENT,
task_description TEXT NOT NULL,
task_type TEXT DEFAULT 'general',
created_at TEXT DEFAULT (datetime('now')),
evaluator_model TEXT,
winner_model TEXT
);

CREATE TABLE IF NOT EXISTS comparison_scores (
id INTEGER PRIMARY KEY AUTOINCREMENT,
comparison_id INTEGER NOT NULL,
model_id TEXT NOT NULL,
correctness INTEGER DEFAULT 0,
completeness INTEGER DEFAULT 0,
code_quality INTEGER DEFAULT 0,
clarity INTEGER DEFAULT 0,
adherence INTEGER DEFAULT 0,
overall INTEGER DEFAULT 0,
latency_ms INTEGER DEFAULT 0,
tokens_used INTEGER DEFAULT 0,
strengths TEXT DEFAULT '',
weaknesses TEXT DEFAULT '',
response_file TEXT DEFAULT '',
FOREIGN KEY (comparison_id) REFERENCES comparisons(id)
);

CREATE INDEX IF NOT EXISTS idx_comparisons_task ON comparisons(task_type);
CREATE INDEX IF NOT EXISTS idx_comparisons_winner ON comparisons(winner_model);
CREATE INDEX IF NOT EXISTS idx_scores_model ON comparison_scores(model_id);
SQL
return 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add a sqlite3 availability check for clear error feedback.

If sqlite3 isn't installed, the script will fail with a cryptic "command not found" error under set -e. Per the coding guidelines on reliability and clear logging, a pre-check with an actionable error message would be much friendlier.

🛡️ Proposed fix
 init_results_db() {
+    if ! command -v sqlite3 &>/dev/null; then
+        print_error "sqlite3 is required for comparison scoring but not found in PATH"
+        return 1
+    fi
     local db_dir
     db_dir="$(dirname "$RESULTS_DB")"
     mkdir -p "$db_dir"

As per coding guidelines, "Automation scripts - focus on: Clear logging and feedback, Error recovery mechanisms".

🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 882 - 920, The
init_results_db function currently assumes sqlite3 is present; add a pre-check
using command availability (e.g., verify sqlite3 with command -v or type) before
attempting to create the DB so you can print a clear actionable error and exit
if missing; reference init_results_db, RESULTS_DB, and sqlite3 when implementing
the check and ensure the script emits a readable error message (including that
sqlite3 is required) and returns a non-zero status instead of failing with a
cryptic "command not found".

Comment on lines +937 to +945
flush_model() {
if [[ -n "$current_model" ]]; then
local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))
model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}")
fi
current_model="" current_correct=0 current_complete=0 current_quality=0
current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0
current_strengths="" current_weaknesses="" current_response=""
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

flush_model missing explicit return and has SC2155.

Per coding guidelines, all functions should include explicit returns. Also, line 939 triggers ShellCheck SC2155 (declare and assign separately to avoid masking return values).

♻️ Proposed fix
     flush_model() {
         if [[ -n "$current_model" ]]; then
-            local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))
+            local overall
+            overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))
             model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}")
         fi
         current_model="" current_correct=0 current_complete=0 current_quality=0
         current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0
         current_strengths="" current_weaknesses="" current_response=""
+        return 0
     }

As per coding guidelines, "Run ShellCheck with zero violations on all scripts" and "Include explicit returns in shell scripts".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
flush_model() {
if [[ -n "$current_model" ]]; then
local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))
model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}")
fi
current_model="" current_correct=0 current_complete=0 current_quality=0
current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0
current_strengths="" current_weaknesses="" current_response=""
}
flush_model() {
if [[ -n "$current_model" ]]; then
local overall
overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))
model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}")
fi
current_model="" current_correct=0 current_complete=0 current_quality=0
current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0
current_strengths="" current_weaknesses="" current_response=""
return 0
}
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 937 - 945, The
flush_model function should declare and assign the local variable separately and
end with an explicit return; change "local overall=$((...))" to "local overall"
followed by "overall=$(( (current_correct + current_complete + current_quality +
current_clarity + current_adherence) / 5 ))", and break the combined post-if
assignments (current_model="", current_correct=0 ...) into separate statements
for clarity, then append an explicit "return 0" at the end of flush_model; keep
references to model_entries and current_* variables intact.

Comment on lines +947 to +966
while [[ $# -gt 0 ]]; do
case "$1" in
--task) task="$2"; shift 2 ;;
--type) task_type="$2"; shift 2 ;;
--evaluator) evaluator="$2"; shift 2 ;;
--winner) winner="$2"; shift 2 ;;
--model) flush_model; current_model="$2"; shift 2 ;;
--correctness) current_correct="$2"; shift 2 ;;
--completeness) current_complete="$2"; shift 2 ;;
--quality) current_quality="$2"; shift 2 ;;
--clarity) current_clarity="$2"; shift 2 ;;
--adherence) current_adherence="$2"; shift 2 ;;
--latency) current_latency="$2"; shift 2 ;;
--tokens) current_tokens="$2"; shift 2 ;;
--strengths) current_strengths="$2"; shift 2 ;;
--weaknesses) current_weaknesses="$2"; shift 2 ;;
--response) current_response="$2"; shift 2 ;;
*) shift ;;
esac
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate numeric inputs to prevent arithmetic failures and corrupt data.

Score values (--correctness, --completeness, etc.) are accepted raw and used directly in integer arithmetic (line 939) and SQL inserts. Non-numeric or out-of-range input will cause a silent arithmetic error or insert garbage into the database. A simple validation helper would catch this early with a clear message.

🛡️ Proposed validation helper
+    # Validate integer in range
+    validate_score() {
+        local name="$1" val="$2"
+        if ! [[ "$val" =~ ^[0-9]+$ ]] || [[ "$val" -lt 1 || "$val" -gt 10 ]]; then
+            print_error "--$name must be an integer between 1 and 10 (got: $val)"
+            return 1
+        fi
+        return 0
+    }
+
     while [[ $# -gt 0 ]]; do
         case "$1" in
             --task) task="$2"; shift 2 ;;
             --type) task_type="$2"; shift 2 ;;
             --evaluator) evaluator="$2"; shift 2 ;;
             --winner) winner="$2"; shift 2 ;;
             --model) flush_model; current_model="$2"; shift 2 ;;
-            --correctness) current_correct="$2"; shift 2 ;;
-            --completeness) current_complete="$2"; shift 2 ;;
-            --quality) current_quality="$2"; shift 2 ;;
-            --clarity) current_clarity="$2"; shift 2 ;;
-            --adherence) current_adherence="$2"; shift 2 ;;
+            --correctness) validate_score correctness "$2" || return 1; current_correct="$2"; shift 2 ;;
+            --completeness) validate_score completeness "$2" || return 1; current_complete="$2"; shift 2 ;;
+            --quality) validate_score quality "$2" || return 1; current_quality="$2"; shift 2 ;;
+            --clarity) validate_score clarity "$2" || return 1; current_clarity="$2"; shift 2 ;;
+            --adherence) validate_score adherence "$2" || return 1; current_adherence="$2"; shift 2 ;;

As per coding guidelines, "Automation scripts - focus on: Reliability and robustness, Clear logging and feedback".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
while [[ $# -gt 0 ]]; do
case "$1" in
--task) task="$2"; shift 2 ;;
--type) task_type="$2"; shift 2 ;;
--evaluator) evaluator="$2"; shift 2 ;;
--winner) winner="$2"; shift 2 ;;
--model) flush_model; current_model="$2"; shift 2 ;;
--correctness) current_correct="$2"; shift 2 ;;
--completeness) current_complete="$2"; shift 2 ;;
--quality) current_quality="$2"; shift 2 ;;
--clarity) current_clarity="$2"; shift 2 ;;
--adherence) current_adherence="$2"; shift 2 ;;
--latency) current_latency="$2"; shift 2 ;;
--tokens) current_tokens="$2"; shift 2 ;;
--strengths) current_strengths="$2"; shift 2 ;;
--weaknesses) current_weaknesses="$2"; shift 2 ;;
--response) current_response="$2"; shift 2 ;;
*) shift ;;
esac
done
# Validate integer in range
validate_score() {
local name="$1" val="$2"
if ! [[ "$val" =~ ^[0-9]+$ ]] || [[ "$val" -lt 1 || "$val" -gt 10 ]]; then
print_error "--$name must be an integer between 1 and 10 (got: $val)"
return 1
fi
return 0
}
while [[ $# -gt 0 ]]; do
case "$1" in
--task) task="$2"; shift 2 ;;
--type) task_type="$2"; shift 2 ;;
--evaluator) evaluator="$2"; shift 2 ;;
--winner) winner="$2"; shift 2 ;;
--model) flush_model; current_model="$2"; shift 2 ;;
--correctness) validate_score correctness "$2" || return 1; current_correct="$2"; shift 2 ;;
--completeness) validate_score completeness "$2" || return 1; current_complete="$2"; shift 2 ;;
--quality) validate_score quality "$2" || return 1; current_quality="$2"; shift 2 ;;
--clarity) validate_score clarity "$2" || return 1; current_clarity="$2"; shift 2 ;;
--adherence) validate_score adherence "$2" || return 1; current_adherence="$2"; shift 2 ;;
--latency) current_latency="$2"; shift 2 ;;
--tokens) current_tokens="$2"; shift 2 ;;
--strengths) current_strengths="$2"; shift 2 ;;
--weaknesses) current_weaknesses="$2"; shift 2 ;;
--response) current_response="$2"; shift 2 ;;
*) shift ;;
esac
done
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 947 - 966, Add
validation for the numeric score flags (current_correct, current_complete,
current_quality, current_clarity, current_adherence, current_latency,
current_tokens) right after argument parsing: implement a helper function (e.g.,
validate_score) that checks each variable is an integer (or numeric as
appropriate) and within expected ranges, logs a clear error and exits on
failure; call this helper before any arithmetic or SQL insertion (used later
around the arithmetic and DB insert logic) and ensure flush_model/current_model
parsing still occurs unaffected.

--strengths) current_strengths="$2"; shift 2 ;;
--weaknesses) current_weaknesses="$2"; shift 2 ;;
--response) current_response="$2"; shift 2 ;;
*) shift ;;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Silent discard of unrecognized flags can mask typos.

*) shift ;; silently ignores unknown arguments. A typo like --corectness 9 would be eaten without warning, resulting in a default score of 0 for that criterion — a data-integrity trap that's hard to debug.

🛡️ Proposed fix — warn on unknown flags
-            *) shift ;;
+            *) print_warning "Unknown option ignored: $1"; shift ;;

As per coding guidelines, "Automation scripts - focus on: Clear logging and feedback".

🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh at line 964, The wildcard case in
the argument parsing (the pattern '*) shift ;;') silently swallows unknown
flags; replace it with a handler that prints a clear error to stderr including
the offending token (e.g., "Unknown option: $1") and exits non-zero so typos
fail fast; update the case block that contains '*) shift ;;' to use a
warning/error message directed to stderr (using >&2) and call exit 1 (or, if
intended to continue, at minimum print a warning and shift) so unrecognized
arguments are not silently discarded.

Comment on lines +998 to +1006
# Insert comparison record
local comp_id
comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();")

# Insert scores for each model
for entry in "${model_entries[@]}"; do
IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry"
sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

SQL injection — user-controlled values interpolated without escaping.

On line 1000, only $task is escaped via sed "s/'/''/g", but $task_type, $evaluator, and $winner are interpolated raw into the SQL string. On line 1005, $m_id is also unescaped. A malicious or simply apostrophe-containing value (e.g., --winner "it's a tie") breaks the query or allows injection.

Apply the same sed "s/'/''/g" escaping consistently to all string-interpolated values, or better yet, use a parameterized approach.

🔒 Proposed fix — consistent escaping
     # Insert comparison record
     local comp_id
-    comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();")
+    local esc_task esc_type esc_eval esc_winner
+    esc_task="${task//\'/\'\'}"
+    esc_type="${task_type//\'/\'\'}"
+    esc_eval="${evaluator//\'/\'\'}"
+    esc_winner="${winner//\'/\'\'}"
+    comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$esc_task', '$esc_type', '$esc_eval', '$esc_winner'); SELECT last_insert_rowid();")
 
     # Insert scores for each model
     for entry in "${model_entries[@]}"; do
         IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry"
-        sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"
+        local esc_mid="${m_id//\'/\'\'}"
+        local esc_str="${m_str//\'/\'\'}"
+        local esc_wea="${m_wea//\'/\'\'}"
+        local esc_res="${m_res//\'/\'\'}"
+        sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$esc_mid', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$esc_str', '$esc_wea', '$esc_res');"
     done
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 998 - 1006, The SQL
insertion uses unescaped, user-controlled variables (comp_id creation and the
comparison_scores INSERT inside the model_entries loop); fix by consistently
escaping all string interpolations (not just task) before embedding into
SQL—e.g., sanitize task_type, evaluator, winner, and m_id and any string columns
(m_str, m_wea, m_res) with the same sed "s/'/''/g" pattern or, better, switch
the INSERT logic in the comp_id and comparison_scores blocks to a parameterized/
prepared statement approach; update the code around comp_id assignment and the
for-loop that reads model_entries to use the sanitized variables (or bound
parameters) when calling sqlite3 so apostrophes and injection vectors are
handled safely.

Comment on lines +999 to +1006
local comp_id
comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();")

# Insert scores for each model
for entry in "${model_entries[@]}"; do
IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry"
sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

No error handling on SQLite insert operations.

If the sqlite3 call on line 1000 fails (permissions, disk full, corrupted DB), comp_id will be empty and the subsequent inserts on line 1005 will either fail or insert with an invalid comparison_id. Under set -e the script exits abruptly with no user-facing diagnostic.

🛡️ Proposed fix
     local comp_id
-    comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons ...")
+    comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons ...") || {
+        print_error "Failed to write comparison to database: $RESULTS_DB"
+        return 1
+    }
+    if [[ -z "$comp_id" ]]; then
+        print_error "Failed to obtain comparison ID from database"
+        return 1
+    fi

As per coding guidelines, "Automation scripts - focus on: Error recovery mechanisms, Clear logging and feedback".

🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 999 - 1006, The
sqlite3 INSERT that sets comp_id can fail and leave comp_id empty causing
invalid downstream inserts into comparison_scores; update the block around
comp_id and the for-loop to check the sqlite3 command exit status and validate
comp_id is a non-empty numeric value before proceeding: after running the
sqlite3 INSERT into comparisons capture both stdout and stderr (or check $?),
verify comp_id is not empty and consists only of digits, log a clear error to
stderr including $RESULTS_DB and the sqlite3 error output if it failed, and exit
or skip the subsequent loop if validation fails so comparison_scores inserts
only run when comp_id is valid; reference symbols: comp_id, RESULTS_DB,
model_entries, comparison_scores, and the sqlite3 INSERT commands.

Comment on lines +1051 to +1061
local where_clause=""
if [[ -n "$model_filter" ]]; then
where_clause="WHERE cs.model_id LIKE '%${model_filter}%'"
fi
if [[ -n "$type_filter" ]]; then
if [[ -n "$where_clause" ]]; then
where_clause="$where_clause AND c.task_type = '$type_filter'"
else
where_clause="WHERE c.task_type = '$type_filter'"
fi
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

SQL injection in filter construction — model_filter, type_filter, and limit interpolated raw.

model_filter and type_filter are placed directly into the SQL WHERE clause without escaping. limit on line 1083 is also unvalidated. All three accept arbitrary user input via --model, --type, and --limit.

🔒 Proposed fix
+    # Validate limit is a positive integer
+    if ! [[ "$limit" =~ ^[0-9]+$ ]]; then
+        print_error "--limit must be a positive integer"
+        return 1
+    fi
+
     local where_clause=""
     if [[ -n "$model_filter" ]]; then
-        where_clause="WHERE cs.model_id LIKE '%${model_filter}%'"
+        local esc_model="${model_filter//\'/\'\'}"
+        where_clause="WHERE cs.model_id LIKE '%${esc_model}%'"
     fi
     if [[ -n "$type_filter" ]]; then
+        local esc_type="${type_filter//\'/\'\'}"
         if [[ -n "$where_clause" ]]; then
-            where_clause="$where_clause AND c.task_type = '$type_filter'"
+            where_clause="$where_clause AND c.task_type = '$esc_type'"
         else
-            where_clause="WHERE c.task_type = '$type_filter'"
+            where_clause="WHERE c.task_type = '$esc_type'"
         fi
     fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local where_clause=""
if [[ -n "$model_filter" ]]; then
where_clause="WHERE cs.model_id LIKE '%${model_filter}%'"
fi
if [[ -n "$type_filter" ]]; then
if [[ -n "$where_clause" ]]; then
where_clause="$where_clause AND c.task_type = '$type_filter'"
else
where_clause="WHERE c.task_type = '$type_filter'"
fi
fi
# Validate limit is a positive integer
if ! [[ "$limit" =~ ^[0-9]+$ ]]; then
print_error "--limit must be a positive integer"
return 1
fi
local where_clause=""
if [[ -n "$model_filter" ]]; then
local esc_model="${model_filter//\'/\'\'}"
where_clause="WHERE cs.model_id LIKE '%${esc_model}%'"
fi
if [[ -n "$type_filter" ]]; then
local esc_type="${type_filter//\'/\'\'}"
if [[ -n "$where_clause" ]]; then
where_clause="$where_clause AND c.task_type = '$esc_type'"
else
where_clause="WHERE c.task_type = '$esc_type'"
fi
fi
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 1051 - 1061, The WHERE
construction interpolates untrusted vars (model_filter, type_filter) and limit
directly into SQL, enabling injection; fix by sanitizing/validating inputs
before building where_clause and limit usage: escape single quotes and
percent/wildcard characters in model_filter (used in LIKE) and escape single
quotes in type_filter, or better use parameterized queries if the DB client
supports them, and validate limit is an integer (reject or default if not);
update the code paths that set where_clause (variables model_filter,
type_filter, where_clause) and wherever limit is used to perform these
checks/escapes before concatenation.

Comment on lines +1079 to +1101
sqlite3 -separator '|' "$RESULTS_DB" "
SELECT c.id, c.created_at, c.task_type, c.task_description, c.winner_model
FROM comparisons c
ORDER BY c.created_at DESC
LIMIT $limit;
" 2>/dev/null | while IFS='|' read -r cid cdate ctype cdesc cwinner; do
echo " #$cid [$ctype] $(echo "$cdesc" | head -c 60) ($cdate)"
if [[ -n "$cwinner" ]]; then
echo " Winner: $cwinner"
fi

# Show scores for this comparison
sqlite3 -separator '|' "$RESULTS_DB" "
SELECT model_id, overall, correctness, completeness, code_quality, clarity, adherence
FROM comparison_scores
WHERE comparison_id = $cid
ORDER BY overall DESC;
" 2>/dev/null | while IFS='|' read -r mid ov co cm cq cl ca; do
printf " %-20s overall:%d (corr:%d comp:%d qual:%d clar:%d adhr:%d)\n" \
"$mid" "$ov" "$co" "$cm" "$cq" "$cl" "$ca"
done
echo ""
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: --model / --type filters not applied to the recent comparisons listing.

The where_clause is applied to the count query (line 1069) and the aggregate rankings query (line 1106), but the recent comparisons listing here queries comparisons without any filter. Running results --model sonnet will show all comparisons in this section while only showing filtered aggregates below — inconsistent and confusing.

🐛 Proposed fix — apply filter to recent comparisons query
     # Show recent comparisons
     sqlite3 -separator '|' "$RESULTS_DB" "
-        SELECT c.id, c.created_at, c.task_type, c.task_description, c.winner_model
-        FROM comparisons c
-        ORDER BY c.created_at DESC
+        SELECT DISTINCT c.id, c.created_at, c.task_type, c.task_description, c.winner_model
+        FROM comparisons c
+        LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id
+        $where_clause
+        ORDER BY c.created_at DESC
         LIMIT $limit;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sqlite3 -separator '|' "$RESULTS_DB" "
SELECT c.id, c.created_at, c.task_type, c.task_description, c.winner_model
FROM comparisons c
ORDER BY c.created_at DESC
LIMIT $limit;
" 2>/dev/null | while IFS='|' read -r cid cdate ctype cdesc cwinner; do
echo " #$cid [$ctype] $(echo "$cdesc" | head -c 60) ($cdate)"
if [[ -n "$cwinner" ]]; then
echo " Winner: $cwinner"
fi
# Show scores for this comparison
sqlite3 -separator '|' "$RESULTS_DB" "
SELECT model_id, overall, correctness, completeness, code_quality, clarity, adherence
FROM comparison_scores
WHERE comparison_id = $cid
ORDER BY overall DESC;
" 2>/dev/null | while IFS='|' read -r mid ov co cm cq cl ca; do
printf " %-20s overall:%d (corr:%d comp:%d qual:%d clar:%d adhr:%d)\n" \
"$mid" "$ov" "$co" "$cm" "$cq" "$cl" "$ca"
done
echo ""
done
sqlite3 -separator '|' "$RESULTS_DB" "
SELECT DISTINCT c.id, c.created_at, c.task_type, c.task_description, c.winner_model
FROM comparisons c
LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id
$where_clause
ORDER BY c.created_at DESC
LIMIT $limit;
" 2>/dev/null | while IFS='|' read -r cid cdate ctype cdesc cwinner; do
echo " #$cid [$ctype] $(echo "$cdesc" | head -c 60) ($cdate)"
if [[ -n "$cwinner" ]]; then
echo " Winner: $cwinner"
fi
# Show scores for this comparison
sqlite3 -separator '|' "$RESULTS_DB" "
SELECT model_id, overall, correctness, completeness, code_quality, clarity, adherence
FROM comparison_scores
WHERE comparison_id = $cid
ORDER BY overall DESC;
" 2>/dev/null | while IFS='|' read -r mid ov co cm cq cl ca; do
printf " %-20s overall:%d (corr:%d comp:%d qual:%d clar:%d adhr:%d)\n" \
"$mid" "$ov" "$co" "$cm" "$cq" "$cl" "$ca"
done
echo ""
done
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 1079 - 1101, The
recent comparisons query is missing the filter stored in where_clause so
--model/--type aren't applied; modify the sqlite3 SELECT that reads from the
comparisons table (the block that SELECTs c.id, c.created_at, c.task_type,
c.task_description, c.winner_model) to include the same where_clause used
elsewhere by adding the variable (e.g. append "$where_clause" or
${where_clause}) after "FROM comparisons c" and before "ORDER BY", ensuring
proper spacing/quoting so the shell expands it into the SQL statement.

@alex-solovyev alex-solovyev merged commit d219f66 into main Feb 9, 2026
19 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant