feat: add comparison scoring framework to compare-models-helper (t168.3)#783
Conversation
Add cmd_score() and cmd_results() for recording and viewing model comparison results. SQLite storage tracks per-model scores across 5 criteria (correctness, completeness, quality, clarity, adherence) plus latency and token usage. Aggregate rankings show win rates and average scores across comparisons. Fixes: t168.3 ref:GH#723
Summary of ChangesHello @alex-solovyev, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughA self-contained comparison scoring framework was introduced using SQLite persistence. The changes add database infrastructure with two tables (comparisons and comparison_scores), a cmd_score handler to record multi-model scoring sessions with aggregation, a cmd_results handler to query and filter past comparisons, and integration into the main command dispatcher. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable framework for scoring and comparing model performance. The implementation is well-structured with new score and results commands and SQLite for persistence. My review focuses on improving security and adherence to the repository's style guide. I've identified several critical SQL injection vulnerabilities where user input is not properly sanitized before being used in database queries. Addressing these is essential for security. I've also included suggestions to fix violations of the shell style guide regarding error suppression and variable declaration, and a recommendation to improve the precision of the overall score calculation. The suggested fixes align with repository-specific rules for shell scripting practices.
|
|
||
| # Insert comparison record | ||
| local comp_id | ||
| comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();") |
There was a problem hiding this comment.
This SQL query is vulnerable to SQL injection. The variables $task_type, $evaluator, and $winner are inserted directly into the query string without escaping. An attacker could provide malicious input to these parameters, compromising the database. To fix this, you should escape single quotes in these variables, similar to how you've handled the $task variable.
| comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();") | |
| comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$(echo "$task_type" | sed "s/'/''/g")', '$(echo "$evaluator" | sed "s/'/''/g")', '$(echo "$winner" | sed "s/'/''/g")'); SELECT last_insert_rowid();") |
References
- The style guide recommends using parameterized queries where possible to prevent SQL injection. When not possible, all user-supplied data must be properly escaped. (link)
- For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions (e.g., a cross-platform
sedwrapper) instead of introducingsourcedependencies. This maintains script independence and avoids risks like path resolution issues, which is particularly important in focused bugfix pull requests.
| # Insert scores for each model | ||
| for entry in "${model_entries[@]}"; do | ||
| IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry" | ||
| sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');" |
There was a problem hiding this comment.
This query is also vulnerable to SQL injection. The variable $m_id is not escaped before being inserted into the SQL string. While other text variables in this query are escaped, this one was missed. All user-controllable string literals in an SQL query must be properly escaped to prevent security vulnerabilities.
| sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');" | |
| sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$(echo "$m_id" | sed "s/'/''/g")', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');" |
References
- The style guide recommends using parameterized queries where possible to prevent SQL injection. When not possible, all user-supplied data must be properly escaped. (link)
- For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions (e.g., a cross-platform
sedwrapper) instead of introducingsourcedependencies. This maintains script independence and avoids risks like path resolution issues, which is particularly important in focused bugfix pull requests.
| if [[ -n "$model_filter" ]]; then | ||
| where_clause="WHERE cs.model_id LIKE '%${model_filter}%'" | ||
| fi | ||
| if [[ -n "$type_filter" ]]; then | ||
| if [[ -n "$where_clause" ]]; then | ||
| where_clause="$where_clause AND c.task_type = '$type_filter'" | ||
| else | ||
| where_clause="WHERE c.task_type = '$type_filter'" | ||
| fi | ||
| fi |
There was a problem hiding this comment.
The construction of where_clause is vulnerable to SQL injection. The variables $model_filter and $type_filter are taken from user input and inserted directly into the SQL query without escaping. This could allow an attacker to manipulate the query. You should sanitize these variables before using them in the query.
| if [[ -n "$model_filter" ]]; then | |
| where_clause="WHERE cs.model_id LIKE '%${model_filter}%'" | |
| fi | |
| if [[ -n "$type_filter" ]]; then | |
| if [[ -n "$where_clause" ]]; then | |
| where_clause="$where_clause AND c.task_type = '$type_filter'" | |
| else | |
| where_clause="WHERE c.task_type = '$type_filter'" | |
| fi | |
| fi | |
| if [[ -n "$model_filter" ]]; then | |
| local sanitized_model_filter | |
| sanitized_model_filter=$(echo "$model_filter" | sed "s/'/''/g") | |
| where_clause="WHERE cs.model_id LIKE '%${sanitized_model_filter}%'" | |
| fi | |
| if [[ -n "$type_filter" ]]; then | |
| local sanitized_type_filter | |
| sanitized_type_filter=$(echo "$type_filter" | sed "s/'/''/g") | |
| if [[ -n "$where_clause" ]]; then | |
| where_clause="$where_clause AND c.task_type = '$sanitized_type_filter'" | |
| else | |
| where_clause="WHERE c.task_type = '$sanitized_type_filter'" | |
| fi | |
| fi |
References
- The style guide recommends using parameterized queries where possible to prevent SQL injection. When not possible, all user-supplied data must be properly escaped. (link)
- For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions (e.g., a cross-platform
sedwrapper) instead of introducingsourcedependencies. This maintains script independence and avoids risks like path resolution issues, which is particularly important in focused bugfix pull requests. - In shell scripts, capture a command's exit code in a variable instead of using
$?directly in conditionals. This aligns with ShellCheck SC2181 and improves clarity for multi-way branches.
|
|
||
| flush_model() { | ||
| if [[ -n "$current_model" ]]; then | ||
| local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 )) |
There was a problem hiding this comment.
This line has two issues:
- Style Guide Violation: It combines
localwith an assignment from a command substitution (arithmetic expansion). The repository style guide (line 11) requires separating declaration and assignment for exit code safety. - Precision Loss: It uses integer division, which truncates the average score (e.g., 8.8 becomes 8), leading to inaccurate
overallscores.
To fix both, you can separate the declaration and use bc for floating-point math. This would also require changing the overall column type to REAL in the init_results_db function.
| local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 )) | |
| local overall | |
| overall=$(echo "scale=2; ($current_correct + $current_complete + $current_quality + $current_clarity + $current_adherence) / 5" | bc) |
References
- The style guide requires declaring local variables and assigning them from command substitutions in separate steps to ensure exit code safety. (link)
- In shell scripts, capture a command's exit code in a variable instead of using
$?directly in conditionals. This aligns with ShellCheck SC2181 and improves clarity for multi-way branches.
| echo "" | ||
|
|
||
| local count | ||
| count=$(sqlite3 "$RESULTS_DB" "SELECT COUNT(DISTINCT c.id) FROM comparisons c LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id $where_clause;" 2>/dev/null || echo "0") |
There was a problem hiding this comment.
This sqlite3 command suppresses errors using 2>/dev/null. This pattern is repeated on lines 1084, 1096, and 1116, and it violates the repository style guide (line 50), which disallows blanket error suppression. Please remove 2>/dev/null from these calls to allow database errors to be visible for debugging. The existing || guards and pipes will still handle command failures gracefully.
| count=$(sqlite3 "$RESULTS_DB" "SELECT COUNT(DISTINCT c.id) FROM comparisons c LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id $where_clause;" 2>/dev/null || echo "0") | |
| count=$(sqlite3 "$RESULTS_DB" "SELECT COUNT(DISTINCT c.id) FROM comparisons c LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id $where_clause;" || echo "0") |
References
- The style guide prohibits blanket error suppression with
2>/dev/null. Errors should be visible for debugging or redirected to a log file. (link)
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Fix all issues with AI agents
In @.agents/scripts/compare-models-helper.sh:
- Around line 937-945: The flush_model function should declare and assign the
local variable separately and end with an explicit return; change "local
overall=$((...))" to "local overall" followed by "overall=$(( (current_correct +
current_complete + current_quality + current_clarity + current_adherence) / 5
))", and break the combined post-if assignments (current_model="",
current_correct=0 ...) into separate statements for clarity, then append an
explicit "return 0" at the end of flush_model; keep references to model_entries
and current_* variables intact.
- Around line 1051-1061: The WHERE construction interpolates untrusted vars
(model_filter, type_filter) and limit directly into SQL, enabling injection; fix
by sanitizing/validating inputs before building where_clause and limit usage:
escape single quotes and percent/wildcard characters in model_filter (used in
LIKE) and escape single quotes in type_filter, or better use parameterized
queries if the DB client supports them, and validate limit is an integer (reject
or default if not); update the code paths that set where_clause (variables
model_filter, type_filter, where_clause) and wherever limit is used to perform
these checks/escapes before concatenation.
- Around line 947-966: Add validation for the numeric score flags
(current_correct, current_complete, current_quality, current_clarity,
current_adherence, current_latency, current_tokens) right after argument
parsing: implement a helper function (e.g., validate_score) that checks each
variable is an integer (or numeric as appropriate) and within expected ranges,
logs a clear error and exits on failure; call this helper before any arithmetic
or SQL insertion (used later around the arithmetic and DB insert logic) and
ensure flush_model/current_model parsing still occurs unaffected.
- Around line 999-1006: The sqlite3 INSERT that sets comp_id can fail and leave
comp_id empty causing invalid downstream inserts into comparison_scores; update
the block around comp_id and the for-loop to check the sqlite3 command exit
status and validate comp_id is a non-empty numeric value before proceeding:
after running the sqlite3 INSERT into comparisons capture both stdout and stderr
(or check $?), verify comp_id is not empty and consists only of digits, log a
clear error to stderr including $RESULTS_DB and the sqlite3 error output if it
failed, and exit or skip the subsequent loop if validation fails so
comparison_scores inserts only run when comp_id is valid; reference symbols:
comp_id, RESULTS_DB, model_entries, comparison_scores, and the sqlite3 INSERT
commands.
- Around line 882-920: The init_results_db function currently assumes sqlite3 is
present; add a pre-check using command availability (e.g., verify sqlite3 with
command -v or type) before attempting to create the DB so you can print a clear
actionable error and exit if missing; reference init_results_db, RESULTS_DB, and
sqlite3 when implementing the check and ensure the script emits a readable error
message (including that sqlite3 is required) and returns a non-zero status
instead of failing with a cryptic "command not found".
- Around line 1079-1101: The recent comparisons query is missing the filter
stored in where_clause so --model/--type aren't applied; modify the sqlite3
SELECT that reads from the comparisons table (the block that SELECTs c.id,
c.created_at, c.task_type, c.task_description, c.winner_model) to include the
same where_clause used elsewhere by adding the variable (e.g. append
"$where_clause" or ${where_clause}) after "FROM comparisons c" and before "ORDER
BY", ensuring proper spacing/quoting so the shell expands it into the SQL
statement.
- Line 964: The wildcard case in the argument parsing (the pattern '*) shift
;;') silently swallows unknown flags; replace it with a handler that prints a
clear error to stderr including the offending token (e.g., "Unknown option: $1")
and exits non-zero so typos fail fast; update the case block that contains '*)
shift ;;' to use a warning/error message directed to stderr (using >&2) and call
exit 1 (or, if intended to continue, at minimum print a warning and shift) so
unrecognized arguments are not silently discarded.
- Around line 998-1006: The SQL insertion uses unescaped, user-controlled
variables (comp_id creation and the comparison_scores INSERT inside the
model_entries loop); fix by consistently escaping all string interpolations (not
just task) before embedding into SQL—e.g., sanitize task_type, evaluator,
winner, and m_id and any string columns (m_str, m_wea, m_res) with the same sed
"s/'/''/g" pattern or, better, switch the INSERT logic in the comp_id and
comparison_scores blocks to a parameterized/ prepared statement approach; update
the code around comp_id assignment and the for-loop that reads model_entries to
use the sanitized variables (or bound parameters) when calling sqlite3 so
apostrophes and injection vectors are handled safely.
🧹 Nitpick comments (1)
.agents/scripts/compare-models-helper.sh (1)
880-880: Consider declaringRESULTS_DBasreadonlyfor consistency.All other global constants in this file (
MODEL_DATA,TIER_MAP,TASK_RECOMMENDATIONS,PROVIDER_ENV_KEYS) are declaredreadonly.RESULTS_DBis assigned once and never modified — making itreadonlywould be consistent and prevent accidental reassignment.-RESULTS_DB="${AIDEVOPS_WORKSPACE_DIR:-$HOME/.aidevops/.agent-workspace}/memory/model-comparisons.db" +readonly RESULTS_DB="${AIDEVOPS_WORKSPACE_DIR:-$HOME/.aidevops/.agent-workspace}/memory/model-comparisons.db"
| init_results_db() { | ||
| local db_dir | ||
| db_dir="$(dirname "$RESULTS_DB")" | ||
| mkdir -p "$db_dir" | ||
|
|
||
| sqlite3 "$RESULTS_DB" <<'SQL' | ||
| CREATE TABLE IF NOT EXISTS comparisons ( | ||
| id INTEGER PRIMARY KEY AUTOINCREMENT, | ||
| task_description TEXT NOT NULL, | ||
| task_type TEXT DEFAULT 'general', | ||
| created_at TEXT DEFAULT (datetime('now')), | ||
| evaluator_model TEXT, | ||
| winner_model TEXT | ||
| ); | ||
|
|
||
| CREATE TABLE IF NOT EXISTS comparison_scores ( | ||
| id INTEGER PRIMARY KEY AUTOINCREMENT, | ||
| comparison_id INTEGER NOT NULL, | ||
| model_id TEXT NOT NULL, | ||
| correctness INTEGER DEFAULT 0, | ||
| completeness INTEGER DEFAULT 0, | ||
| code_quality INTEGER DEFAULT 0, | ||
| clarity INTEGER DEFAULT 0, | ||
| adherence INTEGER DEFAULT 0, | ||
| overall INTEGER DEFAULT 0, | ||
| latency_ms INTEGER DEFAULT 0, | ||
| tokens_used INTEGER DEFAULT 0, | ||
| strengths TEXT DEFAULT '', | ||
| weaknesses TEXT DEFAULT '', | ||
| response_file TEXT DEFAULT '', | ||
| FOREIGN KEY (comparison_id) REFERENCES comparisons(id) | ||
| ); | ||
|
|
||
| CREATE INDEX IF NOT EXISTS idx_comparisons_task ON comparisons(task_type); | ||
| CREATE INDEX IF NOT EXISTS idx_comparisons_winner ON comparisons(winner_model); | ||
| CREATE INDEX IF NOT EXISTS idx_scores_model ON comparison_scores(model_id); | ||
| SQL | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add a sqlite3 availability check for clear error feedback.
If sqlite3 isn't installed, the script will fail with a cryptic "command not found" error under set -e. Per the coding guidelines on reliability and clear logging, a pre-check with an actionable error message would be much friendlier.
🛡️ Proposed fix
init_results_db() {
+ if ! command -v sqlite3 &>/dev/null; then
+ print_error "sqlite3 is required for comparison scoring but not found in PATH"
+ return 1
+ fi
local db_dir
db_dir="$(dirname "$RESULTS_DB")"
mkdir -p "$db_dir"As per coding guidelines, "Automation scripts - focus on: Clear logging and feedback, Error recovery mechanisms".
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 882 - 920, The
init_results_db function currently assumes sqlite3 is present; add a pre-check
using command availability (e.g., verify sqlite3 with command -v or type) before
attempting to create the DB so you can print a clear actionable error and exit
if missing; reference init_results_db, RESULTS_DB, and sqlite3 when implementing
the check and ensure the script emits a readable error message (including that
sqlite3 is required) and returns a non-zero status instead of failing with a
cryptic "command not found".
| flush_model() { | ||
| if [[ -n "$current_model" ]]; then | ||
| local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 )) | ||
| model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}") | ||
| fi | ||
| current_model="" current_correct=0 current_complete=0 current_quality=0 | ||
| current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0 | ||
| current_strengths="" current_weaknesses="" current_response="" | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
flush_model missing explicit return and has SC2155.
Per coding guidelines, all functions should include explicit returns. Also, line 939 triggers ShellCheck SC2155 (declare and assign separately to avoid masking return values).
♻️ Proposed fix
flush_model() {
if [[ -n "$current_model" ]]; then
- local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))
+ local overall
+ overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 ))
model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}")
fi
current_model="" current_correct=0 current_complete=0 current_quality=0
current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0
current_strengths="" current_weaknesses="" current_response=""
+ return 0
}As per coding guidelines, "Run ShellCheck with zero violations on all scripts" and "Include explicit returns in shell scripts".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| flush_model() { | |
| if [[ -n "$current_model" ]]; then | |
| local overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 )) | |
| model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}") | |
| fi | |
| current_model="" current_correct=0 current_complete=0 current_quality=0 | |
| current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0 | |
| current_strengths="" current_weaknesses="" current_response="" | |
| } | |
| flush_model() { | |
| if [[ -n "$current_model" ]]; then | |
| local overall | |
| overall=$(( (current_correct + current_complete + current_quality + current_clarity + current_adherence) / 5 )) | |
| model_entries+=("${current_model}|${current_correct}|${current_complete}|${current_quality}|${current_clarity}|${current_adherence}|${overall}|${current_latency}|${current_tokens}|${current_strengths}|${current_weaknesses}|${current_response}") | |
| fi | |
| current_model="" current_correct=0 current_complete=0 current_quality=0 | |
| current_clarity=0 current_adherence=0 current_latency=0 current_tokens=0 | |
| current_strengths="" current_weaknesses="" current_response="" | |
| return 0 | |
| } |
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 937 - 945, The
flush_model function should declare and assign the local variable separately and
end with an explicit return; change "local overall=$((...))" to "local overall"
followed by "overall=$(( (current_correct + current_complete + current_quality +
current_clarity + current_adherence) / 5 ))", and break the combined post-if
assignments (current_model="", current_correct=0 ...) into separate statements
for clarity, then append an explicit "return 0" at the end of flush_model; keep
references to model_entries and current_* variables intact.
| while [[ $# -gt 0 ]]; do | ||
| case "$1" in | ||
| --task) task="$2"; shift 2 ;; | ||
| --type) task_type="$2"; shift 2 ;; | ||
| --evaluator) evaluator="$2"; shift 2 ;; | ||
| --winner) winner="$2"; shift 2 ;; | ||
| --model) flush_model; current_model="$2"; shift 2 ;; | ||
| --correctness) current_correct="$2"; shift 2 ;; | ||
| --completeness) current_complete="$2"; shift 2 ;; | ||
| --quality) current_quality="$2"; shift 2 ;; | ||
| --clarity) current_clarity="$2"; shift 2 ;; | ||
| --adherence) current_adherence="$2"; shift 2 ;; | ||
| --latency) current_latency="$2"; shift 2 ;; | ||
| --tokens) current_tokens="$2"; shift 2 ;; | ||
| --strengths) current_strengths="$2"; shift 2 ;; | ||
| --weaknesses) current_weaknesses="$2"; shift 2 ;; | ||
| --response) current_response="$2"; shift 2 ;; | ||
| *) shift ;; | ||
| esac | ||
| done |
There was a problem hiding this comment.
Validate numeric inputs to prevent arithmetic failures and corrupt data.
Score values (--correctness, --completeness, etc.) are accepted raw and used directly in integer arithmetic (line 939) and SQL inserts. Non-numeric or out-of-range input will cause a silent arithmetic error or insert garbage into the database. A simple validation helper would catch this early with a clear message.
🛡️ Proposed validation helper
+ # Validate integer in range
+ validate_score() {
+ local name="$1" val="$2"
+ if ! [[ "$val" =~ ^[0-9]+$ ]] || [[ "$val" -lt 1 || "$val" -gt 10 ]]; then
+ print_error "--$name must be an integer between 1 and 10 (got: $val)"
+ return 1
+ fi
+ return 0
+ }
+
while [[ $# -gt 0 ]]; do
case "$1" in
--task) task="$2"; shift 2 ;;
--type) task_type="$2"; shift 2 ;;
--evaluator) evaluator="$2"; shift 2 ;;
--winner) winner="$2"; shift 2 ;;
--model) flush_model; current_model="$2"; shift 2 ;;
- --correctness) current_correct="$2"; shift 2 ;;
- --completeness) current_complete="$2"; shift 2 ;;
- --quality) current_quality="$2"; shift 2 ;;
- --clarity) current_clarity="$2"; shift 2 ;;
- --adherence) current_adherence="$2"; shift 2 ;;
+ --correctness) validate_score correctness "$2" || return 1; current_correct="$2"; shift 2 ;;
+ --completeness) validate_score completeness "$2" || return 1; current_complete="$2"; shift 2 ;;
+ --quality) validate_score quality "$2" || return 1; current_quality="$2"; shift 2 ;;
+ --clarity) validate_score clarity "$2" || return 1; current_clarity="$2"; shift 2 ;;
+ --adherence) validate_score adherence "$2" || return 1; current_adherence="$2"; shift 2 ;;As per coding guidelines, "Automation scripts - focus on: Reliability and robustness, Clear logging and feedback".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| while [[ $# -gt 0 ]]; do | |
| case "$1" in | |
| --task) task="$2"; shift 2 ;; | |
| --type) task_type="$2"; shift 2 ;; | |
| --evaluator) evaluator="$2"; shift 2 ;; | |
| --winner) winner="$2"; shift 2 ;; | |
| --model) flush_model; current_model="$2"; shift 2 ;; | |
| --correctness) current_correct="$2"; shift 2 ;; | |
| --completeness) current_complete="$2"; shift 2 ;; | |
| --quality) current_quality="$2"; shift 2 ;; | |
| --clarity) current_clarity="$2"; shift 2 ;; | |
| --adherence) current_adherence="$2"; shift 2 ;; | |
| --latency) current_latency="$2"; shift 2 ;; | |
| --tokens) current_tokens="$2"; shift 2 ;; | |
| --strengths) current_strengths="$2"; shift 2 ;; | |
| --weaknesses) current_weaknesses="$2"; shift 2 ;; | |
| --response) current_response="$2"; shift 2 ;; | |
| *) shift ;; | |
| esac | |
| done | |
| # Validate integer in range | |
| validate_score() { | |
| local name="$1" val="$2" | |
| if ! [[ "$val" =~ ^[0-9]+$ ]] || [[ "$val" -lt 1 || "$val" -gt 10 ]]; then | |
| print_error "--$name must be an integer between 1 and 10 (got: $val)" | |
| return 1 | |
| fi | |
| return 0 | |
| } | |
| while [[ $# -gt 0 ]]; do | |
| case "$1" in | |
| --task) task="$2"; shift 2 ;; | |
| --type) task_type="$2"; shift 2 ;; | |
| --evaluator) evaluator="$2"; shift 2 ;; | |
| --winner) winner="$2"; shift 2 ;; | |
| --model) flush_model; current_model="$2"; shift 2 ;; | |
| --correctness) validate_score correctness "$2" || return 1; current_correct="$2"; shift 2 ;; | |
| --completeness) validate_score completeness "$2" || return 1; current_complete="$2"; shift 2 ;; | |
| --quality) validate_score quality "$2" || return 1; current_quality="$2"; shift 2 ;; | |
| --clarity) validate_score clarity "$2" || return 1; current_clarity="$2"; shift 2 ;; | |
| --adherence) validate_score adherence "$2" || return 1; current_adherence="$2"; shift 2 ;; | |
| --latency) current_latency="$2"; shift 2 ;; | |
| --tokens) current_tokens="$2"; shift 2 ;; | |
| --strengths) current_strengths="$2"; shift 2 ;; | |
| --weaknesses) current_weaknesses="$2"; shift 2 ;; | |
| --response) current_response="$2"; shift 2 ;; | |
| *) shift ;; | |
| esac | |
| done |
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 947 - 966, Add
validation for the numeric score flags (current_correct, current_complete,
current_quality, current_clarity, current_adherence, current_latency,
current_tokens) right after argument parsing: implement a helper function (e.g.,
validate_score) that checks each variable is an integer (or numeric as
appropriate) and within expected ranges, logs a clear error and exits on
failure; call this helper before any arithmetic or SQL insertion (used later
around the arithmetic and DB insert logic) and ensure flush_model/current_model
parsing still occurs unaffected.
| --strengths) current_strengths="$2"; shift 2 ;; | ||
| --weaknesses) current_weaknesses="$2"; shift 2 ;; | ||
| --response) current_response="$2"; shift 2 ;; | ||
| *) shift ;; |
There was a problem hiding this comment.
Silent discard of unrecognized flags can mask typos.
*) shift ;; silently ignores unknown arguments. A typo like --corectness 9 would be eaten without warning, resulting in a default score of 0 for that criterion — a data-integrity trap that's hard to debug.
🛡️ Proposed fix — warn on unknown flags
- *) shift ;;
+ *) print_warning "Unknown option ignored: $1"; shift ;;As per coding guidelines, "Automation scripts - focus on: Clear logging and feedback".
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh at line 964, The wildcard case in
the argument parsing (the pattern '*) shift ;;') silently swallows unknown
flags; replace it with a handler that prints a clear error to stderr including
the offending token (e.g., "Unknown option: $1") and exits non-zero so typos
fail fast; update the case block that contains '*) shift ;;' to use a
warning/error message directed to stderr (using >&2) and call exit 1 (or, if
intended to continue, at minimum print a warning and shift) so unrecognized
arguments are not silently discarded.
| # Insert comparison record | ||
| local comp_id | ||
| comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();") | ||
|
|
||
| # Insert scores for each model | ||
| for entry in "${model_entries[@]}"; do | ||
| IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry" | ||
| sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');" | ||
| done |
There was a problem hiding this comment.
SQL injection — user-controlled values interpolated without escaping.
On line 1000, only $task is escaped via sed "s/'/''/g", but $task_type, $evaluator, and $winner are interpolated raw into the SQL string. On line 1005, $m_id is also unescaped. A malicious or simply apostrophe-containing value (e.g., --winner "it's a tie") breaks the query or allows injection.
Apply the same sed "s/'/''/g" escaping consistently to all string-interpolated values, or better yet, use a parameterized approach.
🔒 Proposed fix — consistent escaping
# Insert comparison record
local comp_id
- comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();")
+ local esc_task esc_type esc_eval esc_winner
+ esc_task="${task//\'/\'\'}"
+ esc_type="${task_type//\'/\'\'}"
+ esc_eval="${evaluator//\'/\'\'}"
+ esc_winner="${winner//\'/\'\'}"
+ comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$esc_task', '$esc_type', '$esc_eval', '$esc_winner'); SELECT last_insert_rowid();")
# Insert scores for each model
for entry in "${model_entries[@]}"; do
IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry"
- sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');"
+ local esc_mid="${m_id//\'/\'\'}"
+ local esc_str="${m_str//\'/\'\'}"
+ local esc_wea="${m_wea//\'/\'\'}"
+ local esc_res="${m_res//\'/\'\'}"
+ sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$esc_mid', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$esc_str', '$esc_wea', '$esc_res');"
done🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 998 - 1006, The SQL
insertion uses unescaped, user-controlled variables (comp_id creation and the
comparison_scores INSERT inside the model_entries loop); fix by consistently
escaping all string interpolations (not just task) before embedding into
SQL—e.g., sanitize task_type, evaluator, winner, and m_id and any string columns
(m_str, m_wea, m_res) with the same sed "s/'/''/g" pattern or, better, switch
the INSERT logic in the comp_id and comparison_scores blocks to a parameterized/
prepared statement approach; update the code around comp_id assignment and the
for-loop that reads model_entries to use the sanitized variables (or bound
parameters) when calling sqlite3 so apostrophes and injection vectors are
handled safely.
| local comp_id | ||
| comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons (task_description, task_type, evaluator_model, winner_model) VALUES ('$(echo "$task" | sed "s/'/''/g")', '$task_type', '$evaluator', '$winner'); SELECT last_insert_rowid();") | ||
|
|
||
| # Insert scores for each model | ||
| for entry in "${model_entries[@]}"; do | ||
| IFS='|' read -r m_id m_cor m_com m_qua m_cla m_adh m_ove m_lat m_tok m_str m_wea m_res <<< "$entry" | ||
| sqlite3 "$RESULTS_DB" "INSERT INTO comparison_scores (comparison_id, model_id, correctness, completeness, code_quality, clarity, adherence, overall, latency_ms, tokens_used, strengths, weaknesses, response_file) VALUES ($comp_id, '$m_id', $m_cor, $m_com, $m_qua, $m_cla, $m_adh, $m_ove, $m_lat, $m_tok, '$(echo "$m_str" | sed "s/'/''/g")', '$(echo "$m_wea" | sed "s/'/''/g")', '$(echo "$m_res" | sed "s/'/''/g")');" | ||
| done |
There was a problem hiding this comment.
No error handling on SQLite insert operations.
If the sqlite3 call on line 1000 fails (permissions, disk full, corrupted DB), comp_id will be empty and the subsequent inserts on line 1005 will either fail or insert with an invalid comparison_id. Under set -e the script exits abruptly with no user-facing diagnostic.
🛡️ Proposed fix
local comp_id
- comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons ...")
+ comp_id=$(sqlite3 "$RESULTS_DB" "INSERT INTO comparisons ...") || {
+ print_error "Failed to write comparison to database: $RESULTS_DB"
+ return 1
+ }
+ if [[ -z "$comp_id" ]]; then
+ print_error "Failed to obtain comparison ID from database"
+ return 1
+ fiAs per coding guidelines, "Automation scripts - focus on: Error recovery mechanisms, Clear logging and feedback".
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 999 - 1006, The
sqlite3 INSERT that sets comp_id can fail and leave comp_id empty causing
invalid downstream inserts into comparison_scores; update the block around
comp_id and the for-loop to check the sqlite3 command exit status and validate
comp_id is a non-empty numeric value before proceeding: after running the
sqlite3 INSERT into comparisons capture both stdout and stderr (or check $?),
verify comp_id is not empty and consists only of digits, log a clear error to
stderr including $RESULTS_DB and the sqlite3 error output if it failed, and exit
or skip the subsequent loop if validation fails so comparison_scores inserts
only run when comp_id is valid; reference symbols: comp_id, RESULTS_DB,
model_entries, comparison_scores, and the sqlite3 INSERT commands.
| local where_clause="" | ||
| if [[ -n "$model_filter" ]]; then | ||
| where_clause="WHERE cs.model_id LIKE '%${model_filter}%'" | ||
| fi | ||
| if [[ -n "$type_filter" ]]; then | ||
| if [[ -n "$where_clause" ]]; then | ||
| where_clause="$where_clause AND c.task_type = '$type_filter'" | ||
| else | ||
| where_clause="WHERE c.task_type = '$type_filter'" | ||
| fi | ||
| fi |
There was a problem hiding this comment.
SQL injection in filter construction — model_filter, type_filter, and limit interpolated raw.
model_filter and type_filter are placed directly into the SQL WHERE clause without escaping. limit on line 1083 is also unvalidated. All three accept arbitrary user input via --model, --type, and --limit.
🔒 Proposed fix
+ # Validate limit is a positive integer
+ if ! [[ "$limit" =~ ^[0-9]+$ ]]; then
+ print_error "--limit must be a positive integer"
+ return 1
+ fi
+
local where_clause=""
if [[ -n "$model_filter" ]]; then
- where_clause="WHERE cs.model_id LIKE '%${model_filter}%'"
+ local esc_model="${model_filter//\'/\'\'}"
+ where_clause="WHERE cs.model_id LIKE '%${esc_model}%'"
fi
if [[ -n "$type_filter" ]]; then
+ local esc_type="${type_filter//\'/\'\'}"
if [[ -n "$where_clause" ]]; then
- where_clause="$where_clause AND c.task_type = '$type_filter'"
+ where_clause="$where_clause AND c.task_type = '$esc_type'"
else
- where_clause="WHERE c.task_type = '$type_filter'"
+ where_clause="WHERE c.task_type = '$esc_type'"
fi
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| local where_clause="" | |
| if [[ -n "$model_filter" ]]; then | |
| where_clause="WHERE cs.model_id LIKE '%${model_filter}%'" | |
| fi | |
| if [[ -n "$type_filter" ]]; then | |
| if [[ -n "$where_clause" ]]; then | |
| where_clause="$where_clause AND c.task_type = '$type_filter'" | |
| else | |
| where_clause="WHERE c.task_type = '$type_filter'" | |
| fi | |
| fi | |
| # Validate limit is a positive integer | |
| if ! [[ "$limit" =~ ^[0-9]+$ ]]; then | |
| print_error "--limit must be a positive integer" | |
| return 1 | |
| fi | |
| local where_clause="" | |
| if [[ -n "$model_filter" ]]; then | |
| local esc_model="${model_filter//\'/\'\'}" | |
| where_clause="WHERE cs.model_id LIKE '%${esc_model}%'" | |
| fi | |
| if [[ -n "$type_filter" ]]; then | |
| local esc_type="${type_filter//\'/\'\'}" | |
| if [[ -n "$where_clause" ]]; then | |
| where_clause="$where_clause AND c.task_type = '$esc_type'" | |
| else | |
| where_clause="WHERE c.task_type = '$esc_type'" | |
| fi | |
| fi |
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 1051 - 1061, The WHERE
construction interpolates untrusted vars (model_filter, type_filter) and limit
directly into SQL, enabling injection; fix by sanitizing/validating inputs
before building where_clause and limit usage: escape single quotes and
percent/wildcard characters in model_filter (used in LIKE) and escape single
quotes in type_filter, or better use parameterized queries if the DB client
supports them, and validate limit is an integer (reject or default if not);
update the code paths that set where_clause (variables model_filter,
type_filter, where_clause) and wherever limit is used to perform these
checks/escapes before concatenation.
| sqlite3 -separator '|' "$RESULTS_DB" " | ||
| SELECT c.id, c.created_at, c.task_type, c.task_description, c.winner_model | ||
| FROM comparisons c | ||
| ORDER BY c.created_at DESC | ||
| LIMIT $limit; | ||
| " 2>/dev/null | while IFS='|' read -r cid cdate ctype cdesc cwinner; do | ||
| echo " #$cid [$ctype] $(echo "$cdesc" | head -c 60) ($cdate)" | ||
| if [[ -n "$cwinner" ]]; then | ||
| echo " Winner: $cwinner" | ||
| fi | ||
|
|
||
| # Show scores for this comparison | ||
| sqlite3 -separator '|' "$RESULTS_DB" " | ||
| SELECT model_id, overall, correctness, completeness, code_quality, clarity, adherence | ||
| FROM comparison_scores | ||
| WHERE comparison_id = $cid | ||
| ORDER BY overall DESC; | ||
| " 2>/dev/null | while IFS='|' read -r mid ov co cm cq cl ca; do | ||
| printf " %-20s overall:%d (corr:%d comp:%d qual:%d clar:%d adhr:%d)\n" \ | ||
| "$mid" "$ov" "$co" "$cm" "$cq" "$cl" "$ca" | ||
| done | ||
| echo "" | ||
| done |
There was a problem hiding this comment.
Bug: --model / --type filters not applied to the recent comparisons listing.
The where_clause is applied to the count query (line 1069) and the aggregate rankings query (line 1106), but the recent comparisons listing here queries comparisons without any filter. Running results --model sonnet will show all comparisons in this section while only showing filtered aggregates below — inconsistent and confusing.
🐛 Proposed fix — apply filter to recent comparisons query
# Show recent comparisons
sqlite3 -separator '|' "$RESULTS_DB" "
- SELECT c.id, c.created_at, c.task_type, c.task_description, c.winner_model
- FROM comparisons c
- ORDER BY c.created_at DESC
+ SELECT DISTINCT c.id, c.created_at, c.task_type, c.task_description, c.winner_model
+ FROM comparisons c
+ LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id
+ $where_clause
+ ORDER BY c.created_at DESC
LIMIT $limit;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sqlite3 -separator '|' "$RESULTS_DB" " | |
| SELECT c.id, c.created_at, c.task_type, c.task_description, c.winner_model | |
| FROM comparisons c | |
| ORDER BY c.created_at DESC | |
| LIMIT $limit; | |
| " 2>/dev/null | while IFS='|' read -r cid cdate ctype cdesc cwinner; do | |
| echo " #$cid [$ctype] $(echo "$cdesc" | head -c 60) ($cdate)" | |
| if [[ -n "$cwinner" ]]; then | |
| echo " Winner: $cwinner" | |
| fi | |
| # Show scores for this comparison | |
| sqlite3 -separator '|' "$RESULTS_DB" " | |
| SELECT model_id, overall, correctness, completeness, code_quality, clarity, adherence | |
| FROM comparison_scores | |
| WHERE comparison_id = $cid | |
| ORDER BY overall DESC; | |
| " 2>/dev/null | while IFS='|' read -r mid ov co cm cq cl ca; do | |
| printf " %-20s overall:%d (corr:%d comp:%d qual:%d clar:%d adhr:%d)\n" \ | |
| "$mid" "$ov" "$co" "$cm" "$cq" "$cl" "$ca" | |
| done | |
| echo "" | |
| done | |
| sqlite3 -separator '|' "$RESULTS_DB" " | |
| SELECT DISTINCT c.id, c.created_at, c.task_type, c.task_description, c.winner_model | |
| FROM comparisons c | |
| LEFT JOIN comparison_scores cs ON c.id = cs.comparison_id | |
| $where_clause | |
| ORDER BY c.created_at DESC | |
| LIMIT $limit; | |
| " 2>/dev/null | while IFS='|' read -r cid cdate ctype cdesc cwinner; do | |
| echo " #$cid [$ctype] $(echo "$cdesc" | head -c 60) ($cdate)" | |
| if [[ -n "$cwinner" ]]; then | |
| echo " Winner: $cwinner" | |
| fi | |
| # Show scores for this comparison | |
| sqlite3 -separator '|' "$RESULTS_DB" " | |
| SELECT model_id, overall, correctness, completeness, code_quality, clarity, adherence | |
| FROM comparison_scores | |
| WHERE comparison_id = $cid | |
| ORDER BY overall DESC; | |
| " 2>/dev/null | while IFS='|' read -r mid ov co cm cq cl ca; do | |
| printf " %-20s overall:%d (corr:%d comp:%d qual:%d clar:%d adhr:%d)\n" \ | |
| "$mid" "$ov" "$co" "$cm" "$cq" "$cl" "$ca" | |
| done | |
| echo "" | |
| done |
🤖 Prompt for AI Agents
In @.agents/scripts/compare-models-helper.sh around lines 1079 - 1101, The
recent comparisons query is missing the filter stored in where_clause so
--model/--type aren't applied; modify the sqlite3 SELECT that reads from the
comparisons table (the block that SELECTs c.id, c.created_at, c.task_type,
c.task_description, c.winner_model) to include the same where_clause used
elsewhere by adding the variable (e.g. append "$where_clause" or
${where_clause}) after "FROM comparisons c" and before "ORDER BY", ensuring
proper spacing/quoting so the shell expands it into the SQL statement.
Summary
scorecommand to record model comparison results with 5 scoring criteriaresultscommand to view past comparisons and aggregate model rankings~/.aidevops/.agent-workspace/memory/model-comparisons.dbScoring Criteria (1-10 scale)
Usage
Testing
bash -nsyntax check passesFixes: t168.3 ref:GH#723
Summary by CodeRabbit