-
Notifications
You must be signed in to change notification settings - Fork 5
feat: response comparison and scoring framework for model evaluation (t168.3) #773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a response scoring framework: documentation, a SQLite-backed Bash CLI helper script, and an end-to-end test suite enabling prompt management, response recording, multi-criterion scoring, comparisons, leaderboards, and data export. Changes
Sequence DiagramsequenceDiagram
actor User
participant CLI as Response Scoring CLI
participant DB as SQLite Database
participant Scorer as Scoring Logic
participant Renderer as Output Renderer
User->>CLI: init
CLI->>DB: create schema (prompts, responses, scores, comparisons)
DB-->>CLI: ready
User->>CLI: prompt add "<text>"
CLI->>DB: insert prompt
DB-->>CLI: prompt_id
User->>CLI: record <prompt_id> <model> <response_file>
CLI->>DB: insert response + metadata
DB-->>CLI: response_id
User->>CLI: score <response_id> correctness=5 completeness=4 code_quality=5 clarity=5
CLI->>Scorer: validate & compute weighted total (30/25/25/20)
Scorer->>DB: persist per-criterion and aggregate score
DB-->>CLI: score stored
User->>CLI: compare <prompt_id>
CLI->>DB: fetch responses + scores
DB-->>CLI: result set
CLI->>Renderer: format table or JSON
Renderer-->>User: comparison output
User->>CLI: leaderboard [--json]
CLI->>DB: aggregate by model
DB-->>CLI: rankings
CLI->>Renderer: render leaderboard
Renderer-->>User: ranked results
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Mon Feb 9 15:15:25 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new response scoring framework to evaluate AI model responses based on correctness, completeness, code quality, and clarity. It provides tools for managing prompts, recording responses, scoring, comparing, and generating leaderboards, enabling informed model routing and selection. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive framework for scoring and comparing AI model responses, which is a great addition to the model evaluation toolchain. It includes a new helper script, subagent definition, documentation, and a thorough test suite. The code is well-structured and follows most of the project's conventions.
My review focuses on improving security, maintainability, and adherence to the repository's style guide. I've identified a critical SQL injection risk and suggest a mitigation. I've also pointed out several areas where hardcoded values could be centralized to improve maintainability, and a few violations of the shell scripting style guide regarding error handling and linting practices, with modifications to align with repository rules on script independence.
| escaped_title=$(echo "$title" | sed "s/'/''/g") | ||
| escaped_text=$(echo "$text" | sed "s/'/''/g") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script constructs SQL queries by concatenating user-provided strings, which is a significant security risk leading to SQL Injection vulnerabilities. For example, title and text are directly embedded into the SQL string. A malicious input could break out of the string literal and execute arbitrary SQL commands. The repository style guide requires using parameterized queries where possible.
While sqlite3 CLI makes this difficult, using printf '%s' is safer than echo for preparing the values as it doesn't interpret backslashes or options like -n.
| escaped_title=$(echo "$title" | sed "s/'/''/g") | |
| escaped_text=$(echo "$text" | sed "s/'/''/g") | |
| escaped_title=$(printf '%s' "$title" | sed "s/'/''/g") | |
| escaped_text=$(printf '%s' "$text" | sed "s/'/''/g") |
| _get_criterion_weight() { | ||
| local criterion="$1" | ||
| case "$criterion" in | ||
| correctness) echo "0.30" ;; | ||
| completeness) echo "0.25" ;; | ||
| code_quality) echo "0.25" ;; | ||
| clarity) echo "0.20" ;; | ||
| *) echo "0.25" ;; | ||
| esac | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The weights for scoring criteria are hardcoded in this function, and also in SQL queries within cmd_compare, cmd_leaderboard, and cmd_export. This violates the DRY (Don't Repeat Yourself) principle and makes the script hard to maintain, as the weights are already defined in the SCORING_CRITERIA constant but are not used here. Additionally, the default case *) echo "0.25" ;; can hide bugs by assigning an arbitrary weight to unknown criteria.
This function should parse the weight from the SCORING_CRITERIA constant. For unknown criteria, it should return an error. This will centralize the logic and make the system more robust. You should then consider refactoring the SQL queries to use this centralized logic instead of hardcoded weights.
| _get_criterion_weight() { | |
| local criterion="$1" | |
| case "$criterion" in | |
| correctness) echo "0.30" ;; | |
| completeness) echo "0.25" ;; | |
| code_quality) echo "0.25" ;; | |
| clarity) echo "0.20" ;; | |
| *) echo "0.25" ;; | |
| esac | |
| return 0 | |
| } | |
| _get_criterion_weight() { | |
| local criterion="$1" | |
| local weight | |
| # Parse weight from the central SCORING_CRITERIA constant | |
| weight=$(echo "$SCORING_CRITERIA" | grep "^${criterion}|" | cut -d'|' -f2) | |
| if [[ -z "$weight" ]]; then | |
| print_error "Unknown criterion: '${criterion}'. Please check SCORING_CRITERIA." | |
| return 1 | |
| fi | |
| echo "$weight" | |
| return 0 | |
| } |
| @@ -0,0 +1,933 @@ | |||
| #!/usr/bin/env bash | |||
| # shellcheck disable=SC1091,SC2001 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script uses a global shellcheck disable for SC2001. The repository style guide requires using "targeted inline disables with reason comments only". Additionally, SC2001 can often be avoided by using Bash's native parameter expansion, which is more efficient than sed.
For SC2001 violations (e.g., echo "$var" | sed "s/'/''/g"), you can replace them with "${var//\'/\'\'}", which is both more performant and compliant with modern shell scripting practices.
References
- The style guide requires ShellCheck violations to be disabled inline with a reason, not globally at the top of the file. (link)
| mkdir -p "$SCORING_DIR" 2>/dev/null || true | ||
|
|
||
| log_stderr "db init" sqlite3 "$SCORING_DB" " | ||
| CREATE TABLE IF NOT EXISTS prompts ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repository style guide specifies that "All SQLite databases use WAL mode + busy_timeout=5000". The init_db function is missing these PRAGMA settings. This can affect database concurrency and performance under load.
| CREATE TABLE IF NOT EXISTS prompts ( | |
| PRAGMA journal_mode=WAL; | |
| PRAGMA busy_timeout=5000; | |
| CREATE TABLE IF NOT EXISTS prompts ( |
References
- The style guide requires all SQLite databases to be configured with WAL mode and a busy timeout. (link)
| responses=$(sqlite3 -separator '|' "$SCORING_DB" " | ||
| SELECT r.response_id, r.model_id, r.response_time, r.token_count, r.cost_estimate, | ||
| COALESCE(( | ||
| SELECT ROUND( | ||
| SUM(CASE s.criterion | ||
| WHEN 'correctness' THEN s.score * 0.30 | ||
| WHEN 'completeness' THEN s.score * 0.25 | ||
| WHEN 'code_quality' THEN s.score * 0.25 | ||
| WHEN 'clarity' THEN s.score * 0.20 | ||
| ELSE s.score * 0.25 | ||
| END) | ||
| / NULLIF(SUM(CASE s.criterion | ||
| WHEN 'correctness' THEN 0.30 | ||
| WHEN 'completeness' THEN 0.25 | ||
| WHEN 'code_quality' THEN 0.25 | ||
| WHEN 'clarity' THEN 0.20 | ||
| ELSE 0.25 | ||
| END), 0) | ||
| , 2) | ||
| FROM scores s WHERE s.response_id = r.response_id | ||
| ), 0) as weighted_avg, | ||
| (SELECT GROUP_CONCAT(s2.criterion || ':' || s2.score, ',') | ||
| FROM scores s2 WHERE s2.response_id = r.response_id) as score_detail | ||
| FROM responses r | ||
| WHERE r.prompt_id = ${prompt_id} | ||
| ORDER BY weighted_avg DESC; | ||
| " 2>/dev/null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sqlite3 call suppresses all stderr output with 2>/dev/null. This violates the repository style guide, which states that 2>/dev/null should not be used for blanket suppression. If there is a syntax error in the SQL or another database issue, it will be silently ignored, making debugging difficult. The log_stderr helper, if simple and self-contained, should be duplicated within this standalone script to avoid source dependencies, as per repository guidelines for script independence.
Please wrap the sqlite3 command with a duplicated log_stderr function (if simple) and remove the 2>/dev/null redirection to ensure errors are properly logged. For example:
responses=$(log_stderr "compare query" sqlite3 ...)
This should be applied to other similar sqlite3 calls in cmd_leaderboard, cmd_export, and cmd_history.
References
- The style guide prohibits blanket error suppression with
2>/dev/null, recommending logging instead. (link) - For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions instead of introducing
sourcedependencies. This maintains script independence and avoids risks like path resolution issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.agents/scripts/response-scoring-helper.sh:
- Around line 155-161: The SQL is vulnerable because category and difficulty are
interpolated directly into the sqlite3 INSERT; add the same single-quote
escaping used for title/text: declare local escaped_category and
escaped_difficulty, set escaped_category=$(echo "$category" | sed "s/'/''/g")
and escaped_difficulty=$(echo "$difficulty" | sed "s/'/''/g"), and then use
'${escaped_category}' and '${escaped_difficulty}' in the log_stderr / sqlite3
INSERT command (the prompt_id assignment) instead of the raw variables.
- Around line 607-611: The WHERE clause builds SQL by interpolating the category
variable into where_clause, which allows SQL injection; fix by properly escaping
or validating category before interpolation — e.g., sanitize category (allow
only expected values/patterns) or escape single quotes (replace ' with '' ) and
then assign where_clause="WHERE p.category = '${escaped_category}'";
alternatively use a parameterized/psql variable substitution approach instead of
direct interpolation; update the code that sets where_clause and any caller that
passes category in .agents/scripts/response-scoring-helper.sh to use the
sanitized/escaped variable.
- Around line 259-281: Validate numeric inputs before embedding them into SQL:
ensure prompt_id and token_count match an integer regex (e.g., ^[0-9]+$) and
ensure response_time and cost match a numeric/float regex (e.g.,
^[0-9]+(\.[0-9]+)?$); reject or default and error out if validation fails, then
use the validated variables when constructing the sqlite3 INSERT/select
statements. Also apply the same validation to the prompt_exists check (don't
interpolate an unvalidated prompt_id into the SELECT) and propagate this pattern
to other commands named in the script (cmd_record, cmd_score, cmd_compare,
cmd_leaderboard, cmd_export, cmd_history and the prompt show path) so all
numeric parameters are validated before SQL interpolation.
- Around line 716-737: The CSV export query currently uses COALESCE(..., 0)
which treats missing criteria as zero and biases the weighted average; update
the SELECT that builds the weighted_avg (inside the format == "csv" branch) to
compute a proper weighted average like cmd_compare and cmd_leaderboard: build a
numerator by summing weighted MAX(CASE WHEN s.criterion = 'correctness' THEN
s.score END) * 0.30 + ... (for completeness 0.25, code_quality 0.25, clarity
0.20) and build a denominator as NULLIF( (CASE WHEN MAX(CASE WHEN
s.criterion='correctness' THEN s.score END) IS NOT NULL THEN 0.30 ELSE 0 END) +
... , 0); then use ROUND(numerator / denominator, 2) so missing criteria are
excluded from the denominator rather than treated as zero; change the expression
that currently uses COALESCE(...) and weights to this numerator/denominator
pattern referencing the same MAX(CASE ...) expressions and weights.
🧹 Nitpick comments (5)
.agents/scripts/response-scoring-helper.sh (4)
338-342: Score validation works by accident for non-numeric input.The
-lt/-gtoperators inside[[ ]]error out on non-numeric values, and2>/dev/nullsuppresses the message. The non-zero exit code from the error coincidentally triggers theifbody, so the user sees "must be 1-5" rather than a proper "not a number" message. This works but is fragile and misleading.♻️ Explicit numeric check
if [[ -z "$value" ]]; then continue fi # Validate score range - if [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null; then - print_error "Score for ${criterion} must be 1-5, got: ${value}" + if ! [[ "$value" =~ ^[1-5]$ ]]; then + print_error "Score for ${criterion} must be an integer 1-5, got: ${value}" return 1 fi
384-398: Awk expressions interpolate shell variables into the program text — code injection vector.In
_show_response_scores,$score,$weight,$weighted_total, and$weight_sumare interpolated directly intoawk BEGINblocks (lines 386-387, 395). If any of these values were tampered with (e.g., a crafted criterion name in the DB returning something other than a number), arbitrary awk code could execute. In practice these come from the DB and weight helper, so risk is low, but it's worth tightening.♻️ Use awk -v to pass variables safely
- weighted_total=$(awk "BEGIN{printf \"%.4f\", $weighted_total + ($score * $weight)}") - weight_sum=$(awk "BEGIN{printf \"%.4f\", $weight_sum + $weight}") + weighted_total=$(awk -v wt="$weighted_total" -v s="$score" -v w="$weight" 'BEGIN{printf "%.4f", wt + (s * w)}') + weight_sum=$(awk -v ws="$weight_sum" -v w="$weight" 'BEGIN{printf "%.4f", ws + w}')And similarly on line 395:
- weighted_avg=$(awk "BEGIN{printf \"%.2f\", $weighted_total / $weight_sum}") + weighted_avg=$(awk -v wt="$weighted_total" -v ws="$weight_sum" 'BEGIN{printf "%.2f", wt / ws}')
521-538:rankvariable incremented inside a piped while-loop (subshell) — counter resets each invocation but the parent never sees it.
echo "$responses" | while ...creates a subshell. Therankcounter works correctly within the loop for display purposes, but this is a subtle footgun if anyone later tries to userankafter the loop. The same pattern appears on line 677.Consider using a here-string or process substitution (
while ... done <<< "$responses") to keep the loop in the current shell, consistent with how_compare_json(line 565) and_show_response_scores(line 388) already do it.♻️ Consistent here-string pattern
local rank=0 - echo "$responses" | while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do + while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do rank=$((rank + 1)) ... - done + done <<< "$responses"
130-147: Unrecognized arguments are silently swallowed (*) shift ;;).Every argument parser has a catch-all that discards unknown flags. A typo like
--propmptinstead of--promptwould be silently ignored and the required parameter would appear missing, producing a confusing "Usage:" error far from the root cause. Consider emitting a warning for unknown flags.♻️ Example: warn on unknown flags
- *) shift ;; + *) + print_warning "Unknown option ignored: $1" + shift ;;Also applies to: 239-256, 297-306, 439-444, 599-605, 702-708
tests/test-response-scoring.sh (1)
344-369:SCORING_DB_OVERRIDEis changed mid-test but never restored — future tests would silently use the empty DB.Line 346 switches
SCORING_DB_OVERRIDEto a fresh empty database for edge-case testing, but it's never reset. Any tests added after line 369 would use the empty DB instead of the one populated by earlier tests.♻️ Restore the override after edge case tests
+# Save original DB path +ORIGINAL_DB="$SCORING_DB_OVERRIDE" + # Empty prompt list on fresh DB fresh_db="$TEST_DB_DIR/fresh.db" export SCORING_DB_OVERRIDE="$fresh_db" bash "$HELPER" init >/dev/null 2>&1 || true ... # Leaderboard with no data empty_lb=$(bash "$HELPER" leaderboard 2>&1) || true if echo "$empty_lb" | grep -qi "no scored\|leaderboard"; then pass "leaderboard handles empty database" else fail "leaderboard should handle empty database" "$empty_lb" fi + +# Restore original DB for any subsequent tests +export SCORING_DB_OVERRIDE="$ORIGINAL_DB"
| local escaped_title escaped_text | ||
| escaped_title=$(echo "$title" | sed "s/'/''/g") | ||
| escaped_text=$(echo "$text" | sed "s/'/''/g") | ||
|
|
||
| local prompt_id | ||
| prompt_id=$(log_stderr "prompt add" sqlite3 "$SCORING_DB" \ | ||
| "INSERT INTO prompts (title, prompt_text, category, difficulty) VALUES ('${escaped_title}', '${escaped_text}', '${category}', '${difficulty}'); SELECT last_insert_rowid();") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL injection via unescaped category and difficulty parameters.
title and text are single-quote-escaped via sed, but category and difficulty are interpolated directly into the SQL string on line 161. A value like general'; DROP TABLE prompts;-- would be injected verbatim.
Apply the same escaping treatment as title and text:
🛡️ Proposed fix
local escaped_title escaped_text
escaped_title=$(echo "$title" | sed "s/'/''/g")
escaped_text=$(echo "$text" | sed "s/'/''/g")
+ local escaped_category escaped_difficulty
+ escaped_category=$(echo "$category" | sed "s/'/''/g")
+ escaped_difficulty=$(echo "$difficulty" | sed "s/'/''/g")
local prompt_id
prompt_id=$(log_stderr "prompt add" sqlite3 "$SCORING_DB" \
- "INSERT INTO prompts (title, prompt_text, category, difficulty) VALUES ('${escaped_title}', '${escaped_text}', '${category}', '${difficulty}'); SELECT last_insert_rowid();")
+ "INSERT INTO prompts (title, prompt_text, category, difficulty) VALUES ('${escaped_title}', '${escaped_text}', '${escaped_category}', '${escaped_difficulty}'); SELECT last_insert_rowid();")🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 155 - 161, The SQL
is vulnerable because category and difficulty are interpolated directly into the
sqlite3 INSERT; add the same single-quote escaping used for title/text: declare
local escaped_category and escaped_difficulty, set escaped_category=$(echo
"$category" | sed "s/'/''/g") and escaped_difficulty=$(echo "$difficulty" | sed
"s/'/''/g"), and then use '${escaped_category}' and '${escaped_difficulty}' in
the log_stderr / sqlite3 INSERT command (the prompt_id assignment) instead of
the raw variables.
| if [[ -z "$prompt_id" || -z "$model_id" || -z "$response_text" ]]; then | ||
| print_error "Usage: response-scoring-helper.sh record --prompt <id> --model <model_id> --text \"response\" [--time <seconds>] [--tokens <count>] [--cost <usd>]" | ||
| return 1 | ||
| fi | ||
|
|
||
| # Verify prompt exists | ||
| local prompt_exists | ||
| prompt_exists=$(log_stderr "record check" sqlite3 "$SCORING_DB" \ | ||
| "SELECT COUNT(*) FROM prompts WHERE prompt_id = ${prompt_id};") | ||
| if [[ "$prompt_exists" == "0" ]]; then | ||
| print_error "Prompt #${prompt_id} not found" | ||
| return 1 | ||
| fi | ||
|
|
||
| local escaped_text escaped_model | ||
| escaped_text=$(echo "$response_text" | sed "s/'/''/g") | ||
| escaped_model=$(echo "$model_id" | sed "s/'/''/g") | ||
|
|
||
| local response_id | ||
| response_id=$(log_stderr "record insert" sqlite3 "$SCORING_DB" \ | ||
| "INSERT INTO responses (prompt_id, model_id, response_text, response_time, token_count, cost_estimate) | ||
| VALUES (${prompt_id}, '${escaped_model}', '${escaped_text}', ${response_time}, ${token_count}, ${cost}); | ||
| SELECT last_insert_rowid();") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numeric parameters (prompt_id, response_time, token_count, cost) are interpolated into SQL without integer/float validation.
If a user passes --prompt "1; DROP TABLE prompts;--" or --time "0; DROP TABLE responses", the value is spliced directly into the SQL statement. While prompt_id is checked against the DB first (line 266-267, also unvalidated), response_time, token_count, and cost go straight into the INSERT on line 280.
Add numeric validation before SQL interpolation. This pattern applies throughout the script (also in cmd_score, cmd_compare, cmd_leaderboard, cmd_export, cmd_history, and prompt show).
🛡️ Proposed fix — validate before interpolation (example for cmd_record)
if [[ -z "$prompt_id" || -z "$model_id" || -z "$response_text" ]]; then
print_error "Usage: ..."
return 1
fi
+ # Validate numeric inputs
+ if ! [[ "$prompt_id" =~ ^[0-9]+$ ]]; then
+ print_error "prompt_id must be a positive integer"
+ return 1
+ fi
+ if ! [[ "$response_time" =~ ^[0-9]*\.?[0-9]+$ ]]; then
+ print_error "response_time must be a number"
+ return 1
+ fi
+ if ! [[ "$token_count" =~ ^[0-9]+$ ]]; then
+ print_error "token_count must be a positive integer"
+ return 1
+ fi
+ if ! [[ "$cost" =~ ^[0-9]*\.?[0-9]+$ ]]; then
+ print_error "cost must be a number"
+ return 1
+ fi
+
# Verify prompt exists🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 259 - 281, Validate
numeric inputs before embedding them into SQL: ensure prompt_id and token_count
match an integer regex (e.g., ^[0-9]+$) and ensure response_time and cost match
a numeric/float regex (e.g., ^[0-9]+(\.[0-9]+)?$); reject or default and error
out if validation fails, then use the validated variables when constructing the
sqlite3 INSERT/select statements. Also apply the same validation to the
prompt_exists check (don't interpolate an unvalidated prompt_id into the SELECT)
and propagate this pattern to other commands named in the script (cmd_record,
cmd_score, cmd_compare, cmd_leaderboard, cmd_export, cmd_history and the prompt
show path) so all numeric parameters are validated before SQL interpolation.
|
|
||
| local where_clause="" | ||
| if [[ -n "$category" ]]; then | ||
| where_clause="WHERE p.category = '${category}'" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL injection via unescaped category in leaderboard WHERE clause.
The category value from --category is interpolated directly into the SQL WHERE clause without escaping.
🛡️ Proposed fix
local where_clause=""
if [[ -n "$category" ]]; then
- where_clause="WHERE p.category = '${category}'"
+ local escaped_category
+ escaped_category=$(echo "$category" | sed "s/'/''/g")
+ where_clause="WHERE p.category = '${escaped_category}'"
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| local where_clause="" | |
| if [[ -n "$category" ]]; then | |
| where_clause="WHERE p.category = '${category}'" | |
| fi | |
| local where_clause="" | |
| if [[ -n "$category" ]]; then | |
| local escaped_category | |
| escaped_category=$(echo "$category" | sed "s/'/''/g") | |
| where_clause="WHERE p.category = '${escaped_category}'" | |
| fi |
🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 607 - 611, The WHERE
clause builds SQL by interpolating the category variable into where_clause,
which allows SQL injection; fix by properly escaping or validating category
before interpolation — e.g., sanitize category (allow only expected
values/patterns) or escape single quotes (replace ' with '' ) and then assign
where_clause="WHERE p.category = '${escaped_category}'"; alternatively use a
parameterized/psql variable substitution approach instead of direct
interpolation; update the code that sets where_clause and any caller that passes
category in .agents/scripts/response-scoring-helper.sh to use the
sanitized/escaped variable.
| if [[ "$format" == "csv" ]]; then | ||
| echo "prompt_id,prompt_title,response_id,model_id,correctness,completeness,code_quality,clarity,weighted_avg,response_time,token_count,cost_estimate" | ||
| sqlite3 -separator ',' "$SCORING_DB" " | ||
| SELECT r.prompt_id, p.title, r.response_id, r.model_id, | ||
| MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), | ||
| MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), | ||
| MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), | ||
| MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), | ||
| ROUND(( | ||
| COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 + | ||
| COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 + | ||
| COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 + | ||
| COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20 | ||
| ), 2), | ||
| r.response_time, r.token_count, r.cost_estimate | ||
| FROM responses r | ||
| JOIN prompts p ON r.prompt_id = p.prompt_id | ||
| LEFT JOIN scores s ON r.response_id = s.response_id | ||
| ${where_clause} | ||
| GROUP BY r.response_id | ||
| ORDER BY r.prompt_id, r.model_id; | ||
| " 2>/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Export CSV weighted average calculation differs from compare/leaderboard — missing criteria skew the score.
The export query on lines 724-728 uses COALESCE(..., 0) for missing criteria, which substitutes 0 and always divides by the full weight of 1.0 (implicitly — there's no division at all). Meanwhile, cmd_compare (lines 466-480) and cmd_leaderboard (lines 618-631) correctly use NULLIF(SUM(weights), 0) to exclude unscored criteria from the denominator.
If a response has only one criterion scored (e.g., correctness=5), export reports 5*0.30 = 1.50, while compare reports 5.00. This is an inconsistency that produces misleading export data.
🔧 Align export with the same weighted-average formula used elsewhere
- ROUND((
- COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 +
- COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 +
- COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 +
- COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20
- ), 2),
+ ROUND(
+ (COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 +
+ COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 +
+ COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 +
+ COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20)
+ / NULLIF(
+ (CASE WHEN MAX(CASE WHEN s.criterion = 'correctness' THEN 1 END) IS NOT NULL THEN 0.30 ELSE 0 END +
+ CASE WHEN MAX(CASE WHEN s.criterion = 'completeness' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END +
+ CASE WHEN MAX(CASE WHEN s.criterion = 'code_quality' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END +
+ CASE WHEN MAX(CASE WHEN s.criterion = 'clarity' THEN 1 END) IS NOT NULL THEN 0.20 ELSE 0 END), 0)
+ , 2),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [[ "$format" == "csv" ]]; then | |
| echo "prompt_id,prompt_title,response_id,model_id,correctness,completeness,code_quality,clarity,weighted_avg,response_time,token_count,cost_estimate" | |
| sqlite3 -separator ',' "$SCORING_DB" " | |
| SELECT r.prompt_id, p.title, r.response_id, r.model_id, | |
| MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), | |
| MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), | |
| MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), | |
| MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), | |
| ROUND(( | |
| COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 + | |
| COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 + | |
| COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 + | |
| COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20 | |
| ), 2), | |
| r.response_time, r.token_count, r.cost_estimate | |
| FROM responses r | |
| JOIN prompts p ON r.prompt_id = p.prompt_id | |
| LEFT JOIN scores s ON r.response_id = s.response_id | |
| ${where_clause} | |
| GROUP BY r.response_id | |
| ORDER BY r.prompt_id, r.model_id; | |
| " 2>/dev/null | |
| if [[ "$format" == "csv" ]]; then | |
| echo "prompt_id,prompt_title,response_id,model_id,correctness,completeness,code_quality,clarity,weighted_avg,response_time,token_count,cost_estimate" | |
| sqlite3 -separator ',' "$SCORING_DB" " | |
| SELECT r.prompt_id, p.title, r.response_id, r.model_id, | |
| MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), | |
| MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), | |
| MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), | |
| MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), | |
| ROUND( | |
| (COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 + | |
| COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 + | |
| COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 + | |
| COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20) | |
| / NULLIF( | |
| (CASE WHEN MAX(CASE WHEN s.criterion = 'correctness' THEN 1 END) IS NOT NULL THEN 0.30 ELSE 0 END + | |
| CASE WHEN MAX(CASE WHEN s.criterion = 'completeness' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END + | |
| CASE WHEN MAX(CASE WHEN s.criterion = 'code_quality' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END + | |
| CASE WHEN MAX(CASE WHEN s.criterion = 'clarity' THEN 1 END) IS NOT NULL THEN 0.20 ELSE 0 END), 0) | |
| , 2), | |
| r.response_time, r.token_count, r.cost_estimate | |
| FROM responses r | |
| JOIN prompts p ON r.prompt_id = p.prompt_id | |
| LEFT JOIN scores s ON r.response_id = s.response_id | |
| ${where_clause} | |
| GROUP BY r.response_id | |
| ORDER BY r.prompt_id, r.model_id; | |
| " 2>/dev/null |
🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 716 - 737, The CSV
export query currently uses COALESCE(..., 0) which treats missing criteria as
zero and biases the weighted average; update the SELECT that builds the
weighted_avg (inside the format == "csv" branch) to compute a proper weighted
average like cmd_compare and cmd_leaderboard: build a numerator by summing
weighted MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END) * 0.30 +
... (for completeness 0.25, code_quality 0.25, clarity 0.20) and build a
denominator as NULLIF( (CASE WHEN MAX(CASE WHEN s.criterion='correctness' THEN
s.score END) IS NOT NULL THEN 0.30 ELSE 0 END) + ... , 0); then use
ROUND(numerator / denominator, 2) so missing criteria are excluded from the
denominator rather than treated as zero; change the expression that currently
uses COALESCE(...) and weights to this numerator/denominator pattern referencing
the same MAX(CASE ...) expressions and weights.
…(t168.3) Add response-scoring-helper.sh with SQLite-backed evaluation engine: - Structured scoring on 4 weighted criteria: correctness (30%), completeness (25%), code quality (25%), clarity (20%) - Side-by-side comparison with ranked output and winner declaration - Aggregate leaderboard across all evaluations - JSON and CSV export for analysis - Prompt management (create, list, show) - Response recording with timing, token count, and cost metadata - Multi-scorer support (human, AI, etc.) - /score-responses slash command and subagent documentation - 41 passing tests covering all commands and edge cases - ShellCheck clean (zero violations) Complements existing compare-models (specs) with actual response quality evaluation. Workflow: identify candidates -> check availability -> evaluate outputs -> inform model routing decisions.
6eb9c3e to
a6aefbd
Compare
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Mon Feb 9 15:44:00 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @.agents/scripts/response-scoring-helper.sh:
- Around line 58-110: The SQLite foreign key constraints in init_db (and the
comparisons/winner_id, responses.prompt_id, scores.response_id relations) are
never enforced because PRAGMA foreign_keys=ON is not set per connection; add a
PRAGMA foreign_keys = ON execution before any DB DDL/DML is run (e.g., in
init_db and in the connection wrapper used by ensure_db, cmd_record, cmd_score)
so every write/read session enables foreign keys for that connection; ensure the
PRAGMA is issued on the same sqlite3 connection used for subsequent CREATE TABLE
/ INSERT / UPDATE statements to enforce referential integrity for comparisons,
responses, and scores.
- Around line 557-588: The JSON building in _compare_json interpolates unescaped
strings (notably model_id and prompt_title) which can break JSON; either
construct the object with jq if available or ensure proper escaping of
backslashes and double-quotes (and strip/escape control chars) for model_id and
prompt_title before embedding them into the JSON string; update the same
escaping logic used for prompt_title to also run on model_id (and apply to the
leaderboard output path referenced around line 660) so all string fields are
JSON-safe.
- Around line 338-342: The current range check using [[ "$value" -lt 1 ||
"$value" -gt 5 ]] 2>/dev/null silently accepts non-numeric input; change the
validation in response-scoring-helper.sh to first assert the input is an integer
(e.g., test $value with a regex like ^[0-9]+$ using [[ ]] ), and only then
perform the numeric range check against 1–5 (the existing -lt/-gt checks) for
the variables value and criterion; if the integer test fails or the range check
fails, call print_error "Score for ${criterion} must be an integer 1-5, got:
${value}" and return 1.
🧹 Nitpick comments (4)
.agents/scripts/response-scoring-helper.sh (2)
384-391: Shell variables expanded directly inside awk program strings — prefer-vfor safety and clarity.Lines 386-387 and 395 inject
$weighted_total,$score,$weight, and$weight_sumvia double-quoted awk program strings. While the values here flow from the database (with a CHECK constraint) and a hardcoded weight helper, this pattern is fragile — any unexpected character in a DB value would be interpreted as awk code.Using awk's
-vflag is more robust and considered idiomatic:♻️ Suggested improvement
- weighted_total=$(awk "BEGIN{printf \"%.4f\", $weighted_total + ($score * $weight)}") - weight_sum=$(awk "BEGIN{printf \"%.4f\", $weight_sum + $weight}") + weighted_total=$(awk -v wt="$weighted_total" -v sc="$score" -v w="$weight" 'BEGIN{printf "%.4f", wt + (sc * w)}') + weight_sum=$(awk -v ws="$weight_sum" -v w="$weight" 'BEGIN{printf "%.4f", ws + w}')
363-368: Inconsistent use oflog_stderrwrapper for sqlite3 calls.
_show_response_scores(line 367-368) andcmd_compare(line 454-455) callsqlite3directly, while other commands consistently use thelog_stderrwrapper. This means failures in these calls won't appear in the log file, making debugging harder.♻️ Proposed fix (example for _show_response_scores)
local model_id - model_id=$(sqlite3 "$SCORING_DB" \ - "SELECT model_id FROM responses WHERE response_id = ${response_id};") + model_id=$(log_stderr "show scores" sqlite3 "$SCORING_DB" \ + "SELECT model_id FROM responses WHERE response_id = ${response_id};")tests/test-response-scoring.sh (2)
215-222: Missing test for non-numeric score input — would expose the validation bypass.The test on line 216 validates out-of-range integer (
6), which the script does catch. However, a non-numeric value likeabcsilently passes the validation in the helper (see review on line 338-342 of the helper script) and would fail at the SQLite level instead. Adding a non-numeric test case would catch this gap and serve as a regression test for the fix:# Test score validation (non-numeric) score_nan=$(bash "$HELPER" score --response 1 --correctness "abc" 2>&1) || true if echo "$score_nan" | grep -qi "must be 1-5\|error"; then pass "score rejects non-numeric value" else fail "score should reject non-numeric value" "$score_nan" fi
344-369: Consider restoringSCORING_DB_OVERRIDEor adding a comment after switching to the fresh DB.Line 346 changes
SCORING_DB_OVERRIDEto a fresh database for edge-case tests. Any tests added after this block would unknowingly run against the empty DB instead of the fully populated one. A brief comment or a restoration step would prevent future confusion:+# NOTE: SCORING_DB_OVERRIDE now points to a fresh empty DB for the remaining edge-case tests. fresh_db="$TEST_DB_DIR/fresh.db" export SCORING_DB_OVERRIDE="$fresh_db"
| init_db() { | ||
| mkdir -p "$SCORING_DIR" 2>/dev/null || true | ||
|
|
||
| log_stderr "db init" sqlite3 "$SCORING_DB" " | ||
| CREATE TABLE IF NOT EXISTS prompts ( | ||
| prompt_id INTEGER PRIMARY KEY AUTOINCREMENT, | ||
| title TEXT NOT NULL, | ||
| prompt_text TEXT NOT NULL, | ||
| category TEXT DEFAULT 'general', | ||
| difficulty TEXT DEFAULT 'medium', | ||
| created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')) | ||
| ); | ||
|
|
||
| CREATE TABLE IF NOT EXISTS responses ( | ||
| response_id INTEGER PRIMARY KEY AUTOINCREMENT, | ||
| prompt_id INTEGER NOT NULL, | ||
| model_id TEXT NOT NULL, | ||
| response_text TEXT NOT NULL, | ||
| response_time REAL DEFAULT 0.0, | ||
| token_count INTEGER DEFAULT 0, | ||
| cost_estimate REAL DEFAULT 0.0, | ||
| recorded_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')), | ||
| FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id) | ||
| ); | ||
|
|
||
| CREATE TABLE IF NOT EXISTS scores ( | ||
| score_id INTEGER PRIMARY KEY AUTOINCREMENT, | ||
| response_id INTEGER NOT NULL, | ||
| criterion TEXT NOT NULL, | ||
| score INTEGER NOT NULL CHECK(score BETWEEN 1 AND 5), | ||
| rationale TEXT DEFAULT '', | ||
| scored_by TEXT DEFAULT 'human', | ||
| scored_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')), | ||
| FOREIGN KEY (response_id) REFERENCES responses(response_id), | ||
| UNIQUE(response_id, criterion, scored_by) | ||
| ); | ||
|
|
||
| CREATE TABLE IF NOT EXISTS comparisons ( | ||
| comparison_id INTEGER PRIMARY KEY AUTOINCREMENT, | ||
| prompt_id INTEGER NOT NULL, | ||
| winner_id INTEGER, | ||
| notes TEXT DEFAULT '', | ||
| compared_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')), | ||
| FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id), | ||
| FOREIGN KEY (winner_id) REFERENCES responses(response_id) | ||
| ); | ||
|
|
||
| CREATE INDEX IF NOT EXISTS idx_responses_prompt ON responses(prompt_id); | ||
| CREATE INDEX IF NOT EXISTS idx_responses_model ON responses(model_id); | ||
| CREATE INDEX IF NOT EXISTS idx_scores_response ON scores(response_id); | ||
| " | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Foreign keys are declared but never enforced — SQLite requires an explicit pragma.
SQLite ignores FOREIGN KEY constraints by default. Without PRAGMA foreign_keys = ON; before the DML statements, referential integrity (e.g., inserting a response for a deleted prompt, or a comparison with a bogus winner_id) is not enforced at the database level.
The manual existence checks in cmd_record and cmd_score partially compensate, but the comparisons table's winner_id is never validated by application code.
🔧 Proposed fix
init_db() {
mkdir -p "$SCORING_DIR" 2>/dev/null || true
log_stderr "db init" sqlite3 "$SCORING_DB" "
+ PRAGMA foreign_keys = ON;
+
CREATE TABLE IF NOT EXISTS prompts (Note: PRAGMA foreign_keys is per-connection and not persisted, so you'd also want to issue it before every write session (e.g., in ensure_db or a wrapper).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| init_db() { | |
| mkdir -p "$SCORING_DIR" 2>/dev/null || true | |
| log_stderr "db init" sqlite3 "$SCORING_DB" " | |
| CREATE TABLE IF NOT EXISTS prompts ( | |
| prompt_id INTEGER PRIMARY KEY AUTOINCREMENT, | |
| title TEXT NOT NULL, | |
| prompt_text TEXT NOT NULL, | |
| category TEXT DEFAULT 'general', | |
| difficulty TEXT DEFAULT 'medium', | |
| created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')) | |
| ); | |
| CREATE TABLE IF NOT EXISTS responses ( | |
| response_id INTEGER PRIMARY KEY AUTOINCREMENT, | |
| prompt_id INTEGER NOT NULL, | |
| model_id TEXT NOT NULL, | |
| response_text TEXT NOT NULL, | |
| response_time REAL DEFAULT 0.0, | |
| token_count INTEGER DEFAULT 0, | |
| cost_estimate REAL DEFAULT 0.0, | |
| recorded_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')), | |
| FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id) | |
| ); | |
| CREATE TABLE IF NOT EXISTS scores ( | |
| score_id INTEGER PRIMARY KEY AUTOINCREMENT, | |
| response_id INTEGER NOT NULL, | |
| criterion TEXT NOT NULL, | |
| score INTEGER NOT NULL CHECK(score BETWEEN 1 AND 5), | |
| rationale TEXT DEFAULT '', | |
| scored_by TEXT DEFAULT 'human', | |
| scored_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')), | |
| FOREIGN KEY (response_id) REFERENCES responses(response_id), | |
| UNIQUE(response_id, criterion, scored_by) | |
| ); | |
| CREATE TABLE IF NOT EXISTS comparisons ( | |
| comparison_id INTEGER PRIMARY KEY AUTOINCREMENT, | |
| prompt_id INTEGER NOT NULL, | |
| winner_id INTEGER, | |
| notes TEXT DEFAULT '', | |
| compared_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')), | |
| FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id), | |
| FOREIGN KEY (winner_id) REFERENCES responses(response_id) | |
| ); | |
| CREATE INDEX IF NOT EXISTS idx_responses_prompt ON responses(prompt_id); | |
| CREATE INDEX IF NOT EXISTS idx_responses_model ON responses(model_id); | |
| CREATE INDEX IF NOT EXISTS idx_scores_response ON scores(response_id); | |
| " | |
| return 0 | |
| } | |
| init_db() { | |
| mkdir -p "$SCORING_DIR" 2>/dev/null || true | |
| log_stderr "db init" sqlite3 "$SCORING_DB" " | |
| PRAGMA foreign_keys = ON; | |
| CREATE TABLE IF NOT EXISTS prompts ( | |
| prompt_id INTEGER PRIMARY KEY AUTOINCREMENT, | |
| title TEXT NOT NULL, | |
| prompt_text TEXT NOT NULL, | |
| category TEXT DEFAULT 'general', | |
| difficulty TEXT DEFAULT 'medium', | |
| created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')) | |
| ); | |
| CREATE TABLE IF NOT EXISTS responses ( | |
| response_id INTEGER PRIMARY KEY AUTOINCREMENT, | |
| prompt_id INTEGER NOT NULL, | |
| model_id TEXT NOT NULL, | |
| response_text TEXT NOT NULL, | |
| response_time REAL DEFAULT 0.0, | |
| token_count INTEGER DEFAULT 0, | |
| cost_estimate REAL DEFAULT 0.0, | |
| recorded_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')), | |
| FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id) | |
| ); | |
| CREATE TABLE IF NOT EXISTS scores ( | |
| score_id INTEGER PRIMARY KEY AUTOINCREMENT, | |
| response_id INTEGER NOT NULL, | |
| criterion TEXT NOT NULL, | |
| score INTEGER NOT NULL CHECK(score BETWEEN 1 AND 5), | |
| rationale TEXT DEFAULT '', | |
| scored_by TEXT DEFAULT 'human', | |
| scored_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')), | |
| FOREIGN KEY (response_id) REFERENCES responses(response_id), | |
| UNIQUE(response_id, criterion, scored_by) | |
| ); | |
| CREATE TABLE IF NOT EXISTS comparisons ( | |
| comparison_id INTEGER PRIMARY KEY AUTOINCREMENT, | |
| prompt_id INTEGER NOT NULL, | |
| winner_id INTEGER, | |
| notes TEXT DEFAULT '', | |
| compared_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')), | |
| FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id), | |
| FOREIGN KEY (winner_id) REFERENCES responses(response_id) | |
| ); | |
| CREATE INDEX IF NOT EXISTS idx_responses_prompt ON responses(prompt_id); | |
| CREATE INDEX IF NOT EXISTS idx_responses_model ON responses(model_id); | |
| CREATE INDEX IF NOT EXISTS idx_scores_response ON scores(response_id); | |
| " | |
| return 0 | |
| } |
🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 58 - 110, The SQLite
foreign key constraints in init_db (and the comparisons/winner_id,
responses.prompt_id, scores.response_id relations) are never enforced because
PRAGMA foreign_keys=ON is not set per connection; add a PRAGMA foreign_keys = ON
execution before any DB DDL/DML is run (e.g., in init_db and in the connection
wrapper used by ensure_db, cmd_record, cmd_score) so every write/read session
enables foreign keys for that connection; ensure the PRAGMA is issued on the
same sqlite3 connection used for subsequent CREATE TABLE / INSERT / UPDATE
statements to enforce referential integrity for comparisons, responses, and
scores.
| # Validate score range | ||
| if [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null; then | ||
| print_error "Score for ${criterion} must be 1-5, got: ${value}" | ||
| return 1 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Score validation silently passes non-numeric input.
The -lt / -gt operators inside [[ ]] emit an error (exit code 2) on non-integer strings, and 2>/dev/null suppresses it. The effect is that a value like abc or 3; DROP TABLE scores;-- makes both comparisons fail, the overall condition evaluates to false, and the value sails through to the SQL INSERT on line 346.
This is the score-specific manifestation of the numeric-injection gap already noted, but the mechanism is subtle enough to call out: the validation looks present but doesn't actually guard non-numeric input.
🛡️ Proposed fix — validate integer first, then range-check
if [[ -z "$value" ]]; then
continue
fi
- # Validate score range
- if [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null; then
+ # Validate integer and score range
+ if ! [[ "$value" =~ ^[0-9]+$ ]] || [[ "$value" -lt 1 || "$value" -gt 5 ]]; then
print_error "Score for ${criterion} must be 1-5, got: ${value}"
return 1
fi🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 338 - 342, The
current range check using [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null
silently accepts non-numeric input; change the validation in
response-scoring-helper.sh to first assert the input is an integer (e.g., test
$value with a regex like ^[0-9]+$ using [[ ]] ), and only then perform the
numeric range check against 1–5 (the existing -lt/-gt checks) for the variables
value and criterion; if the integer test fails or the range check fails, call
print_error "Score for ${criterion} must be an integer 1-5, got: ${value}" and
return 1.
| _compare_json() { | ||
| local prompt_id="$1" | ||
| local prompt_title="$2" | ||
| local responses="$3" | ||
|
|
||
| local json_entries=() | ||
| local rank=0 | ||
|
|
||
| while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do | ||
| rank=$((rank + 1)) | ||
|
|
||
| local corr="null" comp="null" code="null" clar="null" | ||
| if [[ -n "$score_detail" ]]; then | ||
| local val | ||
| val=$(echo "$score_detail" | tr ',' '\n' | grep "^correctness:" | cut -d: -f2) | ||
| [[ -n "$val" ]] && corr="$val" | ||
| val=$(echo "$score_detail" | tr ',' '\n' | grep "^completeness:" | cut -d: -f2) | ||
| [[ -n "$val" ]] && comp="$val" | ||
| val=$(echo "$score_detail" | tr ',' '\n' | grep "^code_quality:" | cut -d: -f2) | ||
| [[ -n "$val" ]] && code="$val" | ||
| val=$(echo "$score_detail" | tr ',' '\n' | grep "^clarity:" | cut -d: -f2) | ||
| [[ -n "$val" ]] && clar="$val" | ||
| fi | ||
|
|
||
| json_entries+=("{\"rank\":${rank},\"response_id\":${rid},\"model\":\"${model_id}\",\"scores\":{\"correctness\":${corr},\"completeness\":${comp},\"code_quality\":${code},\"clarity\":${clar}},\"weighted_avg\":${wavg:-0},\"response_time\":${rtime:-0},\"tokens\":${tokens:-0},\"cost\":${cost:-0}}") | ||
| done <<< "$responses" | ||
|
|
||
| local escaped_title | ||
| escaped_title=$(echo "$prompt_title" | sed 's/"/\\"/g') | ||
| echo "{\"prompt_id\":${prompt_id},\"title\":\"${escaped_title}\",\"responses\":[$(IFS=,; echo "${json_entries[*]}")]}" | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON output is vulnerable to breakage from special characters in model_id.
model_id (line 581) is interpolated directly into the JSON string with zero escaping. If a model ID contains ", \, or control characters, the output is malformed JSON. The same applies to JSON leaderboard output (line 660).
prompt_title (line 585) escapes only " but not \ or control characters, which is incomplete.
Consider using jq (if available) or at minimum applying the same sed escaping for \ and ":
🔧 Proposed fix (minimal — escape backslash then double-quote)
while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do
rank=$((rank + 1))
+ # Escape JSON-special characters
+ local escaped_model
+ escaped_model=$(printf '%s' "$model_id" | sed 's/\\/\\\\/g; s/"/\\"/g')
...
- json_entries+=("{\"rank\":${rank},\"response_id\":${rid},\"model\":\"${model_id}\",...")
+ json_entries+=("{\"rank\":${rank},\"response_id\":${rid},\"model\":\"${escaped_model}\",...")🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 557 - 588, The JSON
building in _compare_json interpolates unescaped strings (notably model_id and
prompt_title) which can break JSON; either construct the object with jq if
available or ensure proper escaping of backslashes and double-quotes (and
strip/escape control chars) for model_id and prompt_title before embedding them
into the JSON string; update the same escaping logic used for prompt_title to
also run on model_id (and apply to the leaderboard output path referenced around
line 660) so all string fields are JSON-safe.



Summary
response-scoring-helper.sh— SQLite-backed framework for evaluating AI model responses side-by-side with structured scoring criteria/score-responsesslash command andresponse-scoring.mdsubagent documentationWhat This Does
Extends the model comparison toolchain (t168 series) from spec-level comparison to actual response quality evaluation:
compare-models-helper.shmodel-availability-helper.shmodel-registry-helper.shresponse-scoring-helper.sh(new)Scoring Criteria (weighted)
Commands
prompt add/list/show— Manage evaluation promptsrecord— Record model responses with timing/cost metadatascore— Score responses on all 4 criteria (1-5 scale)compare— Side-by-side comparison with ranked outputleaderboard— Aggregate model rankings across evaluationsexport— JSON and CSV export for analysishistory— Scoring audit trailcriteria— Reference rubricsWorkflow
Quality
local var="$1", explicit returns, shared-constants.shSummary by CodeRabbit
New Features
Documentation
Tests