Skip to content

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 9, 2026

Summary

  • Add response-scoring-helper.sh — SQLite-backed framework for evaluating AI model responses side-by-side with structured scoring criteria
  • Add /score-responses slash command and response-scoring.md subagent documentation
  • Add comprehensive test suite (41 tests, all passing)

What This Does

Extends the model comparison toolchain (t168 series) from spec-level comparison to actual response quality evaluation:

Existing Tool Purpose
compare-models-helper.sh Compare model specs (pricing, context, capabilities)
model-availability-helper.sh Check which models are available
model-registry-helper.sh Track model versions and deprecations
response-scoring-helper.sh (new) Evaluate actual model response quality

Scoring Criteria (weighted)

Criterion Weight Description
Correctness 30% Factual accuracy and technical correctness
Completeness 25% Coverage of all requirements and edge cases
Code Quality 25% Clean code, best practices, maintainability
Clarity 20% Clear explanation, good formatting, readability

Commands

  • prompt add/list/show — Manage evaluation prompts
  • record — Record model responses with timing/cost metadata
  • score — Score responses on all 4 criteria (1-5 scale)
  • compare — Side-by-side comparison with ranked output
  • leaderboard — Aggregate model rankings across evaluations
  • export — JSON and CSV export for analysis
  • history — Scoring audit trail
  • criteria — Reference rubrics

Workflow

1. compare-models recommend "task"  → identify candidate models
2. model-availability check <model> → verify availability
3. response-scoring record/score    → evaluate actual outputs
4. response-scoring leaderboard     → inform model routing

Quality

  • ShellCheck: zero violations
  • Tests: 41/41 passing (syntax, CRUD, scoring, comparison, export, edge cases)
  • Follows framework conventions: local var="$1", explicit returns, shared-constants.sh

Summary by CodeRabbit

  • New Features

    • Added a response-scoring workflow to evaluate AI outputs across four weighted criteria, compare responses side-by-side, view leaderboards, and export results (JSON/CSV).
  • Documentation

    • Added comprehensive guides and quick reference for the scoring workflow and usage.
  • Tests

    • Added an end-to-end test suite validating the scoring workflow, comparisons, leaderboard, export, and edge cases.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

Walkthrough

Adds a response scoring framework: documentation, a SQLite-backed Bash CLI helper script, and an end-to-end test suite enabling prompt management, response recording, multi-criterion scoring, comparisons, leaderboards, and data export.

Changes

Cohort / File(s) Summary
Documentation & References
.agents/scripts/commands/score-responses.md, .agents/tools/ai-assistants/response-scoring.md, .agents/tools/ai-assistants/compare-models.md
New docs describing the response-scoring workflow, scoring criteria and weights, example commands, DB location/schema, and a compare-models reference to the scoring doc.
Response Scoring CLI
.agents/scripts/response-scoring-helper.sh
New comprehensive Bash CLI implementing SQLite-backed storage and commands: init, prompt (add/list/show), record, score (per-criterion validation and weighted aggregation), compare (table/JSON), leaderboard, export (JSON/CSV), history, criteria, help, plus internal helpers and DB schema creation.
Test Suite
tests/test-response-scoring.sh
New Bash end-to-end test script exercising CLI features against a temporary SQLite DB: syntax, init, prompt CRUD, recording, scoring, comparisons, leaderboard, export, history, criteria, and edge cases; includes pass/fail reporting and cleanup.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as Response Scoring CLI
    participant DB as SQLite Database
    participant Scorer as Scoring Logic
    participant Renderer as Output Renderer

    User->>CLI: init
    CLI->>DB: create schema (prompts, responses, scores, comparisons)
    DB-->>CLI: ready

    User->>CLI: prompt add "<text>"
    CLI->>DB: insert prompt
    DB-->>CLI: prompt_id

    User->>CLI: record <prompt_id> <model> <response_file>
    CLI->>DB: insert response + metadata
    DB-->>CLI: response_id

    User->>CLI: score <response_id> correctness=5 completeness=4 code_quality=5 clarity=5
    CLI->>Scorer: validate & compute weighted total (30/25/25/20)
    Scorer->>DB: persist per-criterion and aggregate score
    DB-->>CLI: score stored

    User->>CLI: compare <prompt_id>
    CLI->>DB: fetch responses + scores
    DB-->>CLI: result set
    CLI->>Renderer: format table or JSON
    Renderer-->>User: comparison output

    User->>CLI: leaderboard [--json]
    CLI->>DB: aggregate by model
    DB-->>CLI: rankings
    CLI->>Renderer: render leaderboard
    Renderer-->>User: ranked results
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

✨ Models answer, judges weigh the best,

SQLite holds each tested quest.
Four rubrics sing, scores rise and fall,
Compare, export, crown the call—
A leaderboard to crown the quest.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main addition: a response scoring and comparison framework for evaluating AI models. It directly reflects the core deliverables (response-scoring-helper.sh, response-scoring.md, and test suite) and is clear, concise, and meaningful for scanning PR history.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t168.3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 53 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Mon Feb 9 15:15:22 UTC 2026: Code review monitoring started
Mon Feb 9 15:15:23 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 53

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 53
  • VULNERABILITIES: 0

Generated on: Mon Feb 9 15:15:25 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new response scoring framework to evaluate AI model responses based on correctness, completeness, code quality, and clarity. It provides tools for managing prompts, recording responses, scoring, comparing, and generating leaderboards, enabling informed model routing and selection.

Highlights

  • New Feature: Response Scoring Framework: Introduces a SQLite-backed framework (response-scoring-helper.sh) for evaluating AI model responses using structured scoring criteria, enhancing the model comparison toolchain.
  • Slash Command and Documentation: Adds a /score-responses slash command and response-scoring.md subagent documentation for easy access and usage.
  • Comprehensive Test Suite: Includes a comprehensive test suite with 41 passing tests, ensuring the framework's reliability and adherence to coding standards.
Changelog
  • .agents/scripts/commands/score-responses.md
    • Added a new subagent command for scoring model responses.
  • .agents/scripts/response-scoring-helper.sh
    • Implemented the core logic for response scoring, including database initialization, prompt management, response recording, scoring, comparison, leaderboard generation, and export functionalities.
  • .agents/tools/ai-assistants/compare-models.md
    • Updated the compare-models documentation to include a reference to the new response-scoring tool.
  • .agents/tools/ai-assistants/response-scoring.md
    • Created a new documentation file detailing the usage and workflow of the response scoring framework.
  • tests/test-response-scoring.sh
    • Added a comprehensive test suite to validate the functionality of the response-scoring-helper.sh script.
Activity
  • New files added: .agents/scripts/commands/score-responses.md, .agents/scripts/response-scoring-helper.sh, .agents/tools/ai-assistants/response-scoring.md, tests/test-response-scoring.sh
  • Modified file: .agents/tools/ai-assistants/compare-models.md
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive framework for scoring and comparing AI model responses, which is a great addition to the model evaluation toolchain. It includes a new helper script, subagent definition, documentation, and a thorough test suite. The code is well-structured and follows most of the project's conventions.

My review focuses on improving security, maintainability, and adherence to the repository's style guide. I've identified a critical SQL injection risk and suggest a mitigation. I've also pointed out several areas where hardcoded values could be centralized to improve maintainability, and a few violations of the shell scripting style guide regarding error handling and linting practices, with modifications to align with repository rules on script independence.

Comment on lines +156 to +157
escaped_title=$(echo "$title" | sed "s/'/''/g")
escaped_text=$(echo "$text" | sed "s/'/''/g")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script constructs SQL queries by concatenating user-provided strings, which is a significant security risk leading to SQL Injection vulnerabilities. For example, title and text are directly embedded into the SQL string. A malicious input could break out of the string literal and execute arbitrary SQL commands. The repository style guide requires using parameterized queries where possible.

While sqlite3 CLI makes this difficult, using printf '%s' is safer than echo for preparing the values as it doesn't interpret backslashes or options like -n.

Suggested change
escaped_title=$(echo "$title" | sed "s/'/''/g")
escaped_text=$(echo "$text" | sed "s/'/''/g")
escaped_title=$(printf '%s' "$title" | sed "s/'/''/g")
escaped_text=$(printf '%s' "$text" | sed "s/'/''/g")

Comment on lines +418 to +428
_get_criterion_weight() {
local criterion="$1"
case "$criterion" in
correctness) echo "0.30" ;;
completeness) echo "0.25" ;;
code_quality) echo "0.25" ;;
clarity) echo "0.20" ;;
*) echo "0.25" ;;
esac
return 0
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The weights for scoring criteria are hardcoded in this function, and also in SQL queries within cmd_compare, cmd_leaderboard, and cmd_export. This violates the DRY (Don't Repeat Yourself) principle and makes the script hard to maintain, as the weights are already defined in the SCORING_CRITERIA constant but are not used here. Additionally, the default case *) echo "0.25" ;; can hide bugs by assigning an arbitrary weight to unknown criteria.

This function should parse the weight from the SCORING_CRITERIA constant. For unknown criteria, it should return an error. This will centralize the logic and make the system more robust. You should then consider refactoring the SQL queries to use this centralized logic instead of hardcoded weights.

Suggested change
_get_criterion_weight() {
local criterion="$1"
case "$criterion" in
correctness) echo "0.30" ;;
completeness) echo "0.25" ;;
code_quality) echo "0.25" ;;
clarity) echo "0.20" ;;
*) echo "0.25" ;;
esac
return 0
}
_get_criterion_weight() {
local criterion="$1"
local weight
# Parse weight from the central SCORING_CRITERIA constant
weight=$(echo "$SCORING_CRITERIA" | grep "^${criterion}|" | cut -d'|' -f2)
if [[ -z "$weight" ]]; then
print_error "Unknown criterion: '${criterion}'. Please check SCORING_CRITERIA."
return 1
fi
echo "$weight"
return 0
}

@@ -0,0 +1,933 @@
#!/usr/bin/env bash
# shellcheck disable=SC1091,SC2001

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script uses a global shellcheck disable for SC2001. The repository style guide requires using "targeted inline disables with reason comments only". Additionally, SC2001 can often be avoided by using Bash's native parameter expansion, which is more efficient than sed.

For SC2001 violations (e.g., echo "$var" | sed "s/'/''/g"), you can replace them with "${var//\'/\'\'}", which is both more performant and compliant with modern shell scripting practices.

References
  1. The style guide requires ShellCheck violations to be disabled inline with a reason, not globally at the top of the file. (link)

mkdir -p "$SCORING_DIR" 2>/dev/null || true

log_stderr "db init" sqlite3 "$SCORING_DB" "
CREATE TABLE IF NOT EXISTS prompts (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The repository style guide specifies that "All SQLite databases use WAL mode + busy_timeout=5000". The init_db function is missing these PRAGMA settings. This can affect database concurrency and performance under load.

Suggested change
CREATE TABLE IF NOT EXISTS prompts (
PRAGMA journal_mode=WAL;
PRAGMA busy_timeout=5000;
CREATE TABLE IF NOT EXISTS prompts (
References
  1. The style guide requires all SQLite databases to be configured with WAL mode and a busy timeout. (link)

Comment on lines +463 to +489
responses=$(sqlite3 -separator '|' "$SCORING_DB" "
SELECT r.response_id, r.model_id, r.response_time, r.token_count, r.cost_estimate,
COALESCE((
SELECT ROUND(
SUM(CASE s.criterion
WHEN 'correctness' THEN s.score * 0.30
WHEN 'completeness' THEN s.score * 0.25
WHEN 'code_quality' THEN s.score * 0.25
WHEN 'clarity' THEN s.score * 0.20
ELSE s.score * 0.25
END)
/ NULLIF(SUM(CASE s.criterion
WHEN 'correctness' THEN 0.30
WHEN 'completeness' THEN 0.25
WHEN 'code_quality' THEN 0.25
WHEN 'clarity' THEN 0.20
ELSE 0.25
END), 0)
, 2)
FROM scores s WHERE s.response_id = r.response_id
), 0) as weighted_avg,
(SELECT GROUP_CONCAT(s2.criterion || ':' || s2.score, ',')
FROM scores s2 WHERE s2.response_id = r.response_id) as score_detail
FROM responses r
WHERE r.prompt_id = ${prompt_id}
ORDER BY weighted_avg DESC;
" 2>/dev/null)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sqlite3 call suppresses all stderr output with 2>/dev/null. This violates the repository style guide, which states that 2>/dev/null should not be used for blanket suppression. If there is a syntax error in the SQL or another database issue, it will be silently ignored, making debugging difficult. The log_stderr helper, if simple and self-contained, should be duplicated within this standalone script to avoid source dependencies, as per repository guidelines for script independence.

Please wrap the sqlite3 command with a duplicated log_stderr function (if simple) and remove the 2>/dev/null redirection to ensure errors are properly logged. For example:
responses=$(log_stderr "compare query" sqlite3 ...)
This should be applied to other similar sqlite3 calls in cmd_leaderboard, cmd_export, and cmd_history.

References
  1. The style guide prohibits blanket error suppression with 2>/dev/null, recommending logging instead. (link)
  2. For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions instead of introducing source dependencies. This maintains script independence and avoids risks like path resolution issues.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @.agents/scripts/response-scoring-helper.sh:
- Around line 155-161: The SQL is vulnerable because category and difficulty are
interpolated directly into the sqlite3 INSERT; add the same single-quote
escaping used for title/text: declare local escaped_category and
escaped_difficulty, set escaped_category=$(echo "$category" | sed "s/'/''/g")
and escaped_difficulty=$(echo "$difficulty" | sed "s/'/''/g"), and then use
'${escaped_category}' and '${escaped_difficulty}' in the log_stderr / sqlite3
INSERT command (the prompt_id assignment) instead of the raw variables.
- Around line 607-611: The WHERE clause builds SQL by interpolating the category
variable into where_clause, which allows SQL injection; fix by properly escaping
or validating category before interpolation — e.g., sanitize category (allow
only expected values/patterns) or escape single quotes (replace ' with '' ) and
then assign where_clause="WHERE p.category = '${escaped_category}'";
alternatively use a parameterized/psql variable substitution approach instead of
direct interpolation; update the code that sets where_clause and any caller that
passes category in .agents/scripts/response-scoring-helper.sh to use the
sanitized/escaped variable.
- Around line 259-281: Validate numeric inputs before embedding them into SQL:
ensure prompt_id and token_count match an integer regex (e.g., ^[0-9]+$) and
ensure response_time and cost match a numeric/float regex (e.g.,
^[0-9]+(\.[0-9]+)?$); reject or default and error out if validation fails, then
use the validated variables when constructing the sqlite3 INSERT/select
statements. Also apply the same validation to the prompt_exists check (don't
interpolate an unvalidated prompt_id into the SELECT) and propagate this pattern
to other commands named in the script (cmd_record, cmd_score, cmd_compare,
cmd_leaderboard, cmd_export, cmd_history and the prompt show path) so all
numeric parameters are validated before SQL interpolation.
- Around line 716-737: The CSV export query currently uses COALESCE(..., 0)
which treats missing criteria as zero and biases the weighted average; update
the SELECT that builds the weighted_avg (inside the format == "csv" branch) to
compute a proper weighted average like cmd_compare and cmd_leaderboard: build a
numerator by summing weighted MAX(CASE WHEN s.criterion = 'correctness' THEN
s.score END) * 0.30 + ... (for completeness 0.25, code_quality 0.25, clarity
0.20) and build a denominator as NULLIF( (CASE WHEN MAX(CASE WHEN
s.criterion='correctness' THEN s.score END) IS NOT NULL THEN 0.30 ELSE 0 END) +
... , 0); then use ROUND(numerator / denominator, 2) so missing criteria are
excluded from the denominator rather than treated as zero; change the expression
that currently uses COALESCE(...) and weights to this numerator/denominator
pattern referencing the same MAX(CASE ...) expressions and weights.
🧹 Nitpick comments (5)
.agents/scripts/response-scoring-helper.sh (4)

338-342: Score validation works by accident for non-numeric input.

The -lt/-gt operators inside [[ ]] error out on non-numeric values, and 2>/dev/null suppresses the message. The non-zero exit code from the error coincidentally triggers the if body, so the user sees "must be 1-5" rather than a proper "not a number" message. This works but is fragile and misleading.

♻️ Explicit numeric check
         if [[ -z "$value" ]]; then
             continue
         fi
 
         # Validate score range
-        if [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null; then
-            print_error "Score for ${criterion} must be 1-5, got: ${value}"
+        if ! [[ "$value" =~ ^[1-5]$ ]]; then
+            print_error "Score for ${criterion} must be an integer 1-5, got: ${value}"
             return 1
         fi

384-398: Awk expressions interpolate shell variables into the program text — code injection vector.

In _show_response_scores, $score, $weight, $weighted_total, and $weight_sum are interpolated directly into awk BEGIN blocks (lines 386-387, 395). If any of these values were tampered with (e.g., a crafted criterion name in the DB returning something other than a number), arbitrary awk code could execute. In practice these come from the DB and weight helper, so risk is low, but it's worth tightening.

♻️ Use awk -v to pass variables safely
-        weighted_total=$(awk "BEGIN{printf \"%.4f\", $weighted_total + ($score * $weight)}")
-        weight_sum=$(awk "BEGIN{printf \"%.4f\", $weight_sum + $weight}")
+        weighted_total=$(awk -v wt="$weighted_total" -v s="$score" -v w="$weight" 'BEGIN{printf "%.4f", wt + (s * w)}')
+        weight_sum=$(awk -v ws="$weight_sum" -v w="$weight" 'BEGIN{printf "%.4f", ws + w}')

And similarly on line 395:

-        weighted_avg=$(awk "BEGIN{printf \"%.2f\", $weighted_total / $weight_sum}")
+        weighted_avg=$(awk -v wt="$weighted_total" -v ws="$weight_sum" 'BEGIN{printf "%.2f", wt / ws}')

521-538: rank variable incremented inside a piped while-loop (subshell) — counter resets each invocation but the parent never sees it.

echo "$responses" | while ... creates a subshell. The rank counter works correctly within the loop for display purposes, but this is a subtle footgun if anyone later tries to use rank after the loop. The same pattern appears on line 677.

Consider using a here-string or process substitution (while ... done <<< "$responses") to keep the loop in the current shell, consistent with how _compare_json (line 565) and _show_response_scores (line 388) already do it.

♻️ Consistent here-string pattern
     local rank=0
-    echo "$responses" | while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do
+    while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do
         rank=$((rank + 1))
         ...
-    done
+    done <<< "$responses"

130-147: Unrecognized arguments are silently swallowed (*) shift ;;).

Every argument parser has a catch-all that discards unknown flags. A typo like --propmpt instead of --prompt would be silently ignored and the required parameter would appear missing, producing a confusing "Usage:" error far from the root cause. Consider emitting a warning for unknown flags.

♻️ Example: warn on unknown flags
-            *) shift ;;
+            *)
+                print_warning "Unknown option ignored: $1"
+                shift ;;

Also applies to: 239-256, 297-306, 439-444, 599-605, 702-708

tests/test-response-scoring.sh (1)

344-369: SCORING_DB_OVERRIDE is changed mid-test but never restored — future tests would silently use the empty DB.

Line 346 switches SCORING_DB_OVERRIDE to a fresh empty database for edge-case testing, but it's never reset. Any tests added after line 369 would use the empty DB instead of the one populated by earlier tests.

♻️ Restore the override after edge case tests
+# Save original DB path
+ORIGINAL_DB="$SCORING_DB_OVERRIDE"
+
 # Empty prompt list on fresh DB
 fresh_db="$TEST_DB_DIR/fresh.db"
 export SCORING_DB_OVERRIDE="$fresh_db"
 bash "$HELPER" init >/dev/null 2>&1 || true
 ...
 
 # Leaderboard with no data
 empty_lb=$(bash "$HELPER" leaderboard 2>&1) || true
 if echo "$empty_lb" | grep -qi "no scored\|leaderboard"; then
     pass "leaderboard handles empty database"
 else
     fail "leaderboard should handle empty database" "$empty_lb"
 fi
+
+# Restore original DB for any subsequent tests
+export SCORING_DB_OVERRIDE="$ORIGINAL_DB"

Comment on lines +155 to +161
local escaped_title escaped_text
escaped_title=$(echo "$title" | sed "s/'/''/g")
escaped_text=$(echo "$text" | sed "s/'/''/g")

local prompt_id
prompt_id=$(log_stderr "prompt add" sqlite3 "$SCORING_DB" \
"INSERT INTO prompts (title, prompt_text, category, difficulty) VALUES ('${escaped_title}', '${escaped_text}', '${category}', '${difficulty}'); SELECT last_insert_rowid();")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

SQL injection via unescaped category and difficulty parameters.

title and text are single-quote-escaped via sed, but category and difficulty are interpolated directly into the SQL string on line 161. A value like general'; DROP TABLE prompts;-- would be injected verbatim.

Apply the same escaping treatment as title and text:

🛡️ Proposed fix
             local escaped_title escaped_text
             escaped_title=$(echo "$title" | sed "s/'/''/g")
             escaped_text=$(echo "$text" | sed "s/'/''/g")
+            local escaped_category escaped_difficulty
+            escaped_category=$(echo "$category" | sed "s/'/''/g")
+            escaped_difficulty=$(echo "$difficulty" | sed "s/'/''/g")
 
             local prompt_id
             prompt_id=$(log_stderr "prompt add" sqlite3 "$SCORING_DB" \
-                "INSERT INTO prompts (title, prompt_text, category, difficulty) VALUES ('${escaped_title}', '${escaped_text}', '${category}', '${difficulty}'); SELECT last_insert_rowid();")
+                "INSERT INTO prompts (title, prompt_text, category, difficulty) VALUES ('${escaped_title}', '${escaped_text}', '${escaped_category}', '${escaped_difficulty}'); SELECT last_insert_rowid();")
🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 155 - 161, The SQL
is vulnerable because category and difficulty are interpolated directly into the
sqlite3 INSERT; add the same single-quote escaping used for title/text: declare
local escaped_category and escaped_difficulty, set escaped_category=$(echo
"$category" | sed "s/'/''/g") and escaped_difficulty=$(echo "$difficulty" | sed
"s/'/''/g"), and then use '${escaped_category}' and '${escaped_difficulty}' in
the log_stderr / sqlite3 INSERT command (the prompt_id assignment) instead of
the raw variables.

Comment on lines +259 to +281
if [[ -z "$prompt_id" || -z "$model_id" || -z "$response_text" ]]; then
print_error "Usage: response-scoring-helper.sh record --prompt <id> --model <model_id> --text \"response\" [--time <seconds>] [--tokens <count>] [--cost <usd>]"
return 1
fi

# Verify prompt exists
local prompt_exists
prompt_exists=$(log_stderr "record check" sqlite3 "$SCORING_DB" \
"SELECT COUNT(*) FROM prompts WHERE prompt_id = ${prompt_id};")
if [[ "$prompt_exists" == "0" ]]; then
print_error "Prompt #${prompt_id} not found"
return 1
fi

local escaped_text escaped_model
escaped_text=$(echo "$response_text" | sed "s/'/''/g")
escaped_model=$(echo "$model_id" | sed "s/'/''/g")

local response_id
response_id=$(log_stderr "record insert" sqlite3 "$SCORING_DB" \
"INSERT INTO responses (prompt_id, model_id, response_text, response_time, token_count, cost_estimate)
VALUES (${prompt_id}, '${escaped_model}', '${escaped_text}', ${response_time}, ${token_count}, ${cost});
SELECT last_insert_rowid();")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Numeric parameters (prompt_id, response_time, token_count, cost) are interpolated into SQL without integer/float validation.

If a user passes --prompt "1; DROP TABLE prompts;--" or --time "0; DROP TABLE responses", the value is spliced directly into the SQL statement. While prompt_id is checked against the DB first (line 266-267, also unvalidated), response_time, token_count, and cost go straight into the INSERT on line 280.

Add numeric validation before SQL interpolation. This pattern applies throughout the script (also in cmd_score, cmd_compare, cmd_leaderboard, cmd_export, cmd_history, and prompt show).

🛡️ Proposed fix — validate before interpolation (example for cmd_record)
     if [[ -z "$prompt_id" || -z "$model_id" || -z "$response_text" ]]; then
         print_error "Usage: ..."
         return 1
     fi
 
+    # Validate numeric inputs
+    if ! [[ "$prompt_id" =~ ^[0-9]+$ ]]; then
+        print_error "prompt_id must be a positive integer"
+        return 1
+    fi
+    if ! [[ "$response_time" =~ ^[0-9]*\.?[0-9]+$ ]]; then
+        print_error "response_time must be a number"
+        return 1
+    fi
+    if ! [[ "$token_count" =~ ^[0-9]+$ ]]; then
+        print_error "token_count must be a positive integer"
+        return 1
+    fi
+    if ! [[ "$cost" =~ ^[0-9]*\.?[0-9]+$ ]]; then
+        print_error "cost must be a number"
+        return 1
+    fi
+
     # Verify prompt exists
🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 259 - 281, Validate
numeric inputs before embedding them into SQL: ensure prompt_id and token_count
match an integer regex (e.g., ^[0-9]+$) and ensure response_time and cost match
a numeric/float regex (e.g., ^[0-9]+(\.[0-9]+)?$); reject or default and error
out if validation fails, then use the validated variables when constructing the
sqlite3 INSERT/select statements. Also apply the same validation to the
prompt_exists check (don't interpolate an unvalidated prompt_id into the SELECT)
and propagate this pattern to other commands named in the script (cmd_record,
cmd_score, cmd_compare, cmd_leaderboard, cmd_export, cmd_history and the prompt
show path) so all numeric parameters are validated before SQL interpolation.

Comment on lines +607 to +611

local where_clause=""
if [[ -n "$category" ]]; then
where_clause="WHERE p.category = '${category}'"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

SQL injection via unescaped category in leaderboard WHERE clause.

The category value from --category is interpolated directly into the SQL WHERE clause without escaping.

🛡️ Proposed fix
     local where_clause=""
     if [[ -n "$category" ]]; then
-        where_clause="WHERE p.category = '${category}'"
+        local escaped_category
+        escaped_category=$(echo "$category" | sed "s/'/''/g")
+        where_clause="WHERE p.category = '${escaped_category}'"
     fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local where_clause=""
if [[ -n "$category" ]]; then
where_clause="WHERE p.category = '${category}'"
fi
local where_clause=""
if [[ -n "$category" ]]; then
local escaped_category
escaped_category=$(echo "$category" | sed "s/'/''/g")
where_clause="WHERE p.category = '${escaped_category}'"
fi
🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 607 - 611, The WHERE
clause builds SQL by interpolating the category variable into where_clause,
which allows SQL injection; fix by properly escaping or validating category
before interpolation — e.g., sanitize category (allow only expected
values/patterns) or escape single quotes (replace ' with '' ) and then assign
where_clause="WHERE p.category = '${escaped_category}'"; alternatively use a
parameterized/psql variable substitution approach instead of direct
interpolation; update the code that sets where_clause and any caller that passes
category in .agents/scripts/response-scoring-helper.sh to use the
sanitized/escaped variable.

Comment on lines +716 to +737
if [[ "$format" == "csv" ]]; then
echo "prompt_id,prompt_title,response_id,model_id,correctness,completeness,code_quality,clarity,weighted_avg,response_time,token_count,cost_estimate"
sqlite3 -separator ',' "$SCORING_DB" "
SELECT r.prompt_id, p.title, r.response_id, r.model_id,
MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END),
MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END),
MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END),
MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END),
ROUND((
COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 +
COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 +
COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 +
COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20
), 2),
r.response_time, r.token_count, r.cost_estimate
FROM responses r
JOIN prompts p ON r.prompt_id = p.prompt_id
LEFT JOIN scores s ON r.response_id = s.response_id
${where_clause}
GROUP BY r.response_id
ORDER BY r.prompt_id, r.model_id;
" 2>/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Export CSV weighted average calculation differs from compare/leaderboard — missing criteria skew the score.

The export query on lines 724-728 uses COALESCE(..., 0) for missing criteria, which substitutes 0 and always divides by the full weight of 1.0 (implicitly — there's no division at all). Meanwhile, cmd_compare (lines 466-480) and cmd_leaderboard (lines 618-631) correctly use NULLIF(SUM(weights), 0) to exclude unscored criteria from the denominator.

If a response has only one criterion scored (e.g., correctness=5), export reports 5*0.30 = 1.50, while compare reports 5.00. This is an inconsistency that produces misleading export data.

🔧 Align export with the same weighted-average formula used elsewhere
-                   ROUND((
-                       COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 +
-                       COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 +
-                       COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 +
-                       COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20
-                   ), 2),
+                   ROUND(
+                       (COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 +
+                        COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 +
+                        COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 +
+                        COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20)
+                       / NULLIF(
+                           (CASE WHEN MAX(CASE WHEN s.criterion = 'correctness' THEN 1 END) IS NOT NULL THEN 0.30 ELSE 0 END +
+                            CASE WHEN MAX(CASE WHEN s.criterion = 'completeness' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END +
+                            CASE WHEN MAX(CASE WHEN s.criterion = 'code_quality' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END +
+                            CASE WHEN MAX(CASE WHEN s.criterion = 'clarity' THEN 1 END) IS NOT NULL THEN 0.20 ELSE 0 END), 0)
+                   , 2),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [[ "$format" == "csv" ]]; then
echo "prompt_id,prompt_title,response_id,model_id,correctness,completeness,code_quality,clarity,weighted_avg,response_time,token_count,cost_estimate"
sqlite3 -separator ',' "$SCORING_DB" "
SELECT r.prompt_id, p.title, r.response_id, r.model_id,
MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END),
MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END),
MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END),
MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END),
ROUND((
COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 +
COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 +
COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 +
COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20
), 2),
r.response_time, r.token_count, r.cost_estimate
FROM responses r
JOIN prompts p ON r.prompt_id = p.prompt_id
LEFT JOIN scores s ON r.response_id = s.response_id
${where_clause}
GROUP BY r.response_id
ORDER BY r.prompt_id, r.model_id;
" 2>/dev/null
if [[ "$format" == "csv" ]]; then
echo "prompt_id,prompt_title,response_id,model_id,correctness,completeness,code_quality,clarity,weighted_avg,response_time,token_count,cost_estimate"
sqlite3 -separator ',' "$SCORING_DB" "
SELECT r.prompt_id, p.title, r.response_id, r.model_id,
MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END),
MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END),
MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END),
MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END),
ROUND(
(COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 +
COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 +
COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 +
COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20)
/ NULLIF(
(CASE WHEN MAX(CASE WHEN s.criterion = 'correctness' THEN 1 END) IS NOT NULL THEN 0.30 ELSE 0 END +
CASE WHEN MAX(CASE WHEN s.criterion = 'completeness' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END +
CASE WHEN MAX(CASE WHEN s.criterion = 'code_quality' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END +
CASE WHEN MAX(CASE WHEN s.criterion = 'clarity' THEN 1 END) IS NOT NULL THEN 0.20 ELSE 0 END), 0)
, 2),
r.response_time, r.token_count, r.cost_estimate
FROM responses r
JOIN prompts p ON r.prompt_id = p.prompt_id
LEFT JOIN scores s ON r.response_id = s.response_id
${where_clause}
GROUP BY r.response_id
ORDER BY r.prompt_id, r.model_id;
" 2>/dev/null
🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 716 - 737, The CSV
export query currently uses COALESCE(..., 0) which treats missing criteria as
zero and biases the weighted average; update the SELECT that builds the
weighted_avg (inside the format == "csv" branch) to compute a proper weighted
average like cmd_compare and cmd_leaderboard: build a numerator by summing
weighted MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END) * 0.30 +
... (for completeness 0.25, code_quality 0.25, clarity 0.20) and build a
denominator as NULLIF( (CASE WHEN MAX(CASE WHEN s.criterion='correctness' THEN
s.score END) IS NOT NULL THEN 0.30 ELSE 0 END) + ... , 0); then use
ROUND(numerator / denominator, 2) so missing criteria are excluded from the
denominator rather than treated as zero; change the expression that currently
uses COALESCE(...) and weights to this numerator/denominator pattern referencing
the same MAX(CASE ...) expressions and weights.

…(t168.3)

Add response-scoring-helper.sh with SQLite-backed evaluation engine:
- Structured scoring on 4 weighted criteria: correctness (30%),
  completeness (25%), code quality (25%), clarity (20%)
- Side-by-side comparison with ranked output and winner declaration
- Aggregate leaderboard across all evaluations
- JSON and CSV export for analysis
- Prompt management (create, list, show)
- Response recording with timing, token count, and cost metadata
- Multi-scorer support (human, AI, etc.)
- /score-responses slash command and subagent documentation
- 41 passing tests covering all commands and edge cases
- ShellCheck clean (zero violations)

Complements existing compare-models (specs) with actual response quality
evaluation. Workflow: identify candidates -> check availability -> evaluate
outputs -> inform model routing decisions.
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 53 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Mon Feb 9 15:43:57 UTC 2026: Code review monitoring started
Mon Feb 9 15:43:58 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 53

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 53
  • VULNERABILITIES: 0

Generated on: Mon Feb 9 15:44:00 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 9, 2026

@marcusquinn marcusquinn merged commit c89471c into main Feb 9, 2026
9 of 11 checks passed
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @.agents/scripts/response-scoring-helper.sh:
- Around line 58-110: The SQLite foreign key constraints in init_db (and the
comparisons/winner_id, responses.prompt_id, scores.response_id relations) are
never enforced because PRAGMA foreign_keys=ON is not set per connection; add a
PRAGMA foreign_keys = ON execution before any DB DDL/DML is run (e.g., in
init_db and in the connection wrapper used by ensure_db, cmd_record, cmd_score)
so every write/read session enables foreign keys for that connection; ensure the
PRAGMA is issued on the same sqlite3 connection used for subsequent CREATE TABLE
/ INSERT / UPDATE statements to enforce referential integrity for comparisons,
responses, and scores.
- Around line 557-588: The JSON building in _compare_json interpolates unescaped
strings (notably model_id and prompt_title) which can break JSON; either
construct the object with jq if available or ensure proper escaping of
backslashes and double-quotes (and strip/escape control chars) for model_id and
prompt_title before embedding them into the JSON string; update the same
escaping logic used for prompt_title to also run on model_id (and apply to the
leaderboard output path referenced around line 660) so all string fields are
JSON-safe.
- Around line 338-342: The current range check using [[ "$value" -lt 1 ||
"$value" -gt 5 ]] 2>/dev/null silently accepts non-numeric input; change the
validation in response-scoring-helper.sh to first assert the input is an integer
(e.g., test $value with a regex like ^[0-9]+$ using [[ ]] ), and only then
perform the numeric range check against 1–5 (the existing -lt/-gt checks) for
the variables value and criterion; if the integer test fails or the range check
fails, call print_error "Score for ${criterion} must be an integer 1-5, got:
${value}" and return 1.
🧹 Nitpick comments (4)
.agents/scripts/response-scoring-helper.sh (2)

384-391: Shell variables expanded directly inside awk program strings — prefer -v for safety and clarity.

Lines 386-387 and 395 inject $weighted_total, $score, $weight, and $weight_sum via double-quoted awk program strings. While the values here flow from the database (with a CHECK constraint) and a hardcoded weight helper, this pattern is fragile — any unexpected character in a DB value would be interpreted as awk code.

Using awk's -v flag is more robust and considered idiomatic:

♻️ Suggested improvement
-        weighted_total=$(awk "BEGIN{printf \"%.4f\", $weighted_total + ($score * $weight)}")
-        weight_sum=$(awk "BEGIN{printf \"%.4f\", $weight_sum + $weight}")
+        weighted_total=$(awk -v wt="$weighted_total" -v sc="$score" -v w="$weight" 'BEGIN{printf "%.4f", wt + (sc * w)}')
+        weight_sum=$(awk -v ws="$weight_sum" -v w="$weight" 'BEGIN{printf "%.4f", ws + w}')

363-368: Inconsistent use of log_stderr wrapper for sqlite3 calls.

_show_response_scores (line 367-368) and cmd_compare (line 454-455) call sqlite3 directly, while other commands consistently use the log_stderr wrapper. This means failures in these calls won't appear in the log file, making debugging harder.

♻️ Proposed fix (example for _show_response_scores)
     local model_id
-    model_id=$(sqlite3 "$SCORING_DB" \
-        "SELECT model_id FROM responses WHERE response_id = ${response_id};")
+    model_id=$(log_stderr "show scores" sqlite3 "$SCORING_DB" \
+        "SELECT model_id FROM responses WHERE response_id = ${response_id};")
tests/test-response-scoring.sh (2)

215-222: Missing test for non-numeric score input — would expose the validation bypass.

The test on line 216 validates out-of-range integer (6), which the script does catch. However, a non-numeric value like abc silently passes the validation in the helper (see review on line 338-342 of the helper script) and would fail at the SQLite level instead. Adding a non-numeric test case would catch this gap and serve as a regression test for the fix:

# Test score validation (non-numeric)
score_nan=$(bash "$HELPER" score --response 1 --correctness "abc" 2>&1) || true
if echo "$score_nan" | grep -qi "must be 1-5\|error"; then
    pass "score rejects non-numeric value"
else
    fail "score should reject non-numeric value" "$score_nan"
fi

344-369: Consider restoring SCORING_DB_OVERRIDE or adding a comment after switching to the fresh DB.

Line 346 changes SCORING_DB_OVERRIDE to a fresh database for edge-case tests. Any tests added after this block would unknowingly run against the empty DB instead of the fully populated one. A brief comment or a restoration step would prevent future confusion:

+# NOTE: SCORING_DB_OVERRIDE now points to a fresh empty DB for the remaining edge-case tests.
 fresh_db="$TEST_DB_DIR/fresh.db"
 export SCORING_DB_OVERRIDE="$fresh_db"

Comment on lines +58 to +110
init_db() {
mkdir -p "$SCORING_DIR" 2>/dev/null || true

log_stderr "db init" sqlite3 "$SCORING_DB" "
CREATE TABLE IF NOT EXISTS prompts (
prompt_id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
prompt_text TEXT NOT NULL,
category TEXT DEFAULT 'general',
difficulty TEXT DEFAULT 'medium',
created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now'))
);

CREATE TABLE IF NOT EXISTS responses (
response_id INTEGER PRIMARY KEY AUTOINCREMENT,
prompt_id INTEGER NOT NULL,
model_id TEXT NOT NULL,
response_text TEXT NOT NULL,
response_time REAL DEFAULT 0.0,
token_count INTEGER DEFAULT 0,
cost_estimate REAL DEFAULT 0.0,
recorded_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id)
);

CREATE TABLE IF NOT EXISTS scores (
score_id INTEGER PRIMARY KEY AUTOINCREMENT,
response_id INTEGER NOT NULL,
criterion TEXT NOT NULL,
score INTEGER NOT NULL CHECK(score BETWEEN 1 AND 5),
rationale TEXT DEFAULT '',
scored_by TEXT DEFAULT 'human',
scored_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
FOREIGN KEY (response_id) REFERENCES responses(response_id),
UNIQUE(response_id, criterion, scored_by)
);

CREATE TABLE IF NOT EXISTS comparisons (
comparison_id INTEGER PRIMARY KEY AUTOINCREMENT,
prompt_id INTEGER NOT NULL,
winner_id INTEGER,
notes TEXT DEFAULT '',
compared_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id),
FOREIGN KEY (winner_id) REFERENCES responses(response_id)
);

CREATE INDEX IF NOT EXISTS idx_responses_prompt ON responses(prompt_id);
CREATE INDEX IF NOT EXISTS idx_responses_model ON responses(model_id);
CREATE INDEX IF NOT EXISTS idx_scores_response ON scores(response_id);
"
return 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Foreign keys are declared but never enforced — SQLite requires an explicit pragma.

SQLite ignores FOREIGN KEY constraints by default. Without PRAGMA foreign_keys = ON; before the DML statements, referential integrity (e.g., inserting a response for a deleted prompt, or a comparison with a bogus winner_id) is not enforced at the database level.

The manual existence checks in cmd_record and cmd_score partially compensate, but the comparisons table's winner_id is never validated by application code.

🔧 Proposed fix
 init_db() {
     mkdir -p "$SCORING_DIR" 2>/dev/null || true
 
     log_stderr "db init" sqlite3 "$SCORING_DB" "
+        PRAGMA foreign_keys = ON;
+
         CREATE TABLE IF NOT EXISTS prompts (

Note: PRAGMA foreign_keys is per-connection and not persisted, so you'd also want to issue it before every write session (e.g., in ensure_db or a wrapper).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
init_db() {
mkdir -p "$SCORING_DIR" 2>/dev/null || true
log_stderr "db init" sqlite3 "$SCORING_DB" "
CREATE TABLE IF NOT EXISTS prompts (
prompt_id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
prompt_text TEXT NOT NULL,
category TEXT DEFAULT 'general',
difficulty TEXT DEFAULT 'medium',
created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now'))
);
CREATE TABLE IF NOT EXISTS responses (
response_id INTEGER PRIMARY KEY AUTOINCREMENT,
prompt_id INTEGER NOT NULL,
model_id TEXT NOT NULL,
response_text TEXT NOT NULL,
response_time REAL DEFAULT 0.0,
token_count INTEGER DEFAULT 0,
cost_estimate REAL DEFAULT 0.0,
recorded_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id)
);
CREATE TABLE IF NOT EXISTS scores (
score_id INTEGER PRIMARY KEY AUTOINCREMENT,
response_id INTEGER NOT NULL,
criterion TEXT NOT NULL,
score INTEGER NOT NULL CHECK(score BETWEEN 1 AND 5),
rationale TEXT DEFAULT '',
scored_by TEXT DEFAULT 'human',
scored_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
FOREIGN KEY (response_id) REFERENCES responses(response_id),
UNIQUE(response_id, criterion, scored_by)
);
CREATE TABLE IF NOT EXISTS comparisons (
comparison_id INTEGER PRIMARY KEY AUTOINCREMENT,
prompt_id INTEGER NOT NULL,
winner_id INTEGER,
notes TEXT DEFAULT '',
compared_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id),
FOREIGN KEY (winner_id) REFERENCES responses(response_id)
);
CREATE INDEX IF NOT EXISTS idx_responses_prompt ON responses(prompt_id);
CREATE INDEX IF NOT EXISTS idx_responses_model ON responses(model_id);
CREATE INDEX IF NOT EXISTS idx_scores_response ON scores(response_id);
"
return 0
}
init_db() {
mkdir -p "$SCORING_DIR" 2>/dev/null || true
log_stderr "db init" sqlite3 "$SCORING_DB" "
PRAGMA foreign_keys = ON;
CREATE TABLE IF NOT EXISTS prompts (
prompt_id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
prompt_text TEXT NOT NULL,
category TEXT DEFAULT 'general',
difficulty TEXT DEFAULT 'medium',
created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now'))
);
CREATE TABLE IF NOT EXISTS responses (
response_id INTEGER PRIMARY KEY AUTOINCREMENT,
prompt_id INTEGER NOT NULL,
model_id TEXT NOT NULL,
response_text TEXT NOT NULL,
response_time REAL DEFAULT 0.0,
token_count INTEGER DEFAULT 0,
cost_estimate REAL DEFAULT 0.0,
recorded_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id)
);
CREATE TABLE IF NOT EXISTS scores (
score_id INTEGER PRIMARY KEY AUTOINCREMENT,
response_id INTEGER NOT NULL,
criterion TEXT NOT NULL,
score INTEGER NOT NULL CHECK(score BETWEEN 1 AND 5),
rationale TEXT DEFAULT '',
scored_by TEXT DEFAULT 'human',
scored_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
FOREIGN KEY (response_id) REFERENCES responses(response_id),
UNIQUE(response_id, criterion, scored_by)
);
CREATE TABLE IF NOT EXISTS comparisons (
comparison_id INTEGER PRIMARY KEY AUTOINCREMENT,
prompt_id INTEGER NOT NULL,
winner_id INTEGER,
notes TEXT DEFAULT '',
compared_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id),
FOREIGN KEY (winner_id) REFERENCES responses(response_id)
);
CREATE INDEX IF NOT EXISTS idx_responses_prompt ON responses(prompt_id);
CREATE INDEX IF NOT EXISTS idx_responses_model ON responses(model_id);
CREATE INDEX IF NOT EXISTS idx_scores_response ON scores(response_id);
"
return 0
}
🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 58 - 110, The SQLite
foreign key constraints in init_db (and the comparisons/winner_id,
responses.prompt_id, scores.response_id relations) are never enforced because
PRAGMA foreign_keys=ON is not set per connection; add a PRAGMA foreign_keys = ON
execution before any DB DDL/DML is run (e.g., in init_db and in the connection
wrapper used by ensure_db, cmd_record, cmd_score) so every write/read session
enables foreign keys for that connection; ensure the PRAGMA is issued on the
same sqlite3 connection used for subsequent CREATE TABLE / INSERT / UPDATE
statements to enforce referential integrity for comparisons, responses, and
scores.

Comment on lines +338 to +342
# Validate score range
if [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null; then
print_error "Score for ${criterion} must be 1-5, got: ${value}"
return 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Score validation silently passes non-numeric input.

The -lt / -gt operators inside [[ ]] emit an error (exit code 2) on non-integer strings, and 2>/dev/null suppresses it. The effect is that a value like abc or 3; DROP TABLE scores;-- makes both comparisons fail, the overall condition evaluates to false, and the value sails through to the SQL INSERT on line 346.

This is the score-specific manifestation of the numeric-injection gap already noted, but the mechanism is subtle enough to call out: the validation looks present but doesn't actually guard non-numeric input.

🛡️ Proposed fix — validate integer first, then range-check
         if [[ -z "$value" ]]; then
             continue
         fi
 
-        # Validate score range
-        if [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null; then
+        # Validate integer and score range
+        if ! [[ "$value" =~ ^[0-9]+$ ]] || [[ "$value" -lt 1 || "$value" -gt 5 ]]; then
             print_error "Score for ${criterion} must be 1-5, got: ${value}"
             return 1
         fi
🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 338 - 342, The
current range check using [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null
silently accepts non-numeric input; change the validation in
response-scoring-helper.sh to first assert the input is an integer (e.g., test
$value with a regex like ^[0-9]+$ using [[ ]] ), and only then perform the
numeric range check against 1–5 (the existing -lt/-gt checks) for the variables
value and criterion; if the integer test fails or the range check fails, call
print_error "Score for ${criterion} must be an integer 1-5, got: ${value}" and
return 1.

Comment on lines +557 to +588
_compare_json() {
local prompt_id="$1"
local prompt_title="$2"
local responses="$3"

local json_entries=()
local rank=0

while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do
rank=$((rank + 1))

local corr="null" comp="null" code="null" clar="null"
if [[ -n "$score_detail" ]]; then
local val
val=$(echo "$score_detail" | tr ',' '\n' | grep "^correctness:" | cut -d: -f2)
[[ -n "$val" ]] && corr="$val"
val=$(echo "$score_detail" | tr ',' '\n' | grep "^completeness:" | cut -d: -f2)
[[ -n "$val" ]] && comp="$val"
val=$(echo "$score_detail" | tr ',' '\n' | grep "^code_quality:" | cut -d: -f2)
[[ -n "$val" ]] && code="$val"
val=$(echo "$score_detail" | tr ',' '\n' | grep "^clarity:" | cut -d: -f2)
[[ -n "$val" ]] && clar="$val"
fi

json_entries+=("{\"rank\":${rank},\"response_id\":${rid},\"model\":\"${model_id}\",\"scores\":{\"correctness\":${corr},\"completeness\":${comp},\"code_quality\":${code},\"clarity\":${clar}},\"weighted_avg\":${wavg:-0},\"response_time\":${rtime:-0},\"tokens\":${tokens:-0},\"cost\":${cost:-0}}")
done <<< "$responses"

local escaped_title
escaped_title=$(echo "$prompt_title" | sed 's/"/\\"/g')
echo "{\"prompt_id\":${prompt_id},\"title\":\"${escaped_title}\",\"responses\":[$(IFS=,; echo "${json_entries[*]}")]}"
return 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

JSON output is vulnerable to breakage from special characters in model_id.

model_id (line 581) is interpolated directly into the JSON string with zero escaping. If a model ID contains ", \, or control characters, the output is malformed JSON. The same applies to JSON leaderboard output (line 660).

prompt_title (line 585) escapes only " but not \ or control characters, which is incomplete.

Consider using jq (if available) or at minimum applying the same sed escaping for \ and ":

🔧 Proposed fix (minimal — escape backslash then double-quote)
     while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do
         rank=$((rank + 1))
+        # Escape JSON-special characters
+        local escaped_model
+        escaped_model=$(printf '%s' "$model_id" | sed 's/\\/\\\\/g; s/"/\\"/g')
 
         ...
 
-        json_entries+=("{\"rank\":${rank},\"response_id\":${rid},\"model\":\"${model_id}\",...")
+        json_entries+=("{\"rank\":${rank},\"response_id\":${rid},\"model\":\"${escaped_model}\",...")
🤖 Prompt for AI Agents
In @.agents/scripts/response-scoring-helper.sh around lines 557 - 588, The JSON
building in _compare_json interpolates unescaped strings (notably model_id and
prompt_title) which can break JSON; either construct the object with jq if
available or ensure proper escaping of backslashes and double-quotes (and
strip/escape control chars) for model_id and prompt_title before embedding them
into the JSON string; update the same escaping logic used for prompt_title to
also run on model_id (and apply to the leaderboard output path referenced around
line 660) so all string fields are JSON-safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant