feat: response comparison and scoring framework for model evaluation (t168.3) #773

marcusquinn · 2026-02-09T15:14:51Z

Summary

Add response-scoring-helper.sh — SQLite-backed framework for evaluating AI model responses side-by-side with structured scoring criteria
Add /score-responses slash command and response-scoring.md subagent documentation
Add comprehensive test suite (41 tests, all passing)

What This Does

Extends the model comparison toolchain (t168 series) from spec-level comparison to actual response quality evaluation:

Existing Tool	Purpose
`compare-models-helper.sh`	Compare model specs (pricing, context, capabilities)
`model-availability-helper.sh`	Check which models are available
`model-registry-helper.sh`	Track model versions and deprecations
`response-scoring-helper.sh` (new)	Evaluate actual model response quality

Scoring Criteria (weighted)

Criterion	Weight	Description
Correctness	30%	Factual accuracy and technical correctness
Completeness	25%	Coverage of all requirements and edge cases
Code Quality	25%	Clean code, best practices, maintainability
Clarity	20%	Clear explanation, good formatting, readability

Commands

prompt add/list/show — Manage evaluation prompts
record — Record model responses with timing/cost metadata
score — Score responses on all 4 criteria (1-5 scale)
compare — Side-by-side comparison with ranked output
leaderboard — Aggregate model rankings across evaluations
export — JSON and CSV export for analysis
history — Scoring audit trail
criteria — Reference rubrics

Workflow

1. compare-models recommend "task"  → identify candidate models
2. model-availability check <model> → verify availability
3. response-scoring record/score    → evaluate actual outputs
4. response-scoring leaderboard     → inform model routing

Quality

ShellCheck: zero violations
Tests: 41/41 passing (syntax, CRUD, scoring, comparison, export, edge cases)
Follows framework conventions: local var="$1", explicit returns, shared-constants.sh

Summary by CodeRabbit

New Features
- Added a response-scoring workflow to evaluate AI outputs across four weighted criteria, compare responses side-by-side, view leaderboards, and export results (JSON/CSV).
Documentation
- Added comprehensive guides and quick reference for the scoring workflow and usage.
Tests
- Added an end-to-end test suite validating the scoring workflow, comparisons, leaderboard, export, and edge cases.

coderabbitai · 2026-02-09T15:15:10Z

Walkthrough

Adds a response scoring framework: documentation, a SQLite-backed Bash CLI helper script, and an end-to-end test suite enabling prompt management, response recording, multi-criterion scoring, comparisons, leaderboards, and data export.

Changes

Cohort / File(s)	Summary
Documentation & References `.agents/scripts/commands/score-responses.md`, `.agents/tools/ai-assistants/response-scoring.md`, `.agents/tools/ai-assistants/compare-models.md`	New docs describing the response-scoring workflow, scoring criteria and weights, example commands, DB location/schema, and a compare-models reference to the scoring doc.
Response Scoring CLI `.agents/scripts/response-scoring-helper.sh`	New comprehensive Bash CLI implementing SQLite-backed storage and commands: init, prompt (add/list/show), record, score (per-criterion validation and weighted aggregation), compare (table/JSON), leaderboard, export (JSON/CSV), history, criteria, help, plus internal helpers and DB schema creation.
Test Suite `tests/test-response-scoring.sh`	New Bash end-to-end test script exercising CLI features against a temporary SQLite DB: syntax, init, prompt CRUD, recording, scoring, comparisons, leaderboard, export, history, criteria, and edge cases; includes pass/fail reporting and cleanup.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as Response Scoring CLI
    participant DB as SQLite Database
    participant Scorer as Scoring Logic
    participant Renderer as Output Renderer

    User->>CLI: init
    CLI->>DB: create schema (prompts, responses, scores, comparisons)
    DB-->>CLI: ready

    User->>CLI: prompt add "<text>"
    CLI->>DB: insert prompt
    DB-->>CLI: prompt_id

    User->>CLI: record <prompt_id> <model> <response_file>
    CLI->>DB: insert response + metadata
    DB-->>CLI: response_id

    User->>CLI: score <response_id> correctness=5 completeness=4 code_quality=5 clarity=5
    CLI->>Scorer: validate & compute weighted total (30/25/25/20)
    Scorer->>DB: persist per-criterion and aggregate score
    DB-->>CLI: score stored

    User->>CLI: compare <prompt_id>
    CLI->>DB: fetch responses + scores
    DB-->>CLI: result set
    CLI->>Renderer: format table or JSON
    Renderer-->>User: comparison output

    User->>CLI: leaderboard [--json]
    CLI->>DB: aggregate by model
    DB-->>CLI: rankings
    CLI->>Renderer: render leaderboard
    Renderer-->>User: ranked results

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

✨ Models answer, judges weigh the best,

SQLite holds each tested quest.
Four rubrics sing, scores rise and fall,
Compare, export, crown the call—
A leaderboard to crown the quest.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main addition: a response scoring and comparison framework for evaluating AI models. It directly reflects the core deliverables (response-scoring-helper.sh, response-scoring.md, and test suite) and is clear, concise, and meaningful for scanning PR history.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t168.3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-09T15:15:27Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 53 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Mon Feb 9 15:15:22 UTC 2026: Code review monitoring started
Mon Feb 9 15:15:23 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 53

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 53
VULNERABILITIES: 0

Generated on: Mon Feb 9 15:15:25 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

gemini-code-assist · 2026-02-09T15:16:02Z

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new response scoring framework to evaluate AI model responses based on correctness, completeness, code quality, and clarity. It provides tools for managing prompts, recording responses, scoring, comparing, and generating leaderboards, enabling informed model routing and selection.

Highlights

New Feature: Response Scoring Framework: Introduces a SQLite-backed framework (response-scoring-helper.sh) for evaluating AI model responses using structured scoring criteria, enhancing the model comparison toolchain.
Slash Command and Documentation: Adds a /score-responses slash command and response-scoring.md subagent documentation for easy access and usage.
Comprehensive Test Suite: Includes a comprehensive test suite with 41 passing tests, ensuring the framework's reliability and adherence to coding standards.

Changelog

.agents/scripts/commands/score-responses.md
- Added a new subagent command for scoring model responses.
.agents/scripts/response-scoring-helper.sh
- Implemented the core logic for response scoring, including database initialization, prompt management, response recording, scoring, comparison, leaderboard generation, and export functionalities.
.agents/tools/ai-assistants/compare-models.md
- Updated the compare-models documentation to include a reference to the new response-scoring tool.
.agents/tools/ai-assistants/response-scoring.md
- Created a new documentation file detailing the usage and workflow of the response scoring framework.
tests/test-response-scoring.sh
- Added a comprehensive test suite to validate the functionality of the response-scoring-helper.sh script.

Activity

New files added: .agents/scripts/commands/score-responses.md, .agents/scripts/response-scoring-helper.sh, .agents/tools/ai-assistants/response-scoring.md, tests/test-response-scoring.sh
Modified file: .agents/tools/ai-assistants/compare-models.md

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive framework for scoring and comparing AI model responses, which is a great addition to the model evaluation toolchain. It includes a new helper script, subagent definition, documentation, and a thorough test suite. The code is well-structured and follows most of the project's conventions.

My review focuses on improving security, maintainability, and adherence to the repository's style guide. I've identified a critical SQL injection risk and suggest a mitigation. I've also pointed out several areas where hardcoded values could be centralized to improve maintainability, and a few violations of the shell scripting style guide regarding error handling and linting practices, with modifications to align with repository rules on script independence.

gemini-code-assist · 2026-02-09T15:18:41Z

.agents/scripts/response-scoring-helper.sh

+            escaped_title=$(echo "$title" | sed "s/'/''/g")
+            escaped_text=$(echo "$text" | sed "s/'/''/g")


The script constructs SQL queries by concatenating user-provided strings, which is a significant security risk leading to SQL Injection vulnerabilities. For example, title and text are directly embedded into the SQL string. A malicious input could break out of the string literal and execute arbitrary SQL commands. The repository style guide requires using parameterized queries where possible.

While sqlite3 CLI makes this difficult, using printf '%s' is safer than echo for preparing the values as it doesn't interpret backslashes or options like -n.

Suggested change

escaped_title=$(echo "$title" | sed "s/'/''/g")

escaped_text=$(echo "$text" | sed "s/'/''/g")

escaped_title=$(printf '%s' "$title" | sed "s/'/''/g")

escaped_text=$(printf '%s' "$text" | sed "s/'/''/g")

gemini-code-assist · 2026-02-09T15:18:42Z

.agents/scripts/response-scoring-helper.sh

+_get_criterion_weight() {
+    local criterion="$1"
+    case "$criterion" in
+        correctness)  echo "0.30" ;;
+        completeness) echo "0.25" ;;
+        code_quality) echo "0.25" ;;
+        clarity)      echo "0.20" ;;
+        *)            echo "0.25" ;;
+    esac
+    return 0
+}


The weights for scoring criteria are hardcoded in this function, and also in SQL queries within cmd_compare, cmd_leaderboard, and cmd_export. This violates the DRY (Don't Repeat Yourself) principle and makes the script hard to maintain, as the weights are already defined in the SCORING_CRITERIA constant but are not used here. Additionally, the default case *) echo "0.25" ;; can hide bugs by assigning an arbitrary weight to unknown criteria.

This function should parse the weight from the SCORING_CRITERIA constant. For unknown criteria, it should return an error. This will centralize the logic and make the system more robust. You should then consider refactoring the SQL queries to use this centralized logic instead of hardcoded weights.

Suggested change

_get_criterion_weight() {

local criterion="$1"

case "$criterion" in

correctness) echo "0.30" ;;

completeness) echo "0.25" ;;

code_quality) echo "0.25" ;;

clarity) echo "0.20" ;;

*) echo "0.25" ;;

esac

return 0

}

_get_criterion_weight() {

local criterion="$1"

local weight

# Parse weight from the central SCORING_CRITERIA constant

weight=$(echo "$SCORING_CRITERIA" | grep "^${criterion}|" | cut -d'|' -f2)

if [[ -z "$weight" ]]; then

print_error "Unknown criterion: '${criterion}'. Please check SCORING_CRITERIA."

return 1

fi

echo "$weight"

return 0

}

gemini-code-assist · 2026-02-09T15:18:42Z

.agents/scripts/response-scoring-helper.sh

@@ -0,0 +1,933 @@
+#!/usr/bin/env bash
+# shellcheck disable=SC1091,SC2001


The script uses a global shellcheck disable for SC2001. The repository style guide requires using "targeted inline disables with reason comments only". Additionally, SC2001 can often be avoided by using Bash's native parameter expansion, which is more efficient than sed.

For SC2001 violations (e.g., echo "$var" | sed "s/'/''/g"), you can replace them with "${var//\'/\'\'}", which is both more performant and compliant with modern shell scripting practices.

References

The style guide requires ShellCheck violations to be disabled inline with a reason, not globally at the top of the file. ^(link)

gemini-code-assist · 2026-02-09T15:18:42Z

.agents/scripts/response-scoring-helper.sh

+    mkdir -p "$SCORING_DIR" 2>/dev/null || true
+
+    log_stderr "db init" sqlite3 "$SCORING_DB" "
+        CREATE TABLE IF NOT EXISTS prompts (


The repository style guide specifies that "All SQLite databases use WAL mode + busy_timeout=5000". The init_db function is missing these PRAGMA settings. This can affect database concurrency and performance under load.

Suggested change

CREATE TABLE IF NOT EXISTS prompts (

PRAGMA journal_mode=WAL;

PRAGMA busy_timeout=5000;

CREATE TABLE IF NOT EXISTS prompts (

References

The style guide requires all SQLite databases to be configured with WAL mode and a busy timeout. ^(link)

gemini-code-assist · 2026-02-09T15:18:42Z

.agents/scripts/response-scoring-helper.sh

+    responses=$(sqlite3 -separator '|' "$SCORING_DB" "
+        SELECT r.response_id, r.model_id, r.response_time, r.token_count, r.cost_estimate,
+               COALESCE((
+                   SELECT ROUND(
+                       SUM(CASE s.criterion
+                           WHEN 'correctness' THEN s.score * 0.30
+                           WHEN 'completeness' THEN s.score * 0.25
+                           WHEN 'code_quality' THEN s.score * 0.25
+                           WHEN 'clarity' THEN s.score * 0.20
+                           ELSE s.score * 0.25
+                       END)
+                       / NULLIF(SUM(CASE s.criterion
+                           WHEN 'correctness' THEN 0.30
+                           WHEN 'completeness' THEN 0.25
+                           WHEN 'code_quality' THEN 0.25
+                           WHEN 'clarity' THEN 0.20
+                           ELSE 0.25
+                       END), 0)
+                   , 2)
+                   FROM scores s WHERE s.response_id = r.response_id
+               ), 0) as weighted_avg,
+               (SELECT GROUP_CONCAT(s2.criterion || ':' || s2.score, ',')
+                FROM scores s2 WHERE s2.response_id = r.response_id) as score_detail
+        FROM responses r
+        WHERE r.prompt_id = ${prompt_id}
+        ORDER BY weighted_avg DESC;
+    " 2>/dev/null)


This sqlite3 call suppresses all stderr output with 2>/dev/null. This violates the repository style guide, which states that 2>/dev/null should not be used for blanket suppression. If there is a syntax error in the SQL or another database issue, it will be silently ignored, making debugging difficult. The log_stderr helper, if simple and self-contained, should be duplicated within this standalone script to avoid source dependencies, as per repository guidelines for script independence.

Please wrap the sqlite3 command with a duplicated log_stderr function (if simple) and remove the 2>/dev/null redirection to ensure errors are properly logged. For example:
responses=$(log_stderr "compare query" sqlite3 ...)
This should be applied to other similar sqlite3 calls in cmd_leaderboard, cmd_export, and cmd_history.

References

The style guide prohibits blanket error suppression with 2>/dev/null, recommending logging instead. ^(link)

For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions instead of introducing source dependencies. This maintains script independence and avoids risks like path resolution issues.

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In @.agents/scripts/response-scoring-helper.sh:
- Around line 155-161: The SQL is vulnerable because category and difficulty are
interpolated directly into the sqlite3 INSERT; add the same single-quote
escaping used for title/text: declare local escaped_category and
escaped_difficulty, set escaped_category=$(echo "$category" | sed "s/'/''/g")
and escaped_difficulty=$(echo "$difficulty" | sed "s/'/''/g"), and then use
'${escaped_category}' and '${escaped_difficulty}' in the log_stderr / sqlite3
INSERT command (the prompt_id assignment) instead of the raw variables.
- Around line 607-611: The WHERE clause builds SQL by interpolating the category
variable into where_clause, which allows SQL injection; fix by properly escaping
or validating category before interpolation — e.g., sanitize category (allow
only expected values/patterns) or escape single quotes (replace ' with '' ) and
then assign where_clause="WHERE p.category = '${escaped_category}'";
alternatively use a parameterized/psql variable substitution approach instead of
direct interpolation; update the code that sets where_clause and any caller that
passes category in .agents/scripts/response-scoring-helper.sh to use the
sanitized/escaped variable.
- Around line 259-281: Validate numeric inputs before embedding them into SQL:
ensure prompt_id and token_count match an integer regex (e.g., ^[0-9]+$) and
ensure response_time and cost match a numeric/float regex (e.g.,
^[0-9]+(\.[0-9]+)?$); reject or default and error out if validation fails, then
use the validated variables when constructing the sqlite3 INSERT/select
statements. Also apply the same validation to the prompt_exists check (don't
interpolate an unvalidated prompt_id into the SELECT) and propagate this pattern
to other commands named in the script (cmd_record, cmd_score, cmd_compare,
cmd_leaderboard, cmd_export, cmd_history and the prompt show path) so all
numeric parameters are validated before SQL interpolation.
- Around line 716-737: The CSV export query currently uses COALESCE(..., 0)
which treats missing criteria as zero and biases the weighted average; update
the SELECT that builds the weighted_avg (inside the format == "csv" branch) to
compute a proper weighted average like cmd_compare and cmd_leaderboard: build a
numerator by summing weighted MAX(CASE WHEN s.criterion = 'correctness' THEN
s.score END) * 0.30 + ... (for completeness 0.25, code_quality 0.25, clarity
0.20) and build a denominator as NULLIF( (CASE WHEN MAX(CASE WHEN
s.criterion='correctness' THEN s.score END) IS NOT NULL THEN 0.30 ELSE 0 END) +
... , 0); then use ROUND(numerator / denominator, 2) so missing criteria are
excluded from the denominator rather than treated as zero; change the expression
that currently uses COALESCE(...) and weights to this numerator/denominator
pattern referencing the same MAX(CASE ...) expressions and weights.

🧹 Nitpick comments (5)

.agents/scripts/response-scoring-helper.sh (4)
338-342: Score validation works by accident for non-numeric input.

The -lt/-gt operators inside [[ ]] error out on non-numeric values, and 2>/dev/null suppresses the message. The non-zero exit code from the error coincidentally triggers the if body, so the user sees "must be 1-5" rather than a proper "not a number" message. This works but is fragile and misleading.
♻️ Explicit numeric check
         if [[ -z "$value" ]]; then
             continue
         fi
 
         # Validate score range
-        if [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null; then
-            print_error "Score for ${criterion} must be 1-5, got: ${value}"
+        if ! [[ "$value" =~ ^[1-5]$ ]]; then
+            print_error "Score for ${criterion} must be an integer 1-5, got: ${value}"
             return 1
         fi
384-398: Awk expressions interpolate shell variables into the program text — code injection vector.

In _show_response_scores, $score, $weight, $weighted_total, and $weight_sum are interpolated directly into awk BEGIN blocks (lines 386-387, 395). If any of these values were tampered with (e.g., a crafted criterion name in the DB returning something other than a number), arbitrary awk code could execute. In practice these come from the DB and weight helper, so risk is low, but it's worth tightening.
♻️ Use awk -v to pass variables safely
-        weighted_total=$(awk "BEGIN{printf \"%.4f\", $weighted_total + ($score * $weight)}")
-        weight_sum=$(awk "BEGIN{printf \"%.4f\", $weight_sum + $weight}")
+        weighted_total=$(awk -v wt="$weighted_total" -v s="$score" -v w="$weight" 'BEGIN{printf "%.4f", wt + (s * w)}')
+        weight_sum=$(awk -v ws="$weight_sum" -v w="$weight" 'BEGIN{printf "%.4f", ws + w}')
And similarly on line 395:
-        weighted_avg=$(awk "BEGIN{printf \"%.2f\", $weighted_total / $weight_sum}")
+        weighted_avg=$(awk -v wt="$weighted_total" -v ws="$weight_sum" 'BEGIN{printf "%.2f", wt / ws}')
521-538: rank variable incremented inside a piped while-loop (subshell) — counter resets each invocation but the parent never sees it.

echo "$responses" | while ... creates a subshell. The rank counter works correctly within the loop for display purposes, but this is a subtle footgun if anyone later tries to use rank after the loop. The same pattern appears on line 677.

Consider using a here-string or process substitution (while ... done <<< "$responses") to keep the loop in the current shell, consistent with how _compare_json (line 565) and _show_response_scores (line 388) already do it.
♻️ Consistent here-string pattern
     local rank=0
-    echo "$responses" | while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do
+    while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do
         rank=$((rank + 1))
         ...
-    done
+    done <<< "$responses"
130-147: Unrecognized arguments are silently swallowed (*) shift ;;).

Every argument parser has a catch-all that discards unknown flags. A typo like --propmpt instead of --prompt would be silently ignored and the required parameter would appear missing, producing a confusing "Usage:" error far from the root cause. Consider emitting a warning for unknown flags.
♻️ Example: warn on unknown flags
-            *) shift ;;
+            *)
+                print_warning "Unknown option ignored: $1"
+                shift ;;
Also applies to: 239-256, 297-306, 439-444, 599-605, 702-708
tests/test-response-scoring.sh (1)
344-369: SCORING_DB_OVERRIDE is changed mid-test but never restored — future tests would silently use the empty DB.

Line 346 switches SCORING_DB_OVERRIDE to a fresh empty database for edge-case testing, but it's never reset. Any tests added after line 369 would use the empty DB instead of the one populated by earlier tests.
♻️ Restore the override after edge case tests
+# Save original DB path
+ORIGINAL_DB="$SCORING_DB_OVERRIDE"
+
 # Empty prompt list on fresh DB
 fresh_db="$TEST_DB_DIR/fresh.db"
 export SCORING_DB_OVERRIDE="$fresh_db"
 bash "$HELPER" init >/dev/null 2>&1 || true
 ...
 
 # Leaderboard with no data
 empty_lb=$(bash "$HELPER" leaderboard 2>&1) || true
 if echo "$empty_lb" | grep -qi "no scored\|leaderboard"; then
     pass "leaderboard handles empty database"
 else
     fail "leaderboard should handle empty database" "$empty_lb"
 fi
+
+# Restore original DB for any subsequent tests
+export SCORING_DB_OVERRIDE="$ORIGINAL_DB"

coderabbitai · 2026-02-09T15:20:23Z

.agents/scripts/response-scoring-helper.sh

+            local escaped_title escaped_text
+            escaped_title=$(echo "$title" | sed "s/'/''/g")
+            escaped_text=$(echo "$text" | sed "s/'/''/g")
+
+            local prompt_id
+            prompt_id=$(log_stderr "prompt add" sqlite3 "$SCORING_DB" \
+                "INSERT INTO prompts (title, prompt_text, category, difficulty) VALUES ('${escaped_title}', '${escaped_text}', '${category}', '${difficulty}'); SELECT last_insert_rowid();")


⚠️ Potential issue | 🟠 Major

SQL injection via unescaped category and difficulty parameters.

title and text are single-quote-escaped via sed, but category and difficulty are interpolated directly into the SQL string on line 161. A value like general'; DROP TABLE prompts;-- would be injected verbatim.

Apply the same escaping treatment as title and text:

🛡️ Proposed fix

local escaped_title escaped_text escaped_title=$(echo "$title" | sed "s/'/''/g") escaped_text=$(echo "$text" | sed "s/'/''/g") + local escaped_category escaped_difficulty + escaped_category=$(echo "$category" | sed "s/'/''/g") + escaped_difficulty=$(echo "$difficulty" | sed "s/'/''/g") local prompt_id prompt_id=$(log_stderr "prompt add" sqlite3 "$SCORING_DB" \ - "INSERT INTO prompts (title, prompt_text, category, difficulty) VALUES ('${escaped_title}', '${escaped_text}', '${category}', '${difficulty}'); SELECT last_insert_rowid();") + "INSERT INTO prompts (title, prompt_text, category, difficulty) VALUES ('${escaped_title}', '${escaped_text}', '${escaped_category}', '${escaped_difficulty}'); SELECT last_insert_rowid();")

🤖 Prompt for AI Agents

In @.agents/scripts/response-scoring-helper.sh around lines 155 - 161, The SQL is vulnerable because category and difficulty are interpolated directly into the sqlite3 INSERT; add the same single-quote escaping used for title/text: declare local escaped_category and escaped_difficulty, set escaped_category=$(echo "$category" | sed "s/'/''/g") and escaped_difficulty=$(echo "$difficulty" | sed "s/'/''/g"), and then use '${escaped_category}' and '${escaped_difficulty}' in the log_stderr / sqlite3 INSERT command (the prompt_id assignment) instead of the raw variables.

coderabbitai · 2026-02-09T15:20:23Z

.agents/scripts/response-scoring-helper.sh

+    if [[ -z "$prompt_id" || -z "$model_id" || -z "$response_text" ]]; then
+        print_error "Usage: response-scoring-helper.sh record --prompt <id> --model <model_id> --text \"response\" [--time <seconds>] [--tokens <count>] [--cost <usd>]"
+        return 1
+    fi
+
+    # Verify prompt exists
+    local prompt_exists
+    prompt_exists=$(log_stderr "record check" sqlite3 "$SCORING_DB" \
+        "SELECT COUNT(*) FROM prompts WHERE prompt_id = ${prompt_id};")
+    if [[ "$prompt_exists" == "0" ]]; then
+        print_error "Prompt #${prompt_id} not found"
+        return 1
+    fi
+
+    local escaped_text escaped_model
+    escaped_text=$(echo "$response_text" | sed "s/'/''/g")
+    escaped_model=$(echo "$model_id" | sed "s/'/''/g")
+
+    local response_id
+    response_id=$(log_stderr "record insert" sqlite3 "$SCORING_DB" \
+        "INSERT INTO responses (prompt_id, model_id, response_text, response_time, token_count, cost_estimate)
+         VALUES (${prompt_id}, '${escaped_model}', '${escaped_text}', ${response_time}, ${token_count}, ${cost});
+         SELECT last_insert_rowid();")


⚠️ Potential issue | 🟠 Major

Numeric parameters (prompt_id, response_time, token_count, cost) are interpolated into SQL without integer/float validation.

If a user passes --prompt "1; DROP TABLE prompts;--" or --time "0; DROP TABLE responses", the value is spliced directly into the SQL statement. While prompt_id is checked against the DB first (line 266-267, also unvalidated), response_time, token_count, and cost go straight into the INSERT on line 280.

Add numeric validation before SQL interpolation. This pattern applies throughout the script (also in cmd_score, cmd_compare, cmd_leaderboard, cmd_export, cmd_history, and prompt show).

🛡️ Proposed fix — validate before interpolation (example for cmd_record)

if [[ -z "$prompt_id" || -z "$model_id" || -z "$response_text" ]]; then print_error "Usage: ..." return 1 fi + # Validate numeric inputs + if ! [[ "$prompt_id" =~ ^[0-9]+$ ]]; then + print_error "prompt_id must be a positive integer" + return 1 + fi + if ! [[ "$response_time" =~ ^[0-9]*\.?[0-9]+$ ]]; then + print_error "response_time must be a number" + return 1 + fi + if ! [[ "$token_count" =~ ^[0-9]+$ ]]; then + print_error "token_count must be a positive integer" + return 1 + fi + if ! [[ "$cost" =~ ^[0-9]*\.?[0-9]+$ ]]; then + print_error "cost must be a number" + return 1 + fi + # Verify prompt exists

🤖 Prompt for AI Agents

In @.agents/scripts/response-scoring-helper.sh around lines 259 - 281, Validate numeric inputs before embedding them into SQL: ensure prompt_id and token_count match an integer regex (e.g., ^[0-9]+$) and ensure response_time and cost match a numeric/float regex (e.g., ^[0-9]+(\.[0-9]+)?$); reject or default and error out if validation fails, then use the validated variables when constructing the sqlite3 INSERT/select statements. Also apply the same validation to the prompt_exists check (don't interpolate an unvalidated prompt_id into the SELECT) and propagate this pattern to other commands named in the script (cmd_record, cmd_score, cmd_compare, cmd_leaderboard, cmd_export, cmd_history and the prompt show path) so all numeric parameters are validated before SQL interpolation.

coderabbitai · 2026-02-09T15:20:23Z

.agents/scripts/response-scoring-helper.sh

+
+    local where_clause=""
+    if [[ -n "$category" ]]; then
+        where_clause="WHERE p.category = '${category}'"
+    fi


⚠️ Potential issue | 🟠 Major

SQL injection via unescaped category in leaderboard WHERE clause.

The category value from --category is interpolated directly into the SQL WHERE clause without escaping.

🛡️ Proposed fix

local where_clause="" if [[ -n "$category" ]]; then - where_clause="WHERE p.category = '${category}'" + local escaped_category + escaped_category=$(echo "$category" | sed "s/'/''/g") + where_clause="WHERE p.category = '${escaped_category}'" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

local where_clause=""

if [[ -n "$category" ]]; then

where_clause="WHERE p.category = '${category}'"

fi

local where_clause=""

if [[ -n "$category" ]]; then

local escaped_category

escaped_category=$(echo "$category" | sed "s/'/''/g")

where_clause="WHERE p.category = '${escaped_category}'"

fi

🤖 Prompt for AI Agents

In @.agents/scripts/response-scoring-helper.sh around lines 607 - 611, The WHERE clause builds SQL by interpolating the category variable into where_clause, which allows SQL injection; fix by properly escaping or validating category before interpolation — e.g., sanitize category (allow only expected values/patterns) or escape single quotes (replace ' with '' ) and then assign where_clause="WHERE p.category = '${escaped_category}'"; alternatively use a parameterized/psql variable substitution approach instead of direct interpolation; update the code that sets where_clause and any caller that passes category in .agents/scripts/response-scoring-helper.sh to use the sanitized/escaped variable.

coderabbitai · 2026-02-09T15:20:23Z

.agents/scripts/response-scoring-helper.sh

+    if [[ "$format" == "csv" ]]; then
+        echo "prompt_id,prompt_title,response_id,model_id,correctness,completeness,code_quality,clarity,weighted_avg,response_time,token_count,cost_estimate"
+        sqlite3 -separator ',' "$SCORING_DB" "
+            SELECT r.prompt_id, p.title, r.response_id, r.model_id,
+                   MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END),
+                   MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END),
+                   MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END),
+                   MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END),
+                   ROUND((
+                       COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 +
+                       COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 +
+                       COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 +
+                       COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20
+                   ), 2),
+                   r.response_time, r.token_count, r.cost_estimate
+            FROM responses r
+            JOIN prompts p ON r.prompt_id = p.prompt_id
+            LEFT JOIN scores s ON r.response_id = s.response_id
+            ${where_clause}
+            GROUP BY r.response_id
+            ORDER BY r.prompt_id, r.model_id;
+        " 2>/dev/null


⚠️ Potential issue | 🟡 Minor

Export CSV weighted average calculation differs from compare/leaderboard — missing criteria skew the score.

The export query on lines 724-728 uses COALESCE(..., 0) for missing criteria, which substitutes 0 and always divides by the full weight of 1.0 (implicitly — there's no division at all). Meanwhile, cmd_compare (lines 466-480) and cmd_leaderboard (lines 618-631) correctly use NULLIF(SUM(weights), 0) to exclude unscored criteria from the denominator.

If a response has only one criterion scored (e.g., correctness=5), export reports 5*0.30 = 1.50, while compare reports 5.00. This is an inconsistency that produces misleading export data.

🔧 Align export with the same weighted-average formula used elsewhere

- ROUND(( - COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 + - COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 + - COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 + - COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20 - ), 2), + ROUND( + (COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 + + COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 + + COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 + + COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20) + / NULLIF( + (CASE WHEN MAX(CASE WHEN s.criterion = 'correctness' THEN 1 END) IS NOT NULL THEN 0.30 ELSE 0 END + + CASE WHEN MAX(CASE WHEN s.criterion = 'completeness' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END + + CASE WHEN MAX(CASE WHEN s.criterion = 'code_quality' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END + + CASE WHEN MAX(CASE WHEN s.criterion = 'clarity' THEN 1 END) IS NOT NULL THEN 0.20 ELSE 0 END), 0) + , 2),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [[ "$format" == "csv" ]]; then

echo "prompt_id,prompt_title,response_id,model_id,correctness,completeness,code_quality,clarity,weighted_avg,response_time,token_count,cost_estimate"

sqlite3 -separator ',' "$SCORING_DB" "

SELECT r.prompt_id, p.title, r.response_id, r.model_id,

MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END),

MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END),

MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END),

MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END),

ROUND((

COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 +

COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 +

COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 +

COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20

), 2),

r.response_time, r.token_count, r.cost_estimate

FROM responses r

JOIN prompts p ON r.prompt_id = p.prompt_id

LEFT JOIN scores s ON r.response_id = s.response_id

${where_clause}

GROUP BY r.response_id

ORDER BY r.prompt_id, r.model_id;

" 2>/dev/null

if [[ "$format" == "csv" ]]; then

echo "prompt_id,prompt_title,response_id,model_id,correctness,completeness,code_quality,clarity,weighted_avg,response_time,token_count,cost_estimate"

sqlite3 -separator ',' "$SCORING_DB" "

SELECT r.prompt_id, p.title, r.response_id, r.model_id,

MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END),

MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END),

MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END),

MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END),

ROUND(

(COALESCE(MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END), 0) * 0.30 +

COALESCE(MAX(CASE WHEN s.criterion = 'completeness' THEN s.score END), 0) * 0.25 +

COALESCE(MAX(CASE WHEN s.criterion = 'code_quality' THEN s.score END), 0) * 0.25 +

COALESCE(MAX(CASE WHEN s.criterion = 'clarity' THEN s.score END), 0) * 0.20)

/ NULLIF(

(CASE WHEN MAX(CASE WHEN s.criterion = 'correctness' THEN 1 END) IS NOT NULL THEN 0.30 ELSE 0 END +

CASE WHEN MAX(CASE WHEN s.criterion = 'completeness' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END +

CASE WHEN MAX(CASE WHEN s.criterion = 'code_quality' THEN 1 END) IS NOT NULL THEN 0.25 ELSE 0 END +

CASE WHEN MAX(CASE WHEN s.criterion = 'clarity' THEN 1 END) IS NOT NULL THEN 0.20 ELSE 0 END), 0)

, 2),

r.response_time, r.token_count, r.cost_estimate

FROM responses r

JOIN prompts p ON r.prompt_id = p.prompt_id

LEFT JOIN scores s ON r.response_id = s.response_id

${where_clause}

GROUP BY r.response_id

ORDER BY r.prompt_id, r.model_id;

" 2>/dev/null

🤖 Prompt for AI Agents

In @.agents/scripts/response-scoring-helper.sh around lines 716 - 737, The CSV export query currently uses COALESCE(..., 0) which treats missing criteria as zero and biases the weighted average; update the SELECT that builds the weighted_avg (inside the format == "csv" branch) to compute a proper weighted average like cmd_compare and cmd_leaderboard: build a numerator by summing weighted MAX(CASE WHEN s.criterion = 'correctness' THEN s.score END) * 0.30 + ... (for completeness 0.25, code_quality 0.25, clarity 0.20) and build a denominator as NULLIF( (CASE WHEN MAX(CASE WHEN s.criterion='correctness' THEN s.score END) IS NOT NULL THEN 0.30 ELSE 0 END) + ... , 0); then use ROUND(numerator / denominator, 2) so missing criteria are excluded from the denominator rather than treated as zero; change the expression that currently uses COALESCE(...) and weights to this numerator/denominator pattern referencing the same MAX(CASE ...) expressions and weights.

…(t168.3) Add response-scoring-helper.sh with SQLite-backed evaluation engine: - Structured scoring on 4 weighted criteria: correctness (30%), completeness (25%), code quality (25%), clarity (20%) - Side-by-side comparison with ranked output and winner declaration - Aggregate leaderboard across all evaluations - JSON and CSV export for analysis - Prompt management (create, list, show) - Response recording with timing, token count, and cost metadata - Multi-scorer support (human, AI, etc.) - /score-responses slash command and subagent documentation - 41 passing tests covering all commands and edge cases - ShellCheck clean (zero violations) Complements existing compare-models (specs) with actual response quality evaluation. Workflow: identify candidates -> check availability -> evaluate outputs -> inform model routing decisions.

github-actions · 2026-02-09T15:44:02Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 53 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Mon Feb 9 15:43:57 UTC 2026: Code review monitoring started
Mon Feb 9 15:43:58 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 53

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 53
VULNERABILITIES: 0

Generated on: Mon Feb 9 15:44:00 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-09T15:44:19Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In @.agents/scripts/response-scoring-helper.sh:
- Around line 58-110: The SQLite foreign key constraints in init_db (and the
comparisons/winner_id, responses.prompt_id, scores.response_id relations) are
never enforced because PRAGMA foreign_keys=ON is not set per connection; add a
PRAGMA foreign_keys = ON execution before any DB DDL/DML is run (e.g., in
init_db and in the connection wrapper used by ensure_db, cmd_record, cmd_score)
so every write/read session enables foreign keys for that connection; ensure the
PRAGMA is issued on the same sqlite3 connection used for subsequent CREATE TABLE
/ INSERT / UPDATE statements to enforce referential integrity for comparisons,
responses, and scores.
- Around line 557-588: The JSON building in _compare_json interpolates unescaped
strings (notably model_id and prompt_title) which can break JSON; either
construct the object with jq if available or ensure proper escaping of
backslashes and double-quotes (and strip/escape control chars) for model_id and
prompt_title before embedding them into the JSON string; update the same
escaping logic used for prompt_title to also run on model_id (and apply to the
leaderboard output path referenced around line 660) so all string fields are
JSON-safe.
- Around line 338-342: The current range check using [[ "$value" -lt 1 ||
"$value" -gt 5 ]] 2>/dev/null silently accepts non-numeric input; change the
validation in response-scoring-helper.sh to first assert the input is an integer
(e.g., test $value with a regex like ^[0-9]+$ using [[ ]] ), and only then
perform the numeric range check against 1–5 (the existing -lt/-gt checks) for
the variables value and criterion; if the integer test fails or the range check
fails, call print_error "Score for ${criterion} must be an integer 1-5, got:
${value}" and return 1.

🧹 Nitpick comments (4)

.agents/scripts/response-scoring-helper.sh (2)
384-391: Shell variables expanded directly inside awk program strings — prefer -v for safety and clarity.

Lines 386-387 and 395 inject $weighted_total, $score, $weight, and $weight_sum via double-quoted awk program strings. While the values here flow from the database (with a CHECK constraint) and a hardcoded weight helper, this pattern is fragile — any unexpected character in a DB value would be interpreted as awk code.

Using awk's -v flag is more robust and considered idiomatic:
♻️ Suggested improvement
-        weighted_total=$(awk "BEGIN{printf \"%.4f\", $weighted_total + ($score * $weight)}")
-        weight_sum=$(awk "BEGIN{printf \"%.4f\", $weight_sum + $weight}")
+        weighted_total=$(awk -v wt="$weighted_total" -v sc="$score" -v w="$weight" 'BEGIN{printf "%.4f", wt + (sc * w)}')
+        weight_sum=$(awk -v ws="$weight_sum" -v w="$weight" 'BEGIN{printf "%.4f", ws + w}')
363-368: Inconsistent use of log_stderr wrapper for sqlite3 calls.

_show_response_scores (line 367-368) and cmd_compare (line 454-455) call sqlite3 directly, while other commands consistently use the log_stderr wrapper. This means failures in these calls won't appear in the log file, making debugging harder.
♻️ Proposed fix (example for _show_response_scores)
     local model_id
-    model_id=$(sqlite3 "$SCORING_DB" \
-        "SELECT model_id FROM responses WHERE response_id = ${response_id};")
+    model_id=$(log_stderr "show scores" sqlite3 "$SCORING_DB" \
+        "SELECT model_id FROM responses WHERE response_id = ${response_id};")
tests/test-response-scoring.sh (2)
215-222: Missing test for non-numeric score input — would expose the validation bypass.

The test on line 216 validates out-of-range integer (6), which the script does catch. However, a non-numeric value like abc silently passes the validation in the helper (see review on line 338-342 of the helper script) and would fail at the SQLite level instead. Adding a non-numeric test case would catch this gap and serve as a regression test for the fix:
# Test score validation (non-numeric)
score_nan=$(bash "$HELPER" score --response 1 --correctness "abc" 2>&1) || true
if echo "$score_nan" | grep -qi "must be 1-5\|error"; then
    pass "score rejects non-numeric value"
else
    fail "score should reject non-numeric value" "$score_nan"
fi
344-369: Consider restoring SCORING_DB_OVERRIDE or adding a comment after switching to the fresh DB.

Line 346 changes SCORING_DB_OVERRIDE to a fresh database for edge-case tests. Any tests added after this block would unknowingly run against the empty DB instead of the fully populated one. A brief comment or a restoration step would prevent future confusion:
+# NOTE: SCORING_DB_OVERRIDE now points to a fresh empty DB for the remaining edge-case tests.
 fresh_db="$TEST_DB_DIR/fresh.db"
 export SCORING_DB_OVERRIDE="$fresh_db"

coderabbitai · 2026-02-09T15:49:30Z

.agents/scripts/response-scoring-helper.sh

+init_db() {
+    mkdir -p "$SCORING_DIR" 2>/dev/null || true
+
+    log_stderr "db init" sqlite3 "$SCORING_DB" "
+        CREATE TABLE IF NOT EXISTS prompts (
+            prompt_id      INTEGER PRIMARY KEY AUTOINCREMENT,
+            title          TEXT NOT NULL,
+            prompt_text    TEXT NOT NULL,
+            category       TEXT DEFAULT 'general',
+            difficulty     TEXT DEFAULT 'medium',
+            created_at     TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now'))
+        );
+
+        CREATE TABLE IF NOT EXISTS responses (
+            response_id    INTEGER PRIMARY KEY AUTOINCREMENT,
+            prompt_id      INTEGER NOT NULL,
+            model_id       TEXT NOT NULL,
+            response_text  TEXT NOT NULL,
+            response_time  REAL DEFAULT 0.0,
+            token_count    INTEGER DEFAULT 0,
+            cost_estimate  REAL DEFAULT 0.0,
+            recorded_at    TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
+            FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id)
+        );
+
+        CREATE TABLE IF NOT EXISTS scores (
+            score_id       INTEGER PRIMARY KEY AUTOINCREMENT,
+            response_id    INTEGER NOT NULL,
+            criterion      TEXT NOT NULL,
+            score          INTEGER NOT NULL CHECK(score BETWEEN 1 AND 5),
+            rationale      TEXT DEFAULT '',
+            scored_by      TEXT DEFAULT 'human',
+            scored_at      TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
+            FOREIGN KEY (response_id) REFERENCES responses(response_id),
+            UNIQUE(response_id, criterion, scored_by)
+        );
+
+        CREATE TABLE IF NOT EXISTS comparisons (
+            comparison_id  INTEGER PRIMARY KEY AUTOINCREMENT,
+            prompt_id      INTEGER NOT NULL,
+            winner_id      INTEGER,
+            notes          TEXT DEFAULT '',
+            compared_at    TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
+            FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id),
+            FOREIGN KEY (winner_id) REFERENCES responses(response_id)
+        );
+
+        CREATE INDEX IF NOT EXISTS idx_responses_prompt ON responses(prompt_id);
+        CREATE INDEX IF NOT EXISTS idx_responses_model ON responses(model_id);
+        CREATE INDEX IF NOT EXISTS idx_scores_response ON scores(response_id);
+    "
+    return 0
+}


⚠️ Potential issue | 🟡 Minor

Foreign keys are declared but never enforced — SQLite requires an explicit pragma.

SQLite ignores FOREIGN KEY constraints by default. Without PRAGMA foreign_keys = ON; before the DML statements, referential integrity (e.g., inserting a response for a deleted prompt, or a comparison with a bogus winner_id) is not enforced at the database level.

The manual existence checks in cmd_record and cmd_score partially compensate, but the comparisons table's winner_id is never validated by application code.

🔧 Proposed fix

init_db() { mkdir -p "$SCORING_DIR" 2>/dev/null || true log_stderr "db init" sqlite3 "$SCORING_DB" " + PRAGMA foreign_keys = ON; + CREATE TABLE IF NOT EXISTS prompts (

Note: PRAGMA foreign_keys is per-connection and not persisted, so you'd also want to issue it before every write session (e.g., in ensure_db or a wrapper).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

init_db() {

mkdir -p "$SCORING_DIR" 2>/dev/null || true

log_stderr "db init" sqlite3 "$SCORING_DB" "

CREATE TABLE IF NOT EXISTS prompts (

prompt_id INTEGER PRIMARY KEY AUTOINCREMENT,

title TEXT NOT NULL,

prompt_text TEXT NOT NULL,

category TEXT DEFAULT 'general',

difficulty TEXT DEFAULT 'medium',

created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now'))

);

CREATE TABLE IF NOT EXISTS responses (

response_id INTEGER PRIMARY KEY AUTOINCREMENT,

prompt_id INTEGER NOT NULL,

model_id TEXT NOT NULL,

response_text TEXT NOT NULL,

response_time REAL DEFAULT 0.0,

token_count INTEGER DEFAULT 0,

cost_estimate REAL DEFAULT 0.0,

recorded_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),

FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id)

);

CREATE TABLE IF NOT EXISTS scores (

score_id INTEGER PRIMARY KEY AUTOINCREMENT,

response_id INTEGER NOT NULL,

criterion TEXT NOT NULL,

score INTEGER NOT NULL CHECK(score BETWEEN 1 AND 5),

rationale TEXT DEFAULT '',

scored_by TEXT DEFAULT 'human',

scored_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),

FOREIGN KEY (response_id) REFERENCES responses(response_id),

UNIQUE(response_id, criterion, scored_by)

);

CREATE TABLE IF NOT EXISTS comparisons (

comparison_id INTEGER PRIMARY KEY AUTOINCREMENT,

prompt_id INTEGER NOT NULL,

winner_id INTEGER,

notes TEXT DEFAULT '',

compared_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),

FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id),

FOREIGN KEY (winner_id) REFERENCES responses(response_id)

);

CREATE INDEX IF NOT EXISTS idx_responses_prompt ON responses(prompt_id);

CREATE INDEX IF NOT EXISTS idx_responses_model ON responses(model_id);

CREATE INDEX IF NOT EXISTS idx_scores_response ON scores(response_id);

"

return 0

}

init_db() {

mkdir -p "$SCORING_DIR" 2>/dev/null || true

log_stderr "db init" sqlite3 "$SCORING_DB" "

PRAGMA foreign_keys = ON;

CREATE TABLE IF NOT EXISTS prompts (

prompt_id INTEGER PRIMARY KEY AUTOINCREMENT,

title TEXT NOT NULL,

prompt_text TEXT NOT NULL,

category TEXT DEFAULT 'general',

difficulty TEXT DEFAULT 'medium',

created_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now'))

);

CREATE TABLE IF NOT EXISTS responses (

response_id INTEGER PRIMARY KEY AUTOINCREMENT,

prompt_id INTEGER NOT NULL,

model_id TEXT NOT NULL,

response_text TEXT NOT NULL,

response_time REAL DEFAULT 0.0,

token_count INTEGER DEFAULT 0,

cost_estimate REAL DEFAULT 0.0,

recorded_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),

FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id)

);

CREATE TABLE IF NOT EXISTS scores (

score_id INTEGER PRIMARY KEY AUTOINCREMENT,

response_id INTEGER NOT NULL,

criterion TEXT NOT NULL,

score INTEGER NOT NULL CHECK(score BETWEEN 1 AND 5),

rationale TEXT DEFAULT '',

scored_by TEXT DEFAULT 'human',

scored_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),

FOREIGN KEY (response_id) REFERENCES responses(response_id),

UNIQUE(response_id, criterion, scored_by)

);

CREATE TABLE IF NOT EXISTS comparisons (

comparison_id INTEGER PRIMARY KEY AUTOINCREMENT,

prompt_id INTEGER NOT NULL,

winner_id INTEGER,

notes TEXT DEFAULT '',

compared_at TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),

FOREIGN KEY (prompt_id) REFERENCES prompts(prompt_id),

FOREIGN KEY (winner_id) REFERENCES responses(response_id)

);

CREATE INDEX IF NOT EXISTS idx_responses_prompt ON responses(prompt_id);

CREATE INDEX IF NOT EXISTS idx_responses_model ON responses(model_id);

CREATE INDEX IF NOT EXISTS idx_scores_response ON scores(response_id);

"

return 0

}

🤖 Prompt for AI Agents

In @.agents/scripts/response-scoring-helper.sh around lines 58 - 110, The SQLite foreign key constraints in init_db (and the comparisons/winner_id, responses.prompt_id, scores.response_id relations) are never enforced because PRAGMA foreign_keys=ON is not set per connection; add a PRAGMA foreign_keys = ON execution before any DB DDL/DML is run (e.g., in init_db and in the connection wrapper used by ensure_db, cmd_record, cmd_score) so every write/read session enables foreign keys for that connection; ensure the PRAGMA is issued on the same sqlite3 connection used for subsequent CREATE TABLE / INSERT / UPDATE statements to enforce referential integrity for comparisons, responses, and scores.

coderabbitai · 2026-02-09T15:49:30Z

.agents/scripts/response-scoring-helper.sh

+        # Validate score range
+        if [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null; then
+            print_error "Score for ${criterion} must be 1-5, got: ${value}"
+            return 1
+        fi


⚠️ Potential issue | 🟠 Major

Score validation silently passes non-numeric input.

The -lt / -gt operators inside [[ ]] emit an error (exit code 2) on non-integer strings, and 2>/dev/null suppresses it. The effect is that a value like abc or 3; DROP TABLE scores;-- makes both comparisons fail, the overall condition evaluates to false, and the value sails through to the SQL INSERT on line 346.

This is the score-specific manifestation of the numeric-injection gap already noted, but the mechanism is subtle enough to call out: the validation looks present but doesn't actually guard non-numeric input.

🛡️ Proposed fix — validate integer first, then range-check

if [[ -z "$value" ]]; then continue fi - # Validate score range - if [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null; then + # Validate integer and score range + if ! [[ "$value" =~ ^[0-9]+$ ]] || [[ "$value" -lt 1 || "$value" -gt 5 ]]; then print_error "Score for ${criterion} must be 1-5, got: ${value}" return 1 fi

🤖 Prompt for AI Agents

In @.agents/scripts/response-scoring-helper.sh around lines 338 - 342, The current range check using [[ "$value" -lt 1 || "$value" -gt 5 ]] 2>/dev/null silently accepts non-numeric input; change the validation in response-scoring-helper.sh to first assert the input is an integer (e.g., test $value with a regex like ^[0-9]+$ using [[ ]] ), and only then perform the numeric range check against 1–5 (the existing -lt/-gt checks) for the variables value and criterion; if the integer test fails or the range check fails, call print_error "Score for ${criterion} must be an integer 1-5, got: ${value}" and return 1.

coderabbitai · 2026-02-09T15:49:30Z

.agents/scripts/response-scoring-helper.sh

+_compare_json() {
+    local prompt_id="$1"
+    local prompt_title="$2"
+    local responses="$3"
+
+    local json_entries=()
+    local rank=0
+
+    while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do
+        rank=$((rank + 1))
+
+        local corr="null" comp="null" code="null" clar="null"
+        if [[ -n "$score_detail" ]]; then
+            local val
+            val=$(echo "$score_detail" | tr ',' '\n' | grep "^correctness:" | cut -d: -f2)
+            [[ -n "$val" ]] && corr="$val"
+            val=$(echo "$score_detail" | tr ',' '\n' | grep "^completeness:" | cut -d: -f2)
+            [[ -n "$val" ]] && comp="$val"
+            val=$(echo "$score_detail" | tr ',' '\n' | grep "^code_quality:" | cut -d: -f2)
+            [[ -n "$val" ]] && code="$val"
+            val=$(echo "$score_detail" | tr ',' '\n' | grep "^clarity:" | cut -d: -f2)
+            [[ -n "$val" ]] && clar="$val"
+        fi
+
+        json_entries+=("{\"rank\":${rank},\"response_id\":${rid},\"model\":\"${model_id}\",\"scores\":{\"correctness\":${corr},\"completeness\":${comp},\"code_quality\":${code},\"clarity\":${clar}},\"weighted_avg\":${wavg:-0},\"response_time\":${rtime:-0},\"tokens\":${tokens:-0},\"cost\":${cost:-0}}")
+    done <<< "$responses"
+
+    local escaped_title
+    escaped_title=$(echo "$prompt_title" | sed 's/"/\\"/g')
+    echo "{\"prompt_id\":${prompt_id},\"title\":\"${escaped_title}\",\"responses\":[$(IFS=,; echo "${json_entries[*]}")]}"
+    return 0
+}


⚠️ Potential issue | 🟡 Minor

JSON output is vulnerable to breakage from special characters in model_id.

model_id (line 581) is interpolated directly into the JSON string with zero escaping. If a model ID contains ", \, or control characters, the output is malformed JSON. The same applies to JSON leaderboard output (line 660).

prompt_title (line 585) escapes only " but not \ or control characters, which is incomplete.

Consider using jq (if available) or at minimum applying the same sed escaping for \ and ":

🔧 Proposed fix (minimal — escape backslash then double-quote)

while IFS='|' read -r rid model_id rtime tokens cost wavg score_detail; do rank=$((rank + 1)) + # Escape JSON-special characters + local escaped_model + escaped_model=$(printf '%s' "$model_id" | sed 's/\\/\\\\/g; s/"/\\"/g') ... - json_entries+=("{\"rank\":${rank},\"response_id\":${rid},\"model\":\"${model_id}\",...") + json_entries+=("{\"rank\":${rank},\"response_id\":${rid},\"model\":\"${escaped_model}\",...")

🤖 Prompt for AI Agents

In @.agents/scripts/response-scoring-helper.sh around lines 557 - 588, The JSON building in _compare_json interpolates unescaped strings (notably model_id and prompt_title) which can break JSON; either construct the object with jq if available or ensure proper escaping of backslashes and double-quotes (and strip/escape control chars) for model_id and prompt_title before embedding them into the JSON string; update the same escaping logic used for prompt_title to also run on model_id (and apply to the leaderboard output path referenced around line 660) so all string fields are JSON-safe.

…773, #778, #770, #781)

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

coderabbitai bot requested changes Feb 9, 2026

View reviewed changes

marcusquinn force-pushed the feature/t168.3 branch from 6eb9c3e to a6aefbd Compare February 9, 2026 15:43

marcusquinn merged commit c89471c into main Feb 9, 2026
9 of 11 checks passed

coderabbitai bot requested changes Feb 9, 2026

View reviewed changes

alex-solovyev added a commit that referenced this pull request Feb 9, 2026

chore: mark t168.4, t166.3, t132.3, t132.4 complete (worker PRs #660/#…

78a433b

…773, #778, #770, #781)

marcusquinn mentioned this pull request Feb 9, 2026

feat: wire up /compare-models and /compare-models-free slash commands (t168.4) #785

Merged

		escaped_title=$(echo "$title" \| sed "s/'/''/g")
		escaped_text=$(echo "$text" \| sed "s/'/''/g")

		@@ -0,0 +1,933 @@
		#!/usr/bin/env bash
		# shellcheck disable=SC1091,SC2001

feat: response comparison and scoring framework for model evaluation (t168.3) #773

feat: response comparison and scoring framework for model evaluation (t168.3) #773

Conversation

marcusquinn commented Feb 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What This Does

Scoring Criteria (weighted)

Commands

Workflow

Quality

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

Uh oh!

github-actions bot commented Feb 9, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

gemini-code-assist bot commented Feb 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 9, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Feb 9, 2026

Quality Gate passed

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

marcusquinn commented Feb 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 9, 2026 •

edited

Loading