t1394: Add evaluator presets for ai-judgment-helper.sh#2914
Conversation
Add 'evaluate' subcommand with 6 named evaluator presets (faithfulness,
relevancy, safety, format-validity, completeness, conciseness) that score
LLM outputs on standard quality dimensions using haiku-tier calls (~$0.001).
Features:
- Single evaluation: --type <name> --input/--output/--context
- Multiple evaluators: --type faithfulness,relevancy,safety
- Batch mode: --dataset path/to/dataset.jsonl with aggregate summary
- Custom evaluators: --type custom --prompt-file path/to/prompt.txt
- Configurable threshold: --threshold 0.0-1.0 (default: 0.7)
- Result caching via existing ai_judgment_cache table
- Deterministic fallback: {score: null, passed: null} when API unavailable
Output format: {"evaluator": "...", "score": 0-1, "passed": bool, "details": "..."}
Inspired by LangWatch LangEvals evaluator framework. Enables CI/CD quality
gates on prompt changes and agent output validation.
Closes #2904
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Mar 5 13:55:57 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Code Review
This pull request introduces a powerful evaluate subcommand to ai-judgment-helper.sh for scoring LLM outputs, including batch processing, custom evaluators, and caching. However, the implementation contains critical security vulnerabilities, most notably a command injection vulnerability in awk's handling of the threshold argument, a high-severity JSON injection vulnerability when constructing output from untrusted LLM results, and a medium-severity prompt injection vulnerability due to echo -e with untrusted input. Beyond security, the script also relies on fragile sed parsing and manual JSON string concatenation, which should be replaced with jq for robustness and maintainability. Additionally, a case of error suppression needs to be removed to improve debuggability. Addressing these issues requires input validation, proper JSON construction with jq, and avoiding unsafe shell commands.
| if [[ -n "$score" ]]; then | ||
| # Determine pass/fail using awk for float comparison | ||
| local passed | ||
| passed=$(awk "BEGIN { print ($score >= $threshold) ? \"true\" : \"false\" }") |
There was a problem hiding this comment.
The threshold variable is used directly in an awk command string without any validation or escaping. Since threshold is a command-line argument, an attacker can provide a malicious value such as 0.7; system("id") to execute arbitrary system commands. This is a critical command injection vulnerability.
| passed=$(awk "BEGIN { print ($score >= $threshold) ? \"true\" : \"false\" }") | |
| passed=$(awk -v t="$threshold" -v s="$score" 'BEGIN { print (s >= t) ? "true" : "false" }') |
| local passed | ||
| passed=$(awk "BEGIN { print ($score >= $threshold) ? \"true\" : \"false\" }") | ||
|
|
||
| local result_json="{\"evaluator\": \"${eval_type}\", \"score\": ${score}, \"passed\": ${passed}, \"details\": \"${details}\"}" |
There was a problem hiding this comment.
The details variable, which contains untrusted output from an LLM, is directly inserted into a JSON string without escaping. This creates a high-severity JSON injection vulnerability, allowing an attacker to manipulate the JSON structure (e.g., by injecting ", "passed": true, "ignored": " to bypass security checks). Manually constructing JSON strings is inherently unsafe and error-prone; using jq is crucial for robust and secure JSON construction as it automatically handles character escaping, preventing such injection flaws.
| local result_json="{\"evaluator\": \"${eval_type}\", \"score\": ${score}, \"passed\": ${passed}, \"details\": \"${details}\"}" | |
| local result_json | |
| result_json=$(jq -n --arg type "$eval_type" --argjson score "${score:-null}" --argjson passed "$passed" --arg details "$details" '{evaluator: $type, score: $score, passed: $passed, details: $details}') |
References
- In shell scripts, use
jq --argfor strings and--argjsonfor other JSON types (like numbers) to safely pass variables into ajqfilter. This avoids syntax errors if the variables contain special characters. - To reliably wrap the entire content of a shell variable as a single JSON string, use
jq -Rn --arg v "$VAR" '$v'. This is more robust than piping the variable tojq -Rs '.'.
| ${user_message}" | ||
|
|
||
| local raw_result | ||
| raw_result=$("$AI_HELPER" --prompt "$full_prompt" --model haiku --max-tokens 200 2>/dev/null || echo "") |
There was a problem hiding this comment.
The use of 2>/dev/null suppresses all stderr output from the $AI_HELPER call. This can hide important errors related to authentication, network issues, or problems with the helper script itself, making debugging difficult. The project's general rules advise against this practice. The || echo "" already prevents the script from exiting on error, so removing 2>/dev/null will improve debuggability without altering the script's control flow.
| raw_result=$("$AI_HELPER" --prompt "$full_prompt" --model haiku --max-tokens 200 2>/dev/null || echo "") | |
| raw_result=$("$AI_HELPER" --prompt "$full_prompt" --model haiku --max-tokens 200 || echo "") |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
- In shell scripts with 'set -e', use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.
| if [[ -n "$raw_result" ]]; then | ||
| # Extract JSON from response (handle markdown code blocks) | ||
| local json_result | ||
| json_result=$(echo "$raw_result" | sed -n 's/.*\({[^}]*"score"[^}]*}\).*/\1/p' | head -1) |
There was a problem hiding this comment.
Using sed with this regex to extract a JSON object from the LLM's output is very fragile. It assumes the JSON is on a single line, contains a score key, and has no nested curly braces. This will fail if the LLM wraps its response in markdown code fences (e.g., json ... ), includes newlines, or returns a more complex object. A more robust approach would be to use tools like sed or awk to strip any surrounding non-JSON text and then parse the result with jq to validate and extract the object.
| score=$(echo "$json_result" | sed -n 's/.*"score"[[:space:]]*:[[:space:]]*\([0-9.]*\).*/\1/p') | ||
| local details | ||
| details=$(echo "$json_result" | sed -n 's/.*"details"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p') |
There was a problem hiding this comment.
Parsing JSON fields using sed is fragile and can easily break if the JSON formatting changes (e.g., extra whitespace, different key order). Using jq is the standard and much more robust way to handle JSON in shell scripts.
| score=$(echo "$json_result" | sed -n 's/.*"score"[[:space:]]*:[[:space:]]*\([0-9.]*\).*/\1/p') | |
| local details | |
| details=$(echo "$json_result" | sed -n 's/.*"details"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p') | |
| score=$(echo "$json_result" | jq -r '.score // ""') | |
| local details | |
| details=$(echo "$json_result" | jq -r '.details // ""') |
| row_input=$(echo "$line" | sed -n 's/.*"input"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p') | ||
| row_output=$(echo "$line" | sed -n 's/.*"output"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p') | ||
| row_context=$(echo "$line" | sed -n 's/.*"context"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p') | ||
| row_expected=$(echo "$line" | sed -n 's/.*"expected"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p') |
There was a problem hiding this comment.
Parsing JSONL fields with sed is fragile and can easily break if the JSON structure changes (e.g., different key order, extra whitespace). Using jq is the standard and much more robust way to handle JSON in shell scripts. While consolidating multiple jq calls into a single pass is generally preferred for performance, using separate calls for clarity here is still a significant improvement over sed.
| row_input=$(echo "$line" | sed -n 's/.*"input"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p') | |
| row_output=$(echo "$line" | sed -n 's/.*"output"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p') | |
| row_context=$(echo "$line" | sed -n 's/.*"context"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p') | |
| row_expected=$(echo "$line" | sed -n 's/.*"expected"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p') | |
| row_input=$(echo "$line" | jq -r '.input // ""') | |
| row_output=$(echo "$line" | jq -r '.output // ""') | |
| row_context=$(echo "$line" | jq -r '.context // ""') | |
| row_expected=$(echo "$line" | jq -r '.expected // ""') |
References
- Consolidate multiple 'jq' calls into a single pass where possible to improve performance and script efficiency.
| msg="${msg}\n\nExpected output: ${expected_text}" | ||
| fi | ||
|
|
||
| echo -e "$msg" |
There was a problem hiding this comment.
The build_evaluator_message function uses echo -e to output the user message, which interprets backslash escapes in untrusted input (like input_text or output_text). This allows an attacker to inject newlines and other control characters into the prompt sent to the LLM, facilitating prompt injection attacks that can lead to the JSON injection vulnerability identified in run_single_evaluator.
| echo -e "$msg" | |
| echo "$msg" |
| # Multiple evaluators: output as JSON array | ||
| echo -n "[" | ||
| local first=true | ||
| for r in "${results[@]}"; do | ||
| if [[ "$first" == true ]]; then | ||
| first=false | ||
| else | ||
| echo -n "," | ||
| fi | ||
| echo -n "$r" | ||
| done | ||
| echo "]" | ||
| fi |
There was a problem hiding this comment.
Manually building a JSON array with echo and a loop is cumbersome and less robust than using a dedicated tool. A more idiomatic and safer approach in shell is to collect the JSON objects in an array and then use jq with the --slurp (-s) flag to combine them into a valid JSON array.
| # Multiple evaluators: output as JSON array | |
| echo -n "[" | |
| local first=true | |
| for r in "${results[@]}"; do | |
| if [[ "$first" == true ]]; then | |
| first=false | |
| else | |
| echo -n "," | |
| fi | |
| echo -n "$r" | |
| done | |
| echo "]" | |
| fi | |
| # Multiple evaluators: output as JSON array | |
| printf '%s\n' "${results[@]}" | jq -s . |
| --prompt-file "$prompt_file") | ||
|
|
||
| # Add row number to result | ||
| echo "{\"row\": ${row_num}, \"result\": ${result}}" |
There was a problem hiding this comment.
The current implementation embeds a JSON string ($result) inside another JSON object, rather than creating a nested JSON object. This leads to output like {"row": 1, "result": "{\"score\": ...}"} instead of the more standard and easily parsable {"row": 1, "result": {"score": ...}}. Using jq with --argjson can create the correct nested structure.
| echo "{\"row\": ${row_num}, \"result\": ${result}}" | |
| echo "$(jq -n --argjson row "$row_num" --argjson result "$result" '{"row": $row, "result": $result}')" |
References
- In shell scripts, use
jq --argfor strings and--argjsonfor other JSON types (like numbers) to safely pass variables into ajqfilter. This avoids syntax errors if the variables contain special characters.
| if [[ "$total_count" -gt 0 ]]; then | ||
| local avg_score | ||
| avg_score=$(awk "BEGIN { printf \"%.3f\", $total_score / $total_count }") | ||
| local pass_rate | ||
| pass_rate=$(awk "BEGIN { printf \"%.1f\", ($pass_count / $total_count) * 100 }") | ||
| echo "{\"summary\": {\"rows\": ${row_num}, \"evaluations\": ${total_count}, \"avg_score\": ${avg_score}, \"pass_rate\": \"${pass_rate}%\", \"passed\": ${pass_count}, \"failed\": $((total_count - pass_count))}}" | ||
| else | ||
| echo "{\"summary\": {\"rows\": ${row_num}, \"evaluations\": 0, \"avg_score\": null, \"pass_rate\": null, \"passed\": 0, \"failed\": 0}}" | ||
| fi |
There was a problem hiding this comment.
Manually constructing the summary JSON string is less robust than using jq. Using jq ensures proper formatting and escaping, making the script more maintainable and consistent with best practices for handling JSON in shell. This applies to both the success and fallback cases.
| if [[ "$total_count" -gt 0 ]]; then | |
| local avg_score | |
| avg_score=$(awk "BEGIN { printf \"%.3f\", $total_score / $total_count }") | |
| local pass_rate | |
| pass_rate=$(awk "BEGIN { printf \"%.1f\", ($pass_count / $total_count) * 100 }") | |
| echo "{\"summary\": {\"rows\": ${row_num}, \"evaluations\": ${total_count}, \"avg_score\": ${avg_score}, \"pass_rate\": \"${pass_rate}%\", \"passed\": ${pass_count}, \"failed\": $((total_count - pass_count))}}" | |
| else | |
| echo "{\"summary\": {\"rows\": ${row_num}, \"evaluations\": 0, \"avg_score\": null, \"pass_rate\": null, \"passed\": 0, \"failed\": 0}}" | |
| fi | |
| if [[ "$total_count" -gt 0 ]]; then | |
| local avg_score | |
| avg_score=$(awk "BEGIN { printf \"%.3f\", $total_score / $total_count }") | |
| local pass_rate | |
| pass_rate=$(awk "BEGIN { printf \"%.1f\", ($pass_count / $total_count) * 100 }") | |
| jq -n \ | |
| --argjson r "$row_num" \ | |
| --argjson tc "$total_count" \ | |
| --argjson as "$avg_score" \ | |
| --arg pr "${pass_rate}%" \ | |
| --argjson p "$pass_count" \ | |
| --argjson f "$((total_count - pass_count))" \ | |
| '{summary: {rows: $r, evaluations: $tc, avg_score: $as, pass_rate: $pr, passed: $p, failed: $f}}' | |
| else | |
| jq -n --argjson r "$row_num" '{summary: {rows: $r, evaluations: 0, avg_score: null, pass_rate: null, passed: 0, failed: 0}}' | |
| fi |
References
- In shell scripts, use
jq --argfor strings and--argjsonfor other JSON types (like numbers) to safely pass variables into ajqfilter. This avoids syntax errors if the variables contain special characters.
Address all 10 findings from PR #2914 Gemini code review: CRITICAL: - awk threshold comparison already used -v flags (was already safe) HIGH: - Remove 2>/dev/null suppression on AI_HELPER call to expose auth/network errors - Replace fragile sed JSON extraction with sed+grep pipeline that handles markdown code fences and multi-line responses - Replace sed-based JSON field parsing with jq (.score, .details) - Replace sed-based JSONL field parsing with jq (.input, .output, .context, .expected) MEDIUM: - Replace echo -e with printf to prevent backslash escape injection from untrusted LLM input in build_evaluator_message() - Replace manual JSON array construction with printf | jq -s . - Replace string-embedded result JSON with jq --argjson for proper nesting - Replace manual summary JSON string with jq -n --argjson for safe construction All jq-based JSON construction uses --arg for strings and --argjson for numeric/boolean types, preventing injection via special characters.
…helper-review-fixes GH#3179: fix critical review feedback on ai-judgment-helper.sh (PR #2914)



Summary
evaluatesubcommand toai-judgment-helper.shwith 6 named evaluator presets (faithfulness, relevancy, safety, format-validity, completeness, conciseness) that score LLM outputs on standard quality dimensions{score, passed, details}JSON, with deterministic fallback ({score: null, passed: null}) when API is unavailable--datasetfor batch evaluation of JSONL files, custom evaluators via--prompt-file, configurable--threshold, multi-evaluator comma-separated--type, and result cachingChanges
.agents/scripts/ai-judgment-helper.shevaluatesubcommand, 6 evaluator prompt presets,run_single_evaluator(),eval_dataset(),build_evaluator_message(),get_evaluator_prompt(), updated help text and main dispatchtests/test-ai-judgment-helper.shTesting
.shellcheckrc)ANTHROPIC_API_KEYis not setCLI Examples
Closes #2904