Skip to content

t1394: Add evaluator presets for ai-judgment-helper.sh#2914

Merged
marcusquinn merged 1 commit intomainfrom
feature/t1394-evaluator-presets
Mar 5, 2026
Merged

t1394: Add evaluator presets for ai-judgment-helper.sh#2914
marcusquinn merged 1 commit intomainfrom
feature/t1394-evaluator-presets

Conversation

@marcusquinn
Copy link
Owner

Summary

  • Add evaluate subcommand to ai-judgment-helper.sh with 6 named evaluator presets (faithfulness, relevancy, safety, format-validity, completeness, conciseness) that score LLM outputs on standard quality dimensions
  • Each evaluator is a haiku-tier call (~$0.001) returning {score, passed, details} JSON, with deterministic fallback ({score: null, passed: null}) when API is unavailable
  • Supports --dataset for batch evaluation of JSONL files, custom evaluators via --prompt-file, configurable --threshold, multi-evaluator comma-separated --type, and result caching

Changes

File Change
.agents/scripts/ai-judgment-helper.sh Add evaluate subcommand, 6 evaluator prompt presets, run_single_evaluator(), eval_dataset(), build_evaluator_message(), get_evaluator_prompt(), updated help text and main dispatch
tests/test-ai-judgment-helper.sh Add 11 new test cases covering: help listing, argument validation, fallback behavior, multi-type parsing, dataset mode, custom prompt files, caching, and live API evaluation

Testing

  • 24/24 tests pass (offline mode, no API key)
  • ShellCheck clean (only SC1091 for sourced file, handled by .shellcheckrc)
  • Live API tests skip gracefully when ANTHROPIC_API_KEY is not set

CLI Examples

# Single evaluation
ai-judgment-helper.sh evaluate --type faithfulness \
  --input "What is the capital of France?" \
  --output "The capital of France is Paris." \
  --context "France is a country in Western Europe. Its capital is Paris."

# Multiple evaluators
ai-judgment-helper.sh evaluate --type faithfulness,relevancy,safety \
  --input "Explain CORS" --output "CORS allows cross-origin requests..."

# Batch from dataset
ai-judgment-helper.sh evaluate --type relevancy --dataset path/to/dataset.jsonl

# Custom evaluator
ai-judgment-helper.sh evaluate --type custom --prompt-file my-eval.txt \
  --input "..." --output "..."

Closes #2904

Add 'evaluate' subcommand with 6 named evaluator presets (faithfulness,
relevancy, safety, format-validity, completeness, conciseness) that score
LLM outputs on standard quality dimensions using haiku-tier calls (~$0.001).

Features:
- Single evaluation: --type <name> --input/--output/--context
- Multiple evaluators: --type faithfulness,relevancy,safety
- Batch mode: --dataset path/to/dataset.jsonl with aggregate summary
- Custom evaluators: --type custom --prompt-file path/to/prompt.txt
- Configurable threshold: --threshold 0.0-1.0 (default: 0.7)
- Result caching via existing ai_judgment_cache table
- Deterministic fallback: {score: null, passed: null} when API unavailable

Output format: {"evaluator": "...", "score": 0-1, "passed": bool, "details": "..."}

Inspired by LangWatch LangEvals evaluator framework. Enables CI/CD quality
gates on prompt changes and agent output validation.

Closes #2904
@github-actions github-actions bot added the enhancement Auto-created from TODO.md tag label Mar 5, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 39 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 412ab8f8-7d86-49e4-a166-ec012f37debc

📥 Commits

Reviewing files that changed from the base of the PR and between 85ce94f and da20ee1.

📒 Files selected for processing (2)
  • .agents/scripts/ai-judgment-helper.sh
  • tests/test-ai-judgment-helper.sh
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1394-evaluator-presets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the ai-judgment-helper.sh script by introducing a robust LLM output evaluation framework. It allows users to assess the quality of AI-generated text against predefined criteria or custom rules, providing a standardized way to measure performance and ensure desired output characteristics. This feature aims to improve the reliability and utility of LLM interactions by enabling automated quality checks.

Highlights

  • New evaluate subcommand: Introduced a new evaluate subcommand to ai-judgment-helper.sh for scoring LLM outputs on various quality dimensions.
  • Six built-in evaluator presets: Added presets for faithfulness, relevancy, safety, format-validity, completeness, and conciseness, each with a specific system prompt.
  • Batch evaluation and custom prompts: Enabled batch evaluation of JSONL datasets via --dataset and support for custom evaluators using --prompt-file.
  • API fallback and caching: Implemented deterministic fallback to null scores when the API is unavailable and added caching for AI evaluation results.
  • Comprehensive testing: Added 11 new test cases covering argument validation, fallback behavior, multi-type parsing, dataset mode, custom prompts, caching, and live API evaluation.
Changelog
  • .agents/scripts/ai-judgment-helper.sh
    • Added evaluate command to the script's main dispatch and help text.
    • Defined DEFAULT_EVAL_THRESHOLD and EVAL_TYPES constants for evaluation configuration.
    • Implemented get_evaluator_prompt to generate system prompts for six distinct evaluation types.
    • Created build_evaluator_message to construct user messages for the AI evaluator based on input, output, and context.
    • Developed run_single_evaluator to handle individual evaluation calls, including API interaction, JSON parsing, and result caching.
    • Added cmd_evaluate to orchestrate the evaluation process, supporting single, multiple, and custom evaluators, as well as dataset processing.
    • Introduced eval_dataset for iterating through JSONL files and performing batch evaluations, including aggregate statistics.
    • Updated CLI examples to demonstrate usage of the new evaluate subcommand with various options.
  • tests/test-ai-judgment-helper.sh
    • Added test_evaluate_help_listed to confirm the new command and presets appear in the help output.
    • Included tests for argument validation, specifically test_evaluate_missing_type and test_evaluate_missing_output.
    • Implemented test_evaluate_fallback to verify correct behavior when the API is unavailable.
    • Added test_evaluate_multiple_types to ensure proper handling of comma-separated evaluator types.
    • Created test_evaluate_dataset and test_evaluate_dataset_not_found for batch evaluation functionality.
    • Developed test_evaluate_custom_prompt_file and test_evaluate_custom_prompt_not_found for custom evaluator scenarios.
    • Included test_evaluate_caching to validate the caching mechanism for evaluation results.
    • Added test_evaluate_with_api for live API integration testing of the evaluation feature.
    • Updated the main test function to include all newly added evaluation tests.
Activity
  • All 24 tests passed in offline mode without an API key.
  • The codebase remains ShellCheck clean, with a handled exception for SC1091.
  • Live API tests are configured to skip gracefully if ANTHROPIC_API_KEY is not set.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 107 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Mar 5 13:55:54 UTC 2026: Code review monitoring started
Thu Mar 5 13:55:54 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 107

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 107
  • VULNERABILITIES: 0

Generated on: Thu Mar 5 13:55:57 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 5, 2026

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a powerful evaluate subcommand to ai-judgment-helper.sh for scoring LLM outputs, including batch processing, custom evaluators, and caching. However, the implementation contains critical security vulnerabilities, most notably a command injection vulnerability in awk's handling of the threshold argument, a high-severity JSON injection vulnerability when constructing output from untrusted LLM results, and a medium-severity prompt injection vulnerability due to echo -e with untrusted input. Beyond security, the script also relies on fragile sed parsing and manual JSON string concatenation, which should be replaced with jq for robustness and maintainability. Additionally, a case of error suppression needs to be removed to improve debuggability. Addressing these issues requires input validation, proper JSON construction with jq, and avoiding unsafe shell commands.

if [[ -n "$score" ]]; then
# Determine pass/fail using awk for float comparison
local passed
passed=$(awk "BEGIN { print ($score >= $threshold) ? \"true\" : \"false\" }")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The threshold variable is used directly in an awk command string without any validation or escaping. Since threshold is a command-line argument, an attacker can provide a malicious value such as 0.7; system("id") to execute arbitrary system commands. This is a critical command injection vulnerability.

Suggested change
passed=$(awk "BEGIN { print ($score >= $threshold) ? \"true\" : \"false\" }")
passed=$(awk -v t="$threshold" -v s="$score" 'BEGIN { print (s >= t) ? "true" : "false" }')

local passed
passed=$(awk "BEGIN { print ($score >= $threshold) ? \"true\" : \"false\" }")

local result_json="{\"evaluator\": \"${eval_type}\", \"score\": ${score}, \"passed\": ${passed}, \"details\": \"${details}\"}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The details variable, which contains untrusted output from an LLM, is directly inserted into a JSON string without escaping. This creates a high-severity JSON injection vulnerability, allowing an attacker to manipulate the JSON structure (e.g., by injecting ", "passed": true, "ignored": " to bypass security checks). Manually constructing JSON strings is inherently unsafe and error-prone; using jq is crucial for robust and secure JSON construction as it automatically handles character escaping, preventing such injection flaws.

Suggested change
local result_json="{\"evaluator\": \"${eval_type}\", \"score\": ${score}, \"passed\": ${passed}, \"details\": \"${details}\"}"
local result_json
result_json=$(jq -n --arg type "$eval_type" --argjson score "${score:-null}" --argjson passed "$passed" --arg details "$details" '{evaluator: $type, score: $score, passed: $passed, details: $details}')
References
  1. In shell scripts, use jq --arg for strings and --argjson for other JSON types (like numbers) to safely pass variables into a jq filter. This avoids syntax errors if the variables contain special characters.
  2. To reliably wrap the entire content of a shell variable as a single JSON string, use jq -Rn --arg v "$VAR" '$v'. This is more robust than piping the variable to jq -Rs '.'.

${user_message}"

local raw_result
raw_result=$("$AI_HELPER" --prompt "$full_prompt" --model haiku --max-tokens 200 2>/dev/null || echo "")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of 2>/dev/null suppresses all stderr output from the $AI_HELPER call. This can hide important errors related to authentication, network issues, or problems with the helper script itself, making debugging difficult. The project's general rules advise against this practice. The || echo "" already prevents the script from exiting on error, so removing 2>/dev/null will improve debuggability without altering the script's control flow.

Suggested change
raw_result=$("$AI_HELPER" --prompt "$full_prompt" --model haiku --max-tokens 200 2>/dev/null || echo "")
raw_result=$("$AI_HELPER" --prompt "$full_prompt" --model haiku --max-tokens 200 || echo "")
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
  2. In shell scripts with 'set -e', use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.

if [[ -n "$raw_result" ]]; then
# Extract JSON from response (handle markdown code blocks)
local json_result
json_result=$(echo "$raw_result" | sed -n 's/.*\({[^}]*"score"[^}]*}\).*/\1/p' | head -1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using sed with this regex to extract a JSON object from the LLM's output is very fragile. It assumes the JSON is on a single line, contains a score key, and has no nested curly braces. This will fail if the LLM wraps its response in markdown code fences (e.g., json ... ), includes newlines, or returns a more complex object. A more robust approach would be to use tools like sed or awk to strip any surrounding non-JSON text and then parse the result with jq to validate and extract the object.

Comment on lines +901 to +903
score=$(echo "$json_result" | sed -n 's/.*"score"[[:space:]]*:[[:space:]]*\([0-9.]*\).*/\1/p')
local details
details=$(echo "$json_result" | sed -n 's/.*"details"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Parsing JSON fields using sed is fragile and can easily break if the JSON formatting changes (e.g., extra whitespace, different key order). Using jq is the standard and much more robust way to handle JSON in shell scripts.

Suggested change
score=$(echo "$json_result" | sed -n 's/.*"score"[[:space:]]*:[[:space:]]*\([0-9.]*\).*/\1/p')
local details
details=$(echo "$json_result" | sed -n 's/.*"details"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p')
score=$(echo "$json_result" | jq -r '.score // ""')
local details
details=$(echo "$json_result" | jq -r '.details // ""')

Comment on lines +1094 to +1097
row_input=$(echo "$line" | sed -n 's/.*"input"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p')
row_output=$(echo "$line" | sed -n 's/.*"output"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p')
row_context=$(echo "$line" | sed -n 's/.*"context"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p')
row_expected=$(echo "$line" | sed -n 's/.*"expected"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Parsing JSONL fields with sed is fragile and can easily break if the JSON structure changes (e.g., different key order, extra whitespace). Using jq is the standard and much more robust way to handle JSON in shell scripts. While consolidating multiple jq calls into a single pass is generally preferred for performance, using separate calls for clarity here is still a significant improvement over sed.

Suggested change
row_input=$(echo "$line" | sed -n 's/.*"input"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p')
row_output=$(echo "$line" | sed -n 's/.*"output"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p')
row_context=$(echo "$line" | sed -n 's/.*"context"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p')
row_expected=$(echo "$line" | sed -n 's/.*"expected"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p')
row_input=$(echo "$line" | jq -r '.input // ""')
row_output=$(echo "$line" | jq -r '.output // ""')
row_context=$(echo "$line" | jq -r '.context // ""')
row_expected=$(echo "$line" | jq -r '.expected // ""')
References
  1. Consolidate multiple 'jq' calls into a single pass where possible to improve performance and script efficiency.

msg="${msg}\n\nExpected output: ${expected_text}"
fi

echo -e "$msg"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The build_evaluator_message function uses echo -e to output the user message, which interprets backslash escapes in untrusted input (like input_text or output_text). This allows an attacker to inject newlines and other control characters into the prompt sent to the LLM, facilitating prompt injection attacks that can lead to the JSON injection vulnerability identified in run_single_evaluator.

Suggested change
echo -e "$msg"
echo "$msg"

Comment on lines +1044 to +1056
# Multiple evaluators: output as JSON array
echo -n "["
local first=true
for r in "${results[@]}"; do
if [[ "$first" == true ]]; then
first=false
else
echo -n ","
fi
echo -n "$r"
done
echo "]"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Manually building a JSON array with echo and a loop is cumbersome and less robust than using a dedicated tool. A more idiomatic and safer approach in shell is to collect the JSON objects in an array and then use jq with the --slurp (-s) flag to combine them into a valid JSON array.

Suggested change
# Multiple evaluators: output as JSON array
echo -n "["
local first=true
for r in "${results[@]}"; do
if [[ "$first" == true ]]; then
first=false
else
echo -n ","
fi
echo -n "$r"
done
echo "]"
fi
# Multiple evaluators: output as JSON array
printf '%s\n' "${results[@]}" | jq -s .

--prompt-file "$prompt_file")

# Add row number to result
echo "{\"row\": ${row_num}, \"result\": ${result}}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation embeds a JSON string ($result) inside another JSON object, rather than creating a nested JSON object. This leads to output like {"row": 1, "result": "{\"score\": ...}"} instead of the more standard and easily parsable {"row": 1, "result": {"score": ...}}. Using jq with --argjson can create the correct nested structure.

Suggested change
echo "{\"row\": ${row_num}, \"result\": ${result}}"
echo "$(jq -n --argjson row "$row_num" --argjson result "$result" '{"row": $row, "result": $result}')"
References
  1. In shell scripts, use jq --arg for strings and --argjson for other JSON types (like numbers) to safely pass variables into a jq filter. This avoids syntax errors if the variables contain special characters.

Comment on lines +1145 to +1153
if [[ "$total_count" -gt 0 ]]; then
local avg_score
avg_score=$(awk "BEGIN { printf \"%.3f\", $total_score / $total_count }")
local pass_rate
pass_rate=$(awk "BEGIN { printf \"%.1f\", ($pass_count / $total_count) * 100 }")
echo "{\"summary\": {\"rows\": ${row_num}, \"evaluations\": ${total_count}, \"avg_score\": ${avg_score}, \"pass_rate\": \"${pass_rate}%\", \"passed\": ${pass_count}, \"failed\": $((total_count - pass_count))}}"
else
echo "{\"summary\": {\"rows\": ${row_num}, \"evaluations\": 0, \"avg_score\": null, \"pass_rate\": null, \"passed\": 0, \"failed\": 0}}"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Manually constructing the summary JSON string is less robust than using jq. Using jq ensures proper formatting and escaping, making the script more maintainable and consistent with best practices for handling JSON in shell. This applies to both the success and fallback cases.

Suggested change
if [[ "$total_count" -gt 0 ]]; then
local avg_score
avg_score=$(awk "BEGIN { printf \"%.3f\", $total_score / $total_count }")
local pass_rate
pass_rate=$(awk "BEGIN { printf \"%.1f\", ($pass_count / $total_count) * 100 }")
echo "{\"summary\": {\"rows\": ${row_num}, \"evaluations\": ${total_count}, \"avg_score\": ${avg_score}, \"pass_rate\": \"${pass_rate}%\", \"passed\": ${pass_count}, \"failed\": $((total_count - pass_count))}}"
else
echo "{\"summary\": {\"rows\": ${row_num}, \"evaluations\": 0, \"avg_score\": null, \"pass_rate\": null, \"passed\": 0, \"failed\": 0}}"
fi
if [[ "$total_count" -gt 0 ]]; then
local avg_score
avg_score=$(awk "BEGIN { printf \"%.3f\", $total_score / $total_count }")
local pass_rate
pass_rate=$(awk "BEGIN { printf \"%.1f\", ($pass_count / $total_count) * 100 }")
jq -n \
--argjson r "$row_num" \
--argjson tc "$total_count" \
--argjson as "$avg_score" \
--arg pr "${pass_rate}%" \
--argjson p "$pass_count" \
--argjson f "$((total_count - pass_count))" \
'{summary: {rows: $r, evaluations: $tc, avg_score: $as, pass_rate: $pr, passed: $p, failed: $f}}'
else
jq -n --argjson r "$row_num" '{summary: {rows: $r, evaluations: 0, avg_score: null, pass_rate: null, passed: 0, failed: 0}}'
fi
References
  1. In shell scripts, use jq --arg for strings and --argjson for other JSON types (like numbers) to safely pass variables into a jq filter. This avoids syntax errors if the variables contain special characters.

@marcusquinn marcusquinn merged commit 04ea240 into main Mar 5, 2026
22 of 24 checks passed
@marcusquinn marcusquinn deleted the feature/t1394-evaluator-presets branch March 5, 2026 14:25
alex-solovyev added a commit that referenced this pull request Mar 12, 2026
Address all 10 findings from PR #2914 Gemini code review:

CRITICAL:
- awk threshold comparison already used -v flags (was already safe)

HIGH:
- Remove 2>/dev/null suppression on AI_HELPER call to expose auth/network errors
- Replace fragile sed JSON extraction with sed+grep pipeline that handles
  markdown code fences and multi-line responses
- Replace sed-based JSON field parsing with jq (.score, .details)
- Replace sed-based JSONL field parsing with jq (.input, .output, .context, .expected)

MEDIUM:
- Replace echo -e with printf to prevent backslash escape injection from
  untrusted LLM input in build_evaluator_message()
- Replace manual JSON array construction with printf | jq -s .
- Replace string-embedded result JSON with jq --argjson for proper nesting
- Replace manual summary JSON string with jq -n --argjson for safe construction

All jq-based JSON construction uses --arg for strings and --argjson for
numeric/boolean types, preventing injection via special characters.
alex-solovyev added a commit that referenced this pull request Mar 12, 2026
…helper-review-fixes

GH#3179: fix critical review feedback on ai-judgment-helper.sh (PR #2914)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Auto-created from TODO.md tag

Projects

None yet

Development

Successfully merging this pull request may close these issues.

t1394: Evaluator presets for ai-judgment-helper.sh

1 participant