t1394: Add evaluator presets for ai-judgment-helper.sh by marcusquinn · Pull Request #2914 · marcusquinn/aidevops

marcusquinn · 2026-03-05T13:55:23Z

Summary

Add evaluate subcommand to ai-judgment-helper.sh with 6 named evaluator presets (faithfulness, relevancy, safety, format-validity, completeness, conciseness) that score LLM outputs on standard quality dimensions
Each evaluator is a haiku-tier call (~$0.001) returning {score, passed, details} JSON, with deterministic fallback ({score: null, passed: null}) when API is unavailable
Supports --dataset for batch evaluation of JSONL files, custom evaluators via --prompt-file, configurable --threshold, multi-evaluator comma-separated --type, and result caching

Changes

File	Change
`.agents/scripts/ai-judgment-helper.sh`	Add `evaluate` subcommand, 6 evaluator prompt presets, `run_single_evaluator()`, `eval_dataset()`, `build_evaluator_message()`, `get_evaluator_prompt()`, updated help text and main dispatch
`tests/test-ai-judgment-helper.sh`	Add 11 new test cases covering: help listing, argument validation, fallback behavior, multi-type parsing, dataset mode, custom prompt files, caching, and live API evaluation

Testing

24/24 tests pass (offline mode, no API key)
ShellCheck clean (only SC1091 for sourced file, handled by .shellcheckrc)
Live API tests skip gracefully when ANTHROPIC_API_KEY is not set

CLI Examples

# Single evaluation
ai-judgment-helper.sh evaluate --type faithfulness \
  --input "What is the capital of France?" \
  --output "The capital of France is Paris." \
  --context "France is a country in Western Europe. Its capital is Paris."

# Multiple evaluators
ai-judgment-helper.sh evaluate --type faithfulness,relevancy,safety \
  --input "Explain CORS" --output "CORS allows cross-origin requests..."

# Batch from dataset
ai-judgment-helper.sh evaluate --type relevancy --dataset path/to/dataset.jsonl

# Custom evaluator
ai-judgment-helper.sh evaluate --type custom --prompt-file my-eval.txt \
  --input "..." --output "..."

Closes #2904

Add 'evaluate' subcommand with 6 named evaluator presets (faithfulness, relevancy, safety, format-validity, completeness, conciseness) that score LLM outputs on standard quality dimensions using haiku-tier calls (~$0.001). Features: - Single evaluation: --type <name> --input/--output/--context - Multiple evaluators: --type faithfulness,relevancy,safety - Batch mode: --dataset path/to/dataset.jsonl with aggregate summary - Custom evaluators: --type custom --prompt-file path/to/prompt.txt - Configurable threshold: --threshold 0.0-1.0 (default: 0.7) - Result caching via existing ai_judgment_cache table - Deterministic fallback: {score: null, passed: null} when API unavailable Output format: {"evaluator": "...", "score": 0-1, "passed": bool, "details": "..."} Inspired by LangWatch LangEvals evaluator framework. Enables CI/CD quality gates on prompt changes and agent output validation. Closes #2904

coderabbitai · 2026-03-05T13:55:41Z

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 39 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 412ab8f8-7d86-49e4-a166-ec012f37debc

📥 Commits

Reviewing files that changed from the base of the PR and between 85ce94f and da20ee1.

📒 Files selected for processing (2)

.agents/scripts/ai-judgment-helper.sh
tests/test-ai-judgment-helper.sh

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t1394-evaluator-presets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-05T13:55:43Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the ai-judgment-helper.sh script by introducing a robust LLM output evaluation framework. It allows users to assess the quality of AI-generated text against predefined criteria or custom rules, providing a standardized way to measure performance and ensure desired output characteristics. This feature aims to improve the reliability and utility of LLM interactions by enabling automated quality checks.

Highlights

New evaluate subcommand: Introduced a new evaluate subcommand to ai-judgment-helper.sh for scoring LLM outputs on various quality dimensions.
Six built-in evaluator presets: Added presets for faithfulness, relevancy, safety, format-validity, completeness, and conciseness, each with a specific system prompt.
Batch evaluation and custom prompts: Enabled batch evaluation of JSONL datasets via --dataset and support for custom evaluators using --prompt-file.
API fallback and caching: Implemented deterministic fallback to null scores when the API is unavailable and added caching for AI evaluation results.
Comprehensive testing: Added 11 new test cases covering argument validation, fallback behavior, multi-type parsing, dataset mode, custom prompts, caching, and live API evaluation.

Changelog

.agents/scripts/ai-judgment-helper.sh
- Added evaluate command to the script's main dispatch and help text.
- Defined DEFAULT_EVAL_THRESHOLD and EVAL_TYPES constants for evaluation configuration.
- Implemented get_evaluator_prompt to generate system prompts for six distinct evaluation types.
- Created build_evaluator_message to construct user messages for the AI evaluator based on input, output, and context.
- Developed run_single_evaluator to handle individual evaluation calls, including API interaction, JSON parsing, and result caching.
- Added cmd_evaluate to orchestrate the evaluation process, supporting single, multiple, and custom evaluators, as well as dataset processing.
- Introduced eval_dataset for iterating through JSONL files and performing batch evaluations, including aggregate statistics.
- Updated CLI examples to demonstrate usage of the new evaluate subcommand with various options.
tests/test-ai-judgment-helper.sh
- Added test_evaluate_help_listed to confirm the new command and presets appear in the help output.
- Included tests for argument validation, specifically test_evaluate_missing_type and test_evaluate_missing_output.
- Implemented test_evaluate_fallback to verify correct behavior when the API is unavailable.
- Added test_evaluate_multiple_types to ensure proper handling of comma-separated evaluator types.
- Created test_evaluate_dataset and test_evaluate_dataset_not_found for batch evaluation functionality.
- Developed test_evaluate_custom_prompt_file and test_evaluate_custom_prompt_not_found for custom evaluator scenarios.
- Included test_evaluate_caching to validate the caching mechanism for evaluation results.
- Added test_evaluate_with_api for live API integration testing of the evaluation feature.
- Updated the main test function to include all newly added evaluation tests.

Activity

All 24 tests passed in offline mode without an API key.
The codebase remains ShellCheck clean, with a handled exception for SC1091.
Live API tests are configured to skip gracefully if ANTHROPIC_API_KEY is not set.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-03-05T13:55:58Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 107 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Mar 5 13:55:54 UTC 2026: Code review monitoring started
Thu Mar 5 13:55:54 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 107

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 107
VULNERABILITIES: 0

Generated on: Thu Mar 5 13:55:57 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-03-05T13:56:59Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gemini-code-assist

Code Review

This pull request introduces a powerful evaluate subcommand to ai-judgment-helper.sh for scoring LLM outputs, including batch processing, custom evaluators, and caching. However, the implementation contains critical security vulnerabilities, most notably a command injection vulnerability in awk's handling of the threshold argument, a high-severity JSON injection vulnerability when constructing output from untrusted LLM results, and a medium-severity prompt injection vulnerability due to echo -e with untrusted input. Beyond security, the script also relies on fragile sed parsing and manual JSON string concatenation, which should be replaced with jq for robustness and maintainability. Additionally, a case of error suppression needs to be removed to improve debuggability. Addressing these issues requires input validation, proper JSON construction with jq, and avoiding unsafe shell commands.

gemini-code-assist · 2026-03-05T14:03:11Z