t1096: Add failure mode classification and output quality rating to evaluate.sh#1632
t1096: Add failure mode classification and output quality rating to evaluate.sh#1632marcusquinn merged 2 commits intomainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThis PR enriches the evaluation and pattern tracking system with metadata capabilities. It introduces failure mode classification and quality scoring functions, extends pattern storage to accept and persist these attributes, and updates the evaluation workflow to collect and record this richer metadata throughout the supervisor pipeline. Changes
Sequence DiagramsequenceDiagram
participant Task as Task Evaluator
participant Eval as evaluate.sh
participant Class as Classify & Rate
participant Memory as Memory Storage
participant Tracker as Pattern Tracker
Task->>Eval: evaluate_worker_with_metadata(task_id)
Eval->>Eval: evaluate_worker(task_id)
Eval->>Class: classify_failure_mode(outcome_detail)
Class-->>Eval: failure_mode (TRANSIENT|RESOURCE|LOGIC...)
Eval->>Class: rate_output_quality(outcome_type, outcome_detail)
Class-->>Eval: quality_score (0|1|2)
Eval->>Eval: record_evaluation_metadata(task_id, failure_mode, quality_score)
Eval->>Memory: store_failure_pattern() or store_success_pattern()
Memory->>Memory: append [fmode:...] [quality:...] to content
Memory->>Memory: extend tags with failure_mode & quality
Memory->>Tracker: record pattern with metadata
Tracker-->>Memory: pattern stored
Eval-->>Task: return original verdict
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's ability to evaluate and track worker performance by introducing granular failure mode classification and output quality ratings. These new metrics provide deeper insights into why tasks succeed or fail and the quality of their output, which will improve pattern analysis and model routing decisions. The changes are implemented with backward compatibility in mind, ensuring existing workflows remain unaffected while enabling richer data collection. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 18 02:48:59 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable enhancement for failure analysis by adding failure mode classification and output quality ratings. The implementation is well-designed, using a non-breaking wrapper function (evaluate_worker_with_metadata) to introduce the new functionality, which is a great pattern. The changes are consistently applied across all relevant scripts, from evaluation and AI interaction to data storage and reporting. The extended AI prompt and robust parsing with fallbacks in evaluate.sh are particularly well-executed. I've included a couple of minor suggestions to simplify some conditional logic for improved conciseness.
bf045b7 to
cdac1c8
Compare
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 18 03:02:10 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
…aluate.sh (t1096) - Add classify_failure_mode(): maps outcome_detail strings to 5 categories (TRANSIENT, RESOURCE, LOGIC, BLOCKED, AMBIGUOUS) for pattern tracking - Add rate_output_quality(): derives 3-point quality score (0/1/2) from outcome type without extra AI calls - Add record_evaluation_metadata(): records richer fields to pattern tracker after each worker assessment - Add evaluate_worker_with_metadata(): thin wrapper that calls evaluate_worker(), classifies failure mode, rates quality, records metadata, returns verdict unchanged - Extend evaluate_with_ai() to request FMODE and QUALITY in AI prompt; parses extended VERDICT:type:detail:FMODE:mode:QUALITY:n format with fallback to basic format for backward compatibility - Update pulse.sh to call evaluate_worker_with_metadata() instead of evaluate_worker() - Extend store_failure_pattern() and store_success_pattern() in memory-integration.sh to accept and propagate failure_mode and quality_score fields - Add --failure-mode and --quality options to pattern-tracker-helper.sh cmd_record() with validation and documentation Chose tag-based approach for new fields (failure_mode:X, quality:N) to avoid schema changes — matches existing pattern-tracker conventions and enables immediate filtering via SQLite LIKE queries.
cdac1c8 to
1467cf0
Compare
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 18 03:27:22 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.agents/scripts/supervisor/pulse.sh (1)
214-231:⚠️ Potential issue | 🟡 MinorProof-log
makerfield is inaccurate when fallback path is taken.Line 230 unconditionally records
"evaluate_worker_with_metadata"as the maker, but if thecommand -vcheck at line 217 fails, the actual evaluator used isevaluate_worker. This misattributes the proof-log entry in the fallback case.Proposed fix: track which evaluator was actually used
+ local eval_maker="evaluate_worker" if command -v evaluate_worker_with_metadata &>/dev/null; then outcome=$(evaluate_worker_with_metadata "$tid" "$skip_ai") + eval_maker="evaluate_worker_with_metadata" else outcome=$(evaluate_worker "$tid" "$skip_ai") fi local outcome_type="${outcome%%:*}" local outcome_detail="${outcome#*:}" # Proof-log: record evaluation outcome (t218) local _eval_duration _eval_duration=$(_proof_log_stage_duration "$tid" "evaluate") write_proof_log --task "$tid" --event "evaluate" --stage "evaluate" \ --decision "$outcome" --evidence "skip_ai=$skip_ai" \ - --maker "evaluate_worker_with_metadata" \ + --maker "$eval_maker" \ ${_eval_duration:+--duration "$_eval_duration"} 2>/dev/null || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 214 - 231, The proof-log always sets the maker to "evaluate_worker_with_metadata" even when the script fell back to evaluate_worker; modify the code that chooses the evaluator (the command -v check and assignment to outcome) to also set a variable (e.g., evaluator_maker) to the actual tool name ("evaluate_worker_with_metadata" or "evaluate_worker") and then pass that variable to write_proof_log via the --maker flag instead of the hardcoded string; update references around evaluate_worker_with_metadata, evaluate_worker, outcome and the write_proof_log invocation so the recorded maker accurately reflects which evaluator was used.
🧹 Nitpick comments (6)
.agents/scripts/supervisor/evaluate.sh (4)
767-778: Same pattern:failedcase branches all return"0".The explicit case for
worker_never_started*|log_file_missing*|…and the*wildcard both output0. If you intend these to diverge later, a# TODOcomment would signal that. Otherwise, simplify.Simplify
failed) - # worker_never_started / log missing = truly no output - case "$outcome_detail" in - worker_never_started* | log_file_missing* | log_file_empty | \ - no_log_path_in_db* | max_retries) - echo "0" - ;; - *) - echo "0" - ;; - esac + echo "0" ;;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/evaluate.sh around lines 767 - 778, The `failed)` branch in the outcome handling duplicates behavior: both the explicit `case "$outcome_detail"` patterns (worker_never_started*, log_file_missing*, etc.) and the wildcard `*` echo "0"; simplify by collapsing the inner case to a single unconditional `echo "0"` (or replace with a single-case comment like `# TODO: differentiate outcomes later`) in the `failed)` block so you remove the redundant `case` and make the intent clear in evaluate.sh's `failed)` outcome handling.
745-751: Redundantif/else— both branches return the same value.Lines 747–751 branch on
work_in_progressbut both paths yieldecho "1". The conditional has no effect and adds cognitive noise.Simplify
retry) - # work_in_progress = partial commits exist - if [[ "$outcome_detail" == "work_in_progress" ]]; then - echo "1" - else - echo "1" - fi + echo "1" ;;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/evaluate.sh around lines 745 - 751, In the retry) case the if/else that checks "$outcome_detail" == "work_in_progress" is redundant because both branches echo "1"; simplify by removing the conditional and directly outputting echo "1" in the retry) block (locate the retry) case and the "$outcome_detail" check to edit).
836-845: Hardcoded--task-type "feature"loses task type granularity.
record_evaluation_metadataalways passes--task-type "feature"to the pattern tracker. The supervisor DB likely has richer task-type info (or it could be inferred from tags/description). This means all evaluation metadata records are tagged as "feature" regardless of actual task type, degrading the usefulness of task-type-based filtering and recommendations.If extracting the real task type is non-trivial, at least consider accepting it as a parameter or defaulting to a more neutral value like
"unknown".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/evaluate.sh around lines 836 - 845, The call to the pattern tracker currently hardcodes --task-type "feature" which loses real task-type information; change the invocation of "$pattern_helper" record to pass a real task type variable (e.g., use --task-type "$task_type") and ensure task_type is populated earlier (fallback to "unknown" if unset); update where record is called (the pattern_helper record invocation) and add a default assignment like task_type="${task_type:-unknown}" before the call so the tracker receives meaningful task-type values instead of always "feature".
1351-1389: Regex for extended verdict may be too restrictive for AI-generated detail strings.The grep pattern at line 1355:
VERDICT:[a-z]+:[a-z_]+:FMODE:[A-Z]+:QUALITY:[012]The
[a-z_]+portion fordetailwon't match details containing digits or hyphens (e.g., if the AI returnsbackend_503_errororcontext-miss). This causes a silent fallback to the basic format, which then also uses[a-z_]+.This is safe due to the fallback chain, but expanding to
[a-z0-9_-]+would better match plausible AI responses and reduce false fallbacks.Expand character class for detail
- verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z_]+:FMODE:[A-Z]+:QUALITY:[012]' | head -1 || true) + verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z0-9_-]+:FMODE:[A-Z]+:QUALITY:[012]' | head -1 || true)And for the basic fallback:
- basic_verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z_]+' | head -1 || true) + basic_verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z0-9_-]+' | head -1 || true)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/evaluate.sh around lines 1351 - 1389, The extended and basic VERDICT grep patterns are too restrictive (they use [a-z_]+) and will miss detail strings with digits or hyphens; update the patterns used when setting verdict_line and basic_verdict_line (the grep -oE strings) to allow digits and hyphens (e.g., replace the detail class with [a-z0-9_-]+) so the extended parse block (variables verdict_line, raw, verdict, fmode_part, ai_fmode, ai_quality and exported _AI_EVAL_FMODE/_AI_EVAL_QUALITY) correctly captures AI details and the fallback behaves less often..agents/scripts/supervisor/pulse.sh (1)
298-298:store_failure_pattern/store_success_patterncalls not propagating new metadata fields.The existing calls to
store_failure_patternandstore_success_patternthroughoutpulse.sh(lines 298, 327, 395, 408, 429, 449) still use 3–4 args and don't pass thefailure_mode/quality_scorecaptured byevaluate_worker_with_metadata. The metadata is recorded separately viarecord_evaluation_metadatainside the wrapper, so pattern-tracker gets the data — but the memory-helper entries from these calls won't carry the extended fields.This is likely acceptable as a first iteration (metadata is recorded once via pattern-tracker), but worth noting for future alignment if you want memory-integration records to also carry the enriched fields.
Also applies to: 327-327, 395-395, 408-408, 429-429, 449-449
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh at line 298, The calls to store_failure_pattern and store_success_pattern in pulse.sh are still using only 3–4 args so they don't propagate the failure_mode and quality_score produced by evaluate_worker_with_metadata; update each invocation of store_failure_pattern/store_success_pattern (the same places where evaluate_worker_with_metadata and record_evaluation_metadata are used) to accept and forward the additional metadata fields (failure_mode and quality_score) so memory-helper entries include the enriched metadata, ensuring call signatures match the pattern-tracker wrapper that records evaluation metadata..agents/scripts/pattern-tracker-helper.sh (1)
292-305:quality_scorecaptured in tags/content but not as a pattern_metadata column.The
quality_score(0|1|2) is recorded at line 267 in tags (quality:$quality_score) and line 279 in content ([quality:$quality_score]), making it queryable through the learnings table. However, the INSERT at line 304 does not store it as a column inpattern_metadata, preventing direct SQL aggregation on that table (e.g.,SELECT AVG(quality_score) FROM pattern_metadata). Sincequality_scorerepresents output completeness rather than CI test outcome (stored asquality), consider adding it as an INTEGER column to pattern_metadata if you plan future analytics that require direct SQL queries on completeness metrics.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/pattern-tracker-helper.sh around lines 292 - 305, Add a new INTEGER column quality_score to the pattern_metadata table and include it in the INSERT/REPLACE call: introduce a shell var sql_quality_score (set to NULL or the integer value of quality_score), populate it from the existing quality_score variable, and add quality_score to the column list and $sql_quality_score to the VALUES list in the sqlite3 invocation that operates on pattern_metadata; also add a DB migration or ALTER TABLE to create the quality_score INTEGER column so queries like SELECT AVG(quality_score) work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/pattern-tracker-helper.sh:
- Line 96: Help text is missing the new CLI flag --quality-score and currently
duplicates/confuses the existing --quality/--failure-mode options; update the
help block to include a clear entry for --quality-score (describe allowed values
0|1|2 and intended meaning) and consolidate/remove the duplicate --quality /
--failure-mode lines so callers understand that --quality accepts descriptive
strings (ci-pass-first-try|ci-pass-after-fix|needs-human) while --quality-score
maps to the numeric path that sets the quality_score variable; ensure the help
text references the exact option names (--quality-score and --quality) and align
wording with the existing validation branches t1096 (numeric quality_score) and
t1095 (string quality) so users aren’t routed to the wrong validation.
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1434-1484: The bug is that evaluate_worker is run in a subshell so
_AI_EVAL_FMODE/_AI_EVAL_QUALITY set by evaluate_with_ai are lost; modify
evaluate_worker to append AI fields to its stdout when set (e.g., emit
"type:detail:FMODE:<mode>:QUALITY:<score>") and update
evaluate_worker_with_metadata to parse those extra colon-separated tokens from
verdict into failure_mode and quality_score (falling back to
classify_failure_mode/rate_output_quality when absent), then strip the extra
tokens before echoing the original "type:detail" to callers; update references
to _AI_EVAL_FMODE/_AI_EVAL_QUALITY handling to use the parsed values and ensure
record_evaluation_metadata gets the parsed failure_mode/quality_score.
---
Outside diff comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 214-231: The proof-log always sets the maker to
"evaluate_worker_with_metadata" even when the script fell back to
evaluate_worker; modify the code that chooses the evaluator (the command -v
check and assignment to outcome) to also set a variable (e.g., evaluator_maker)
to the actual tool name ("evaluate_worker_with_metadata" or "evaluate_worker")
and then pass that variable to write_proof_log via the --maker flag instead of
the hardcoded string; update references around evaluate_worker_with_metadata,
evaluate_worker, outcome and the write_proof_log invocation so the recorded
maker accurately reflects which evaluator was used.
---
Nitpick comments:
In @.agents/scripts/pattern-tracker-helper.sh:
- Around line 292-305: Add a new INTEGER column quality_score to the
pattern_metadata table and include it in the INSERT/REPLACE call: introduce a
shell var sql_quality_score (set to NULL or the integer value of quality_score),
populate it from the existing quality_score variable, and add quality_score to
the column list and $sql_quality_score to the VALUES list in the sqlite3
invocation that operates on pattern_metadata; also add a DB migration or ALTER
TABLE to create the quality_score INTEGER column so queries like SELECT
AVG(quality_score) work.
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 767-778: The `failed)` branch in the outcome handling duplicates
behavior: both the explicit `case "$outcome_detail"` patterns
(worker_never_started*, log_file_missing*, etc.) and the wildcard `*` echo "0";
simplify by collapsing the inner case to a single unconditional `echo "0"` (or
replace with a single-case comment like `# TODO: differentiate outcomes later`)
in the `failed)` block so you remove the redundant `case` and make the intent
clear in evaluate.sh's `failed)` outcome handling.
- Around line 745-751: In the retry) case the if/else that checks
"$outcome_detail" == "work_in_progress" is redundant because both branches echo
"1"; simplify by removing the conditional and directly outputting echo "1" in
the retry) block (locate the retry) case and the "$outcome_detail" check to
edit).
- Around line 836-845: The call to the pattern tracker currently hardcodes
--task-type "feature" which loses real task-type information; change the
invocation of "$pattern_helper" record to pass a real task type variable (e.g.,
use --task-type "$task_type") and ensure task_type is populated earlier
(fallback to "unknown" if unset); update where record is called (the
pattern_helper record invocation) and add a default assignment like
task_type="${task_type:-unknown}" before the call so the tracker receives
meaningful task-type values instead of always "feature".
- Around line 1351-1389: The extended and basic VERDICT grep patterns are too
restrictive (they use [a-z_]+) and will miss detail strings with digits or
hyphens; update the patterns used when setting verdict_line and
basic_verdict_line (the grep -oE strings) to allow digits and hyphens (e.g.,
replace the detail class with [a-z0-9_-]+) so the extended parse block
(variables verdict_line, raw, verdict, fmode_part, ai_fmode, ai_quality and
exported _AI_EVAL_FMODE/_AI_EVAL_QUALITY) correctly captures AI details and the
fallback behaves less often.
In @.agents/scripts/supervisor/pulse.sh:
- Line 298: The calls to store_failure_pattern and store_success_pattern in
pulse.sh are still using only 3–4 args so they don't propagate the failure_mode
and quality_score produced by evaluate_worker_with_metadata; update each
invocation of store_failure_pattern/store_success_pattern (the same places where
evaluate_worker_with_metadata and record_evaluation_metadata are used) to accept
and forward the additional metadata fields (failure_mode and quality_score) so
memory-helper entries include the enriched metadata, ensuring call signatures
match the pattern-tracker wrapper that records evaluation metadata.
Auto-dismissed: bot review does not block autonomous pipeline
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 18 03:41:01 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
…d t1094.1 (t1107) All t1094 deliverables confirmed implemented: - Prompt strategy tracking → t1095 (pr:#1629) - Output quality gradient + failure categorization → t1096 (pr:#1632) - Token usage → t1095 (pr:#1629) - A/B comparison → t1098+t1099 (pr:#1637, pr:#1634) - Prompt-repeat strategy → t1097 (pr:#1631) - Build-agent reference → t1094.1 (pr:#1633) t1094 parent ready to be marked complete: verified:2026-02-18



Summary
classify_failure_mode()to map outcome_detail strings to 5 categories: TRANSIENT, RESOURCE, LOGIC, BLOCKED, AMBIGUOUSrate_output_quality()to derive a 3-point quality score (0=no_output, 1=partial, 2=complete) from outcome typerecord_evaluation_metadata()to record richer fields to pattern tracker after each worker assessmentevaluate_worker_with_metadata()as a thin wrapper that classifies, rates, records, and returns the original verdict unchangedevaluate_with_ai()to request FMODE and QUALITY in the AI prompt; parses extendedVERDICT:type:detail:FMODE:mode:QUALITY:nformat with fallback to basic formatpulse.shto callevaluate_worker_with_metadata()instead ofevaluate_worker()store_failure_pattern()andstore_success_pattern()inmemory-integration.shto accept and propagate new fields--failure-modeand--qualityoptions topattern-tracker-helper.sh cmd_record()Design decisions
failure_mode:X,quality:N) avoids schema changes and matches existing pattern-tracker conventionsevaluate_worker_with_metadata()is a non-breaking wrapper — all existing callers ofevaluate_worker()continue to work_AI_EVAL_FMODE/_AI_EVAL_QUALITYshell variables to pass extended fields fromevaluate_with_ai()to the wrapper without changing the return valueRef #1621
Summary by CodeRabbit
New Features
Improvements