Skip to content

Comments

t1096: Add failure mode classification and output quality rating to evaluate.sh#1632

Merged
marcusquinn merged 2 commits intomainfrom
feature/t1096
Feb 18, 2026
Merged

t1096: Add failure mode classification and output quality rating to evaluate.sh#1632
marcusquinn merged 2 commits intomainfrom
feature/t1096

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 18, 2026

Summary

  • Add classify_failure_mode() to map outcome_detail strings to 5 categories: TRANSIENT, RESOURCE, LOGIC, BLOCKED, AMBIGUOUS
  • Add rate_output_quality() to derive a 3-point quality score (0=no_output, 1=partial, 2=complete) from outcome type
  • Add record_evaluation_metadata() to record richer fields to pattern tracker after each worker assessment
  • Add evaluate_worker_with_metadata() as a thin wrapper that classifies, rates, records, and returns the original verdict unchanged
  • Extend evaluate_with_ai() to request FMODE and QUALITY in the AI prompt; parses extended VERDICT:type:detail:FMODE:mode:QUALITY:n format with fallback to basic format
  • Update pulse.sh to call evaluate_worker_with_metadata() instead of evaluate_worker()
  • Extend store_failure_pattern() and store_success_pattern() in memory-integration.sh to accept and propagate new fields
  • Add --failure-mode and --quality options to pattern-tracker-helper.sh cmd_record()

Design decisions

  • Tag-based approach for new fields (failure_mode:X, quality:N) avoids schema changes and matches existing pattern-tracker conventions
  • evaluate_worker_with_metadata() is a non-breaking wrapper — all existing callers of evaluate_worker() continue to work
  • AI eval extended format uses _AI_EVAL_FMODE/_AI_EVAL_QUALITY shell variables to pass extended fields from evaluate_with_ai() to the wrapper without changing the return value
  • Deterministic classification handles all known outcome_detail strings; AI classification only used for AMBIGUOUS cases

Ref #1621

Summary by CodeRabbit

  • New Features

    • Added quality scoring (0-2 scale) and failure mode categorization for pattern tracking
    • New failure classifications: TRANSIENT, RESOURCE, LOGIC, BLOCKED, AMBIGUOUS, NONE
    • Enhanced metadata recording with quality and failure mode information
  • Improvements

    • Extended pattern storage to capture richer evaluation outcomes and quality signals
    • Improved filtering and analysis capabilities through extended metadata tagging

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 11 minutes and 22 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Walkthrough

This PR enriches the evaluation and pattern tracking system with metadata capabilities. It introduces failure mode classification and quality scoring functions, extends pattern storage to accept and persist these attributes, and updates the evaluation workflow to collect and record this richer metadata throughout the supervisor pipeline.

Changes

Cohort / File(s) Summary
Pattern Tracker Core
.agents/scripts/pattern-tracker-helper.sh
Adds CLI option --quality-score to the record command. Expands failure_mode validation to accept TRANSIENT, RESOURCE, LOGIC, BLOCKED, AMBIGUOUS, NONE. Introduces memory_type computation (SUCCESS_PATTERN or FAILURE_PATTERN based on outcome). Extends tagging and content construction to include failure_mode and quality_score, and persists these fields to pattern_metadata table.
Evaluation Enhancement
.agents/scripts/supervisor/evaluate.sh
Introduces four public functions: classify_failure_mode() mapping outcome details to categories, rate_output_quality() deriving 0/1/2 scores, record_evaluation_metadata() persisting rich metadata, and evaluate_worker_with_metadata() wrapping core evaluation. Enhances evaluate_with_ai to parse extended FMODE and QUALITY verdict fields with fallback compatibility.
Memory Integration
.agents/scripts/supervisor/memory-integration.sh
Extends store_failure_pattern() to accept optional failure_mode and quality_score parameters, appending them to content and tags. Updates store_success_pattern() to accept quality_score parameter and include quality signals in tags and content.
Pulse Integration
.agents/scripts/supervisor/pulse.sh
Replaces direct evaluate_worker call with conditional wrapper preferring evaluate_worker_with_metadata when available, enabling metadata capture at the pulse level while maintaining backward compatibility.

Sequence Diagram

sequenceDiagram
    participant Task as Task Evaluator
    participant Eval as evaluate.sh
    participant Class as Classify & Rate
    participant Memory as Memory Storage
    participant Tracker as Pattern Tracker

    Task->>Eval: evaluate_worker_with_metadata(task_id)
    Eval->>Eval: evaluate_worker(task_id)
    Eval->>Class: classify_failure_mode(outcome_detail)
    Class-->>Eval: failure_mode (TRANSIENT|RESOURCE|LOGIC...)
    Eval->>Class: rate_output_quality(outcome_type, outcome_detail)
    Class-->>Eval: quality_score (0|1|2)
    Eval->>Eval: record_evaluation_metadata(task_id, failure_mode, quality_score)
    Eval->>Memory: store_failure_pattern() or store_success_pattern()
    Memory->>Memory: append [fmode:...] [quality:...] to content
    Memory->>Memory: extend tags with failure_mode & quality
    Memory->>Tracker: record pattern with metadata
    Tracker-->>Memory: pattern stored
    Eval-->>Task: return original verdict
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Poem

🔍 From outcome whispers, classification blooms,
Quality scores in metadata rooms,
Pattern-tracker sings with richer tales,
Failure modes and success prevails,
Zero debt maintained, A-grade prevails! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically summarizes the main changes: adding failure mode classification and output quality rating to evaluate.sh, which aligns with the core objectives and file modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1096

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@marcusquinn marcusquinn marked this pull request as ready for review February 18, 2026 02:48
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's ability to evaluate and track worker performance by introducing granular failure mode classification and output quality ratings. These new metrics provide deeper insights into why tasks succeed or fail and the quality of their output, which will improve pattern analysis and model routing decisions. The changes are implemented with backward compatibility in mind, ensuring existing workflows remain unaffected while enabling richer data collection.

Highlights

  • Enhanced Evaluation Metrics: Introduced classify_failure_mode() to categorize worker outcome details into five distinct failure modes (TRANSIENT, RESOURCE, LOGIC, BLOCKED, AMBIGUOUS) and rate_output_quality() to assign a 3-point quality score (no_output, partial, complete) to worker outputs.
  • Richer Pattern Tracking: Added record_evaluation_metadata() to store these new failure mode and quality metrics, along with AI evaluation flags, into the pattern tracker, providing more granular data for analysis.
  • Non-Breaking Evaluation Wrapper: Created evaluate_worker_with_metadata() as a wrapper around the existing evaluate_worker() function. This new function performs the classification, quality rating, and metadata recording without altering the original verdict, ensuring backward compatibility.
  • AI Prompt Extension: Modified evaluate_with_ai() to request failure mode and output quality directly from the AI model, and updated its parsing logic to handle an extended verdict format (VERDICT:type:detail:FMODE:mode:QUALITY:n).
  • Integration with Pulse Cycle: Updated pulse.sh to utilize the new evaluate_worker_with_metadata() wrapper, ensuring that all worker evaluations automatically capture and record the enhanced metadata.
  • Pattern Tracker Helper Updates: Extended pattern-tracker-helper.sh to accept --failure-mode and --quality options in its record command, and updated store_failure_pattern() and store_success_pattern() in memory-integration.sh to propagate these new fields.
Changelog
  • .agents/scripts/pattern-tracker-helper.sh
    • Refactored logging functions for improved readability.
    • Added --failure-mode and --quality options to cmd_record for capturing new evaluation metrics.
    • Implemented validation for the new failure_mode and quality_score parameters.
    • Included failure_mode and quality_score in the generated tags and content for pattern records.
    • Updated help documentation to reflect the new record options.
  • .agents/scripts/supervisor/evaluate.sh
    • Added classify_failure_mode() to map outcome details to standardized failure categories.
    • Introduced rate_output_quality() to assign a numerical quality score based on task outcome.
    • Created record_evaluation_metadata() for comprehensive logging of evaluation results to the pattern tracker.
    • Modified evaluate_with_ai() to include requests for failure mode and output quality in the AI prompt.
    • Updated evaluate_with_ai() to parse an extended AI verdict format, extracting failure mode and quality score.
    • Implemented evaluate_worker_with_metadata() as a wrapper to orchestrate evaluation, classification, rating, and metadata recording.
  • .agents/scripts/supervisor/memory-integration.sh
    • Updated store_failure_pattern() to accept and include failure_mode and quality_score in pattern content and tags.
    • Updated store_success_pattern() to accept and include quality_score in pattern content and tags, defaulting to 'NONE' for failure mode.
  • .agents/scripts/supervisor/pulse.sh
    • Modified cmd_pulse() to use evaluate_worker_with_metadata() for worker evaluation, with a fallback to evaluate_worker().
    • Updated the maker field in proof logs to reflect the use of evaluate_worker_with_metadata.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 22 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 02:48:56 UTC 2026: Code review monitoring started
Wed Feb 18 02:48:56 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 22

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 22
  • VULNERABILITIES: 0

Generated on: Wed Feb 18 02:48:59 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable enhancement for failure analysis by adding failure mode classification and output quality ratings. The implementation is well-designed, using a non-breaking wrapper function (evaluate_worker_with_metadata) to introduce the new functionality, which is a great pattern. The changes are consistently applied across all relevant scripts, from evaluation and AI interaction to data storage and reporting. The extended AI prompt and robust parsing with fallbacks in evaluate.sh are particularly well-executed. I've included a couple of minor suggestions to simplify some conditional logic for improved conciseness.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 22 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 03:02:07 UTC 2026: Code review monitoring started
Wed Feb 18 03:02:08 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 22

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 22
  • VULNERABILITIES: 0

Generated on: Wed Feb 18 03:02:10 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

…aluate.sh (t1096)

- Add classify_failure_mode(): maps outcome_detail strings to 5 categories
  (TRANSIENT, RESOURCE, LOGIC, BLOCKED, AMBIGUOUS) for pattern tracking
- Add rate_output_quality(): derives 3-point quality score (0/1/2) from
  outcome type without extra AI calls
- Add record_evaluation_metadata(): records richer fields to pattern tracker
  after each worker assessment
- Add evaluate_worker_with_metadata(): thin wrapper that calls evaluate_worker(),
  classifies failure mode, rates quality, records metadata, returns verdict unchanged
- Extend evaluate_with_ai() to request FMODE and QUALITY in AI prompt;
  parses extended VERDICT:type:detail:FMODE:mode:QUALITY:n format with
  fallback to basic format for backward compatibility
- Update pulse.sh to call evaluate_worker_with_metadata() instead of evaluate_worker()
- Extend store_failure_pattern() and store_success_pattern() in memory-integration.sh
  to accept and propagate failure_mode and quality_score fields
- Add --failure-mode and --quality options to pattern-tracker-helper.sh cmd_record()
  with validation and documentation

Chose tag-based approach for new fields (failure_mode:X, quality:N) to avoid
schema changes — matches existing pattern-tracker conventions and enables
immediate filtering via SQLite LIKE queries.
@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 23 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 03:27:19 UTC 2026: Code review monitoring started
Wed Feb 18 03:27:20 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 23

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 23
  • VULNERABILITIES: 0

Generated on: Wed Feb 18 03:27:22 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

coderabbitai[bot]
coderabbitai bot previously requested changes Feb 18, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.agents/scripts/supervisor/pulse.sh (1)

214-231: ⚠️ Potential issue | 🟡 Minor

Proof-log maker field is inaccurate when fallback path is taken.

Line 230 unconditionally records "evaluate_worker_with_metadata" as the maker, but if the command -v check at line 217 fails, the actual evaluator used is evaluate_worker. This misattributes the proof-log entry in the fallback case.

Proposed fix: track which evaluator was actually used
+			local eval_maker="evaluate_worker"
 			if command -v evaluate_worker_with_metadata &>/dev/null; then
 				outcome=$(evaluate_worker_with_metadata "$tid" "$skip_ai")
+				eval_maker="evaluate_worker_with_metadata"
 			else
 				outcome=$(evaluate_worker "$tid" "$skip_ai")
 			fi
 			local outcome_type="${outcome%%:*}"
 			local outcome_detail="${outcome#*:}"

 			# Proof-log: record evaluation outcome (t218)
 			local _eval_duration
 			_eval_duration=$(_proof_log_stage_duration "$tid" "evaluate")
 			write_proof_log --task "$tid" --event "evaluate" --stage "evaluate" \
 				--decision "$outcome" --evidence "skip_ai=$skip_ai" \
-				--maker "evaluate_worker_with_metadata" \
+				--maker "$eval_maker" \
 				${_eval_duration:+--duration "$_eval_duration"} 2>/dev/null || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 214 - 231, The proof-log
always sets the maker to "evaluate_worker_with_metadata" even when the script
fell back to evaluate_worker; modify the code that chooses the evaluator (the
command -v check and assignment to outcome) to also set a variable (e.g.,
evaluator_maker) to the actual tool name ("evaluate_worker_with_metadata" or
"evaluate_worker") and then pass that variable to write_proof_log via the
--maker flag instead of the hardcoded string; update references around
evaluate_worker_with_metadata, evaluate_worker, outcome and the write_proof_log
invocation so the recorded maker accurately reflects which evaluator was used.
🧹 Nitpick comments (6)
.agents/scripts/supervisor/evaluate.sh (4)

767-778: Same pattern: failed case branches all return "0".

The explicit case for worker_never_started*|log_file_missing*|… and the * wildcard both output 0. If you intend these to diverge later, a # TODO comment would signal that. Otherwise, simplify.

Simplify
 	failed)
-		# worker_never_started / log missing = truly no output
-		case "$outcome_detail" in
-		worker_never_started* | log_file_missing* | log_file_empty | \
-			no_log_path_in_db* | max_retries)
-			echo "0"
-			;;
-		*)
-			echo "0"
-			;;
-		esac
+		echo "0"
 		;;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/evaluate.sh around lines 767 - 778, The `failed)`
branch in the outcome handling duplicates behavior: both the explicit `case
"$outcome_detail"` patterns (worker_never_started*, log_file_missing*, etc.) and
the wildcard `*` echo "0"; simplify by collapsing the inner case to a single
unconditional `echo "0"` (or replace with a single-case comment like `# TODO:
differentiate outcomes later`) in the `failed)` block so you remove the
redundant `case` and make the intent clear in evaluate.sh's `failed)` outcome
handling.

745-751: Redundant if/else — both branches return the same value.

Lines 747–751 branch on work_in_progress but both paths yield echo "1". The conditional has no effect and adds cognitive noise.

Simplify
 	retry)
-		# work_in_progress = partial commits exist
-		if [[ "$outcome_detail" == "work_in_progress" ]]; then
-			echo "1"
-		else
-			echo "1"
-		fi
+		echo "1"
 		;;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/evaluate.sh around lines 745 - 751, In the retry)
case the if/else that checks "$outcome_detail" == "work_in_progress" is
redundant because both branches echo "1"; simplify by removing the conditional
and directly outputting echo "1" in the retry) block (locate the retry) case and
the "$outcome_detail" check to edit).

836-845: Hardcoded --task-type "feature" loses task type granularity.

record_evaluation_metadata always passes --task-type "feature" to the pattern tracker. The supervisor DB likely has richer task-type info (or it could be inferred from tags/description). This means all evaluation metadata records are tagged as "feature" regardless of actual task type, degrading the usefulness of task-type-based filtering and recommendations.

If extracting the real task type is non-trivial, at least consider accepting it as a parameter or defaulting to a more neutral value like "unknown".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/evaluate.sh around lines 836 - 845, The call to
the pattern tracker currently hardcodes --task-type "feature" which loses real
task-type information; change the invocation of "$pattern_helper" record to pass
a real task type variable (e.g., use --task-type "$task_type") and ensure
task_type is populated earlier (fallback to "unknown" if unset); update where
record is called (the pattern_helper record invocation) and add a default
assignment like task_type="${task_type:-unknown}" before the call so the tracker
receives meaningful task-type values instead of always "feature".

1351-1389: Regex for extended verdict may be too restrictive for AI-generated detail strings.

The grep pattern at line 1355:

VERDICT:[a-z]+:[a-z_]+:FMODE:[A-Z]+:QUALITY:[012]

The [a-z_]+ portion for detail won't match details containing digits or hyphens (e.g., if the AI returns backend_503_error or context-miss). This causes a silent fallback to the basic format, which then also uses [a-z_]+.

This is safe due to the fallback chain, but expanding to [a-z0-9_-]+ would better match plausible AI responses and reduce false fallbacks.

Expand character class for detail
-	verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z_]+:FMODE:[A-Z]+:QUALITY:[012]' | head -1 || true)
+	verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z0-9_-]+:FMODE:[A-Z]+:QUALITY:[012]' | head -1 || true)

And for the basic fallback:

-	basic_verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z_]+' | head -1 || true)
+	basic_verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z0-9_-]+' | head -1 || true)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/evaluate.sh around lines 1351 - 1389, The
extended and basic VERDICT grep patterns are too restrictive (they use [a-z_]+)
and will miss detail strings with digits or hyphens; update the patterns used
when setting verdict_line and basic_verdict_line (the grep -oE strings) to allow
digits and hyphens (e.g., replace the detail class with [a-z0-9_-]+) so the
extended parse block (variables verdict_line, raw, verdict, fmode_part,
ai_fmode, ai_quality and exported _AI_EVAL_FMODE/_AI_EVAL_QUALITY) correctly
captures AI details and the fallback behaves less often.
.agents/scripts/supervisor/pulse.sh (1)

298-298: store_failure_pattern / store_success_pattern calls not propagating new metadata fields.

The existing calls to store_failure_pattern and store_success_pattern throughout pulse.sh (lines 298, 327, 395, 408, 429, 449) still use 3–4 args and don't pass the failure_mode / quality_score captured by evaluate_worker_with_metadata. The metadata is recorded separately via record_evaluation_metadata inside the wrapper, so pattern-tracker gets the data — but the memory-helper entries from these calls won't carry the extended fields.

This is likely acceptable as a first iteration (metadata is recorded once via pattern-tracker), but worth noting for future alignment if you want memory-integration records to also carry the enriched fields.

Also applies to: 327-327, 395-395, 408-408, 429-429, 449-449

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh at line 298, The calls to
store_failure_pattern and store_success_pattern in pulse.sh are still using only
3–4 args so they don't propagate the failure_mode and quality_score produced by
evaluate_worker_with_metadata; update each invocation of
store_failure_pattern/store_success_pattern (the same places where
evaluate_worker_with_metadata and record_evaluation_metadata are used) to accept
and forward the additional metadata fields (failure_mode and quality_score) so
memory-helper entries include the enriched metadata, ensuring call signatures
match the pattern-tracker wrapper that records evaluation metadata.
.agents/scripts/pattern-tracker-helper.sh (1)

292-305: quality_score captured in tags/content but not as a pattern_metadata column.

The quality_score (0|1|2) is recorded at line 267 in tags (quality:$quality_score) and line 279 in content ([quality:$quality_score]), making it queryable through the learnings table. However, the INSERT at line 304 does not store it as a column in pattern_metadata, preventing direct SQL aggregation on that table (e.g., SELECT AVG(quality_score) FROM pattern_metadata). Since quality_score represents output completeness rather than CI test outcome (stored as quality), consider adding it as an INTEGER column to pattern_metadata if you plan future analytics that require direct SQL queries on completeness metrics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/pattern-tracker-helper.sh around lines 292 - 305, Add a new
INTEGER column quality_score to the pattern_metadata table and include it in the
INSERT/REPLACE call: introduce a shell var sql_quality_score (set to NULL or the
integer value of quality_score), populate it from the existing quality_score
variable, and add quality_score to the column list and $sql_quality_score to the
VALUES list in the sqlite3 invocation that operates on pattern_metadata; also
add a DB migration or ALTER TABLE to create the quality_score INTEGER column so
queries like SELECT AVG(quality_score) work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/pattern-tracker-helper.sh:
- Line 96: Help text is missing the new CLI flag --quality-score and currently
duplicates/confuses the existing --quality/--failure-mode options; update the
help block to include a clear entry for --quality-score (describe allowed values
0|1|2 and intended meaning) and consolidate/remove the duplicate --quality /
--failure-mode lines so callers understand that --quality accepts descriptive
strings (ci-pass-first-try|ci-pass-after-fix|needs-human) while --quality-score
maps to the numeric path that sets the quality_score variable; ensure the help
text references the exact option names (--quality-score and --quality) and align
wording with the existing validation branches t1096 (numeric quality_score) and
t1095 (string quality) so users aren’t routed to the wrong validation.

In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1434-1484: The bug is that evaluate_worker is run in a subshell so
_AI_EVAL_FMODE/_AI_EVAL_QUALITY set by evaluate_with_ai are lost; modify
evaluate_worker to append AI fields to its stdout when set (e.g., emit
"type:detail:FMODE:<mode>:QUALITY:<score>") and update
evaluate_worker_with_metadata to parse those extra colon-separated tokens from
verdict into failure_mode and quality_score (falling back to
classify_failure_mode/rate_output_quality when absent), then strip the extra
tokens before echoing the original "type:detail" to callers; update references
to _AI_EVAL_FMODE/_AI_EVAL_QUALITY handling to use the parsed values and ensure
record_evaluation_metadata gets the parsed failure_mode/quality_score.

---

Outside diff comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 214-231: The proof-log always sets the maker to
"evaluate_worker_with_metadata" even when the script fell back to
evaluate_worker; modify the code that chooses the evaluator (the command -v
check and assignment to outcome) to also set a variable (e.g., evaluator_maker)
to the actual tool name ("evaluate_worker_with_metadata" or "evaluate_worker")
and then pass that variable to write_proof_log via the --maker flag instead of
the hardcoded string; update references around evaluate_worker_with_metadata,
evaluate_worker, outcome and the write_proof_log invocation so the recorded
maker accurately reflects which evaluator was used.

---

Nitpick comments:
In @.agents/scripts/pattern-tracker-helper.sh:
- Around line 292-305: Add a new INTEGER column quality_score to the
pattern_metadata table and include it in the INSERT/REPLACE call: introduce a
shell var sql_quality_score (set to NULL or the integer value of quality_score),
populate it from the existing quality_score variable, and add quality_score to
the column list and $sql_quality_score to the VALUES list in the sqlite3
invocation that operates on pattern_metadata; also add a DB migration or ALTER
TABLE to create the quality_score INTEGER column so queries like SELECT
AVG(quality_score) work.

In @.agents/scripts/supervisor/evaluate.sh:
- Around line 767-778: The `failed)` branch in the outcome handling duplicates
behavior: both the explicit `case "$outcome_detail"` patterns
(worker_never_started*, log_file_missing*, etc.) and the wildcard `*` echo "0";
simplify by collapsing the inner case to a single unconditional `echo "0"` (or
replace with a single-case comment like `# TODO: differentiate outcomes later`)
in the `failed)` block so you remove the redundant `case` and make the intent
clear in evaluate.sh's `failed)` outcome handling.
- Around line 745-751: In the retry) case the if/else that checks
"$outcome_detail" == "work_in_progress" is redundant because both branches echo
"1"; simplify by removing the conditional and directly outputting echo "1" in
the retry) block (locate the retry) case and the "$outcome_detail" check to
edit).
- Around line 836-845: The call to the pattern tracker currently hardcodes
--task-type "feature" which loses real task-type information; change the
invocation of "$pattern_helper" record to pass a real task type variable (e.g.,
use --task-type "$task_type") and ensure task_type is populated earlier
(fallback to "unknown" if unset); update where record is called (the
pattern_helper record invocation) and add a default assignment like
task_type="${task_type:-unknown}" before the call so the tracker receives
meaningful task-type values instead of always "feature".
- Around line 1351-1389: The extended and basic VERDICT grep patterns are too
restrictive (they use [a-z_]+) and will miss detail strings with digits or
hyphens; update the patterns used when setting verdict_line and
basic_verdict_line (the grep -oE strings) to allow digits and hyphens (e.g.,
replace the detail class with [a-z0-9_-]+) so the extended parse block
(variables verdict_line, raw, verdict, fmode_part, ai_fmode, ai_quality and
exported _AI_EVAL_FMODE/_AI_EVAL_QUALITY) correctly captures AI details and the
fallback behaves less often.

In @.agents/scripts/supervisor/pulse.sh:
- Line 298: The calls to store_failure_pattern and store_success_pattern in
pulse.sh are still using only 3–4 args so they don't propagate the failure_mode
and quality_score produced by evaluate_worker_with_metadata; update each
invocation of store_failure_pattern/store_success_pattern (the same places where
evaluate_worker_with_metadata and record_evaluation_metadata are used) to accept
and forward the additional metadata fields (failure_mode and quality_score) so
memory-helper entries include the enriched metadata, ensuring call signatures
match the pattern-tracker wrapper that records evaluation metadata.

@marcusquinn marcusquinn dismissed coderabbitai[bot]’s stale review February 18, 2026 03:33

Auto-dismissed: bot review does not block autonomous pipeline

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 23 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 03:40:59 UTC 2026: Code review monitoring started
Wed Feb 18 03:40:59 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 23

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 23
  • VULNERABILITIES: 0

Generated on: Wed Feb 18 03:41:01 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

@marcusquinn marcusquinn merged commit 1811738 into main Feb 18, 2026
27 checks passed
@marcusquinn marcusquinn deleted the feature/t1096 branch February 18, 2026 03:45
marcusquinn added a commit that referenced this pull request Feb 18, 2026
…d t1094.1 (t1107)

All t1094 deliverables confirmed implemented:
- Prompt strategy tracking → t1095 (pr:#1629)
- Output quality gradient + failure categorization → t1096 (pr:#1632)
- Token usage → t1095 (pr:#1629)
- A/B comparison → t1098+t1099 (pr:#1637, pr:#1634)
- Prompt-repeat strategy → t1097 (pr:#1631)
- Build-agent reference → t1094.1 (pr:#1633)

t1094 parent ready to be marked complete: verified:2026-02-18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant