feat(supervisor): self-healing diagnostic subtasks on failure/block (t147.5, t150) by marcusquinn · Pull Request #464 · marcusquinn/aidevops

marcusquinn · 2026-02-07T19:20:58Z

Summary

t147.5: Triaged all 3 review threads from PR feat: add skill scan results audit trail #406 (sed -i portability, severity attribution, stale severity table). All were already fixed in prior PRs (fix: sed portability, speech-to-speech reliability (t145, t141) #447, current code). Replied with evidence on each thread.
t150: Implemented supervisor self-healing - on task failure/block, auto-creates a diagnostic subtask that investigates the root cause and re-queues the parent on success.

Self-Healing Design

When a task reaches blocked or failed state, the supervisor can now automatically create a diagnostic subtask (e.g., t001.diag) that:

Inherits the parent's repo, model, and batch membership
Includes failure context in its description (error, original task description, log path)
Is dispatched like any other task in the batch

When the diagnostic task completes, the pulse cycle detects it and:

Re-queues the parent task (blocked -> queued)
Resets retry counter to 0 (fresh attempts)
Stores the self-healing success in memory
Sends a notification

Eligibility Rules (prevent infinite loops)

Diagnostic tasks cannot self-heal themselves
Max 1 heal attempt per task (configurable via SUPERVISOR_MAX_HEAL_ATTEMPTS)
auth_error, out_of_memory, merge_conflict are ineligible (require human intervention)
No duplicate diagnostics (skips if one is already active)

New Commands

supervisor-helper.sh self-heal <task_id> - manually trigger for a specific task
supervisor-helper.sh self-heal --batch <id> - heal all eligible in a batch
Automatic: pulse cycle Phase 1b/1c runs self-healing checks each pulse

DB Schema Changes

parent_task_id TEXT - links diagnostic subtask to parent
is_diagnostic INTEGER DEFAULT 0 - flags diagnostic tasks
heal_attempts INTEGER DEFAULT 0 - tracks heal attempts per task
Automatic migration for existing databases

Testing

Verified with integration tests:

Full cycle: blocked task -> self-heal -> diagnostic created -> diagnostic completes -> parent re-queued
Eligibility: auth_error rejected, already-healed rejected, diagnostic tasks rejected
Status display: shows self-healing info, diagnostic subtasks, heal attempts

…on failure/block (t147.5, t150) On task failure or block, the supervisor now auto-creates a diagnostic subtask (e.g., t001.diag) that investigates the root cause. When the diagnostic task completes, the parent task is re-queued with fresh retries. Self-healing features: - is_self_heal_eligible(): checks task type, heal attempts, failure type - create_diagnostic_subtask(): creates diag task with failure context - check_diagnostic_completions(): re-queues parents after diag success - attempt_self_healing(): scans for healable tasks during pulse - cmd_self_heal: manual trigger for specific task or batch Eligibility rules prevent infinite loops: - Diagnostic tasks cannot self-heal (no recursion) - Max 1 heal attempt per task (configurable via SUPERVISOR_MAX_HEAL_ATTEMPTS) - auth_error, out_of_memory, merge_conflict require human intervention DB schema: adds parent_task_id, is_diagnostic, heal_attempts columns with automatic migration for existing databases. Also triages PR #406 review threads (t147.5): all 3 threads already fixed in prior PRs. Replied with evidence on each thread.

coderabbitai · 2026-02-07T19:21:06Z

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 9 minutes and 33 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t147.5

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-07T19:21:15Z

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the supervisor's resilience by introducing an automated self-healing mechanism. It allows the system to proactively address and recover from task failures or blocks by spawning specialized diagnostic tasks. This automation reduces the need for manual intervention, streamlines task progression, and improves the overall robustness of the task management workflow.

Highlights

Automated Self-Healing for Tasks: The supervisor now automatically creates diagnostic subtasks for tasks that enter a 'blocked' or 'failed' state. These subtasks investigate the root cause and, upon successful completion, re-queue the original parent task with fresh retry attempts.
Eligibility Rules and Loop Prevention: Strict eligibility rules are in place to prevent infinite healing loops. Diagnostic tasks cannot self-heal, a maximum of one heal attempt is allowed per task (configurable), and certain failure types (e.g., authentication errors, out-of-memory, merge conflicts) are deemed ineligible as they require human intervention.
New self-heal Command and Pulse Integration: A new supervisor-helper.sh self-heal command has been introduced, allowing manual triggering of self-healing for specific tasks or entire batches. The automatic self-healing checks are integrated into the supervisor's pulse cycle (Phase 1b/1c).
Database Schema Enhancements: The tasks table in the database has been extended with new columns: parent_task_id to link diagnostic subtasks to their origin, is_diagnostic to flag diagnostic tasks, and heal_attempts to track healing efforts. An automatic migration ensures compatibility for existing databases.
Enhanced Status Reporting: The supervisor-helper.sh status command now includes self-healing statistics, showing active and completed diagnostic tasks, total tasks healed, and detailed self-healing information for individual tasks.

Changelog

.agents/scripts/supervisor-helper.sh
- Added self-heal command to the script's usage instructions and main command dispatcher.
- Updated internal documentation to reflect the new self-healing state transitions for tasks.
- Implemented database migration logic to add parent_task_id, is_diagnostic, and heal_attempts columns to the tasks table.
- Modified the CREATE TABLE IF NOT EXISTS tasks statement to include the new self-healing related columns and an index on parent_task_id.
- Enhanced cmd_status to display global self-healing statistics (active/completed diagnostics, total tasks healed) and detailed self-healing information for individual tasks.
- Integrated check_diagnostic_completions and attempt_self_healing functions into the cmd_pulse function for automatic execution.
- Introduced is_self_heal_eligible function to determine if a task meets criteria for self-healing.
- Added create_diagnostic_subtask function responsible for generating and queuing new diagnostic tasks.
- Implemented check_diagnostic_completions function to re-queue parent tasks after their diagnostic subtasks successfully complete.
- Developed attempt_self_healing function to scan for and initiate self-healing for eligible blocked/failed tasks.
- Created cmd_self_heal function to provide manual control over the self-healing process for tasks or batches.
- Updated help messages and environment variable descriptions to include SUPERVISOR_MAX_HEAL_ATTEMPTS.
TODO.md
- Marked task t147.5 as completed, noting that its review threads were addressed in prior PRs.
- Added and marked task t150 (supervisor self-healing) as completed, including detailed notes on its functionality, eligibility rules, and integration.

Activity

The pull request author, marcusquinn, has implemented a new self-healing feature for the supervisor system.
It addresses task t147.5 by confirming that previous review comments were already resolved in other PRs.
The main focus is on task t150, which introduces the core self-healing logic, including automatic diagnostic subtask creation, eligibility checks, and integration into the supervisor's pulse cycle.
The changes include significant updates to the supervisor-helper.sh script and database schema to support this new functionality.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a robust self-healing mechanism for the supervisor, allowing it to automatically create diagnostic subtasks for failed or blocked tasks. The implementation is comprehensive, covering eligibility checks, subtask creation, and re-queuing of the parent task upon successful diagnosis. The changes also include necessary database schema migrations, status reporting enhancements, and a manual trigger command. My review focuses on improving database query efficiency by combining multiple queries into single ones, addressing a bug that prevents multiple healing attempts, and improving error handling by logging errors instead of suppressing them, which aligns with the project's stated quality goals.

gemini-code-assist · 2026-02-07T19:23:44Z