Skip to content

feat(supervisor): self-healing diagnostic subtasks on failure/block (t147.5, t150)#464

Closed
marcusquinn wants to merge 1 commit intomainfrom
feature/t147.5
Closed

feat(supervisor): self-healing diagnostic subtasks on failure/block (t147.5, t150)#464
marcusquinn wants to merge 1 commit intomainfrom
feature/t147.5

Conversation

@marcusquinn
Copy link
Owner

Summary

Self-Healing Design

When a task reaches blocked or failed state, the supervisor can now automatically create a diagnostic subtask (e.g., t001.diag) that:

  1. Inherits the parent's repo, model, and batch membership
  2. Includes failure context in its description (error, original task description, log path)
  3. Is dispatched like any other task in the batch

When the diagnostic task completes, the pulse cycle detects it and:

  1. Re-queues the parent task (blocked -> queued)
  2. Resets retry counter to 0 (fresh attempts)
  3. Stores the self-healing success in memory
  4. Sends a notification

Eligibility Rules (prevent infinite loops)

  • Diagnostic tasks cannot self-heal themselves
  • Max 1 heal attempt per task (configurable via SUPERVISOR_MAX_HEAL_ATTEMPTS)
  • auth_error, out_of_memory, merge_conflict are ineligible (require human intervention)
  • No duplicate diagnostics (skips if one is already active)

New Commands

  • supervisor-helper.sh self-heal <task_id> - manually trigger for a specific task
  • supervisor-helper.sh self-heal --batch <id> - heal all eligible in a batch
  • Automatic: pulse cycle Phase 1b/1c runs self-healing checks each pulse

DB Schema Changes

  • parent_task_id TEXT - links diagnostic subtask to parent
  • is_diagnostic INTEGER DEFAULT 0 - flags diagnostic tasks
  • heal_attempts INTEGER DEFAULT 0 - tracks heal attempts per task
  • Automatic migration for existing databases

Testing

Verified with integration tests:

  • Full cycle: blocked task -> self-heal -> diagnostic created -> diagnostic completes -> parent re-queued
  • Eligibility: auth_error rejected, already-healed rejected, diagnostic tasks rejected
  • Status display: shows self-healing info, diagnostic subtasks, heal attempts

…on failure/block (t147.5, t150)

On task failure or block, the supervisor now auto-creates a diagnostic
subtask (e.g., t001.diag) that investigates the root cause. When the
diagnostic task completes, the parent task is re-queued with fresh retries.

Self-healing features:
- is_self_heal_eligible(): checks task type, heal attempts, failure type
- create_diagnostic_subtask(): creates diag task with failure context
- check_diagnostic_completions(): re-queues parents after diag success
- attempt_self_healing(): scans for healable tasks during pulse
- cmd_self_heal: manual trigger for specific task or batch

Eligibility rules prevent infinite loops:
- Diagnostic tasks cannot self-heal (no recursion)
- Max 1 heal attempt per task (configurable via SUPERVISOR_MAX_HEAL_ATTEMPTS)
- auth_error, out_of_memory, merge_conflict require human intervention

DB schema: adds parent_task_id, is_diagnostic, heal_attempts columns
with automatic migration for existing databases.

Also triages PR #406 review threads (t147.5): all 3 threads already
fixed in prior PRs. Replied with evidence on each thread.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 7, 2026

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 9 minutes and 33 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t147.5

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the supervisor's resilience by introducing an automated self-healing mechanism. It allows the system to proactively address and recover from task failures or blocks by spawning specialized diagnostic tasks. This automation reduces the need for manual intervention, streamlines task progression, and improves the overall robustness of the task management workflow.

Highlights

  • Automated Self-Healing for Tasks: The supervisor now automatically creates diagnostic subtasks for tasks that enter a 'blocked' or 'failed' state. These subtasks investigate the root cause and, upon successful completion, re-queue the original parent task with fresh retry attempts.
  • Eligibility Rules and Loop Prevention: Strict eligibility rules are in place to prevent infinite healing loops. Diagnostic tasks cannot self-heal, a maximum of one heal attempt is allowed per task (configurable), and certain failure types (e.g., authentication errors, out-of-memory, merge conflicts) are deemed ineligible as they require human intervention.
  • New self-heal Command and Pulse Integration: A new supervisor-helper.sh self-heal command has been introduced, allowing manual triggering of self-healing for specific tasks or entire batches. The automatic self-healing checks are integrated into the supervisor's pulse cycle (Phase 1b/1c).
  • Database Schema Enhancements: The tasks table in the database has been extended with new columns: parent_task_id to link diagnostic subtasks to their origin, is_diagnostic to flag diagnostic tasks, and heal_attempts to track healing efforts. An automatic migration ensures compatibility for existing databases.
  • Enhanced Status Reporting: The supervisor-helper.sh status command now includes self-healing statistics, showing active and completed diagnostic tasks, total tasks healed, and detailed self-healing information for individual tasks.
Changelog
  • .agents/scripts/supervisor-helper.sh
    • Added self-heal command to the script's usage instructions and main command dispatcher.
    • Updated internal documentation to reflect the new self-healing state transitions for tasks.
    • Implemented database migration logic to add parent_task_id, is_diagnostic, and heal_attempts columns to the tasks table.
    • Modified the CREATE TABLE IF NOT EXISTS tasks statement to include the new self-healing related columns and an index on parent_task_id.
    • Enhanced cmd_status to display global self-healing statistics (active/completed diagnostics, total tasks healed) and detailed self-healing information for individual tasks.
    • Integrated check_diagnostic_completions and attempt_self_healing functions into the cmd_pulse function for automatic execution.
    • Introduced is_self_heal_eligible function to determine if a task meets criteria for self-healing.
    • Added create_diagnostic_subtask function responsible for generating and queuing new diagnostic tasks.
    • Implemented check_diagnostic_completions function to re-queue parent tasks after their diagnostic subtasks successfully complete.
    • Developed attempt_self_healing function to scan for and initiate self-healing for eligible blocked/failed tasks.
    • Created cmd_self_heal function to provide manual control over the self-healing process for tasks or batches.
    • Updated help messages and environment variable descriptions to include SUPERVISOR_MAX_HEAL_ATTEMPTS.
  • TODO.md
    • Marked task t147.5 as completed, noting that its review threads were addressed in prior PRs.
    • Added and marked task t150 (supervisor self-healing) as completed, including detailed notes on its functionality, eligibility rules, and integration.
Activity
  • The pull request author, marcusquinn, has implemented a new self-healing feature for the supervisor system.
  • It addresses task t147.5 by confirming that previous review comments were already resolved in other PRs.
  • The main focus is on task t150, which introduces the core self-healing logic, including automatic diagnostic subtask creation, eligibility checks, and integration into the supervisor's pulse cycle.
  • The changes include significant updates to the supervisor-helper.sh script and database schema to support this new functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust self-healing mechanism for the supervisor, allowing it to automatically create diagnostic subtasks for failed or blocked tasks. The implementation is comprehensive, covering eligibility checks, subtask creation, and re-queuing of the parent task upon successful diagnosis. The changes also include necessary database schema migrations, status reporting enhancements, and a manual trigger command. My review focuses on improving database query efficiency by combining multiple queries into single ones, addressing a bug that prevents multiple healing attempts, and improving error handling by logging errors instead of suppressing them, which aligns with the project's stated quality goals.

FROM tasks t
WHERE t.status IN ('blocked', 'failed')
AND t.is_diagnostic = 0
AND t.heal_attempts = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The database query to find healable tasks incorrectly hardcodes AND t.heal_attempts = 0. This prevents the self-healing mechanism from attempting to heal a task more than once, even if SUPERVISOR_MAX_HEAL_ATTEMPTS is configured to a value greater than 1. This contradicts the feature's documented behavior. The condition should be changed to respect the configured maximum number of heal attempts.

Suggested change
AND t.heal_attempts = 0
AND t.heal_attempts < ${SUPERVISOR_MAX_HEAL_ATTEMPTS:-1}

Comment on lines +1013 to +1015
task_parent=$(db "$SUPERVISOR_DB" "SELECT parent_task_id FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "")
task_is_diag=$(db "$SUPERVISOR_DB" "SELECT is_diagnostic FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "0")
task_heal_attempts=$(db "$SUPERVISOR_DB" "SELECT heal_attempts FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "0")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These three consecutive database calls fetch different columns from the same task row. They can be combined into a single, more efficient query to reduce database load.

Suggested change
task_parent=$(db "$SUPERVISOR_DB" "SELECT parent_task_id FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "")
task_is_diag=$(db "$SUPERVISOR_DB" "SELECT is_diagnostic FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "0")
task_heal_attempts=$(db "$SUPERVISOR_DB" "SELECT heal_attempts FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "0")
local task_info
task_info=$(db -separator '|' "$SUPERVISOR_DB" "SELECT ifnull(parent_task_id, ''), is_diagnostic, heal_attempts FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null)
IFS='|' read -r task_parent task_is_diag task_heal_attempts <<< "${task_info:-|0|0}"

Comment on lines +3585 to +3588
check_diagnostic_completions 2>/dev/null || true

# Phase 1c: Self-healing - create diagnostic subtasks for blocked/failed tasks (t150)
attempt_self_healing "${batch_id:-}" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calls to check_diagnostic_completions and attempt_self_healing suppress all stderr output using 2>/dev/null. This can hide important errors, making debugging difficult, and goes against the goal of t144 ("excessive 2>/dev/null suppresses real errors") mentioned in the PR description. Consider logging stderr to the supervisor log file so that errors are captured without interrupting the pulse cycle.

Suggested change
check_diagnostic_completions 2>/dev/null || true
# Phase 1c: Self-healing - create diagnostic subtasks for blocked/failed tasks (t150)
attempt_self_healing "${batch_id:-}" 2>/dev/null || true
check_diagnostic_completions >> "$SUPERVISOR_LOG" 2>&1 || true
# Phase 1c: Self-healing - create diagnostic subtasks for blocked/failed tasks (t150)
attempt_self_healing "${batch_id:-}" >> "$SUPERVISOR_LOG" 2>&1 || true

Comment on lines +416 to +418
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN parent_task_id TEXT;" 2>/dev/null || true
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN is_diagnostic INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN heal_attempts INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ALTER TABLE commands suppress all errors with 2>/dev/null. While the if check should prevent "duplicate column" errors, this blanket suppression can hide other critical issues, such as database permission errors or a locked database. It would be safer to remove the error suppression to ensure any unexpected migration failures are visible. The || true is sufficient to prevent the script from exiting if set -e is active.

Suggested change
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN parent_task_id TEXT;" 2>/dev/null || true
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN is_diagnostic INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN heal_attempts INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN parent_task_id TEXT;" || true
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN is_diagnostic INTEGER NOT NULL DEFAULT 0;" || true
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN heal_attempts INTEGER NOT NULL DEFAULT 0;" || true

Comment on lines +4310 to +4323
local is_diag
is_diag=$(db "$SUPERVISOR_DB" "SELECT is_diagnostic FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0")
if [[ "$is_diag" -eq 1 ]]; then
echo "skip:is_diagnostic_task"
return 1
fi

# Check heal attempts
local heal_count
heal_count=$(db "$SUPERVISOR_DB" "SELECT heal_attempts FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0")
if [[ "$heal_count" -ge "$max_heal_attempts" ]]; then
echo "skip:max_heal_attempts_reached"
return 1
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function makes multiple database calls to fetch is_diagnostic and heal_attempts for the same task. These can be combined into a single, more efficient query.

Suggested change
local is_diag
is_diag=$(db "$SUPERVISOR_DB" "SELECT is_diagnostic FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0")
if [[ "$is_diag" -eq 1 ]]; then
echo "skip:is_diagnostic_task"
return 1
fi
# Check heal attempts
local heal_count
heal_count=$(db "$SUPERVISOR_DB" "SELECT heal_attempts FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0")
if [[ "$heal_count" -ge "$max_heal_attempts" ]]; then
echo "skip:max_heal_attempts_reached"
return 1
fi
local is_diag heal_count
local task_info
task_info=$(db -separator '|' "$SUPERVISOR_DB" "SELECT is_diagnostic, heal_attempts FROM tasks WHERE id = '$escaped_id';" 2>/dev/null)
IFS='|' read -r is_diag heal_count <<< "${task_info:-0|0}"
# Check if this task is itself a diagnostic subtask
if [[ "${is_diag:-0}" -eq 1 ]]; then
echo "skip:is_diagnostic_task"
return 1
fi
# Check heal attempts
if [[ "${heal_count:-0}" -ge "$max_heal_attempts" ]]; then
echo "skip:max_heal_attempts_reached"
return 1
fi

Comment on lines +5137 to +5139
local tstatus terror
tstatus=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "")
terror=$(db "$SUPERVISOR_DB" "SELECT error FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These two consecutive database calls fetch status and error from the same task row. They can be combined into a single query for better performance.

Suggested change
local tstatus terror
tstatus=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "")
terror=$(db "$SUPERVISOR_DB" "SELECT error FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "")
local tstatus terror task_info
task_info=$(db -separator '|' "$SUPERVISOR_DB" "SELECT status, ifnull(error, '') FROM tasks WHERE id = '$escaped_id';" 2>/dev/null)
IFS='|' read -r tstatus terror <<< "${task_info:-|}"

@marcusquinn
Copy link
Owner Author

Closing: this work was completed by t150 (PR #462) which is already merged.

@marcusquinn marcusquinn closed this Feb 7, 2026
@marcusquinn marcusquinn deleted the feature/t147.5 branch February 7, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant