feat(supervisor): self-healing diagnostic subtasks on failure/block (t147.5, t150)#464
feat(supervisor): self-healing diagnostic subtasks on failure/block (t147.5, t150)#464marcusquinn wants to merge 1 commit intomainfrom
Conversation
…on failure/block (t147.5, t150) On task failure or block, the supervisor now auto-creates a diagnostic subtask (e.g., t001.diag) that investigates the root cause. When the diagnostic task completes, the parent task is re-queued with fresh retries. Self-healing features: - is_self_heal_eligible(): checks task type, heal attempts, failure type - create_diagnostic_subtask(): creates diag task with failure context - check_diagnostic_completions(): re-queues parents after diag success - attempt_self_healing(): scans for healable tasks during pulse - cmd_self_heal: manual trigger for specific task or batch Eligibility rules prevent infinite loops: - Diagnostic tasks cannot self-heal (no recursion) - Max 1 heal attempt per task (configurable via SUPERVISOR_MAX_HEAL_ATTEMPTS) - auth_error, out_of_memory, merge_conflict require human intervention DB schema: adds parent_task_id, is_diagnostic, heal_attempts columns with automatic migration for existing databases. Also triages PR #406 review threads (t147.5): all 3 threads already fixed in prior PRs. Replied with evidence on each thread.
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the supervisor's resilience by introducing an automated self-healing mechanism. It allows the system to proactively address and recover from task failures or blocks by spawning specialized diagnostic tasks. This automation reduces the need for manual intervention, streamlines task progression, and improves the overall robustness of the task management workflow. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a robust self-healing mechanism for the supervisor, allowing it to automatically create diagnostic subtasks for failed or blocked tasks. The implementation is comprehensive, covering eligibility checks, subtask creation, and re-queuing of the parent task upon successful diagnosis. The changes also include necessary database schema migrations, status reporting enhancements, and a manual trigger command. My review focuses on improving database query efficiency by combining multiple queries into single ones, addressing a bug that prevents multiple healing attempts, and improving error handling by logging errors instead of suppressing them, which aligns with the project's stated quality goals.
| FROM tasks t | ||
| WHERE t.status IN ('blocked', 'failed') | ||
| AND t.is_diagnostic = 0 | ||
| AND t.heal_attempts = 0 |
There was a problem hiding this comment.
The database query to find healable tasks incorrectly hardcodes AND t.heal_attempts = 0. This prevents the self-healing mechanism from attempting to heal a task more than once, even if SUPERVISOR_MAX_HEAL_ATTEMPTS is configured to a value greater than 1. This contradicts the feature's documented behavior. The condition should be changed to respect the configured maximum number of heal attempts.
| AND t.heal_attempts = 0 | |
| AND t.heal_attempts < ${SUPERVISOR_MAX_HEAL_ATTEMPTS:-1} |
| task_parent=$(db "$SUPERVISOR_DB" "SELECT parent_task_id FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "") | ||
| task_is_diag=$(db "$SUPERVISOR_DB" "SELECT is_diagnostic FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "0") | ||
| task_heal_attempts=$(db "$SUPERVISOR_DB" "SELECT heal_attempts FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "0") |
There was a problem hiding this comment.
These three consecutive database calls fetch different columns from the same task row. They can be combined into a single, more efficient query to reduce database load.
| task_parent=$(db "$SUPERVISOR_DB" "SELECT parent_task_id FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "") | |
| task_is_diag=$(db "$SUPERVISOR_DB" "SELECT is_diagnostic FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "0") | |
| task_heal_attempts=$(db "$SUPERVISOR_DB" "SELECT heal_attempts FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null || echo "0") | |
| local task_info | |
| task_info=$(db -separator '|' "$SUPERVISOR_DB" "SELECT ifnull(parent_task_id, ''), is_diagnostic, heal_attempts FROM tasks WHERE id = '$(sql_escape "$target")';" 2>/dev/null) | |
| IFS='|' read -r task_parent task_is_diag task_heal_attempts <<< "${task_info:-|0|0}" |
| check_diagnostic_completions 2>/dev/null || true | ||
|
|
||
| # Phase 1c: Self-healing - create diagnostic subtasks for blocked/failed tasks (t150) | ||
| attempt_self_healing "${batch_id:-}" 2>/dev/null || true |
There was a problem hiding this comment.
The calls to check_diagnostic_completions and attempt_self_healing suppress all stderr output using 2>/dev/null. This can hide important errors, making debugging difficult, and goes against the goal of t144 ("excessive 2>/dev/null suppresses real errors") mentioned in the PR description. Consider logging stderr to the supervisor log file so that errors are captured without interrupting the pulse cycle.
| check_diagnostic_completions 2>/dev/null || true | |
| # Phase 1c: Self-healing - create diagnostic subtasks for blocked/failed tasks (t150) | |
| attempt_self_healing "${batch_id:-}" 2>/dev/null || true | |
| check_diagnostic_completions >> "$SUPERVISOR_LOG" 2>&1 || true | |
| # Phase 1c: Self-healing - create diagnostic subtasks for blocked/failed tasks (t150) | |
| attempt_self_healing "${batch_id:-}" >> "$SUPERVISOR_LOG" 2>&1 || true |
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN parent_task_id TEXT;" 2>/dev/null || true | ||
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN is_diagnostic INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true | ||
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN heal_attempts INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true |
There was a problem hiding this comment.
The ALTER TABLE commands suppress all errors with 2>/dev/null. While the if check should prevent "duplicate column" errors, this blanket suppression can hide other critical issues, such as database permission errors or a locked database. It would be safer to remove the error suppression to ensure any unexpected migration failures are visible. The || true is sufficient to prevent the script from exiting if set -e is active.
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN parent_task_id TEXT;" 2>/dev/null || true | |
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN is_diagnostic INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true | |
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN heal_attempts INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true | |
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN parent_task_id TEXT;" || true | |
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN is_diagnostic INTEGER NOT NULL DEFAULT 0;" || true | |
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN heal_attempts INTEGER NOT NULL DEFAULT 0;" || true |
| local is_diag | ||
| is_diag=$(db "$SUPERVISOR_DB" "SELECT is_diagnostic FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0") | ||
| if [[ "$is_diag" -eq 1 ]]; then | ||
| echo "skip:is_diagnostic_task" | ||
| return 1 | ||
| fi | ||
|
|
||
| # Check heal attempts | ||
| local heal_count | ||
| heal_count=$(db "$SUPERVISOR_DB" "SELECT heal_attempts FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0") | ||
| if [[ "$heal_count" -ge "$max_heal_attempts" ]]; then | ||
| echo "skip:max_heal_attempts_reached" | ||
| return 1 | ||
| fi |
There was a problem hiding this comment.
The function makes multiple database calls to fetch is_diagnostic and heal_attempts for the same task. These can be combined into a single, more efficient query.
| local is_diag | |
| is_diag=$(db "$SUPERVISOR_DB" "SELECT is_diagnostic FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0") | |
| if [[ "$is_diag" -eq 1 ]]; then | |
| echo "skip:is_diagnostic_task" | |
| return 1 | |
| fi | |
| # Check heal attempts | |
| local heal_count | |
| heal_count=$(db "$SUPERVISOR_DB" "SELECT heal_attempts FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0") | |
| if [[ "$heal_count" -ge "$max_heal_attempts" ]]; then | |
| echo "skip:max_heal_attempts_reached" | |
| return 1 | |
| fi | |
| local is_diag heal_count | |
| local task_info | |
| task_info=$(db -separator '|' "$SUPERVISOR_DB" "SELECT is_diagnostic, heal_attempts FROM tasks WHERE id = '$escaped_id';" 2>/dev/null) | |
| IFS='|' read -r is_diag heal_count <<< "${task_info:-0|0}" | |
| # Check if this task is itself a diagnostic subtask | |
| if [[ "${is_diag:-0}" -eq 1 ]]; then | |
| echo "skip:is_diagnostic_task" | |
| return 1 | |
| fi | |
| # Check heal attempts | |
| if [[ "${heal_count:-0}" -ge "$max_heal_attempts" ]]; then | |
| echo "skip:max_heal_attempts_reached" | |
| return 1 | |
| fi |
| local tstatus terror | ||
| tstatus=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "") | ||
| terror=$(db "$SUPERVISOR_DB" "SELECT error FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "") |
There was a problem hiding this comment.
These two consecutive database calls fetch status and error from the same task row. They can be combined into a single query for better performance.
| local tstatus terror | |
| tstatus=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "") | |
| terror=$(db "$SUPERVISOR_DB" "SELECT error FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "") | |
| local tstatus terror task_info | |
| task_info=$(db -separator '|' "$SUPERVISOR_DB" "SELECT status, ifnull(error, '') FROM tasks WHERE id = '$escaped_id';" 2>/dev/null) | |
| IFS='|' read -r tstatus terror <<< "${task_info:-|}" |
|
Closing: this work was completed by t150 (PR #462) which is already merged. |
Summary
Self-Healing Design
When a task reaches
blockedorfailedstate, the supervisor can now automatically create a diagnostic subtask (e.g.,t001.diag) that:When the diagnostic task completes, the pulse cycle detects it and:
blocked -> queued)Eligibility Rules (prevent infinite loops)
SUPERVISOR_MAX_HEAL_ATTEMPTS)auth_error,out_of_memory,merge_conflictare ineligible (require human intervention)New Commands
supervisor-helper.sh self-heal <task_id>- manually trigger for a specific tasksupervisor-helper.sh self-heal --batch <id>- heal all eligible in a batchDB Schema Changes
parent_task_id TEXT- links diagnostic subtask to parentis_diagnostic INTEGER DEFAULT 0- flags diagnostic tasksheal_attempts INTEGER DEFAULT 0- tracks heal attempts per taskTesting
Verified with integration tests: