fix(supervisor): prevent worker death on cron exit and fix dispatch deadlock #431

marcusquinn · 2026-02-07T14:38:33Z

Summary

Workers launched by cron pulse were dying after ~1-2 minutes because background subshells get killed when the parent cron script exits. Fixed with nohup + disown.
Dispatch loop had SC2319 bug: $? after if cmd_dispatch captured the if-condition exit code, not the actual dispatch exit code. Refactored to cmd || exit=$? pattern.
Added pulse-level health check flag so the 8-second health probe only runs once per pulse invocation, not once per task.
Dispatch failures are now logged instead of swallowed by 2>/dev/null.

Before: 23 queued, 0 running, 0 dispatched (complete deadlock)
After: 21 queued, 3 running, 8 deployed

Resolves dispatch deadlock in batch quality-hardening-8h.

Summary by CodeRabbit

Chores
- Internal infrastructure optimizations to improve system reliability and performance under scheduled execution environments.

…eadlock - Use nohup + disown for worker processes so they survive parent (cron) exit - Fix SC2319: capture cmd_dispatch exit code correctly (was using $? after if) - Add pulse-level health check caching to avoid 8s probe per task per pulse - Log dispatch failures instead of swallowing with 2>/dev/null - Apply same nohup fix to cmd_reprompt worker launch Root cause: workers launched as background subshells were killed when the 2-minute cron pulse script exited, causing all tasks to die at ~1-2 minutes. The dispatch loop also had a bug where $? captured the if-condition exit code instead of cmd_dispatch's actual exit code.

gemini-code-assist · 2026-02-07T14:38:50Z

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical stability and performance issues within the supervisor script. It resolves a deadlock in the task dispatching mechanism and prevents premature termination of worker processes launched by cron. Additionally, it optimizes model health checks to reduce overhead and enhances error logging for better debugging. These changes significantly improve the reliability and efficiency of task execution, as demonstrated by the transition from a complete deadlock to successful task deployment.

Highlights

Worker Process Longevity: Implemented nohup and disown for background worker processes launched by cron, preventing them from being terminated prematurely when the parent cron script exits.
Dispatch Deadlock Resolution: Corrected a shell scripting bug (SC2319) where the exit code of cmd_dispatch was not correctly captured, leading to a dispatch deadlock. The logic now accurately captures and handles dispatch outcomes, including specific logging for concurrency limits and unavailable providers.
Optimized Health Checks: Introduced a pulse-level health verification flag (_PULSE_HEALTH_VERIFIED) to ensure that the 8-second model health probe runs only once per supervisor pulse invocation, rather than for every individual task, significantly reducing overhead.
Improved Error Logging: Dispatch failures are no longer silently suppressed (2>/dev/null) and are now explicitly logged, providing better visibility into issues and aiding in debugging.

Changelog

.agents/scripts/supervisor-helper.sh
- Implemented nohup and disown for background worker processes to ensure they persist after the parent cron script exits.
- Refactored cmd_dispatch calls to correctly capture and handle its exit status, resolving a dispatch deadlock.
- Introduced _PULSE_HEALTH_VERIFIED flag to optimize model health checks, preventing redundant 8-second probes within a single pulse.
- Removed 2>/dev/null from cmd_dispatch calls, enabling logging of dispatch failures.
- Added specific log messages for concurrency limits, unavailable providers, and general dispatch failures.
TODO.md
- Updated the BLOCKED status for task t140 to include ambiguous_ai_unavailable, reflecting a change in dispatch failure reasons.

Activity

No human activity has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-07T14:38:51Z

Caution

Review failed

The pull request is closed.

Walkthrough

This PR optimizes supervisor health checks using a pulse-scoped cache flag to prevent redundant probes within single pulses, replaces direct background task invocation with nohup and disown for robust process lifecycle management, and introduces explicit exit-code handling in dispatch loops to properly respond to concurrency limits and provider unavailability.

Changes

Cohort / File(s)	Summary
Supervisor health & dispatch optimization `.agents/scripts/supervisor-helper.sh`	Introduces `_PULSE_HEALTH_VERIFIED` flag for per-pulse health check caching to eliminate redundant probes; replaces direct background task invocation with `nohup bash -c ...` + `disown` for robust PID handling; adds explicit exit-code handling in dispatch loops (0=success, 2=concurrency limit, 3=provider unavailable) with appropriate control flow; resets cache flag at pulse initialization.
Backlog documentation update `TODO.md`	Appends additional BLOCKED reason to item t140 noting an observed error mode `ambiguous_ai_unavailable` alongside existing `backend_infrastructure_error`.

Sequence Diagram(s)

sequenceDiagram
    participant Pulse as Pulse Executor
    participant Health as Health Checker
    participant Dispatch as Dispatch Handler
    participant Worker as Background Worker
    participant Provider as Model Provider

    Pulse->>Pulse: Reset _PULSE_HEALTH_VERIFIED flag
    Pulse->>Health: check_model_health()
    
    alt _PULSE_HEALTH_VERIFIED set
        Health-->>Pulse: Return success (cached)
    else Health not verified
        Health->>Health: Check cache (8-second probe)
        alt Cache miss or stale
            Health->>Provider: Probe health endpoint
            Provider-->>Health: Health status
        end
        Health->>Health: Set _PULSE_HEALTH_VERIFIED
        Health-->>Pulse: Return success/failure
    end
    
    alt Health check passed
        Pulse->>Dispatch: Execute dispatch loop
        Dispatch->>Dispatch: Start nohup bash -c worker command
        Dispatch->>Dispatch: Capture dispatch_exit code
        
        alt dispatch_exit == 0
            Dispatch->>Dispatch: Success, continue
        else dispatch_exit == 2
            Dispatch->>Dispatch: Concurrency limit, stop
        else dispatch_exit == 3
            Dispatch->>Dispatch: Provider unavailable, stop
        else other
            Dispatch->>Dispatch: Dispatch failure, log & handle
        end
        
        Dispatch->>Worker: disown process (background survival)
        Worker-->>Pulse: Process runs independently
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

fix: supervisor self-healing -- macOS timeout, PR detection, model names, stale PID cleanup #429: Modifies supervisor health-probing and background worker lifecycle management at code level, directly overlapping with health check caching and PID handling changes.
feat: add worker dispatch with worktree isolation to supervisor (t128.2) #377: Extends background worker dispatch/pulse logic with PID handling and backgrounding behavior, sharing dispatch and pulse control flow modifications.
fix: supervisor integration testing fixes (t128.7) #384: Directly modifies check_model_health behavior and dispatch/background worker startup with PID management, closely aligned with this PR's health check and process lifecycle changes.

Poem

🧬 Pulse-Scoped Serenity

Health checks cached within each beat,
No probes repeated, crisp and sweet.
Background workers now disown with grace,
Exit codes guide the dispatch race.
Cron survives what once would fall—
Robust and ready, standing tall. 🚀

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bugfix/supervisor-dispatch-deadlock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-07T14:39:09Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 25 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sat Feb 7 14:39:02 UTC 2026: Code review monitoring started
Sat Feb 7 14:39:03 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 25
Sat Feb 7 14:39:03 UTC 2026: Qlty - 0 issues found, auto-formatting applied
Sat Feb 7 14:39:05 UTC 2026: Codacy analysis completed with auto-fixes

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 25
VULNERABILITIES: 0

Generated on: Sat Feb 7 14:39:08 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-07T14:39:35Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gemini-code-assist

Code Review

This pull request introduces several important fixes and improvements to the supervisor script. The use of nohup and disown to prevent worker processes from dying when launched from cron is a solid solution. The refactoring of the dispatch loop to correctly capture exit codes and improve logging is a significant enhancement to the script's robustness and debuggability. Additionally, the introduction of a pulse-level health check flag is a smart optimization that will reduce unnecessary overhead. The code changes are well-implemented, and there is one suggestion regarding code duplication to improve maintainability.

gemini-code-assist · 2026-02-07T14:41:44Z

.agents/scripts/supervisor-helper.sh

        # Also start background process as fallback (Tabby may not support OSC 1337)
-        (cd "$worktree_path" && "${cmd_parts[@]}" > "$log_file" 2>&1; echo "EXIT:$?" >> "$log_file") &
+        # Use nohup + disown to survive parent (cron) exit
+        nohup bash -c "cd '${worktree_path}' && $(printf '%q ' "${cmd_parts[@]}") > '${log_file}' 2>&1; echo \"EXIT:\$?\" >> '${log_file}'" &>/dev/null &
    else
        # Headless: background process
-        (cd "$worktree_path" && "${cmd_parts[@]}" > "$log_file" 2>&1; echo "EXIT:$?" >> "$log_file") &
+        # Use nohup + disown to survive parent (cron) exit — without this,
+        # workers die after ~2 minutes when the cron pulse script exits
+        nohup bash -c "cd '${worktree_path}' && $(printf '%q ' "${cmd_parts[@]}") > '${log_file}' 2>&1; echo \"EXIT:\$?\" >> '${log_file}'" &>/dev/null &
    fi


The command to launch the background worker is duplicated in both the if (line 1971) and else (line 1976) branches. A very similar command is also used in the cmd_reprompt function (line 2560). This makes it harder to maintain if the dispatch logic needs to be changed in the future.

Consider extracting this complex command into a local helper function to avoid repetition and improve readability.

For example, you could create a helper function like this:

_dispatch_background_worker() { local work_path="$1" local log_path="$2" shift 2 local -a command_parts=("$@") nohup bash -c "cd '${work_path}' && $(printf '%q ' "${command_parts[@]}") > '${log_path}' 2>&1; echo \"EXIT:\$?\" >> '${log_path}'" &>/dev/null & }

And then call it from both branches, which would simplify the main function's logic.

marcusquinn added 3 commits February 7, 2026 06:56

chore: mark t140 blocked in TODO.md

3d19740

Merge branch 'main' of https://github.com/marcusquinn/aidevops

d7ed7e6

marcusquinn merged commit 5901e1d into main Feb 7, 2026
5 of 6 checks passed

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

This was referenced Feb 7, 2026

fix(supervisor): health check 503 false positive on JSON timestamps #488

Merged

fix: suppress stdout pollution in create_task_worktree (t169, t173) #643

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(supervisor): prevent worker death on cron exit and fix dispatch deadlock #431

fix(supervisor): prevent worker death on cron exit and fix dispatch deadlock #431

Uh oh!

marcusquinn commented Feb 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 7, 2026

Uh oh!

coderabbitai bot commented Feb 7, 2026 •

edited

Loading

Review failed

Uh oh!

github-actions bot commented Feb 7, 2026

Uh oh!

sonarqubecloud bot commented Feb 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(supervisor): prevent worker death on cron exit and fix dispatch deadlock #431

fix(supervisor): prevent worker death on cron exit and fix dispatch deadlock #431

Uh oh!

Conversation

marcusquinn commented Feb 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Feb 7, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Feb 7, 2026

Quality Gate passed

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcusquinn commented Feb 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 7, 2026 •

edited

Loading