-
Notifications
You must be signed in to change notification settings - Fork 5
fix(supervisor): prevent worker death on cron exit and fix dispatch deadlock #431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…eadlock - Use nohup + disown for worker processes so they survive parent (cron) exit - Fix SC2319: capture cmd_dispatch exit code correctly (was using $? after if) - Add pulse-level health check caching to avoid 8s probe per task per pulse - Log dispatch failures instead of swallowing with 2>/dev/null - Apply same nohup fix to cmd_reprompt worker launch Root cause: workers launched as background subshells were killed when the 2-minute cron pulse script exited, causing all tasks to die at ~1-2 minutes. The dispatch loop also had a bug where $? captured the if-condition exit code instead of cmd_dispatch's actual exit code.
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses critical stability and performance issues within the supervisor script. It resolves a deadlock in the task dispatching mechanism and prevents premature termination of worker processes launched by cron. Additionally, it optimizes model health checks to reduce overhead and enhances error logging for better debugging. These changes significantly improve the reliability and efficiency of task execution, as demonstrated by the transition from a complete deadlock to successful task deployment. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Caution Review failedThe pull request is closed. WalkthroughThis PR optimizes supervisor health checks using a pulse-scoped cache flag to prevent redundant probes within single pulses, replaces direct background task invocation with Changes
Sequence Diagram(s)sequenceDiagram
participant Pulse as Pulse Executor
participant Health as Health Checker
participant Dispatch as Dispatch Handler
participant Worker as Background Worker
participant Provider as Model Provider
Pulse->>Pulse: Reset _PULSE_HEALTH_VERIFIED flag
Pulse->>Health: check_model_health()
alt _PULSE_HEALTH_VERIFIED set
Health-->>Pulse: Return success (cached)
else Health not verified
Health->>Health: Check cache (8-second probe)
alt Cache miss or stale
Health->>Provider: Probe health endpoint
Provider-->>Health: Health status
end
Health->>Health: Set _PULSE_HEALTH_VERIFIED
Health-->>Pulse: Return success/failure
end
alt Health check passed
Pulse->>Dispatch: Execute dispatch loop
Dispatch->>Dispatch: Start nohup bash -c worker command
Dispatch->>Dispatch: Capture dispatch_exit code
alt dispatch_exit == 0
Dispatch->>Dispatch: Success, continue
else dispatch_exit == 2
Dispatch->>Dispatch: Concurrency limit, stop
else dispatch_exit == 3
Dispatch->>Dispatch: Provider unavailable, stop
else other
Dispatch->>Dispatch: Dispatch failure, log & handle
end
Dispatch->>Worker: disown process (background survival)
Worker-->>Pulse: Process runs independently
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Sat Feb 7 14:39:08 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several important fixes and improvements to the supervisor script. The use of nohup and disown to prevent worker processes from dying when launched from cron is a solid solution. The refactoring of the dispatch loop to correctly capture exit codes and improve logging is a significant enhancement to the script's robustness and debuggability. Additionally, the introduction of a pulse-level health check flag is a smart optimization that will reduce unnecessary overhead. The code changes are well-implemented, and there is one suggestion regarding code duplication to improve maintainability.
| # Also start background process as fallback (Tabby may not support OSC 1337) | ||
| (cd "$worktree_path" && "${cmd_parts[@]}" > "$log_file" 2>&1; echo "EXIT:$?" >> "$log_file") & | ||
| # Use nohup + disown to survive parent (cron) exit | ||
| nohup bash -c "cd '${worktree_path}' && $(printf '%q ' "${cmd_parts[@]}") > '${log_file}' 2>&1; echo \"EXIT:\$?\" >> '${log_file}'" &>/dev/null & | ||
| else | ||
| # Headless: background process | ||
| (cd "$worktree_path" && "${cmd_parts[@]}" > "$log_file" 2>&1; echo "EXIT:$?" >> "$log_file") & | ||
| # Use nohup + disown to survive parent (cron) exit — without this, | ||
| # workers die after ~2 minutes when the cron pulse script exits | ||
| nohup bash -c "cd '${worktree_path}' && $(printf '%q ' "${cmd_parts[@]}") > '${log_file}' 2>&1; echo \"EXIT:\$?\" >> '${log_file}'" &>/dev/null & | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command to launch the background worker is duplicated in both the if (line 1971) and else (line 1976) branches. A very similar command is also used in the cmd_reprompt function (line 2560). This makes it harder to maintain if the dispatch logic needs to be changed in the future.
Consider extracting this complex command into a local helper function to avoid repetition and improve readability.
For example, you could create a helper function like this:
_dispatch_background_worker() {
local work_path="$1"
local log_path="$2"
shift 2
local -a command_parts=("$@")
nohup bash -c "cd '${work_path}' && $(printf '%q ' "${command_parts[@]}") > '${log_path}' 2>&1; echo \"EXIT:\$?\" >> '${log_path}'" &>/dev/null &
}And then call it from both branches, which would simplify the main function's logic.



Summary
nohup+disown.$?afterif cmd_dispatchcaptured the if-condition exit code, not the actual dispatch exit code. Refactored tocmd || exit=$?pattern.2>/dev/null.Before: 23 queued, 0 running, 0 dispatched (complete deadlock)
After: 21 queued, 3 running, 8 deployed
Resolves dispatch deadlock in batch
quality-hardening-8h.Summary by CodeRabbit