-
Notifications
You must be signed in to change notification settings - Fork 6
fix(supervisor): auto-recover deploying stuck state when deploy completes #948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -7222,6 +7222,49 @@ cmd_pr_lifecycle() { | |||||||||||||||||||||||||||||
| stage_timings="${stage_timings}deploying:$((stage_end - stage_start))s," | ||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| # Step 4b: Auto-recover stuck deploying state (t222) | ||||||||||||||||||||||||||||||
| # If a task is already in 'deploying' (from a prior pulse where the deploy | ||||||||||||||||||||||||||||||
| # succeeded but the transition to 'deployed' failed), re-attempt the | ||||||||||||||||||||||||||||||
| # transition and housekeeping steps. The deploy itself already completed | ||||||||||||||||||||||||||||||
| # successfully — only the state transition was lost. | ||||||||||||||||||||||||||||||
| if [[ "$tstatus" == "deploying" ]]; then | ||||||||||||||||||||||||||||||
| local stage_start | ||||||||||||||||||||||||||||||
| stage_start=$(date +%s) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| log_warn "Task $task_id stuck in deploying state — attempting auto-recovery (t222)" | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| if [[ "$dry_run" == "false" ]]; then | ||||||||||||||||||||||||||||||
| # Re-run housekeeping that may have been skipped when the prior | ||||||||||||||||||||||||||||||
| # transition failed (all non-blocking, best-effort) | ||||||||||||||||||||||||||||||
| cleanup_after_merge "$task_id" 2>>"$SUPERVISOR_LOG" || log_warn "Worktree cleanup issue for $task_id during recovery (non-blocking)" | ||||||||||||||||||||||||||||||
| update_todo_on_complete "$task_id" 2>>"$SUPERVISOR_LOG" || log_warn "TODO.md update issue for $task_id during recovery (non-blocking)" | ||||||||||||||||||||||||||||||
| populate_verify_queue "$task_id" "$tpr" "$trepo" 2>>"$SUPERVISOR_LOG" || log_warn "Verify queue population issue for $task_id during recovery (non-blocking)" | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| # Attempt the transition that previously failed | ||||||||||||||||||||||||||||||
| if cmd_transition "$task_id" "deployed" 2>>"$SUPERVISOR_LOG"; then | ||||||||||||||||||||||||||||||
| log_success "Auto-recovered $task_id: deploying -> deployed (t222)" | ||||||||||||||||||||||||||||||
| send_task_notification "$task_id" "deployed" "Auto-recovered from stuck deploying state" 2>>"$SUPERVISOR_LOG" || true | ||||||||||||||||||||||||||||||
| store_success_pattern "$task_id" "deployed" "" 2>>"$SUPERVISOR_LOG" || true | ||||||||||||||||||||||||||||||
| write_proof_log --task "$task_id" --event "auto_recover" --stage "deploying" \ | ||||||||||||||||||||||||||||||
| --decision "deploying->deployed" --evidence "stuck_state_recovery" \ | ||||||||||||||||||||||||||||||
| --maker "pr_lifecycle:t222" 2>/dev/null || true | ||||||||||||||||||||||||||||||
| else | ||||||||||||||||||||||||||||||
| log_error "Auto-recovery failed for $task_id — transition to deployed rejected" | ||||||||||||||||||||||||||||||
| # If the transition itself is invalid, something is deeply wrong. | ||||||||||||||||||||||||||||||
| # Transition to failed so the task doesn't stay stuck forever. | ||||||||||||||||||||||||||||||
| cmd_transition "$task_id" "failed" --error "Auto-recovery failed: deploying->deployed transition rejected (t222)" 2>>"$SUPERVISOR_LOG" || true | ||||||||||||||||||||||||||||||
| send_task_notification "$task_id" "failed" "Stuck in deploying, auto-recovery failed" 2>>"$SUPERVISOR_LOG" || true | ||||||||||||||||||||||||||||||
|
Comment on lines
+7244
to
+7256
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Avoid failing tasks when recovery races with a concurrent transition. Line 7244: If Suggested fix- if cmd_transition "$task_id" "deployed" 2>>"$SUPERVISOR_LOG"; then
+ if cmd_transition "$task_id" "deployed" 2>>"$SUPERVISOR_LOG"; then
log_success "Auto-recovered $task_id: deploying -> deployed (t222)"
send_task_notification "$task_id" "deployed" "Auto-recovered from stuck deploying state" 2>>"$SUPERVISOR_LOG" || true
store_success_pattern "$task_id" "deployed" "" 2>>"$SUPERVISOR_LOG" || true
write_proof_log --task "$task_id" --event "auto_recover" --stage "deploying" \
--decision "deploying->deployed" --evidence "stuck_state_recovery" \
--maker "pr_lifecycle:t222" 2>/dev/null || true
else
- log_error "Auto-recovery failed for $task_id — transition to deployed rejected"
- # If the transition itself is invalid, something is deeply wrong.
- # Transition to failed so the task doesn't stay stuck forever.
- cmd_transition "$task_id" "failed" --error "Auto-recovery failed: deploying->deployed transition rejected (t222)" 2>>"$SUPERVISOR_LOG" || true
- send_task_notification "$task_id" "failed" "Stuck in deploying, auto-recovery failed" 2>>"$SUPERVISOR_LOG" || true
+ # Re-check state to avoid clobbering a concurrent transition
+ local current_state
+ current_state=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null || echo "")
+ if [[ "$current_state" == "deploying" ]]; then
+ log_error "Auto-recovery failed for $task_id — transition to deployed rejected"
+ cmd_transition "$task_id" "failed" --error "Auto-recovery failed: deploying->deployed transition rejected (t222)" 2>>"$SUPERVISOR_LOG" || true
+ send_task_notification "$task_id" "failed" "Stuck in deploying, auto-recovery failed" 2>>"$SUPERVISOR_LOG" || true
+ else
+ log_info "Auto-recovery skipped: $task_id already $current_state"
+ fi
fi🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||
| else | ||||||||||||||||||||||||||||||
| log_info "[dry-run] Would auto-recover $task_id from deploying to deployed" | ||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| # t222: Record recovery timing | ||||||||||||||||||||||||||||||
| local stage_end | ||||||||||||||||||||||||||||||
| stage_end=$(date +%s) | ||||||||||||||||||||||||||||||
| stage_timings="${stage_timings}deploying_recovery:$((stage_end - stage_start))s," | ||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| # t219: Record total lifecycle timing and log to proof-log | ||||||||||||||||||||||||||||||
| local lifecycle_end_time | ||||||||||||||||||||||||||||||
| lifecycle_end_time=$(date +%s) | ||||||||||||||||||||||||||||||
|
|
@@ -7774,6 +7817,33 @@ cmd_pulse() { | |||||||||||||||||||||||||||||
| done <<< "$stale_diags" | ||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| # Phase 4d: Auto-recover stuck deploying tasks (t222) | ||||||||||||||||||||||||||||||
| # Tasks can get stuck in 'deploying' if the deploy succeeds but the | ||||||||||||||||||||||||||||||
| # transition to 'deployed' fails (e.g., DB write error, process killed | ||||||||||||||||||||||||||||||
| # mid-transition). Detect tasks in 'deploying' state for longer than | ||||||||||||||||||||||||||||||
| # the deploy timeout and auto-recover them via process_post_pr_lifecycle | ||||||||||||||||||||||||||||||
| # (which now handles the deploying state in Step 4b of cmd_pr_lifecycle). | ||||||||||||||||||||||||||||||
| local deploying_timeout_seconds="${SUPERVISOR_DEPLOY_TIMEOUT:-600}" # 10 min default | ||||||||||||||||||||||||||||||
| local stuck_deploying | ||||||||||||||||||||||||||||||
| stuck_deploying=$(db "$SUPERVISOR_DB" " | ||||||||||||||||||||||||||||||
| SELECT id, updated_at FROM tasks | ||||||||||||||||||||||||||||||
| WHERE status = 'deploying' | ||||||||||||||||||||||||||||||
| AND updated_at < strftime('%Y-%m-%dT%H:%M:%SZ', 'now', '-${deploying_timeout_seconds} seconds'); | ||||||||||||||||||||||||||||||
| " 2>/dev/null || echo "") | ||||||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Error suppression with
Suggested change
References
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| if [[ -n "$stuck_deploying" ]]; then | ||||||||||||||||||||||||||||||
| while IFS='|' read -r stuck_id stuck_updated; do | ||||||||||||||||||||||||||||||
| [[ -n "$stuck_id" ]] || continue | ||||||||||||||||||||||||||||||
| log_warn " Stuck deploying: $stuck_id (last updated: ${stuck_updated:-unknown}, timeout: ${deploying_timeout_seconds}s) — triggering recovery (t222)" | ||||||||||||||||||||||||||||||
| # process_post_pr_lifecycle will pick this up and run cmd_pr_lifecycle | ||||||||||||||||||||||||||||||
| # which now handles the deploying state in Step 4b | ||||||||||||||||||||||||||||||
| cmd_pr_lifecycle "$stuck_id" 2>>"$SUPERVISOR_LOG" || { | ||||||||||||||||||||||||||||||
| log_error " Recovery failed for stuck deploying task $stuck_id — forcing to deployed" | ||||||||||||||||||||||||||||||
| cmd_transition "$stuck_id" "deployed" --error "Force-recovered from stuck deploying (t222)" 2>>"$SUPERVISOR_LOG" || true | ||||||||||||||||||||||||||||||
|
Comment on lines
+7841
to
+7842
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The fallback logic here seems incorrect. If
Suggested change
|
||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||
|
Comment on lines
+7840
to
+7843
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Guard the force-deploy fallback with a state re-check. Line 7840: The fallback forces Suggested fix- cmd_pr_lifecycle "$stuck_id" 2>>"$SUPERVISOR_LOG" || {
- log_error " Recovery failed for stuck deploying task $stuck_id — forcing to deployed"
- cmd_transition "$stuck_id" "deployed" --error "Force-recovered from stuck deploying (t222)" 2>>"$SUPERVISOR_LOG" || true
- }
+ cmd_pr_lifecycle "$stuck_id" 2>>"$SUPERVISOR_LOG" || {
+ log_error " Recovery failed for stuck deploying task $stuck_id — evaluating fallback"
+ local current_state
+ current_state=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$(sql_escape "$stuck_id")';" 2>/dev/null || echo "")
+ if [[ "$current_state" == "deploying" ]]; then
+ cmd_transition "$stuck_id" "deployed" --error "Force-recovered from stuck deploying (t222)" 2>>"$SUPERVISOR_LOG" || true
+ else
+ log_warn " Skipping force-recovery for $stuck_id (state now $current_state)"
+ fi
+ }📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||
| done <<< "$stuck_deploying" | ||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| # Phase 5: Summary | ||||||||||||||||||||||||||||||
| local total_running | ||||||||||||||||||||||||||||||
| total_running=$(cmd_running_count "${batch_id:-}") | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error suppression with
2>/dev/nullviolates the repository style guide. Rule #50 states that2>/dev/nullis only acceptable when redirecting to log files, not for blanket suppression. Errors fromwrite_proof_logshould be captured for debugging purposes.References
2>/dev/nullis acceptable ONLY when redirecting to log files, not blanket suppression. This change suppresses stderr without redirecting stdout to a log file, which can hide important errors. (link)