Skip to content

feat: auto-escalate model on worker failure + extend timeouts (t314)#1215

Merged
marcusquinn merged 1 commit intomainfrom
feature/t314
Feb 12, 2026
Merged

feat: auto-escalate model on worker failure + extend timeouts (t314)#1215
marcusquinn merged 1 commit intomainfrom
feature/t314

Conversation

@marcusquinn
Copy link
Owner

Summary

  • Auto-escalate task model to next tier (e.g., sonnet→opus) when workers fail, so retries use a more capable model
  • Increase hung worker timeout from 30min to 1h and max runtime from 2h to 4h
  • Records failure patterns for future routing decisions

Problem

Workers dispatched with model:sonnet for complex tasks (t311.2, t311.3, t303) consistently hung at the 30min timeout. On retry, the same sonnet model was used again, producing the same failure. The existing quality gate escalation only triggers for completed tasks that fail quality checks — it never fires for hung/crashed workers.

Changes

Change Before After
SUPERVISOR_WORKER_TIMEOUT 1800s (30min) 3600s (1h)
SUPERVISOR_WORKER_MAX_RUNTIME 7200s (2h) 14400s (4h)
Model on failure Same model reused Auto-escalate via get_next_tier()

New function escalate_model_on_failure() is called from attempt_self_heal() before creating diagnostic subtasks. Uses existing escalation_depth/max_escalation columns (from t132.6) to prevent infinite escalation.

Testing

  • bash -n syntax check: PASS
  • ShellCheck: no new warnings
  • Diff: 90 insertions, 3 deletions

Ref #1212

… (t314)

When workers fail (hung, crashed, max runtime), automatically escalate the
task's model to the next tier via get_next_tier() before re-queuing. Previously,
retries repeated with the same underpowered model. Now: sonnet failures
auto-escalate to opus, haiku to sonnet, etc.

Also doubles worker timeouts:
- Hung detection: 30min -> 1h (SUPERVISOR_WORKER_TIMEOUT)
- Max runtime: 2h -> 4h (SUPERVISOR_WORKER_MAX_RUNTIME)

Complex refactoring tasks (t311.2, t311.3) consistently hit the 30min hung
timeout. The new defaults give workers adequate time while still catching
truly stuck processes.
@gemini-code-assist
Copy link

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 minutes and 56 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t314

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 15 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 12 03:52:20 UTC 2026: Code review monitoring started
Thu Feb 12 03:52:20 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 15

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 15
  • VULNERABILITIES: 0

Generated on: Thu Feb 12 03:52:22 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

@marcusquinn marcusquinn merged commit 73703d3 into main Feb 12, 2026
10 of 11 checks passed
marcusquinn added a commit that referenced this pull request Feb 12, 2026
t303 (#1216) and t311.2 (#1218) were branched before t314 (#1215) merged,
so their squash-merges overwrote the timeout changes. Restoring:
- SUPERVISOR_WORKER_TIMEOUT: 1800 -> 3600 (1h)
- SUPERVISOR_WORKER_MAX_RUNTIME: 7200 -> 14400 (4h)
marcusquinn added a commit that referenced this pull request Feb 12, 2026
…es (#1219)

t303 (#1216) and t311.2 (#1218) were branched before t314 (#1215) merged,
so their squash-merges overwrote the timeout changes. Restoring:
- SUPERVISOR_WORKER_TIMEOUT: 1800 -> 3600 (1h)
- SUPERVISOR_WORKER_MAX_RUNTIME: 7200 -> 14400 (4h)
@marcusquinn marcusquinn deleted the feature/t314 branch February 21, 2026 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant