diff --git a/TODO.md b/TODO.md index 3fea182b0c..d38a1c6b88 100644 --- a/TODO.md +++ b/TODO.md @@ -1470,37 +1470,37 @@ t019.3.4,Update AGENTS.md with Beads integration docs,,beads,1h,45m,2025-12-21T1 - [x] t1101 Verify and close t1081 parent task — all subtasks verified #chore #auto-dispatch ~15m model:sonnet — t1081 subtasks t1081.1-t1081.4 are all verified with merged PRs in the last 24h. Review whether the parent task t1081 ('Daily skill auto-update pipeline') is fully satisfied by these subtasks or if additional integration work is needed. If complete: mark t1081 [x] with proper proof-log referencing the subtask PRs. This unblocks t1082 ('Maintainer skill-update PR pipeline'). ref:GH#1644 assignee:marcusquinn pr:#1645 verified:2026-02-18 completed:2026-02-18 -- [ ] t1102 Mark t1079 verified and close — scripts already have set -euo pipefail #chore #auto-dispatch ~10m model:sonnet — t1079 is still open in TODO.md but investigation on GH#1572 confirmed all 8 scripts already have `set -euo pipefail`. Mark t1079 `[x]` with `verified:2026-02-18` proof-log. No PR needed since no code change was required. assignee:marcusquinn started:2026-02-18T14:50:36Z ref:GH#1646 +- [-] t1102 Mark t1079 verified and close — scripts already have set -euo pipefail #chore #auto-dispatch ~10m model:sonnet — t1079 is still open in TODO.md but investigation on GH#1572 confirmed all 8 scripts already have `set -euo pipefail`. Mark t1079 `[x]` with `verified:2026-02-18` proof-log. No PR needed since no code change was required. assignee:marcusquinn started:2026-02-18T14:50:36Z ref:GH#1646 cancelled:2026-02-18 cancel-reason:stuck-evaluating-state-manual-cleanup - [ ] t1103 Deduplicate t1101 in TODO.md — appears twice at lines 1471 and 1473 #bugfix #auto-dispatch ~10m model:sonnet — t1101 ('Verify and close t1081 parent task') appears twice in TODO.md. Line 1471 has no assignee, line 1473 has `assignee:marcusquinn`. Remove the duplicate (line 1471) keeping the claimed version. This is likely from a merge conflict or race condition in planning commits. Also verify the dedup_todo_task_ids() function handles this case — the same bug that created issues #1533-#1540. ref:GH#1647 -- [ ] t1104 Verify and close t1085 parent — all 7 subtasks completed plus 4 follow-up bug fixes merged #chore #auto-dispatch ~15m model:sonnet — t1085 ('Supervisor Intelligence Upgrade') has all 7 subtasks (t1085.1-t1085.7) marked [x] with merged PRs. Additionally, 4 follow-up bug fixes were merged on the same day: PR#1640 (portable timeout), PR#1641 (remove artificial pulse counter), PR#1642 (fix --format text), PR#1643 (increase timeout). Phase 14 is now operational. Mark t1085 [x] with proof-log referencing subtask PRs and the follow-up fixes. Close GH#1599. assignee:marcusquinn started:2026-02-18T14:50:54Z ref:GH#1648 +- [-] t1104 Verify and close t1085 parent — all 7 subtasks completed plus 4 follow-up bug fixes merged #chore #auto-dispatch ~15m model:sonnet — t1085 ('Supervisor Intelligence Upgrade') has all 7 subtasks (t1085.1-t1085.7) marked [x] with merged PRs. Additionally, 4 follow-up bug fixes were merged on the same day: PR#1640 (portable timeout), PR#1641 (remove artificial pulse counter), PR#1642 (fix --format text), PR#1643 (increase timeout). Phase 14 is now operational. Mark t1085 [x] with proof-log referencing subtask PRs and the follow-up fixes. Close GH#1599. assignee:marcusquinn started:2026-02-18T14:50:54Z ref:GH#1648 cancelled:2026-02-18 cancel-reason:stuck-evaluating-state-manual-cleanup -- [ ] t1105 Verify and close t1056 parent — all 4 subtasks completed with merged PRs #chore #auto-dispatch ~15m model:sonnet — t1056 ('Automated Matrix+Cloudron setup') has all 4 subtasks (t1056.1-t1056.4) marked [x] with merged PRs (#1470, #1471, #1474, #1473). The parent task at line 223 of TODO.md is still open. Mark it [x] with proof-log. Note: there is a DIFFERENT t1056 at line 1359 (a bugfix about Intel brew shellenv) that was already completed — this is a task ID collision that should be flagged but not fixed here (the completed one has its own PR#1512). assignee:marcusquinn started:2026-02-18T14:51:12Z ref:GH#1649 +- [-] t1105 Verify and close t1056 parent — all 4 subtasks completed with merged PRs #chore #auto-dispatch ~15m model:sonnet — t1056 ('Automated Matrix+Cloudron setup') has all 4 subtasks (t1056.1-t1056.4) marked [x] with merged PRs (#1470, #1471, #1474, #1473). The parent task at line 223 of TODO.md is still open. Mark it [x] with proof-log. Note: there is a DIFFERENT t1056 at line 1359 (a bugfix about Intel brew shellenv) that was already completed — this is a task ID collision that should be flagged but not fixed here (the completed one has its own PR#1512). assignee:marcusquinn started:2026-02-18T14:51:12Z ref:GH#1649 cancelled:2026-02-18 cancel-reason:stuck-evaluating-state-manual-cleanup -- [ ] t1107 Verify and close t1094 parent — only subtask t1094.1 exists, remaining scope unclear #chore #auto-dispatch ~30m model:sonnet — t1094 (Unified model performance scoring) is an 8h feature task with only one subtask (t1094.1, completed). The parent description mentions prompt strategy tracking, output quality gradient, failure categorization, token usage, and A/B comparison — but only t1094.1 (build-agent reference update) was created as a subtask. Either: (1) the remaining scope was absorbed by other tasks (t1095-t1100 cover related ground), (2) subtasks were never created, or (3) the parent needs decomposition. Investigate which deliverables from t1094's description are already implemented and which remain. If complete, mark with proof-log. If not, create remaining subtasks. ref:GH#1655 assignee:marcusquinn started:2026-02-18T15:01:01Z +- [-] t1107 Verify and close t1094 parent — only subtask t1094.1 exists, remaining scope unclear #chore #auto-dispatch ~30m model:sonnet — t1094 (Unified model performance scoring) is an 8h feature task with only one subtask (t1094.1, completed). The parent description mentions prompt strategy tracking, output quality gradient, failure categorization, token usage, and A/B comparison — but only t1094.1 (build-agent reference update) was created as a subtask. Either: (1) the remaining scope was absorbed by other tasks (t1095-t1100 cover related ground), (2) subtasks were never created, or (3) the parent needs decomposition. Investigate which deliverables from t1094's description are already implemented and which remain. If complete, mark with proof-log. If not, create remaining subtasks. ref:GH#1655 assignee:marcusquinn started:2026-02-18T15:01:01Z cancelled:2026-02-18 cancel-reason:stuck-evaluating-state-manual-cleanup -- [ ] t1108 Investigate 10 high-severity audit findings — closed issues without PR linkage #chore #auto-dispatch #audit ~1h model:sonnet — The issue audit found 10 closed GitHub issues with no PR linkage or completion evidence: #1601 (t1085.2), #1581 (t1083.3), #1572 (t1079), #1540 (t1077), #1539 (t1076), #1538 (t1075), #1537 (t1074), #1535 (t1073), #1534 (t1072.3), #1533 (t1071.2). For each: (1) check if a merged PR exists that addresses the task, (2) if yes, add the PR link and re-close with evidence, (3) if no, reopen the issue. Some may be false positives from the issue-sync pipeline closing issues before PR linkage was recorded. Batch process — do not reopen issues that have legitimate merged PRs. ref:GH#1656 assignee:marcusquinn started:2026-02-18T15:21:09Z +- [-] t1108 Investigate 10 high-severity audit findings — closed issues without PR linkage #chore #auto-dispatch #audit ~1h model:sonnet — The issue audit found 10 closed GitHub issues with no PR linkage or completion evidence: #1601 (t1085.2), #1581 (t1083.3), #1572 (t1079), #1540 (t1077), #1539 (t1076), #1538 (t1075), #1537 (t1074), #1535 (t1073), #1534 (t1072.3), #1533 (t1071.2). For each: (1) check if a merged PR exists that addresses the task, (2) if yes, add the PR link and re-close with evidence, (3) if no, reopen the issue. Some may be false positives from the issue-sync pipeline closing issues before PR linkage was recorded. Batch process — do not reopen issues that have legitimate merged PRs. ref:GH#1656 assignee:marcusquinn started:2026-02-18T15:21:09Z cancelled:2026-02-18 cancel-reason:stuck-evaluating-state-manual-cleanup -- [ ] t1109 Add guard against opus escalation for simple chore/verification tasks #enhancement #auto-dispatch #self-improvement ~1h model:sonnet category:automation — t1102 (a simple 'verify and close' chore task, ~10m estimate) was escalated to opus after a clean_exit_no_signal failure. This failure mode indicates a worker infrastructure issue, not insufficient model capability. The quality gate escalation logic should distinguish between task-complexity failures (wrong answers, incomplete work) and infrastructure failures (worker_never_started, clean_exit_no_signal, hung workers). Infrastructure failures should retry at the same tier, not escalate. Additionally, tasks tagged #chore with estimates <=15m should have a ceiling of sonnet unless explicitly overridden. Pattern data: t1104 and t1105 (identical chore/verify tasks) both succeeded at sonnet. ref:GH#1657 assignee:marcusquinn started:2026-02-18T15:21:51Z +- [-] t1109 Add guard against opus escalation for simple chore/verification tasks #enhancement #auto-dispatch #self-improvement ~1h model:sonnet category:automation — t1102 (a simple 'verify and close' chore task, ~10m estimate) was escalated to opus after a clean_exit_no_signal failure. This failure mode indicates a worker infrastructure issue, not insufficient model capability. The quality gate escalation logic should distinguish between task-complexity failures (wrong answers, incomplete work) and infrastructure failures (worker_never_started, clean_exit_no_signal, hung workers). Infrastructure failures should retry at the same tier, not escalate. Additionally, tasks tagged #chore with estimates <=15m should have a ceiling of sonnet unless explicitly overridden. Pattern data: t1104 and t1105 (identical chore/verify tasks) both succeeded at sonnet. ref:GH#1657 assignee:marcusquinn started:2026-02-18T15:21:51Z cancelled:2026-02-18 cancel-reason:superseded-by-feature/supervisor-self-heal -- [ ] t1110 Prevent issue-sync from closing issues without PR linkage evidence #enhancement #auto-dispatch #self-improvement ~2h model:sonnet — The issue-sync GitHub Action auto-closes issues when TODO tasks are marked [x]. However, 10+ issues were closed without any PR linkage, creating high-severity audit findings. The issue-sync should verify that the TODO entry has a proof-log (pr:#NNN or verified:YYYY-MM-DD) before closing the linked GitHub issue. If no proof-log exists, the issue should remain open with a comment explaining why. This aligns with the existing pre-commit hook that warns about missing proof-logs but doesn't block. ref:GH#1658 assignee:marcusquinn started:2026-02-18T15:31:58Z +- [-] t1110 Prevent issue-sync from closing issues without PR linkage evidence #enhancement #auto-dispatch #self-improvement ~2h model:sonnet — The issue-sync GitHub Action auto-closes issues when TODO tasks are marked [x]. However, 10+ issues were closed without any PR linkage, creating high-severity audit findings. The issue-sync should verify that the TODO entry has a proof-log (pr:#NNN or verified:YYYY-MM-DD) before closing the linked GitHub issue. If no proof-log exists, the issue should remain open with a comment explaining why. This aligns with the existing pre-commit hook that warns about missing proof-logs but doesn't block. ref:GH#1658 assignee:marcusquinn started:2026-02-18T15:31:58Z cancelled:2026-02-18 cancel-reason:superseded-by-feature/supervisor-self-heal -- [ ] t1111 Resolve 4 stuck 'evaluating' tasks in supervisor DB #chore #auto-dispatch #self-improvement ~30m model:sonnet — The supervisor DB shows 4 tasks in 'evaluating' state with no running workers. These may be stuck from a previous pulse cycle that was interrupted (possibly by the memory-based respawn in Phase 11, or the GNU timeout bug fixed in PR#1642). Investigate which tasks are stuck in evaluating, check if their workers produced output, and either transition them to verified/failed/queued as appropriate. Check supervisor logs for the last evaluation attempt on each. ref:GH#1659 +- [-] t1111 Resolve 4 stuck 'evaluating' tasks in supervisor DB #chore #auto-dispatch #self-improvement ~30m model:sonnet — The supervisor DB shows 4 tasks in 'evaluating' state with no running workers. These may be stuck from a previous pulse cycle that was interrupted (possibly by the memory-based respawn in Phase 11, or the GNU timeout bug fixed in PR#1642). Investigate which tasks are stuck in evaluating, check if their workers produced output, and either transition them to verified/failed/queued as appropriate. Check supervisor logs for the last evaluation attempt on each. ref:GH#1659 cancelled:2026-02-18 cancel-reason:stuck-evaluating-state-manual-cleanup -- [ ] t1112 Add supervisor self-heal for stuck 'evaluating' tasks #enhancement #auto-dispatch #self-improvement ~1h model:sonnet category:automation — The supervisor pulse should detect tasks that have been in 'evaluating' state for longer than a threshold (e.g., 10 minutes) and automatically retry evaluation or transition them to 'failed' with a reason. Currently, if evaluation is interrupted (crash, timeout, respawn), the task stays in 'evaluating' forever. Add a Phase N check: query for tasks where status='evaluating' AND updated_at < now()-600s, log a warning, and re-queue them for evaluation. This prevents silent accumulation of stuck tasks. ref:GH#1660 +- [-] t1112 Add supervisor self-heal for stuck 'evaluating' tasks #enhancement #auto-dispatch #self-improvement ~1h model:sonnet category:automation — The supervisor pulse should detect tasks that have been in 'evaluating' state for longer than a threshold (e.g., 10 minutes) and automatically retry evaluation or transition them to 'failed' with a reason. Currently, if evaluation is interrupted (crash, timeout, respawn), the task stays in 'evaluating' forever. Add a Phase N check: query for tasks where status='evaluating' AND updated_at < now()-600s, log a warning, and re-queue them for evaluation. This prevents silent accumulation of stuck tasks. ref:GH#1660 cancelled:2026-02-18 cancel-reason:superseded-by-feature/supervisor-self-heal - [ ] t1113 Add worker_never_started diagnostic and auto-retry with environment check #enhancement #auto-dispatch #self-improvement ~1h model:sonnet category:automation — 5 tasks failed with 'worker_never_started:no_sentinel' on Feb 13. This means the Claude CLI was invoked but never produced output. Current behavior: mark as failed and move on. Improved behavior: (1) before dispatch, run a lightweight Claude CLI health check (e.g., 'Claude --version' or a trivial prompt), (2) if health check fails, log the environment issue and skip dispatch until next pulse rather than burning a retry, (3) add the health check result to the dispatch log so failures are diagnosable. This prevents wasting retries on environment issues. ref:GH#1664 - [ ] t1114 Track opus vs sonnet token cost ratio in pattern tracker for ROI analysis #enhancement #auto-dispatch #self-improvement ~2h model:sonnet category:efficiency — Pattern tracker records model tier per task but doesn't track token usage or cost. Add optional token_count and estimated_cost fields to pattern records (populated from worker evaluation output when available). This enables: (1) cost-per-task-type analysis, (2) ROI comparison between tiers (does opus's higher success rate justify 10-15x cost for chore tasks?), (3) data-driven model routing recommendations. Start by adding the fields and populating from supervisor evaluation logs which already contain duration data. ref:GH#1663 assignee:marcusquinn started:2026-02-18T16:24:54Z -- [ ] t1115 Diagnose and fix dispatch stall — 6 queued tasks, 0 running workers #bugfix #auto-dispatch #self-improvement ~1h model:sonnet category:automation — Current state shows 6 queued tasks and 0 running workers, with 4 tasks stuck in 'evaluating'. The pulse cycle should be dispatching queued tasks but appears stalled. Investigate: (1) Is the cron pulse running? Check last pulse timestamp. (2) Are the 4 evaluating tasks blocking the dispatch loop? (3) Is there a concurrency limit being hit by counting evaluating tasks as 'active'? (4) Check supervisor-helper.sh pulse logs for errors. Fix the root cause so queued tasks flow to dispatch automatically. ref:GH#1667 +- [-] t1115 Diagnose and fix dispatch stall — 6 queued tasks, 0 running workers #bugfix #auto-dispatch #self-improvement ~1h model:sonnet category:automation — Current state shows 6 queued tasks and 0 running workers, with 4 tasks stuck in 'evaluating'. The pulse cycle should be dispatching queued tasks but appears stalled. Investigate: (1) Is the cron pulse running? Check last pulse timestamp. (2) Are the 4 evaluating tasks blocking the dispatch loop? (3) Is there a concurrency limit being hit by counting evaluating tasks as 'active'? (4) Check supervisor-helper.sh pulse logs for errors. Fix the root cause so queued tasks flow to dispatch automatically. ref:GH#1667 cancelled:2026-02-18 cancel-reason:superseded-by-feature/supervisor-self-heal - [x] t1116 Unblock t1082 — verify t1081 parent is complete and mark done #chore #auto-dispatch ~15m model:sonnet — t1101 (verify and close t1081 parent) completed with PR#1645, confirming all t1081 subtasks (t1081.1-t1081.4) are verified. However, t1081 itself may still be marked open in TODO.md, which keeps t1082 blocked. Steps: (1) Check if t1081 is marked [x] in TODO.md. (2) If not, mark it complete with proof-log referencing PR#1645 and subtask PRs. (3) Verify t1082 is now unblocked. (4) If t1082 is unblocked, ensure it's in the dispatch queue. ref:GH#1668 assignee:marcusquinn started:2026-02-18T16:25:46Z pr:#1687 verified:2026-02-18 completed:2026-02-18 - [ ] t1117 Add model tier field to supervisor dispatch logging for cost analysis #enhancement #auto-dispatch #self-improvement ~1h model:sonnet category:observability — Pattern data shows 500 opus entries vs 354 sonnet, but we can't easily determine which tasks were unnecessarily dispatched at opus. Add explicit model_tier logging to the supervisor's dispatch and evaluation records so that post-hoc analysis can identify cost waste patterns. Fields needed: (1) requested_tier (from TODO.md model: tag), (2) actual_tier (after escalation/fallback), (3) token_count (if available from worker output). This feeds into t1114 (opus vs sonnet cost ratio tracking) and t1109 (opus escalation guard). ref:GH#1669 assignee:marcusquinn started:2026-02-18T16:26:58Z -- [ ] t1119 Add dispatch stall auto-detection to supervisor pulse cycle #enhancement #auto-dispatch #self-improvement ~3h model:sonnet category:automation — The current state shows 7 queued tasks and 0 running workers with no automatic recovery. The supervisor pulse should detect when queued > 0 and running == 0 for more than 2 consecutive pulse cycles (4+ minutes) and trigger diagnostic actions: (1) check if dispatch.sh is functional, (2) verify Claude CLI is available, (3) check for stuck evaluating tasks blocking slots, (4) log the stall event, (5) attempt a single dispatch as a probe. This would have caught the current stall automatically instead of requiring t1115 to be manually created. ref:GH#1672 +- [-] t1119 Add dispatch stall auto-detection to supervisor pulse cycle #enhancement #auto-dispatch #self-improvement ~3h model:sonnet category:automation — The current state shows 7 queued tasks and 0 running workers with no automatic recovery. The supervisor pulse should detect when queued > 0 and running == 0 for more than 2 consecutive pulse cycles (4+ minutes) and trigger diagnostic actions: (1) check if dispatch.sh is functional, (2) verify Claude CLI is available, (3) check for stuck evaluating tasks blocking slots, (4) log the stall event, (5) attempt a single dispatch as a probe. This would have caught the current stall automatically instead of requiring t1115 to be manually created. ref:GH#1672 cancelled:2026-02-18 cancel-reason:superseded-by-feature/supervisor-self-heal - [ ] t1120 Add platform abstraction to issue-sync-helper.sh — Gitea and GitLab support #feature #git #sync ~4h (ai:2h test:1h read:1h) model:sonnet ref:GH#1673 logged:2026-02-18 - issue-sync-helper.sh currently hardcodes `gh` CLI for all operations (create, close, edit, list, search, labels)