feat(claude-loop-tick): self-sufficient background — zero-PR backoff + push-hang workaround + ship-rate metric#4146
Conversation
…backoff, push-hang awareness, ship-rate metric (B-0615 sibling) Addresses Aaron's directive 2026-05-18: fix background services so they stop wasting resources when output rate drops to zero (relevant to ServiceTitan funding optic — burning model tokens for 0 PRs/hour is untenable as a metric). Three bundled improvements (all touch background-loop self-sufficiency): 1. **Zero-PR backoff** (lines 32-39, 175-194, 304-307): track consecutive cycles where produced_pr=false via ratings file. After ZETA_CLAUDE_LOOP_BACKOFF_THRESHOLD (default 3) zero-PR cycles, multiply claudeIntervalMs linearly up to ZETA_CLAUDE_LOOP_BACKOFF_MAX_MULTIPLIER (default 30x). Resets on first produced_pr=true. Stops burning model tokens during push-hang famine or other systemic-block conditions. 2. **Push-hang workaround instructions in spawned-claude prompts** (lines 231, 250): inform the spawned claude session about the REST git-data API bypass pattern documented in PR #4145. When git push silently fails (exit 0, no remote update — B-0615), the bypass uses POST /repos/.../git/{blobs,trees,commits,refs} to land commits directly via REST. 3. **Ship-rate metric + heartbeat visibility** (lines 384-401): computes shipped/total ratio across last 10 cycles, surfaces in heartbeat log. Adds backoff_xN annotation to dueIn when in backoff state. Operational visibility for monitoring whether the famine-detection actually triggers in practice. Also wraps refresh-worldview.ts invocations in timeout --kill-after for consistency with the rule landed in PR #4145. Landed via REST git-data API because git push remains hanging system-wide at authoring time (the very failure mode this PR teaches the loop to recognize and work around). Experimental per Aaron's 2026-05-18T13:35Z: "make any changes you think will fix your backgournd service we can experiment." Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 686e0553ca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (r.produced_pr === true) break; | ||
| if (r.produced_pr === false) consecutiveZeroPrCycles += 1; |
There was a problem hiding this comment.
Exclude drain cycles from zero-PR backoff counter
The new backoff logic increments on every produced_pr === false, but drain-mode runs never set produced_pr true (they are thread-resolution/merge cycles, not PR-creation cycles), so a healthy period with open PRs will still look like consecutive “zero-PR failures.” In that scenario the loop quickly ratchets to backoffMaxMultiplier, delaying review-thread handling and merges by up to 30x even though useful work is happening. Gate the counter to pickup-mode attempts (or record a separate success signal for drain work) before applying backoff.
Useful? React with 👍 / 👎.
…dex P1 follow-up on #4146) (#4148) PR #4146 was merged before this Codex P1 fix could land on its branch. The finding: zero-PR backoff in #4146 increments on every produced_pr= false, but drain-mode runs (thread-resolution / merge work on existing PRs) never set produced_pr=true. So a healthy drain-only period would falsely look like consecutive zero-PR failures and ratchet to 30x slowdown — delaying review-thread handling and merges. Fix: filter the counter on r.mode === 'pickup' — skip drain cycles entirely when counting. Drain mode is excluded because the success signal there is NOT produced_pr (no new PR is created) but rather thread-resolution + merge, which we don't currently track as a discrete success metric (could be future work). Backoff now only triggers when pickup-mode (the cycle that's SUPPOSED to create new PRs) consistently fails. Healthy drain operation no longer slows it down. Co-authored-by: Claude <noreply@anthropic.com>
…pline to Vera's spawned prompt (#4149) Cross-agent consistency with claude-loop-tick (PR #4146): Vera's spawned codex sessions now know about (1) timeout --kill-after for git network ops, (2) the REST git-data API bypass via bun tools/github/rest-push.ts (PR #4147), and (3) refresh-worldview should be timeout-wrapped. Three new sentences appended to the existing refresh-worldview prompt block (lines 203-209): - Wraps refresh-worldview invocation in timeout --kill-after - Generic wrap-all-git-network-ops discipline per the rule landed in PR #4145 - Push-hang workaround: prefer bun tools/github/rest-push.ts (PR #4147) over git push when push hangs No changes to Vera's loop-tick.ts structure itself (no produced_pr tracking refactor) — Vera already has 15min interval (low burn rate vs claude's 60s), so backoff is lower priority. Push-hang awareness is the high-leverage cross-cutting fix. Co-authored-by: Claude <noreply@anthropic.com>
Aaron's directive 2026-05-18T13:35Z: "make any changes you think will fix your backgournd service we can experiment." Resource-cost concern: 9-hour foreground stretch with 0 PRs + system-wide push-hang for hours = bad metric for ServiceTitan funding optic.
Three bundled improvements (all touch background-loop self-sufficiency)
1. Zero-PR backoff
When
produced_pr=falsefor N consecutive cycles (ZETA_CLAUDE_LOOP_BACKOFF_THRESHOLD, default 3), multiplyclaudeIntervalMslinearly up toZETA_CLAUDE_LOOP_BACKOFF_MAX_MULTIPLIER(default 30x). Resets on firstproduced_pr=true. Default config: 60s interval grows to 30 min when in famine.Motivation: at 60s interval with 10-min claude timeout, a stuck loop burns ~50,000 tokens/hour for 0 output. The backoff caps the waste while still polling for recovery.
2. Push-hang workaround in spawned-claude prompts
Both
pickupanddrainmode prompts now include explicit instructions for the REST git-data API bypass whengit pushsilently fails (exit 0, no remote update — B-0615). The bypass uses:Worked example: PR #4145 AND THIS VERY PR — both landed via REST git-data API because regular
git pushis still hanging system-wide as I write this.Also wraps
bun tools/github/refresh-worldview.tsinvocations intimeout --kill-after=5s 30sper the rule landed in PR #4145.3. Ship-rate metric in heartbeat log
Computes
shipped/totalratio across last 10 cycles + surfaces in heartbeat log line asship_rate=N/M. Addsbackoff_xN_zero_pr_cycles=Kannotation todue_inwhen in backoff state. Operational visibility — lets us monitor whether the famine-detection actually triggers in practice.Experiment framing
Per Aaron's authorization: this is exploratory. We can:
ZETA_CLAUDE_LOOP_BACKOFF_THRESHOLD,ZETA_CLAUDE_LOOP_BACKOFF_MAX_MULTIPLIER)THRESHOLD=999(effectively disables)No destructive changes; all opt-in via existing config knobs.
Self-eating dog food
This PR's commit (
686e055) was landed via REST git-data API whilegit pushwas hanging on this very machine — exactly the failure mode the PR teaches the loop to recognize and work around.Composes with
timeout --kill-afterdiscipline + REST bypass)Co-Authored-By: Claude noreply@anthropic.com