feat: address user feedback — zombie cleanup, ErrSkipTick, config getters#6
Conversation
…ters, docs - Zombie children: lazily prune stopped children via done channel on GetChildren/GetChild/GetChildCount — no manual cleanup needed - ErrSkipTick: sentinel for periodic handlers to skip a tick without triggering restart (transient failures like DB timeouts) - WorkerInfo.GetHandler(): expose handler for middleware type assertion, enabling the handler-as-metadata pattern - Worker config getters: GetInterval, GetRestartOnFail, GetJitterPercent, GetInitialDelay for reconciler introspection - GetChildCount(): efficient child count without allocating a sorted slice - WithSkipOnNotAcquired: convenience LockOption for log-and-skip pattern - Docs: error semantics decision table, interceptor level guidance, handler-as-metadata reconciler example, Locker compatibility note
|
Caution Review failedPull request was closed or merged during review Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant PeriodicWorker as Periodic Worker
participant Handler as CycleHandler
participant Framework
rect rgba(100,150,200,0.5)
Note over PeriodicWorker,Handler: ErrSkipTick flow (skip current tick)
Client->>PeriodicWorker: tick
PeriodicWorker->>Handler: RunCycle(ctx, info)
Handler-->>PeriodicWorker: return ErrSkipTick
PeriodicWorker->>Framework: errors.Is(err, ErrSkipTick)
Framework-->>PeriodicWorker: suppress failure, reset timer
PeriodicWorker->>Client: wait for next tick
end
rect rgba(200,100,100,0.5)
Note over PeriodicWorker,Framework: Other error flow (fail/restart)
Client->>PeriodicWorker: tick
PeriodicWorker->>Handler: RunCycle(ctx, info)
Handler-->>PeriodicWorker: return error
PeriodicWorker->>Framework: non-ErrSkipTick error
Framework->>Framework: record WorkerFailed -> backoff / restart (if enabled)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR enhances the workers framework based on production reconciler feedback by adding lazy “zombie” child cleanup, introducing ErrSkipTick for periodic handlers, exposing handler/config introspection APIs, and expanding documentation/examples around these patterns.
Changes:
- Add lazy pruning for permanently-stopped child workers via a per-child
donechannel and a newWorkerInfo.GetChildCount(). - Introduce
ErrSkipTickto allow periodic handlers to skip a tick without triggering restart, plus tests around timer-loop behavior and Run-level metrics. - Add
WorkerInfo.GetHandler()and Worker config getters (GetInterval,GetRestartOnFail,GetJitterPercent,GetInitialDelay), plus middleware convenienceWithSkipOnNotAcquiredand expanded docs/examples.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| worker.go | Adds handler exposure on WorkerInfo, child “done” tracking + lazy pruning, GetChildCount, and Worker config getters. |
| run.go | Introduces ErrSkipTick, threads child done channel out of addWorkerToSupervisor, and adjusts failure-metric accounting. |
| helpers.go | Updates periodic timer loop to treat ErrSkipTick as “continue ticking.” |
| worker_test.go | Adds tests for GetHandler, GetChildCount, zombie cleanup, and Worker config getters. |
| run_test.go | Adds integration tests for ErrSkipTick behavior in Run() and failure metrics. |
| helpers_test.go | Adds unit tests verifying ErrSkipTick (and wrapped) does not terminate the timer loop. |
| middleware/lock.go | Adds WithSkipOnNotAcquired convenience option and documents the error-handling footgun. |
| middleware/lock_test.go | Adds tests for WithSkipOnNotAcquired behavior (including nil log function). |
| middleware/README.md | Regenerates docs to include WithSkipOnNotAcquired and updated line references. |
| example_test.go | Adds reconciler example demonstrating handler-as-metadata change detection. |
| README.md | Expands docs: periodic error semantics table, interceptor guidance, and reconciler example; regenerates API docs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
run.go (1)
146-156:⚠️ Potential issue | 🟡 MinorClose the
donechannel fromclosingSupervisor.Serveto handle the suture failure threshold case.When
restartOnFail=trueand the worker exceeds suture's failure threshold, suture stops restarting but returns a generic error (notErrDoNotRestart). The final call toworkerRunService.Servetakes thereturn errbranch (line 156), leavingdoneunclosed. The child entry then remains in the parent'schildrenmap indefinitely becausepruneStoppedLockedonly detects and removes entries with a closeddonechannel.Fix: pass
donetoclosingSupervisorand close it there withsync.Onceprotection, ensuring cleanup on any supervisor exit path—including when suture gives up:type closingSupervisor struct { *suture.Supervisor closeFn func() done chan struct{} doneOnce sync.Once } func (cs *closingSupervisor) Serve(ctx context.Context) error { err := cs.Supervisor.Serve(ctx) cs.closeFn() cs.doneOnce.Do(func() { if cs.done != nil { close(cs.done) } }) return err }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@run.go` around lines 146 - 156, The done channel isn't closed when suture stops restarting without ErrDoNotRestart; update closingSupervisor to accept a done channel and a sync.Once (e.g., add fields done chan struct{} and doneOnce sync.Once to closingSupervisor), then in closingSupervisor.Serve call cs.Supervisor.Serve(ctx), invoke cs.closeFn(), and use cs.doneOnce.Do to close cs.done if non-nil before returning the error; also ensure the caller constructs closingSupervisor with the worker's done channel so the child entry is always pruned on any supervisor exit path.
🧹 Nitpick comments (2)
example_test.go (1)
359-368: Nit:solverConfig.nameis unused — drop it or use it.The
namefield is populated in every config entry but never read inside the reconciler (the map key already serves as the worker name). Removing it tightens the example, or alternatively passcfg.nametoworkers.NewWorker(...)instead ofkeyto make the field meaningful.♻️ Suggested simplification
- type solverConfig struct { - version int - name string - } + type solverConfig struct { + version int + } @@ - configs := []map[string]solverConfig{ - {"a": {version: 1, name: "a"}}, - {"a": {version: 1, name: "a"}, "b": {version: 1, name: "b"}}, - {"a": {version: 2, name: "a"}, "b": {version: 1, name: "b"}}, // a gets new version - } + configs := []map[string]solverConfig{ + {"a": {version: 1}}, + {"a": {version: 1}, "b": {version: 1}}, + {"a": {version: 2}, "b": {version: 1}}, // a gets new version + }Note: this is in a Go doc example, so re-run
make docto regenerateREADME.mdafter the change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@example_test.go` around lines 359 - 368, The solverConfig struct includes an unused field name; either remove solverConfig.name and stop populating it in the configs slice to simplify the example, or use it by passing cfg.name into workers.NewWorker(...) (instead of the map key) so the field becomes meaningful; update the configs entries accordingly and rerun make doc to regenerate the Go doc example.worker.go (1)
277-286: Optional: avoid full prune scan inGetChild.
GetChildonly needs to know whether the requested entry is alive, but it currently callspruneStoppedLockedwhich scans all children. For maps with many children, prefer a per-entry check. (Hot-path map iteration of all children on every lookup is also potentially surprising when only one name is being queried.)♻️ Suggested local check
func (info *WorkerInfo) GetChild(name string) (Worker, bool) { info.childrenMu.Lock() defer info.childrenMu.Unlock() - info.pruneStoppedLocked() - if entry, ok := info.children[name]; ok { - return *entry.worker, true - } - return Worker{}, false + entry, ok := info.children[name] + if !ok { + return Worker{}, false + } + if entry.done != nil { + select { + case <-entry.done: + delete(info.children, name) + return Worker{}, false + default: + } + } + return *entry.worker, true }If you prefer the simpler "always full prune" model for consistency with
GetChildren/GetChildCount, feel free to keep as-is.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@worker.go` around lines 277 - 286, GetChild currently locks childrenMu then calls pruneStoppedLocked which scans all children; instead, inside WorkerInfo.GetChild hold childrenMu, look up info.children[name] directly, if not found return false, otherwise inspect the found entry's stopped/finished state (the same condition used by pruneStoppedLocked) and if it's stopped remove that single entry from the map and return false, otherwise return the live worker; keep the same locking (childrenMu.Lock/Unlock) and reuse the existing entry fields and logic from pruneStoppedLocked to determine "stopped" so you avoid a full-map scan while preserving concurrency safety.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@run.go`:
- Around line 146-156: The done channel isn't closed when suture stops
restarting without ErrDoNotRestart; update closingSupervisor to accept a done
channel and a sync.Once (e.g., add fields done chan struct{} and doneOnce
sync.Once to closingSupervisor), then in closingSupervisor.Serve call
cs.Supervisor.Serve(ctx), invoke cs.closeFn(), and use cs.doneOnce.Do to close
cs.done if non-nil before returning the error; also ensure the caller constructs
closingSupervisor with the worker's done channel so the child entry is always
pruned on any supervisor exit path.
---
Nitpick comments:
In `@example_test.go`:
- Around line 359-368: The solverConfig struct includes an unused field name;
either remove solverConfig.name and stop populating it in the configs slice to
simplify the example, or use it by passing cfg.name into workers.NewWorker(...)
(instead of the map key) so the field becomes meaningful; update the configs
entries accordingly and rerun make doc to regenerate the Go doc example.
In `@worker.go`:
- Around line 277-286: GetChild currently locks childrenMu then calls
pruneStoppedLocked which scans all children; instead, inside WorkerInfo.GetChild
hold childrenMu, look up info.children[name] directly, if not found return
false, otherwise inspect the found entry's stopped/finished state (the same
condition used by pruneStoppedLocked) and if it's stopped remove that single
entry from the map and return false, otherwise return the live worker; keep the
same locking (childrenMu.Lock/Unlock) and reuse the existing entry fields and
logic from pruneStoppedLocked to determine "stopped" so you avoid a full-map
scan while preserving concurrency safety.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 15599bbe-50c8-4404-a60f-4f1bd0ee45b4
📒 Files selected for processing (11)
README.mdexample_test.gohelpers.gohelpers_test.gomiddleware/README.mdmiddleware/lock.gomiddleware/lock_test.gorun.gorun_test.goworker.goworker_test.go
Add() held the lock but didn't call pruneStoppedLocked(), so a child that permanently stopped could block re-Add until the caller happened to call GetChildren/GetChild/GetChildCount. Now Add prunes stale entries before checking for name conflicts.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
A non-periodic worker returning ErrSkipTick would silently skip the WorkerFailed metric while still restarting — a silent failure loop. Now ErrSkipTick is only suppressed from failure metrics when the worker has an interval (periodic), matching the semantics: ErrSkipTick is only meaningful for periodic workers where the timer loop handles it.
Replace the children map + done channel with suture's Services() API. closingSupervisor now implements childService interface, exposing worker name, Worker config, and token. All child enumeration (Add/Remove/GetChildren/GetChild) queries suture directly. - Remove childEntry type, children map, childrenMu, pruneStoppedLocked, removeLocked, done channel - Add childService interface with isActive() check via inner supervisor - Remove GetChildCount (no longer cheaper than len(GetChildren())) - Fix GetChildCount doc nit (len([WorkerInfo.GetChildren]) → valid Go) Zombie children are eliminated by design — suture only returns active services, so stopped children are never visible to the parent.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
run.go:141
workerRunService.Serveno longer removes child services fromws.childSupwhen the worker attempt exits. SinceWorkerInfo.suppoints at the same supervisor that hostsworkerRunService, any children added viainfo.Add()will keep running across restarts and can even outlive a permanently-stopped worker, violating the documented scoped lifecycle (“when this worker stops, all its children stop too”). Consider reintroducing per-attempt teardown by removing all child services (e.g., those implementingchildService) fromws.childSupbefore returning fromServe, so children are stopped on both restart and permanent stop paths.
info := &WorkerInfo{
name: ws.w.name,
attempt: attempt,
handler: ws.w.handler,
sup: ws.childSup,
cfg: ws.cfg,
active: ws.active,
metrics: m,
}
err := ws.runFn(ctx, info)
if err != nil && ctx.Err() == nil && !errors.Is(err, suture.ErrDoNotRestart) &&
(ws.w.interval <= 0 || !errors.Is(err, ErrSkipTick)) {
m.WorkerFailed(ws.w.name, err)
}
// Determine whether this worker is permanently stopping.
permanentStop := !ws.w.restartOnFail || err == nil || ctx.Err() != nil || errors.Is(err, suture.ErrDoNotRestart)
if permanentStop {
ws.closeFn()
return suture.ErrDoNotRestart
}
return err
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Revert the Services()-based child management back to the done channel approach. The Services() approach introduced edge cases (grandchildren fooling isActive, performance overhead) without solving a real problem — the done channel handles all permanent stop cases correctly. The done channel is closed on permanentStop in workerRunService.Serve, which covers: nil+WithRestart(false), ErrDoNotRestart, and ctx cancel. Failure threshold does NOT close done (correct — suture keeps restarting after backoff decay). The defer in Serve cleans up grandchildren before done is closed. Also fixes: - GetChildCount doc: invalid Go syntax in comment - Test message: "nil sup" → "nil supervisor" - Test comments: remove stale GetChildCount references - Add grandchildren zombie test validating the done channel approach
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
worker_test.go (1)
254-272: Consider asserting viaEventuallyto reduce flake risk.The fixed
time.Sleep(200 * time.Millisecond)will be the slowest reliable signal in the suite. If you ever see this flaking on loaded CI, swapping toassert.Eventually(t, func() bool { return len(info.GetChildren()) == 0 }, 1*time.Second, 10*time.Millisecond, ...)would let the assertion succeed as soon as pruning is observable rather than always paying the worst case.The same applies to the other zombie tests using fixed 100ms sleeps (lines 247, 281, 295, 315, 334).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@worker_test.go` around lines 254 - 272, Replace the fixed time.Sleep + assert.Empty in TestWorkerInfo_ZombieChild_WithGrandchildren with an eventual assertion that polls info.GetChildren() until it becomes empty; specifically, remove the time.Sleep(200 * time.Millisecond) and instead use assert.Eventually(t, func() bool { return len(info.GetChildren()) == 0 }, 1*time.Second, 10*time.Millisecond, "stopped child should not appear even with live grandchildren") so the test succeeds as soon as pruning is observed; apply the same pattern to the other zombie tests that use fixed sleeps (the tests around lines with 100ms sleeps) referencing their test names and calls to info.GetChildren().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@example_test.go`:
- Around line 359-403: The example has a race when replacing a child: calling
info.Remove(key) then immediately info.Add(...) can fail to replace because
Remove is not atomic and the old child's done channel may not be closed yet;
update the example to either insert a short sleep between Remove and Add (as
done in ExampleWorkerInfo_Add_replace) or add a clarifying comment explaining
the caveat so readers don't copy this pattern into production, and remove the
unused solverConfig.name field (or stop populating it) since the map key is used
as the worker name; reference symbols: info.Remove, info.Add, workers.NewWorker,
solverHandler, solverConfig.
---
Nitpick comments:
In `@worker_test.go`:
- Around line 254-272: Replace the fixed time.Sleep + assert.Empty in
TestWorkerInfo_ZombieChild_WithGrandchildren with an eventual assertion that
polls info.GetChildren() until it becomes empty; specifically, remove the
time.Sleep(200 * time.Millisecond) and instead use assert.Eventually(t, func()
bool { return len(info.GetChildren()) == 0 }, 1*time.Second,
10*time.Millisecond, "stopped child should not appear even with live
grandchildren") so the test succeeds as soon as pruning is observed; apply the
same pattern to the other zombie tests that use fixed sleeps (the tests around
lines with 100ms sleeps) referencing their test names and calls to
info.GetChildren().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 35554b15-01ad-42c4-84fc-baa8287a5da6
📒 Files selected for processing (6)
README.mdexample_test.gorun.gorun_test.goworker.goworker_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- run.go
- worker.go
Add sleep between Remove and Add in the reconciler example to avoid the race where Add sees the stale entry before the done channel closes. Remove unused solverConfig.name field (map key is used).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ines pruneStoppedLocked was calling delete(info.children, name) without removing the underlying supervisor service, leaving an orphaned closingSupervisor running indefinitely. Now calls removeLocked which does sup.Remove(token) + delete.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Periodic workers should generally use ErrSkipTick for transient failures and ErrDoNotRestart for permanent completion, rather than disabling restart entirely.
Workers create root spans (no parent) from context.Background(). With ParentBased(TraceIDRatioBased(0.2)), 80% of worker traces were silently dropped. Now Tracing() injects a sampled remote span context before creating the span, so ParentBased always samples. Also injects the OTEL trace ID into the log context as "trace" for correlation with the tracing backend.
* docs: update workers howto for ErrSkipTick, zombie cleanup, handler-as-metadata - Add ErrSkipTick to handler return values table with usage example - Document automatic zombie child cleanup in Dynamic Workers section - Add handler-as-metadata reconciler example (config change detection via GetChild().GetHandler() type assertion) - Add WithSkipOnNotAcquired convenience + footgun warning for WithOnNotAcquired error return - Update WorkerInfo methods table with GetHandler, GetChildCount - Note Locker interface compatibility for existing implementations Companion to go-coldbrew/workers#6. * fix: address review comments on workers howto - Add defer rows.Close() to pollDatabase example - Fix WithOnNotAcquired wording (callback returns error, not function) - Rewrite automatic cleanup note with timing caveat - Simplify handler reference sharing explanation - Remove GetChildCount from WorkerInfo table (removed from API) * fix: add GetChildCount back to WorkerInfo table GetChildCount is present in the API (backed by a map, genuinely cheaper than len(GetChildren()) which allocates a sorted slice). * fix: ErrSkipTick table wording and example ctx handling - Return values table: "No effect" → "Treated like return error" for ErrSkipTick in long-running workers (it's not a no-op) - pollDatabase example: check ctx.Err() before returning ErrSkipTick so context cancellation triggers clean shutdown, not a skip * docs: recommend keeping restart enabled for periodic workers * fix: consistent suture naming in automatic cleanup note * docs: document always-sampled worker spans and trace ID in logs
Summary
Addresses 7 user feedback items from production reconciler patterns, plus 2 additional generic capabilities:
GetChildren/GetChild/GetChildCount. No manual cleanup orWithRestart(true)workaround needed.ErrSkipTick— sentinel for periodic handlers to skip a tick without triggering restart. Useful for transient failures (DB timeouts, network blips).WorkerInfo.GetHandler()— exposes the current worker's handler on WorkerInfo, enabling middleware to type-assert against handler state (handler-as-metadata pattern).GetInterval(),GetRestartOnFail(),GetJitterPercent(),GetInitialDelay()for reconciler introspection viaGetChild().GetChildCount()— efficient child count without allocating a sorted slice.WithSkipOnNotAcquired— convenienceLockOptionfor the common log-and-skip pattern, with doc warning about theWithOnNotAcquirederror-return footgun.Design decisions
WithMetadataAPI —GetChild().GetHandler()+ type assertion is the Go-idiomatic approach. Users put metadata on theirCycleHandlerstruct. SeeExample_reconcilerWithChangeDetection.ErrSkipTickhandled in timer loop — falls through to timer reset instead of exiting. Defensive check inServe()excludes it from failure metrics.Test plan
TestWorkerInfo_ZombieChild_AutoCleanup— child withWithRestart(false)stops, auto-prunedTestWorkerInfo_ZombieChild_ErrDoNotRestart— child returnsErrDoNotRestart, auto-prunedTestWorkerInfo_ZombieChild_ReAdd— re-Add succeeds after zombie pruneTestEveryIntervalWithJitter_ErrSkipTick— timer continues after ErrSkipTickTestEveryIntervalWithJitter_ErrSkipTick_Wrapped— wrapped ErrSkipTick also caughtTestRun_ErrSkipTick_PeriodicWorker— integration test via Run()TestRun_ErrSkipTick_NotCountedAsFailure— ErrSkipTick excluded from failure metricsTestWorkerInfo_GetHandler/_Nil— handler exposed on WorkerInfoTestWorker_ConfigGetters/_Defaults— all getters return correct valuesTestWorkerInfo_GetChildCount/_Nil— count without allocationTestDistributedLock_WithSkipOnNotAcquired— log function called, nil errorExample_reconcilerWithChangeDetection— handler-as-metadata pattern works end-to-endmake test— 90.7% coverage (workers), 97.1% (middleware)make lint— 0 issues, no vulnerabilitiesmake doc— READMEs regeneratedSummary by CodeRabbit
New Features
Documentation
Tests