Skip to content

feat: address user feedback — zombie cleanup, ErrSkipTick, config getters#6

Merged
ankurs merged 9 commits into
mainfrom
feat/workers-feedback-v2
Apr 26, 2026
Merged

feat: address user feedback — zombie cleanup, ErrSkipTick, config getters#6
ankurs merged 9 commits into
mainfrom
feat/workers-feedback-v2

Conversation

@ankurs
Copy link
Copy Markdown
Member

@ankurs ankurs commented Apr 25, 2026

Summary

Addresses 7 user feedback items from production reconciler patterns, plus 2 additional generic capabilities:

  • Zombie children auto-cleanup — stopped children are lazily pruned via a done channel on GetChildren/GetChild/GetChildCount. No manual cleanup or WithRestart(true) workaround needed.
  • ErrSkipTick — sentinel for periodic handlers to skip a tick without triggering restart. Useful for transient failures (DB timeouts, network blips).
  • WorkerInfo.GetHandler() — exposes the current worker's handler on WorkerInfo, enabling middleware to type-assert against handler state (handler-as-metadata pattern).
  • Worker config gettersGetInterval(), GetRestartOnFail(), GetJitterPercent(), GetInitialDelay() for reconciler introspection via GetChild().
  • GetChildCount() — efficient child count without allocating a sorted slice.
  • WithSkipOnNotAcquired — convenience LockOption for the common log-and-skip pattern, with doc warning about the WithOnNotAcquired error-return footgun.
  • Documentation — error semantics decision table, run-level vs worker-level interceptor guidance, handler-as-metadata reconciler example, Locker interface compatibility note.

Design decisions

  • No WithMetadata APIGetChild().GetHandler() + type assertion is the Go-idiomatic approach. Users put metadata on their CycleHandler struct. See Example_reconcilerWithChangeDetection.
  • Lazy pruning over callbacks — zombie cleanup happens at read time via non-blocking channel select, not via goroutines or callback threading. Zero-cost when children are alive.
  • ErrSkipTick handled in timer loop — falls through to timer reset instead of exiting. Defensive check in Serve() excludes it from failure metrics.

Test plan

  • TestWorkerInfo_ZombieChild_AutoCleanup — child with WithRestart(false) stops, auto-pruned
  • TestWorkerInfo_ZombieChild_ErrDoNotRestart — child returns ErrDoNotRestart, auto-pruned
  • TestWorkerInfo_ZombieChild_ReAdd — re-Add succeeds after zombie prune
  • TestEveryIntervalWithJitter_ErrSkipTick — timer continues after ErrSkipTick
  • TestEveryIntervalWithJitter_ErrSkipTick_Wrapped — wrapped ErrSkipTick also caught
  • TestRun_ErrSkipTick_PeriodicWorker — integration test via Run()
  • TestRun_ErrSkipTick_NotCountedAsFailure — ErrSkipTick excluded from failure metrics
  • TestWorkerInfo_GetHandler / _Nil — handler exposed on WorkerInfo
  • TestWorker_ConfigGetters / _Defaults — all getters return correct values
  • TestWorkerInfo_GetChildCount / _Nil — count without allocation
  • TestDistributedLock_WithSkipOnNotAcquired — log function called, nil error
  • Example_reconcilerWithChangeDetection — handler-as-metadata pattern works end-to-end
  • make test — 90.7% coverage (workers), 97.1% (middleware)
  • make lint — 0 issues, no vulnerabilities
  • make doc — READMEs regenerated

Summary by CodeRabbit

  • New Features

    • ErrSkipTick sentinel to skip a periodic tick without counting as a failure
    • Lock option to log-and-skip when a lock is held (WithSkipOnNotAcquired)
    • Worker introspection getters (interval, delay, jitter, restart flag), child count/handler access, and test-handler option
    • Reconciler example showing config-driven child replacement
  • Documentation

    • Clarified periodic-handler return semantics, interceptor scoping/inheritance, lock behavior, and tracing sampling/log injection
  • Tests

    • Added coverage for skip-tick, lock skipping, worker getters, child pruning, tracing sampling, and reconciliation example

…ters, docs

- Zombie children: lazily prune stopped children via done channel on
  GetChildren/GetChild/GetChildCount — no manual cleanup needed
- ErrSkipTick: sentinel for periodic handlers to skip a tick without
  triggering restart (transient failures like DB timeouts)
- WorkerInfo.GetHandler(): expose handler for middleware type assertion,
  enabling the handler-as-metadata pattern
- Worker config getters: GetInterval, GetRestartOnFail, GetJitterPercent,
  GetInitialDelay for reconciler introspection
- GetChildCount(): efficient child count without allocating a sorted slice
- WithSkipOnNotAcquired: convenience LockOption for log-and-skip pattern
- Docs: error semantics decision table, interceptor level guidance,
  handler-as-metadata reconciler example, Locker compatibility note
Copilot AI review requested due to automatic review settings April 25, 2026 14:35
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Caution

Review failed

Pull request was closed or merged during review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds ErrSkipTick to skip a single periodic tick without triggering failure accounting, exposes runtime getters and handler introspection on Worker/WorkerInfo, implements lazy pruning of permanently stopped child workers, adds WithSkipOnNotAcquired lock-skip option, updates run/service bookkeeping, and includes a reconciler example using handler-as-metadata.

Changes

Cohort / File(s) Summary
Periodic tick helpers & tests
helpers.go, helpers_test.go
everyIntervalWithJitter treats ErrSkipTick (via errors.Is) as non-fatal and continues ticking; tests added for direct and wrapped ErrSkipTick.
Run/service & error sentinel
run.go, run_test.go
Add exported ErrSkipTick; propagate handler into WorkerInfo; add per-worker done channel and supervisor API tweaks; tests for ErrSkipTick behavior (periodic vs non-periodic).
Worker API, child lifecycle & tests
worker.go, worker_test.go
Add Worker getters (GetInterval, GetInitialDelay, GetJitterPercent, GetRestartOnFail), WorkerInfo.GetHandler(), WithTestHandler, GetChildCount(); track per-child done channels, lazy pruning (pruneStoppedLocked()), and tests for pruning/zombie children and getters.
Distributed lock middleware & tests
middleware/lock.go, middleware/lock_test.go, middleware/README.md
Introduce WithSkipOnNotAcquired(logFn) LockOption that logs (optional) and returns nil to skip the cycle without restart; clarify WithOnNotAcquired error propagation and Locker direct-satisfaction; tests for skip & logging behavior.
Tracing middleware & tests
middleware/tracing.go, middleware/tracing_test.go
Add ensureSampled(ctx) to force sampling for worker spans and inject TraceID into log context; tests added for sampling behavior and uniqueness.
Docs, example & anchors
README.md, middleware/README.md, example_test.go
Document periodic-handler semantics, interceptor scoping/inheritance, new exported APIs and ErrSkipTick; add reconciler example Example_reconcilerWithChangeDetection; update anchors and README entries.
Misc runtime helpers
run.go (additional changes)
Run/service bookkeeping: return/track per-worker completion channel; minor supervisor API adjustments; expose worker restart accounting behavior change for ErrSkipTick.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant PeriodicWorker as Periodic Worker
    participant Handler as CycleHandler
    participant Framework

    rect rgba(100,150,200,0.5)
    Note over PeriodicWorker,Handler: ErrSkipTick flow (skip current tick)
    Client->>PeriodicWorker: tick
    PeriodicWorker->>Handler: RunCycle(ctx, info)
    Handler-->>PeriodicWorker: return ErrSkipTick
    PeriodicWorker->>Framework: errors.Is(err, ErrSkipTick)
    Framework-->>PeriodicWorker: suppress failure, reset timer
    PeriodicWorker->>Client: wait for next tick
    end

    rect rgba(200,100,100,0.5)
    Note over PeriodicWorker,Framework: Other error flow (fail/restart)
    Client->>PeriodicWorker: tick
    PeriodicWorker->>Handler: RunCycle(ctx, info)
    Handler-->>PeriodicWorker: return error
    PeriodicWorker->>Framework: non-ErrSkipTick error
    Framework->>Framework: record WorkerFailed -> backoff / restart (if enabled)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hop and I check each worker's worth,
I skip one tick softly, let the meadow breathe,
I peek at handlers, prune the sleepers thin,
Locks may whisper "skip" and leave no din,
The workers hum along — hop, hop, reprieve.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the three main features: zombie cleanup (stale child cleanup), ErrSkipTick (skip tick control), and config getters (Worker introspection methods).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/workers-feedback-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the workers framework based on production reconciler feedback by adding lazy “zombie” child cleanup, introducing ErrSkipTick for periodic handlers, exposing handler/config introspection APIs, and expanding documentation/examples around these patterns.

Changes:

  • Add lazy pruning for permanently-stopped child workers via a per-child done channel and a new WorkerInfo.GetChildCount().
  • Introduce ErrSkipTick to allow periodic handlers to skip a tick without triggering restart, plus tests around timer-loop behavior and Run-level metrics.
  • Add WorkerInfo.GetHandler() and Worker config getters (GetInterval, GetRestartOnFail, GetJitterPercent, GetInitialDelay), plus middleware convenience WithSkipOnNotAcquired and expanded docs/examples.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
worker.go Adds handler exposure on WorkerInfo, child “done” tracking + lazy pruning, GetChildCount, and Worker config getters.
run.go Introduces ErrSkipTick, threads child done channel out of addWorkerToSupervisor, and adjusts failure-metric accounting.
helpers.go Updates periodic timer loop to treat ErrSkipTick as “continue ticking.”
worker_test.go Adds tests for GetHandler, GetChildCount, zombie cleanup, and Worker config getters.
run_test.go Adds integration tests for ErrSkipTick behavior in Run() and failure metrics.
helpers_test.go Adds unit tests verifying ErrSkipTick (and wrapped) does not terminate the timer loop.
middleware/lock.go Adds WithSkipOnNotAcquired convenience option and documents the error-handling footgun.
middleware/lock_test.go Adds tests for WithSkipOnNotAcquired behavior (including nil log function).
middleware/README.md Regenerates docs to include WithSkipOnNotAcquired and updated line references.
example_test.go Adds reconciler example demonstrating handler-as-metadata change detection.
README.md Expands docs: periodic error semantics table, interceptor guidance, and reconciler example; regenerates API docs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread run.go Outdated
Comment thread worker.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
run.go (1)

146-156: ⚠️ Potential issue | 🟡 Minor

Close the done channel from closingSupervisor.Serve to handle the suture failure threshold case.

When restartOnFail=true and the worker exceeds suture's failure threshold, suture stops restarting but returns a generic error (not ErrDoNotRestart). The final call to workerRunService.Serve takes the return err branch (line 156), leaving done unclosed. The child entry then remains in the parent's children map indefinitely because pruneStoppedLocked only detects and removes entries with a closed done channel.

Fix: pass done to closingSupervisor and close it there with sync.Once protection, ensuring cleanup on any supervisor exit path—including when suture gives up:

type closingSupervisor struct {
	*suture.Supervisor
	closeFn  func()
	done     chan struct{}
	doneOnce sync.Once
}

func (cs *closingSupervisor) Serve(ctx context.Context) error {
	err := cs.Supervisor.Serve(ctx)
	cs.closeFn()
	cs.doneOnce.Do(func() {
		if cs.done != nil {
			close(cs.done)
		}
	})
	return err
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@run.go` around lines 146 - 156, The done channel isn't closed when suture
stops restarting without ErrDoNotRestart; update closingSupervisor to accept a
done channel and a sync.Once (e.g., add fields done chan struct{} and doneOnce
sync.Once to closingSupervisor), then in closingSupervisor.Serve call
cs.Supervisor.Serve(ctx), invoke cs.closeFn(), and use cs.doneOnce.Do to close
cs.done if non-nil before returning the error; also ensure the caller constructs
closingSupervisor with the worker's done channel so the child entry is always
pruned on any supervisor exit path.
🧹 Nitpick comments (2)
example_test.go (1)

359-368: Nit: solverConfig.name is unused — drop it or use it.

The name field is populated in every config entry but never read inside the reconciler (the map key already serves as the worker name). Removing it tightens the example, or alternatively pass cfg.name to workers.NewWorker(...) instead of key to make the field meaningful.

♻️ Suggested simplification
-		type solverConfig struct {
-			version int
-			name    string
-		}
+		type solverConfig struct {
+			version int
+		}
@@
-		configs := []map[string]solverConfig{
-			{"a": {version: 1, name: "a"}},
-			{"a": {version: 1, name: "a"}, "b": {version: 1, name: "b"}},
-			{"a": {version: 2, name: "a"}, "b": {version: 1, name: "b"}}, // a gets new version
-		}
+		configs := []map[string]solverConfig{
+			{"a": {version: 1}},
+			{"a": {version: 1}, "b": {version: 1}},
+			{"a": {version: 2}, "b": {version: 1}}, // a gets new version
+		}

Note: this is in a Go doc example, so re-run make doc to regenerate README.md after the change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@example_test.go` around lines 359 - 368, The solverConfig struct includes an
unused field name; either remove solverConfig.name and stop populating it in the
configs slice to simplify the example, or use it by passing cfg.name into
workers.NewWorker(...) (instead of the map key) so the field becomes meaningful;
update the configs entries accordingly and rerun make doc to regenerate the Go
doc example.
worker.go (1)

277-286: Optional: avoid full prune scan in GetChild.

GetChild only needs to know whether the requested entry is alive, but it currently calls pruneStoppedLocked which scans all children. For maps with many children, prefer a per-entry check. (Hot-path map iteration of all children on every lookup is also potentially surprising when only one name is being queried.)

♻️ Suggested local check
 func (info *WorkerInfo) GetChild(name string) (Worker, bool) {
 	info.childrenMu.Lock()
 	defer info.childrenMu.Unlock()
 
-	info.pruneStoppedLocked()
-	if entry, ok := info.children[name]; ok {
-		return *entry.worker, true
-	}
-	return Worker{}, false
+	entry, ok := info.children[name]
+	if !ok {
+		return Worker{}, false
+	}
+	if entry.done != nil {
+		select {
+		case <-entry.done:
+			delete(info.children, name)
+			return Worker{}, false
+		default:
+		}
+	}
+	return *entry.worker, true
 }

If you prefer the simpler "always full prune" model for consistency with GetChildren/GetChildCount, feel free to keep as-is.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@worker.go` around lines 277 - 286, GetChild currently locks childrenMu then
calls pruneStoppedLocked which scans all children; instead, inside
WorkerInfo.GetChild hold childrenMu, look up info.children[name] directly, if
not found return false, otherwise inspect the found entry's stopped/finished
state (the same condition used by pruneStoppedLocked) and if it's stopped remove
that single entry from the map and return false, otherwise return the live
worker; keep the same locking (childrenMu.Lock/Unlock) and reuse the existing
entry fields and logic from pruneStoppedLocked to determine "stopped" so you
avoid a full-map scan while preserving concurrency safety.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@run.go`:
- Around line 146-156: The done channel isn't closed when suture stops
restarting without ErrDoNotRestart; update closingSupervisor to accept a done
channel and a sync.Once (e.g., add fields done chan struct{} and doneOnce
sync.Once to closingSupervisor), then in closingSupervisor.Serve call
cs.Supervisor.Serve(ctx), invoke cs.closeFn(), and use cs.doneOnce.Do to close
cs.done if non-nil before returning the error; also ensure the caller constructs
closingSupervisor with the worker's done channel so the child entry is always
pruned on any supervisor exit path.

---

Nitpick comments:
In `@example_test.go`:
- Around line 359-368: The solverConfig struct includes an unused field name;
either remove solverConfig.name and stop populating it in the configs slice to
simplify the example, or use it by passing cfg.name into workers.NewWorker(...)
(instead of the map key) so the field becomes meaningful; update the configs
entries accordingly and rerun make doc to regenerate the Go doc example.

In `@worker.go`:
- Around line 277-286: GetChild currently locks childrenMu then calls
pruneStoppedLocked which scans all children; instead, inside WorkerInfo.GetChild
hold childrenMu, look up info.children[name] directly, if not found return
false, otherwise inspect the found entry's stopped/finished state (the same
condition used by pruneStoppedLocked) and if it's stopped remove that single
entry from the map and return false, otherwise return the live worker; keep the
same locking (childrenMu.Lock/Unlock) and reuse the existing entry fields and
logic from pruneStoppedLocked to determine "stopped" so you avoid a full-map
scan while preserving concurrency safety.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 15599bbe-50c8-4404-a60f-4f1bd0ee45b4

📥 Commits

Reviewing files that changed from the base of the PR and between d124a58 and 8ff9383.

📒 Files selected for processing (11)
  • README.md
  • example_test.go
  • helpers.go
  • helpers_test.go
  • middleware/README.md
  • middleware/lock.go
  • middleware/lock_test.go
  • run.go
  • run_test.go
  • worker.go
  • worker_test.go

Add() held the lock but didn't call pruneStoppedLocked(), so a child
that permanently stopped could block re-Add until the caller happened
to call GetChildren/GetChild/GetChildCount. Now Add prunes stale
entries before checking for name conflicts.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread run.go
Comment thread worker.go Outdated
ankurs added 2 commits April 25, 2026 22:50
A non-periodic worker returning ErrSkipTick would silently skip the
WorkerFailed metric while still restarting — a silent failure loop.
Now ErrSkipTick is only suppressed from failure metrics when the
worker has an interval (periodic), matching the semantics: ErrSkipTick
is only meaningful for periodic workers where the timer loop handles it.
Replace the children map + done channel with suture's Services() API.
closingSupervisor now implements childService interface, exposing
worker name, Worker config, and token. All child enumeration
(Add/Remove/GetChildren/GetChild) queries suture directly.

- Remove childEntry type, children map, childrenMu, pruneStoppedLocked,
  removeLocked, done channel
- Add childService interface with isActive() check via inner supervisor
- Remove GetChildCount (no longer cheaper than len(GetChildren()))
- Fix GetChildCount doc nit (len([WorkerInfo.GetChildren]) → valid Go)

Zombie children are eliminated by design — suture only returns active
services, so stopped children are never visible to the parent.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

run.go:141

  • workerRunService.Serve no longer removes child services from ws.childSup when the worker attempt exits. Since WorkerInfo.sup points at the same supervisor that hosts workerRunService, any children added via info.Add() will keep running across restarts and can even outlive a permanently-stopped worker, violating the documented scoped lifecycle (“when this worker stops, all its children stop too”). Consider reintroducing per-attempt teardown by removing all child services (e.g., those implementing childService) from ws.childSup before returning from Serve, so children are stopped on both restart and permanent stop paths.
	info := &WorkerInfo{
		name:    ws.w.name,
		attempt: attempt,
		handler: ws.w.handler,
		sup:     ws.childSup,
		cfg:     ws.cfg,
		active:  ws.active,
		metrics: m,
	}

	err := ws.runFn(ctx, info)

	if err != nil && ctx.Err() == nil && !errors.Is(err, suture.ErrDoNotRestart) &&
		(ws.w.interval <= 0 || !errors.Is(err, ErrSkipTick)) {
		m.WorkerFailed(ws.w.name, err)
	}

	// Determine whether this worker is permanently stopping.
	permanentStop := !ws.w.restartOnFail || err == nil || ctx.Err() != nil || errors.Is(err, suture.ErrDoNotRestart)

	if permanentStop {
		ws.closeFn()
		return suture.ErrDoNotRestart
	}
	return err

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread run.go Outdated
Comment thread worker.go Outdated
Comment thread worker_test.go Outdated
Comment thread worker_test.go Outdated
Comment thread README.md Outdated
Revert the Services()-based child management back to the done channel
approach. The Services() approach introduced edge cases (grandchildren
fooling isActive, performance overhead) without solving a real problem
— the done channel handles all permanent stop cases correctly.

The done channel is closed on permanentStop in workerRunService.Serve,
which covers: nil+WithRestart(false), ErrDoNotRestart, and ctx cancel.
Failure threshold does NOT close done (correct — suture keeps
restarting after backoff decay). The defer in Serve cleans up
grandchildren before done is closed.

Also fixes:
- GetChildCount doc: invalid Go syntax in comment
- Test message: "nil sup" → "nil supervisor"
- Test comments: remove stale GetChildCount references
- Add grandchildren zombie test validating the done channel approach
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
worker_test.go (1)

254-272: Consider asserting via Eventually to reduce flake risk.

The fixed time.Sleep(200 * time.Millisecond) will be the slowest reliable signal in the suite. If you ever see this flaking on loaded CI, swapping to assert.Eventually(t, func() bool { return len(info.GetChildren()) == 0 }, 1*time.Second, 10*time.Millisecond, ...) would let the assertion succeed as soon as pruning is observable rather than always paying the worst case.

The same applies to the other zombie tests using fixed 100ms sleeps (lines 247, 281, 295, 315, 334).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@worker_test.go` around lines 254 - 272, Replace the fixed time.Sleep +
assert.Empty in TestWorkerInfo_ZombieChild_WithGrandchildren with an eventual
assertion that polls info.GetChildren() until it becomes empty; specifically,
remove the time.Sleep(200 * time.Millisecond) and instead use
assert.Eventually(t, func() bool { return len(info.GetChildren()) == 0 },
1*time.Second, 10*time.Millisecond, "stopped child should not appear even with
live grandchildren") so the test succeeds as soon as pruning is observed; apply
the same pattern to the other zombie tests that use fixed sleeps (the tests
around lines with 100ms sleeps) referencing their test names and calls to
info.GetChildren().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@example_test.go`:
- Around line 359-403: The example has a race when replacing a child: calling
info.Remove(key) then immediately info.Add(...) can fail to replace because
Remove is not atomic and the old child's done channel may not be closed yet;
update the example to either insert a short sleep between Remove and Add (as
done in ExampleWorkerInfo_Add_replace) or add a clarifying comment explaining
the caveat so readers don't copy this pattern into production, and remove the
unused solverConfig.name field (or stop populating it) since the map key is used
as the worker name; reference symbols: info.Remove, info.Add, workers.NewWorker,
solverHandler, solverConfig.

---

Nitpick comments:
In `@worker_test.go`:
- Around line 254-272: Replace the fixed time.Sleep + assert.Empty in
TestWorkerInfo_ZombieChild_WithGrandchildren with an eventual assertion that
polls info.GetChildren() until it becomes empty; specifically, remove the
time.Sleep(200 * time.Millisecond) and instead use assert.Eventually(t, func()
bool { return len(info.GetChildren()) == 0 }, 1*time.Second,
10*time.Millisecond, "stopped child should not appear even with live
grandchildren") so the test succeeds as soon as pruning is observed; apply the
same pattern to the other zombie tests that use fixed sleeps (the tests around
lines with 100ms sleeps) referencing their test names and calls to
info.GetChildren().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 35554b15-01ad-42c4-84fc-baa8287a5da6

📥 Commits

Reviewing files that changed from the base of the PR and between 8ff9383 and 074c7b4.

📒 Files selected for processing (6)
  • README.md
  • example_test.go
  • run.go
  • run_test.go
  • worker.go
  • worker_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • run.go
  • worker.go

Comment thread example_test.go
Add sleep between Remove and Add in the reconciler example to avoid
the race where Add sees the stale entry before the done channel
closes. Remove unused solverConfig.name field (map key is used).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread worker.go
…ines

pruneStoppedLocked was calling delete(info.children, name) without
removing the underlying supervisor service, leaving an orphaned
closingSupervisor running indefinitely. Now calls removeLocked
which does sup.Remove(token) + delete.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ankurs added 2 commits April 26, 2026 20:50
Periodic workers should generally use ErrSkipTick for transient
failures and ErrDoNotRestart for permanent completion, rather than
disabling restart entirely.
Workers create root spans (no parent) from context.Background().
With ParentBased(TraceIDRatioBased(0.2)), 80% of worker traces were
silently dropped. Now Tracing() injects a sampled remote span context
before creating the span, so ParentBased always samples.

Also injects the OTEL trace ID into the log context as "trace" for
correlation with the tracing backend.
@ankurs ankurs merged commit 3626332 into main Apr 26, 2026
3 of 4 checks passed
@ankurs ankurs deleted the feat/workers-feedback-v2 branch April 26, 2026 14:02
ankurs added a commit to go-coldbrew/docs.coldbrew.cloud that referenced this pull request Apr 26, 2026
* docs: update workers howto for ErrSkipTick, zombie cleanup, handler-as-metadata

- Add ErrSkipTick to handler return values table with usage example
- Document automatic zombie child cleanup in Dynamic Workers section
- Add handler-as-metadata reconciler example (config change detection
  via GetChild().GetHandler() type assertion)
- Add WithSkipOnNotAcquired convenience + footgun warning for
  WithOnNotAcquired error return
- Update WorkerInfo methods table with GetHandler, GetChildCount
- Note Locker interface compatibility for existing implementations

Companion to go-coldbrew/workers#6.

* fix: address review comments on workers howto

- Add defer rows.Close() to pollDatabase example
- Fix WithOnNotAcquired wording (callback returns error, not function)
- Rewrite automatic cleanup note with timing caveat
- Simplify handler reference sharing explanation
- Remove GetChildCount from WorkerInfo table (removed from API)

* fix: add GetChildCount back to WorkerInfo table

GetChildCount is present in the API (backed by a map, genuinely
cheaper than len(GetChildren()) which allocates a sorted slice).

* fix: ErrSkipTick table wording and example ctx handling

- Return values table: "No effect" → "Treated like return error" for
  ErrSkipTick in long-running workers (it's not a no-op)
- pollDatabase example: check ctx.Err() before returning ErrSkipTick
  so context cancellation triggers clean shutdown, not a skip

* docs: recommend keeping restart enabled for periodic workers

* fix: consistent suture naming in automatic cleanup note

* docs: document always-sampled worker spans and trace ID in logs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants