t1206: Add dispatch deduplication guard for repeated task failures#1835
t1206: Add dispatch deduplication guard for repeated task failures#1835marcusquinn merged 2 commits intomainfrom
Conversation
…206) Prevents token waste when tasks fail with the same error repeatedly: - 10-minute cooldown enforced before re-dispatch after any failure - Tasks blocked after 2 consecutive identical failures with diagnostic note - Warning logged when same error code appears twice in succession Implementation: - database.sh: add last_failure_at and consecutive_failure_count columns to tasks table (schema + migration for existing DBs) - dispatch.sh: add check_dispatch_dedup_guard() called in cmd_dispatch() after max_retries check; returns 1 (block) or 2 (cooldown/defer) - dispatch.sh: add update_failure_dedup_state() to track failure timestamps and consecutive counts with error-key normalisation - pulse.sh: call update_failure_dedup_state() in retry handler so dedup state is updated whenever a task is marked for retry Configurable via env vars: SUPERVISOR_FAILURE_COOLDOWN_SECS (default: 600) SUPERVISOR_MAX_CONSECUTIVE_FAILURES (default: 2) Addresses observed pattern: t1032.1 failed twice in 2 min (14:50, 14:52), t1030 failed twice in 22 min (14:24, 14:46) with identical errors.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
WalkthroughAdds a dispatch deduplication guard (t1206): schema columns for last failure and consecutive failure count, migration guard to add them if missing, guard/check/update/reset functions, and integrations into dispatch and pulse flows to enforce cooldowns and blocking for repeated identical failures. Changes
Sequence DiagramsequenceDiagram
actor Client
participant Dispatch as dispatch.sh
participant Guard as check_dispatch_dedup_guard()
participant DB as "SQLite: tasks"
participant Pulse as pulse.sh
participant Updater as update_failure_dedup_state()
Client->>Dispatch: cmd_dispatch(task_id)
Dispatch->>Guard: check_dispatch_dedup_guard(task_id)
Guard->>DB: SELECT last_failure_at, consecutive_failure_count, last_error
alt Cooldown active (recent failure)
Guard-->>Dispatch: return 2 (cooldown)
Dispatch-->>Client: defer to next pulse
else Blocked (threshold reached)
Guard-->>Dispatch: return 1 (blocked)
Dispatch->>DB: mark task blocked / log
Dispatch-->>Client: abort dispatch
else Proceed
Guard-->>Dispatch: return 0 (proceed)
Dispatch->>Client: perform dispatch
end
Pulse->>Pulse: task execution fails
Pulse->>Updater: update_failure_dedup_state(task_id, error_detail)
Updater->>DB: read/compare last_error -> increment/reset count, set last_failure_at
Updater-->>Pulse: updated
Pulse->>Pulse: schedule retry or mark blocked based on state
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 18 23:13:53 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1458-1464: The warning uses a hardcoded "2" instead of the
configured SUPERVISOR_MAX_CONSECUTIVE_FAILURES; update the check and message to
use a computed threshold variable (e.g.
threshold=${SUPERVISOR_MAX_CONSECUTIVE_FAILURES:-2}) and compare new_count
against that (use [[ "$new_count" -ge "$threshold" ]]) and include "$threshold"
in the log_warn text so the message reflects the actual blocking threshold;
operate on existing symbols new_count, current_count, new_error_key,
SUPERVISOR_MAX_CONSECUTIVE_FAILURES, log_warn and task_id.
- Around line 1442-1468: The dedup logic is currently reading the updated error
because cmd_transition updates the tasks.error column before
update_failure_dedup_state runs; change the call order so
update_failure_dedup_state is invoked before cmd_transition, ensuring
update_failure_dedup_state reads the previous error value (so its comparison of
current_error_key vs new_error_key works correctly) and then call cmd_transition
to persist the new error; update any related comments to reflect the new
sequencing involving the functions update_failure_dedup_state and
cmd_transition.
- Around line 1470-1478: Reset the dedup state when a task is successfully
completed or after a successful dispatch: clear last_failure_at (set to NULL)
and reset consecutive_failure_count to 0. Locate the code path that transitions
tasks to the "complete" state (or the successful dispatch/ack block used when
re-queuing, e.g., the quality gate escalation around the earlier dispatch/queue
logic) and add an UPDATE using the same db function/SUPERVISOR_DB that sets
last_failure_at = NULL and consecutive_failure_count = 0 for the given task id
(replace the existing failure-update logic that writes last_failure_at and
new_count). This ensures the cooldown guard that reads last_failure_at and
consecutive_failure_count (the cooldown check around lines 1391–1420) does not
defer tasks based on a prior failure once the task has completed successfully.
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1082-1088: Move the call to update_failure_dedup_state "$tid"
"$outcome_detail" so it executes before cmd_transition "$tid" "retrying" --error
"$outcome_detail"; the dedup logic in update_failure_dedup_state expects to read
the previous DB error (via COALESCE(error, '')) to compare with the
new_error_key, but cmd_transition writes the new error first which causes the
comparison to always match and never reset the streak; reorder the three lines
so cleanup_worker_processes "$tid" remains where it is, call
update_failure_dedup_state "$tid" "$outcome_detail" (redirecting stderr to
$SUPERVISOR_LOG || true) next, and only then call cmd_transition "$tid"
"retrying" --error "$outcome_detail" to ensure the function sees the prior error
state.
Auto-dismissed: bot review does not block autonomous pipeline
- Use configurable SUPERVISOR_MAX_CONSECUTIVE_FAILURES threshold in warning log instead of hardcoded 2 (dispatch.sh) - Reorder update_failure_dedup_state before cmd_transition in retry handler so the function reads the previous DB error for accurate streak comparison (pulse.sh) - Add reset_failure_dedup_state() to clear last_failure_at and consecutive_failure_count on task completion, preventing stale cooldown deferral on re-queued tasks (dispatch.sh) - Call reset_failure_dedup_state in complete handler (pulse.sh)
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 18 23:38:34 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1491-1504: The reset_failure_dedup_state function currently clears
last_failure_at and consecutive_failure_count for a given task ID but may be
susceptible to SQL injection or quoting issues and silently swallows DB errors;
ensure you use the existing sql_escape("$task_id") output safely when
constructing the UPDATE and avoid embedding unescaped variables elsewhere, keep
the DB call via db "$SUPERVISOR_DB" and reference the same column names
(last_failure_at, consecutive_failure_count) and table (tasks), but also
propagate/log any DB errors instead of discarding them (remove the trailing
2>/dev/null || true or replace with proper error handling) so failures to reset
are visible for operators.
- Around line 1436-1482: The update_failure_dedup_state function correctly
normalizes error keys using "${error_detail%%:*}" and "${current_error%%:*}",
properly guards empty keys with -n "$current_error_key", and uses the
configurable SUPERVISOR_MAX_CONSECUTIVE_FAILURES (max_consecutive) for the
warning threshold; no code changes required—leave update_failure_dedup_state
as-is.



Adds a dispatch deduplication guard to prevent the supervisor from re-dispatching tasks that fail with the same error in a short window.
Problem
Worker outcomes showed repeated identical failures wasting tokens:
Solution
Three-layer guard implemented in the supervisor dispatch pipeline:
SUPERVISOR_FAILURE_COOLDOWN_SECS)blockedstatus with a diagnostic note requiring manual intervention (configurable viaSUPERVISOR_MAX_CONSECUTIVE_FAILURES)Changes
database.sh: Addlast_failure_at(TEXT) andconsecutive_failure_count(INTEGER) columns totaskstable — both ininit_db()schema and as a migration for existing DBsdispatch.sh: Addcheck_dispatch_dedup_guard()called incmd_dispatch()after max_retries check; returns 1 (block task) or 2 (defer to next pulse)dispatch.sh: Addupdate_failure_dedup_state()to track failure timestamps and consecutive counts with error-key normalisation (strips detail suffix for comparison)pulse.sh: Callupdate_failure_dedup_state()in retry handler so dedup state is updated whenever a task is marked for retryVerification
Summary by CodeRabbit