Skip to content

Comments

t1206: Add dispatch deduplication guard for repeated task failures#1835

Merged
marcusquinn merged 2 commits intomainfrom
feature/t1206
Feb 18, 2026
Merged

t1206: Add dispatch deduplication guard for repeated task failures#1835
marcusquinn merged 2 commits intomainfrom
feature/t1206

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 18, 2026

Adds a dispatch deduplication guard to prevent the supervisor from re-dispatching tasks that fail with the same error in a short window.

Problem

Worker outcomes showed repeated identical failures wasting tokens:

  • t1032.1 dispatched and failed twice within 2 minutes (14:50 and 14:52) with the same error
  • t1030 failed twice within 22 minutes (14:24 and 14:46) with identical errors

Solution

Three-layer guard implemented in the supervisor dispatch pipeline:

  1. 10-minute cooldown — after any failure, the task cannot be re-dispatched for 10 minutes (configurable via SUPERVISOR_FAILURE_COOLDOWN_SECS)
  2. Consecutive failure blocking — after 2 consecutive identical failures, the task is moved to blocked status with a diagnostic note requiring manual intervention (configurable via SUPERVISOR_MAX_CONSECUTIVE_FAILURES)
  3. Warning logging — a warning is emitted whenever the same task fails with the same error code twice in succession

Changes

  • database.sh: Add last_failure_at (TEXT) and consecutive_failure_count (INTEGER) columns to tasks table — both in init_db() schema and as a migration for existing DBs
  • dispatch.sh: Add check_dispatch_dedup_guard() called in cmd_dispatch() after max_retries check; returns 1 (block task) or 2 (defer to next pulse)
  • dispatch.sh: Add update_failure_dedup_state() to track failure timestamps and consecutive counts with error-key normalisation (strips detail suffix for comparison)
  • pulse.sh: Call update_failure_dedup_state() in retry handler so dedup state is updated whenever a task is marked for retry

Verification

  • ShellCheck: zero violations on all 3 modified files
  • Bash syntax: clean on all 3 files
  • Supervisor globals test: 10/10 passed

Summary by CodeRabbit

  • New Features
    • Dispatch deduplication guard prevents repeated re-dispatch of tasks with identical failures; enforces a configurable cooldown (default 10 minutes) and blocks after a configurable consecutive-failure threshold.
    • Failure details are tracked to decide defer/block actions; retries update this state and successful runs reset it.
    • Failure reasons are recorded to aid troubleshooting.

…206)

Prevents token waste when tasks fail with the same error repeatedly:
- 10-minute cooldown enforced before re-dispatch after any failure
- Tasks blocked after 2 consecutive identical failures with diagnostic note
- Warning logged when same error code appears twice in succession

Implementation:
- database.sh: add last_failure_at and consecutive_failure_count columns
  to tasks table (schema + migration for existing DBs)
- dispatch.sh: add check_dispatch_dedup_guard() called in cmd_dispatch()
  after max_retries check; returns 1 (block) or 2 (cooldown/defer)
- dispatch.sh: add update_failure_dedup_state() to track failure timestamps
  and consecutive counts with error-key normalisation
- pulse.sh: call update_failure_dedup_state() in retry handler so dedup
  state is updated whenever a task is marked for retry

Configurable via env vars:
  SUPERVISOR_FAILURE_COOLDOWN_SECS (default: 600)
  SUPERVISOR_MAX_CONSECUTIVE_FAILURES (default: 2)

Addresses observed pattern: t1032.1 failed twice in 2 min (14:50, 14:52),
t1030 failed twice in 22 min (14:24, 14:46) with identical errors.
@gemini-code-assist
Copy link

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

Walkthrough

Adds a dispatch deduplication guard (t1206): schema columns for last failure and consecutive failure count, migration guard to add them if missing, guard/check/update/reset functions, and integrations into dispatch and pulse flows to enforce cooldowns and blocking for repeated identical failures.

Changes

Cohort / File(s) Summary
Database Schema & Migration
.agents/scripts/supervisor/database.sh
Adds last_failure_at TEXT and consecutive_failure_count INTEGER NOT NULL DEFAULT 0 to tasks schema and a migration guard (t1206) that conditionally adds these columns if absent.
Dispatch Deduplication Logic
.agents/scripts/supervisor/dispatch.sh
Introduces check_dispatch_dedup_guard(), update_failure_dedup_state(), and reset_failure_dedup_state(); integrates guard check into cmd_dispatch to block, cooldown (defer), or proceed based on stored failure state and error key comparisons.
Pulse Failure Tracking
.agents/scripts/supervisor/pulse.sh
Calls update_failure_dedup_state() on retry paths and reset_failure_dedup_state() after successful completion to keep deduplication state in sync with task outcomes.

Sequence Diagram

sequenceDiagram
    actor Client
    participant Dispatch as dispatch.sh
    participant Guard as check_dispatch_dedup_guard()
    participant DB as "SQLite: tasks"
    participant Pulse as pulse.sh
    participant Updater as update_failure_dedup_state()

    Client->>Dispatch: cmd_dispatch(task_id)
    Dispatch->>Guard: check_dispatch_dedup_guard(task_id)
    Guard->>DB: SELECT last_failure_at, consecutive_failure_count, last_error
    alt Cooldown active (recent failure)
        Guard-->>Dispatch: return 2 (cooldown)
        Dispatch-->>Client: defer to next pulse
    else Blocked (threshold reached)
        Guard-->>Dispatch: return 1 (blocked)
        Dispatch->>DB: mark task blocked / log
        Dispatch-->>Client: abort dispatch
    else Proceed
        Guard-->>Dispatch: return 0 (proceed)
        Dispatch->>Client: perform dispatch
    end

    Pulse->>Pulse: task execution fails
    Pulse->>Updater: update_failure_dedup_state(task_id, error_detail)
    Updater->>DB: read/compare last_error -> increment/reset count, set last_failure_at
    Updater-->>Pulse: updated
    Pulse->>Pulse: schedule retry or mark blocked based on state
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Poem

🛡️ When tasks fall into the same old trap,
A tiny guard sets down a map.
Cooldowns hush the tireless try,
Counts reset when errors say goodbye.
t1206 keeps the dispatch calm — a gentle clap.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 't1206: Add dispatch deduplication guard for repeated task failures' clearly and specifically describes the main change: introducing a deduplication guard mechanism for handling repeated task failures.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1206

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 28 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 23:13:50 UTC 2026: Code review monitoring started
Wed Feb 18 23:13:51 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 28

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 28
  • VULNERABILITIES: 0

Generated on: Wed Feb 18 23:13:53 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

coderabbitai[bot]
coderabbitai bot previously requested changes Feb 18, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1458-1464: The warning uses a hardcoded "2" instead of the
configured SUPERVISOR_MAX_CONSECUTIVE_FAILURES; update the check and message to
use a computed threshold variable (e.g.
threshold=${SUPERVISOR_MAX_CONSECUTIVE_FAILURES:-2}) and compare new_count
against that (use [[ "$new_count" -ge "$threshold" ]]) and include "$threshold"
in the log_warn text so the message reflects the actual blocking threshold;
operate on existing symbols new_count, current_count, new_error_key,
SUPERVISOR_MAX_CONSECUTIVE_FAILURES, log_warn and task_id.
- Around line 1442-1468: The dedup logic is currently reading the updated error
because cmd_transition updates the tasks.error column before
update_failure_dedup_state runs; change the call order so
update_failure_dedup_state is invoked before cmd_transition, ensuring
update_failure_dedup_state reads the previous error value (so its comparison of
current_error_key vs new_error_key works correctly) and then call cmd_transition
to persist the new error; update any related comments to reflect the new
sequencing involving the functions update_failure_dedup_state and
cmd_transition.
- Around line 1470-1478: Reset the dedup state when a task is successfully
completed or after a successful dispatch: clear last_failure_at (set to NULL)
and reset consecutive_failure_count to 0. Locate the code path that transitions
tasks to the "complete" state (or the successful dispatch/ack block used when
re-queuing, e.g., the quality gate escalation around the earlier dispatch/queue
logic) and add an UPDATE using the same db function/SUPERVISOR_DB that sets
last_failure_at = NULL and consecutive_failure_count = 0 for the given task id
(replace the existing failure-update logic that writes last_failure_at and
new_count). This ensures the cooldown guard that reads last_failure_at and
consecutive_failure_count (the cooldown check around lines 1391–1420) does not
defer tasks based on a prior failure once the task has completed successfully.

In @.agents/scripts/supervisor/pulse.sh:
- Around line 1082-1088: Move the call to update_failure_dedup_state "$tid"
"$outcome_detail" so it executes before cmd_transition "$tid" "retrying" --error
"$outcome_detail"; the dedup logic in update_failure_dedup_state expects to read
the previous DB error (via COALESCE(error, '')) to compare with the
new_error_key, but cmd_transition writes the new error first which causes the
comparison to always match and never reset the streak; reorder the three lines
so cleanup_worker_processes "$tid" remains where it is, call
update_failure_dedup_state "$tid" "$outcome_detail" (redirecting stderr to
$SUPERVISOR_LOG || true) next, and only then call cmd_transition "$tid"
"retrying" --error "$outcome_detail" to ensure the function sees the prior error
state.

@marcusquinn marcusquinn dismissed coderabbitai[bot]’s stale review February 18, 2026 23:31

Auto-dismissed: bot review does not block autonomous pipeline

- Use configurable SUPERVISOR_MAX_CONSECUTIVE_FAILURES threshold in
  warning log instead of hardcoded 2 (dispatch.sh)
- Reorder update_failure_dedup_state before cmd_transition in retry
  handler so the function reads the previous DB error for accurate
  streak comparison (pulse.sh)
- Add reset_failure_dedup_state() to clear last_failure_at and
  consecutive_failure_count on task completion, preventing stale
  cooldown deferral on re-queued tasks (dispatch.sh)
- Call reset_failure_dedup_state in complete handler (pulse.sh)
@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 28 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 23:38:32 UTC 2026: Code review monitoring started
Wed Feb 18 23:38:32 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 28

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 28
  • VULNERABILITIES: 0

Generated on: Wed Feb 18 23:38:34 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1491-1504: The reset_failure_dedup_state function currently clears
last_failure_at and consecutive_failure_count for a given task ID but may be
susceptible to SQL injection or quoting issues and silently swallows DB errors;
ensure you use the existing sql_escape("$task_id") output safely when
constructing the UPDATE and avoid embedding unescaped variables elsewhere, keep
the DB call via db "$SUPERVISOR_DB" and reference the same column names
(last_failure_at, consecutive_failure_count) and table (tasks), but also
propagate/log any DB errors instead of discarding them (remove the trailing
2>/dev/null || true or replace with proper error handling) so failures to reset
are visible for operators.
- Around line 1436-1482: The update_failure_dedup_state function correctly
normalizes error keys using "${error_detail%%:*}" and "${current_error%%:*}",
properly guards empty keys with -n "$current_error_key", and uses the
configurable SUPERVISOR_MAX_CONSECUTIVE_FAILURES (max_consecutive) for the
warning threshold; no code changes required—leave update_failure_dedup_state
as-is.

@marcusquinn marcusquinn merged commit 7be68e9 into main Feb 18, 2026
31 checks passed
@marcusquinn marcusquinn deleted the feature/t1206 branch February 18, 2026 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant