t1206: Add dispatch deduplication guard for repeated task failures by marcusquinn · Pull Request #1835 · marcusquinn/aidevops

marcusquinn · 2026-02-18T23:13:21Z

Adds a dispatch deduplication guard to prevent the supervisor from re-dispatching tasks that fail with the same error in a short window.

Problem

Worker outcomes showed repeated identical failures wasting tokens:

t1032.1 dispatched and failed twice within 2 minutes (14:50 and 14:52) with the same error
t1030 failed twice within 22 minutes (14:24 and 14:46) with identical errors

Solution

Three-layer guard implemented in the supervisor dispatch pipeline:

10-minute cooldown — after any failure, the task cannot be re-dispatched for 10 minutes (configurable via SUPERVISOR_FAILURE_COOLDOWN_SECS)
Consecutive failure blocking — after 2 consecutive identical failures, the task is moved to blocked status with a diagnostic note requiring manual intervention (configurable via SUPERVISOR_MAX_CONSECUTIVE_FAILURES)
Warning logging — a warning is emitted whenever the same task fails with the same error code twice in succession

Changes

database.sh: Add last_failure_at (TEXT) and consecutive_failure_count (INTEGER) columns to tasks table — both in init_db() schema and as a migration for existing DBs
dispatch.sh: Add check_dispatch_dedup_guard() called in cmd_dispatch() after max_retries check; returns 1 (block task) or 2 (defer to next pulse)
dispatch.sh: Add update_failure_dedup_state() to track failure timestamps and consecutive counts with error-key normalisation (strips detail suffix for comparison)
pulse.sh: Call update_failure_dedup_state() in retry handler so dedup state is updated whenever a task is marked for retry

Verification

ShellCheck: zero violations on all 3 modified files
Bash syntax: clean on all 3 files
Supervisor globals test: 10/10 passed

Summary by CodeRabbit

New Features
- Dispatch deduplication guard prevents repeated re-dispatch of tasks with identical failures; enforces a configurable cooldown (default 10 minutes) and blocks after a configurable consecutive-failure threshold.
- Failure details are tracked to decide defer/block actions; retries update this state and successful runs reset it.
- Failure reasons are recorded to aid troubleshooting.

…206) Prevents token waste when tasks fail with the same error repeatedly: - 10-minute cooldown enforced before re-dispatch after any failure - Tasks blocked after 2 consecutive identical failures with diagnostic note - Warning logged when same error code appears twice in succession Implementation: - database.sh: add last_failure_at and consecutive_failure_count columns to tasks table (schema + migration for existing DBs) - dispatch.sh: add check_dispatch_dedup_guard() called in cmd_dispatch() after max_retries check; returns 1 (block) or 2 (cooldown/defer) - dispatch.sh: add update_failure_dedup_state() to track failure timestamps and consecutive counts with error-key normalisation - pulse.sh: call update_failure_dedup_state() in retry handler so dedup state is updated whenever a task is marked for retry Configurable via env vars: SUPERVISOR_FAILURE_COOLDOWN_SECS (default: 600) SUPERVISOR_MAX_CONSECUTIVE_FAILURES (default: 2) Addresses observed pattern: t1032.1 failed twice in 2 min (14:50, 14:52), t1030 failed twice in 22 min (14:24, 14:46) with identical errors.

gemini-code-assist · 2026-02-18T23:13:25Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-02-18T23:13:43Z

Walkthrough

Adds a dispatch deduplication guard (t1206): schema columns for last failure and consecutive failure count, migration guard to add them if missing, guard/check/update/reset functions, and integrations into dispatch and pulse flows to enforce cooldowns and blocking for repeated identical failures.

Changes

Cohort / File(s)	Summary
Database Schema & Migration `.agents/scripts/supervisor/database.sh`	Adds `last_failure_at TEXT` and `consecutive_failure_count INTEGER NOT NULL DEFAULT 0` to `tasks` schema and a migration guard (t1206) that conditionally adds these columns if absent.
Dispatch Deduplication Logic `.agents/scripts/supervisor/dispatch.sh`	Introduces `check_dispatch_dedup_guard()`, `update_failure_dedup_state()`, and `reset_failure_dedup_state()`; integrates guard check into `cmd_dispatch` to block, cooldown (defer), or proceed based on stored failure state and error key comparisons.
Pulse Failure Tracking `.agents/scripts/supervisor/pulse.sh`	Calls `update_failure_dedup_state()` on retry paths and `reset_failure_dedup_state()` after successful completion to keep deduplication state in sync with task outcomes.

Sequence Diagram

sequenceDiagram
    actor Client
    participant Dispatch as dispatch.sh
    participant Guard as check_dispatch_dedup_guard()
    participant DB as "SQLite: tasks"
    participant Pulse as pulse.sh
    participant Updater as update_failure_dedup_state()

    Client->>Dispatch: cmd_dispatch(task_id)
    Dispatch->>Guard: check_dispatch_dedup_guard(task_id)
    Guard->>DB: SELECT last_failure_at, consecutive_failure_count, last_error
    alt Cooldown active (recent failure)
        Guard-->>Dispatch: return 2 (cooldown)
        Dispatch-->>Client: defer to next pulse
    else Blocked (threshold reached)
        Guard-->>Dispatch: return 1 (blocked)
        Dispatch->>DB: mark task blocked / log
        Dispatch-->>Client: abort dispatch
    else Proceed
        Guard-->>Dispatch: return 0 (proceed)
        Dispatch->>Client: perform dispatch
    end

    Pulse->>Pulse: task execution fails
    Pulse->>Updater: update_failure_dedup_state(task_id, error_detail)
    Updater->>DB: read/compare last_error -> increment/reset count, set last_failure_at
    Updater-->>Pulse: updated
    Pulse->>Pulse: schedule retry or mark blocked based on state

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

t1206: Add dispatch deduplication guard for repeated task failures #1831 — Implements the t1206 dispatch deduplication guard (same feature: DB columns, migration, guard functions, dispatch/pulse integration).

Possibly related PRs

fix: supervisor integration testing fixes (t128.7) #384 — Touches dispatch and supervisor flows; related at code-level where guard is integrated into dispatch/pulse.
t1031: Modularize supervisor-helper.sh into domain modules #1359 — Modifies supervisor DB and scripts; overlaps with schema and script changes in this PR.
feat: add supervisor-helper.sh with SQLite schema and state machine (t128.1) #376 — Alters SQLite tasks schema; relevant to the added columns and migration here.

Poem

🛡️ When tasks fall into the same old trap,
A tiny guard sets down a map.
Cooldowns hush the tireless try,
Counts reset when errors say goodbye.
t1206 keeps the dispatch calm — a gentle clap.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 't1206: Add dispatch deduplication guard for repeated task failures' clearly and specifically describes the main change: introducing a deduplication guard mechanism for handling repeated task failures.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t1206

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-18T23:13:54Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 28 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 23:13:50 UTC 2026: Code review monitoring started
Wed Feb 18 23:13:51 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 28

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 28
VULNERABILITIES: 0

Generated on: Wed Feb 18 23:13:53 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1458-1464: The warning uses a hardcoded "2" instead of the
configured SUPERVISOR_MAX_CONSECUTIVE_FAILURES; update the check and message to
use a computed threshold variable (e.g.
threshold=${SUPERVISOR_MAX_CONSECUTIVE_FAILURES:-2}) and compare new_count
against that (use [[ "$new_count" -ge "$threshold" ]]) and include "$threshold"
in the log_warn text so the message reflects the actual blocking threshold;
operate on existing symbols new_count, current_count, new_error_key,
SUPERVISOR_MAX_CONSECUTIVE_FAILURES, log_warn and task_id.
- Around line 1442-1468: The dedup logic is currently reading the updated error
because cmd_transition updates the tasks.error column before
update_failure_dedup_state runs; change the call order so
update_failure_dedup_state is invoked before cmd_transition, ensuring
update_failure_dedup_state reads the previous error value (so its comparison of
current_error_key vs new_error_key works correctly) and then call cmd_transition
to persist the new error; update any related comments to reflect the new
sequencing involving the functions update_failure_dedup_state and
cmd_transition.
- Around line 1470-1478: Reset the dedup state when a task is successfully
completed or after a successful dispatch: clear last_failure_at (set to NULL)
and reset consecutive_failure_count to 0. Locate the code path that transitions
tasks to the "complete" state (or the successful dispatch/ack block used when
re-queuing, e.g., the quality gate escalation around the earlier dispatch/queue
logic) and add an UPDATE using the same db function/SUPERVISOR_DB that sets
last_failure_at = NULL and consecutive_failure_count = 0 for the given task id
(replace the existing failure-update logic that writes last_failure_at and
new_count). This ensures the cooldown guard that reads last_failure_at and
consecutive_failure_count (the cooldown check around lines 1391–1420) does not
defer tasks based on a prior failure once the task has completed successfully.

In @.agents/scripts/supervisor/pulse.sh:
- Around line 1082-1088: Move the call to update_failure_dedup_state "$tid"
"$outcome_detail" so it executes before cmd_transition "$tid" "retrying" --error
"$outcome_detail"; the dedup logic in update_failure_dedup_state expects to read
the previous DB error (via COALESCE(error, '')) to compare with the
new_error_key, but cmd_transition writes the new error first which causes the
comparison to always match and never reset the streak; reorder the three lines
so cleanup_worker_processes "$tid" remains where it is, call
update_failure_dedup_state "$tid" "$outcome_detail" (redirecting stderr to
$SUPERVISOR_LOG || true) next, and only then call cmd_transition "$tid"
"retrying" --error "$outcome_detail" to ensure the function sees the prior error
state.

.agents/scripts/supervisor/dispatch.sh

.agents/scripts/supervisor/pulse.sh

Auto-dismissed: bot review does not block autonomous pipeline

- Use configurable SUPERVISOR_MAX_CONSECUTIVE_FAILURES threshold in warning log instead of hardcoded 2 (dispatch.sh) - Reorder update_failure_dedup_state before cmd_transition in retry handler so the function reads the previous DB error for accurate streak comparison (pulse.sh) - Add reset_failure_dedup_state() to clear last_failure_at and consecutive_failure_count on task completion, preventing stale cooldown deferral on re-queued tasks (dispatch.sh) - Call reset_failure_dedup_state in complete handler (pulse.sh)

github-actions · 2026-02-18T23:38:36Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 28 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 23:38:32 UTC 2026: Code review monitoring started
Wed Feb 18 23:38:32 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 28

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 28
VULNERABILITIES: 0

Generated on: Wed Feb 18 23:38:34 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-18T23:39:24Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1491-1504: The reset_failure_dedup_state function currently clears
last_failure_at and consecutive_failure_count for a given task ID but may be
susceptible to SQL injection or quoting issues and silently swallows DB errors;
ensure you use the existing sql_escape("$task_id") output safely when
constructing the UPDATE and avoid embedding unescaped variables elsewhere, keep
the DB call via db "$SUPERVISOR_DB" and reference the same column names
(last_failure_at, consecutive_failure_count) and table (tasks), but also
propagate/log any DB errors instead of discarding them (remove the trailing
2>/dev/null || true or replace with proper error handling) so failures to reset
are visible for operators.
- Around line 1436-1482: The update_failure_dedup_state function correctly
normalizes error keys using "${error_detail%%:*}" and "${current_error%%:*}",
properly guards empty keys with -n "$current_error_key", and uses the
configurable SUPERVISOR_MAX_CONSECUTIVE_FAILURES (max_consecutive) for the
warning threshold; no code changes required—leave update_failure_dedup_state
as-is.

coderabbitai bot previously requested changes Feb 18, 2026

View reviewed changes

.agents/scripts/supervisor/dispatch.sh Show resolved Hide resolved

.agents/scripts/supervisor/dispatch.sh Show resolved Hide resolved

.agents/scripts/supervisor/dispatch.sh Show resolved Hide resolved

.agents/scripts/supervisor/pulse.sh Outdated Show resolved Hide resolved

marcusquinn mentioned this pull request Feb 18, 2026

[Supervisor:marcusquinn] 0 queued, 0 working, 1 in review at 14:46 UTC #1502

Closed

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

coderabbitai bot approved these changes Feb 18, 2026

View reviewed changes

marcusquinn merged commit 7be68e9 into main Feb 18, 2026
31 checks passed

marcusquinn deleted the feature/t1206 branch February 18, 2026 23:57

marcusquinn added a commit that referenced this pull request Feb 18, 2026

chore: mark t1206 complete in TODO.md (#1835)

b756a3f

marcusquinn mentioned this pull request Feb 18, 2026

t1206: Add dispatch deduplication guard for repeated task failures #1831

Closed

This was referenced Feb 23, 2026

fix: resolve three supervisor pulse-blocking bugs #2185

Merged

fix: clear dedup guard state on task reset #2189

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

t1206: Add dispatch deduplication guard for repeated task failures#1835

t1206: Add dispatch deduplication guard for repeated task failures#1835
marcusquinn merged 2 commits intomainfrom
feature/t1206

marcusquinn commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

sonarqubecloud bot commented Feb 18, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

marcusquinn commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Verification

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Feb 18, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 18, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Feb 18, 2026

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcusquinn commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading