Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 133 additions & 1 deletion .agents/scripts/supervisor-helper.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@
# supervisor-helper.sh pr-check <task_id> Check PR CI/review status
# supervisor-helper.sh pr-merge <task_id> [--dry-run] Merge PR (squash)
# supervisor-helper.sh self-heal <task_id> Create diagnostic subtask for failed/blocked task
# supervisor-helper.sh backup [reason] Backup supervisor database (t162)
# supervisor-helper.sh restore [backup_file] Restore from backup (lists if no file) (t162)
# supervisor-helper.sh db [sql] Direct SQLite access
# supervisor-helper.sh help
#
Expand Down Expand Up @@ -449,6 +451,108 @@ db() {
sqlite3 -cmd ".timeout 5000" "$@"
}

#######################################
# Backup supervisor database before destructive operations (t162)
# Creates timestamped copy in supervisor dir. Keeps last 5 backups.
# Usage: backup_db [reason]
#######################################
backup_db() {
local reason="${1:-manual}"

if [[ ! -f "$SUPERVISOR_DB" ]]; then
log_warn "No database to backup at: $SUPERVISOR_DB"
return 1
fi

local timestamp
timestamp=$(date -u +%Y%m%dT%H%M%SZ)
local backup_file="$SUPERVISOR_DIR/supervisor-backup-${timestamp}-${reason}.db"

# Use SQLite .backup for consistency (handles WAL correctly)
if sqlite3 "$SUPERVISOR_DB" ".backup '$backup_file'" 2>/dev/null; then
log_success "Database backed up: $backup_file"
else
# Fallback to file copy if .backup fails
if cp "$SUPERVISOR_DB" "$backup_file" 2>/dev/null; then
# Also copy WAL/SHM if present
[[ -f "${SUPERVISOR_DB}-wal" ]] && cp "${SUPERVISOR_DB}-wal" "${backup_file}-wal" 2>/dev/null || true
[[ -f "${SUPERVISOR_DB}-shm" ]] && cp "${SUPERVISOR_DB}-shm" "${backup_file}-shm" 2>/dev/null || true
log_success "Database backed up (file copy): $backup_file"
else
log_error "Failed to backup database"
return 1
fi
fi

# Prune old backups: keep last 5
local backup_count
# shellcheck disable=SC2012
backup_count=$(ls -1 "$SUPERVISOR_DIR"/supervisor-backup-*.db 2>/dev/null | wc -l | tr -d ' ')
if [[ "$backup_count" -gt 5 ]]; then
local to_remove
to_remove=$((backup_count - 5))
# shellcheck disable=SC2012
ls -1t "$SUPERVISOR_DIR"/supervisor-backup-*.db 2>/dev/null | tail -n "$to_remove" | while IFS= read -r old_backup; do
rm -f "$old_backup" "${old_backup}-wal" "${old_backup}-shm" 2>/dev/null || true
done
log_info "Pruned $to_remove old backup(s)"
fi

echo "$backup_file"
return 0
}
Comment on lines +454 to +503
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sanitize the backup reason before using it in filenames.

reason is user-supplied and is embedded directly in the backup filename. A value containing path separators or .. can escape $SUPERVISOR_DIR, and unsafe characters can break filenames. Normalize it to a safe slug before composing the path.

✅ Suggested fix
 backup_db() {
     local reason="${1:-manual}"
+    local safe_reason
+    safe_reason=$(echo "$reason" | tr -cs 'A-Za-z0-9._-' '_' | sed -E 's/^_+|_+$//g')
+    [[ -z "$safe_reason" ]] && safe_reason="manual"
 
     if [[ ! -f "$SUPERVISOR_DB" ]]; then
         log_warn "No database to backup at: $SUPERVISOR_DB"
         return 1
     fi
@@
-    local backup_file="$SUPERVISOR_DIR/supervisor-backup-${timestamp}-${reason}.db"
+    local backup_file="$SUPERVISOR_DIR/supervisor-backup-${timestamp}-${safe_reason}.db"

As per coding guidelines, "Reliability and robustness".

🤖 Prompt for AI Agents
In @.agents/scripts/supervisor-helper.sh around lines 454 - 503, The backup_db
function embeds user-supplied reason into filenames, allowing path traversal or
unsafe characters; before composing backup_file sanitize reason (used in
variable reason and backup_file) by stripping any path separators and leading
dots, replacing non-alphanumeric characters with a safe separator (e.g. hyphen),
collapsing repeated separators, trimming to a reasonable length, and defaulting
to "manual" if empty; perform this normalization immediately after local
reason="${1:-manual}" and use the sanitized value when creating backup_file and
when pruning/cleanup so filenames cannot escape $SUPERVISOR_DIR or contain
unsafe characters.


#######################################
# Restore supervisor database from backup (t162)
# Usage: restore_db [backup_file]
# If no file specified, lists available backups
#######################################
restore_db() {
local backup_file="${1:-}"

if [[ -z "$backup_file" ]]; then
log_info "Available backups:"
# shellcheck disable=SC2012
ls -1t "$SUPERVISOR_DIR"/supervisor-backup-*.db 2>/dev/null | while IFS= read -r f; do
local size
size=$(du -h "$f" 2>/dev/null | cut -f1)
local task_count
task_count=$(sqlite3 "$f" "SELECT count(*) FROM tasks;" 2>/dev/null || echo "?")
echo " $f ($size, $task_count tasks)"
done
return 0
fi

if [[ ! -f "$backup_file" ]]; then
log_error "Backup file not found: $backup_file"
return 1
fi

# Verify backup is valid SQLite
if ! sqlite3 "$backup_file" "SELECT count(*) FROM tasks;" >/dev/null 2>&1; then
log_error "Backup file is not a valid supervisor database: $backup_file"
return 1
fi

# Backup current DB before overwriting
if [[ -f "$SUPERVISOR_DB" ]]; then
backup_db "pre-restore" >/dev/null 2>&1 || true
fi

cp "$backup_file" "$SUPERVISOR_DB"
[[ -f "${backup_file}-wal" ]] && cp "${backup_file}-wal" "${SUPERVISOR_DB}-wal" 2>/dev/null || true
[[ -f "${backup_file}-shm" ]] && cp "${backup_file}-shm" "${SUPERVISOR_DB}-shm" 2>/dev/null || true

local task_count
task_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM tasks;")
local batch_count
batch_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM batches;")

log_success "Database restored from: $backup_file"
log_info "Tasks: $task_count | Batches: $batch_count"
return 0
}
Comment on lines +510 to +554
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

head -20 .agents/scripts/supervisor-helper.sh && \
wc -l .agents/scripts/supervisor-helper.sh

Repository: marcusquinn/aidevops

Length of output: 1554


🏁 Script executed:

sed -n '500,560p' .agents/scripts/supervisor-helper.sh

Repository: marcusquinn/aidevops

Length of output: 2086


🏁 Script executed:

# Search for backup_db and db functions to understand their behavior
grep -n "^backup_db()" .agents/scripts/supervisor-helper.sh
grep -n "^db()" .agents/scripts/supervisor-helper.sh

Repository: marcusquinn/aidevops

Length of output: 92


🏁 Script executed:

# Check if WAL/SHM handling exists elsewhere in the script
rg -n "wal|shm" .agents/scripts/supervisor-helper.sh

Repository: marcusquinn/aidevops

Length of output: 775


🏁 Script executed:

sed -n '450,460p' .agents/scripts/supervisor-helper.sh

Repository: marcusquinn/aidevops

Length of output: 399


🏁 Script executed:

sed -n '459,500p' .agents/scripts/supervisor-helper.sh

Repository: marcusquinn/aidevops

Length of output: 1767


🌐 Web query:

SQLite WAL file replay behavior when database file replaced restored

💡 Result:

In WAL mode, the on-disk “database state” is not just db.sqlite — it’s effectively the combination of:

  • db.sqlite (main database file)
  • db.sqlite-wal (write-ahead log containing committed transactions not yet written back)
  • db.sqlite-shm (wal-index/shared-memory coordination file; rebuildable)

SQLite’s docs explicitly warn that if the database file is separated from its WAL file, committed transactions can be lost and the database can become corrupted; therefore the WAL “is part of the persistent state of the database and should be kept with the database if the database is copied or moved.” [1] They also list “Overwriting a database file with another without also deleting any hot journal (including -wal)” as an action “likely to lead to corruption.” [2]

What happens if you restore/replace only db.sqlite but leave an old db.sqlite-wal?

On the next open, SQLite will see the -wal and treat it like crash-recovery state; readers consult the WAL for the latest committed pages, and SQLite may run recovery/checkpoint activity to reconcile WAL content with the database file [1]. If the leftover -wal does not actually belong to the restored db.sqlite, you’ve “mispaird” files; SQLite can then present inconsistent data, fail with errors, or end up corrupting the restored database as it tries to interpret/replay WAL content that was for a different database incarnation [1], [2].

Practical rules

  • If you restore db.sqlite, restore the matching db.sqlite-wal too (and you may restore db.sqlite-shm, but it’s not required for recovery content). [1]
  • If you intentionally want the restored db.sqlite to stand alone, then delete db.sqlite-wal (and db.sqlite-shm) before opening the restored database, so SQLite cannot try to use stale WAL state. This aligns with SQLite’s corruption guidance about not leaving a “hot” -wal behind when swapping the main db file. [2]

Backup/restore best practice (to avoid this entirely)

Before taking a backup, ensure the DB is in a quiescent/clean state (e.g., checkpoint and close connections) so there is no meaningful -wal to carry around; otherwise you must treat db.sqlite + db.sqlite-wal as an inseparable pair. [1], [2]

Sources: SQLite WAL mode documentation [1], “How To Corrupt An SQLite Database File” (hot journals / mispairing WAL) [2].

[1] https://sqlite.org/wal.html
[2] https://www.sqlite.org/howtocorrupt.html


Remove stale WAL/SHM before restoring the database.

If ${SUPERVISOR_DB}-wal and -shm from a previous database remain, SQLite will attempt to replay them against the restored file, causing corruption or data inconsistency. These must be cleared before copying the backup. Additionally, add error handling to fail fast if the copy operation fails.

Suggested fix
     # Backup current DB before overwriting
     if [[ -f "$SUPERVISOR_DB" ]]; then
         backup_db "pre-restore" >/dev/null 2>&1 || true
     fi
 
-    cp "$backup_file" "$SUPERVISOR_DB"
+    # Remove stale WAL/SHM to avoid replaying old frames
+    rm -f "${SUPERVISOR_DB}-wal" "${SUPERVISOR_DB}-shm" 2>/dev/null || true
+
+    if ! cp "$backup_file" "$SUPERVISOR_DB"; then
+        log_error "Failed to restore database from: $backup_file"
+        return 1
+    fi
     [[ -f "${backup_file}-wal" ]] && cp "${backup_file}-wal" "${SUPERVISOR_DB}-wal" 2>/dev/null || true
     [[ -f "${backup_file}-shm" ]] && cp "${backup_file}-shm" "${SUPERVISOR_DB}-shm" 2>/dev/null || true

Per coding guidelines: Reliability and robustness.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
restore_db() {
local backup_file="${1:-}"
if [[ -z "$backup_file" ]]; then
log_info "Available backups:"
# shellcheck disable=SC2012
ls -1t "$SUPERVISOR_DIR"/supervisor-backup-*.db 2>/dev/null | while IFS= read -r f; do
local size
size=$(du -h "$f" 2>/dev/null | cut -f1)
local task_count
task_count=$(sqlite3 "$f" "SELECT count(*) FROM tasks;" 2>/dev/null || echo "?")
echo " $f ($size, $task_count tasks)"
done
return 0
fi
if [[ ! -f "$backup_file" ]]; then
log_error "Backup file not found: $backup_file"
return 1
fi
# Verify backup is valid SQLite
if ! sqlite3 "$backup_file" "SELECT count(*) FROM tasks;" >/dev/null 2>&1; then
log_error "Backup file is not a valid supervisor database: $backup_file"
return 1
fi
# Backup current DB before overwriting
if [[ -f "$SUPERVISOR_DB" ]]; then
backup_db "pre-restore" >/dev/null 2>&1 || true
fi
cp "$backup_file" "$SUPERVISOR_DB"
[[ -f "${backup_file}-wal" ]] && cp "${backup_file}-wal" "${SUPERVISOR_DB}-wal" 2>/dev/null || true
[[ -f "${backup_file}-shm" ]] && cp "${backup_file}-shm" "${SUPERVISOR_DB}-shm" 2>/dev/null || true
local task_count
task_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM tasks;")
local batch_count
batch_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM batches;")
log_success "Database restored from: $backup_file"
log_info "Tasks: $task_count | Batches: $batch_count"
return 0
}
restore_db() {
local backup_file="${1:-}"
if [[ -z "$backup_file" ]]; then
log_info "Available backups:"
# shellcheck disable=SC2012
ls -1t "$SUPERVISOR_DIR"/supervisor-backup-*.db 2>/dev/null | while IFS= read -r f; do
local size
size=$(du -h "$f" 2>/dev/null | cut -f1)
local task_count
task_count=$(sqlite3 "$f" "SELECT count(*) FROM tasks;" 2>/dev/null || echo "?")
echo " $f ($size, $task_count tasks)"
done
return 0
fi
if [[ ! -f "$backup_file" ]]; then
log_error "Backup file not found: $backup_file"
return 1
fi
# Verify backup is valid SQLite
if ! sqlite3 "$backup_file" "SELECT count(*) FROM tasks;" >/dev/null 2>&1; then
log_error "Backup file is not a valid supervisor database: $backup_file"
return 1
fi
# Backup current DB before overwriting
if [[ -f "$SUPERVISOR_DB" ]]; then
backup_db "pre-restore" >/dev/null 2>&1 || true
fi
# Remove stale WAL/SHM to avoid replaying old frames
rm -f "${SUPERVISOR_DB}-wal" "${SUPERVISOR_DB}-shm" 2>/dev/null || true
if ! cp "$backup_file" "$SUPERVISOR_DB"; then
log_error "Failed to restore database from: $backup_file"
return 1
fi
[[ -f "${backup_file}-wal" ]] && cp "${backup_file}-wal" "${SUPERVISOR_DB}-wal" 2>/dev/null || true
[[ -f "${backup_file}-shm" ]] && cp "${backup_file}-shm" "${SUPERVISOR_DB}-shm" 2>/dev/null || true
local task_count
task_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM tasks;")
local batch_count
batch_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM batches;")
log_success "Database restored from: $backup_file"
log_info "Tasks: $task_count | Batches: $batch_count"
return 0
}
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor-helper.sh around lines 510 - 554, In
restore_db(), before copying the backup into SUPERVISOR_DB, ensure any existing
stale SQLite journal files for the target are removed (delete
"${SUPERVISOR_DB}-wal" and "${SUPERVISOR_DB}-shm" if present) so SQLite won't
replay them against the restored file; then perform the cp of "$backup_file" to
"$SUPERVISOR_DB" and immediately check the command exit status, logging an error
via log_error and returning non‑zero on failure; keep the existing logic that
conditionally copies backup-file WAL/SHM after a successful cp and retain the
pre-restore backup_db call.


#######################################
# Ensure supervisor directory and DB exist
#######################################
Expand All @@ -475,6 +579,7 @@ ensure_db() {
check_sql=$(db "$SUPERVISOR_DB" "SELECT sql FROM sqlite_master WHERE type='table' AND name='tasks';" 2>/dev/null || echo "")
if [[ -n "$check_sql" ]] && ! echo "$check_sql" | grep -q 'pr_review'; then
log_info "Migrating database schema for post-PR lifecycle states (t128.8)..."
backup_db "pre-migrate-t128.8" >/dev/null 2>&1 || log_warn "Backup failed, proceeding with migration"
db "$SUPERVISOR_DB" << 'MIGRATE'
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
Expand All @@ -499,7 +604,13 @@ CREATE TABLE tasks (
completed_at TEXT,
updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now'))
);
INSERT INTO tasks SELECT * FROM tasks_old;
INSERT INTO tasks (id, repo, description, status, session_id, worktree, branch,
log_file, retries, max_retries, model, error, pr_url,
created_at, started_at, completed_at, updated_at)
SELECT id, repo, description, status, session_id, worktree, branch,
log_file, retries, max_retries, model, error, pr_url,
created_at, started_at, completed_at, updated_at
FROM tasks_old;
DROP TABLE tasks_old;
CREATE INDEX IF NOT EXISTS idx_tasks_status ON tasks(status);
CREATE INDEX IF NOT EXISTS idx_tasks_repo ON tasks(repo);
Expand Down Expand Up @@ -565,6 +676,7 @@ MIGRATE
check_sql_t148=$(db "$SUPERVISOR_DB" "SELECT sql FROM sqlite_master WHERE type='table' AND name='tasks';" 2>/dev/null || echo "")
if [[ -n "$check_sql_t148" ]] && ! echo "$check_sql_t148" | grep -q 'review_triage'; then
log_info "Migrating database schema for review_triage state (t148)..."
backup_db "pre-migrate-t148" >/dev/null 2>&1 || log_warn "Backup failed, proceeding with migration"
db "$SUPERVISOR_DB" << 'MIGRATE_T148'
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
Expand Down Expand Up @@ -740,6 +852,22 @@ cmd_init() {
return 0
}

#######################################
# Backup supervisor database (t162)
#######################################
cmd_backup() {
local reason="${1:-manual}"
backup_db "$reason"
}

#######################################
# Restore supervisor database from backup (t162)
#######################################
cmd_restore() {
local backup_file="${1:-}"
restore_db "$backup_file"
}

#######################################
# Add a task to the supervisor
#######################################
Expand Down Expand Up @@ -7132,6 +7260,8 @@ Usage:
supervisor-helper.sh running-count [batch_id] Count active tasks
supervisor-helper.sh reset <task_id> Reset task to queued
supervisor-helper.sh cancel <task_id|batch_id> Cancel task or batch
supervisor-helper.sh backup [reason] Backup supervisor database
supervisor-helper.sh restore [backup_file] Restore from backup (lists if no file)
supervisor-helper.sh auto-pickup [--repo path] Scan TODO.md for auto-dispatch tasks
supervisor-helper.sh cron [install|uninstall|status] Manage cron-based pulse scheduling
supervisor-helper.sh watch [--repo path] Watch TODO.md for changes (fswatch)
Expand Down Expand Up @@ -7416,6 +7546,8 @@ main() {
running-count) cmd_running_count "$@" ;;
reset) cmd_reset "$@" ;;
cancel) cmd_cancel "$@" ;;
backup) cmd_backup "$@" ;;
restore) cmd_restore "$@" ;;
db) cmd_db "$@" ;;
help|--help|-h) show_usage ;;
*) log_error "Unknown command: $command"; show_usage; return 1 ;;
Expand Down
2 changes: 2 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@
- Notes: Problem: supervisor adds tasks to its DB but not to TODO.md, so workers running /full-loop cannot find the task description. Fix: 1) build_dispatch_cmd now passes task description inline in the prompt (` -- description`), 2) full-loop.md Step 0 has 3-tier resolution: inline desc > TODO.md > supervisor DB, 3) full-loop-helper.sh PR creation queries supervisor DB as fallback. Also adds headless worker rules (no user prompts, no TODO.md edits, graceful auth failure handling) and PR creation hardening (gh auth check, rebase before push, proper title/body).
- [ ] t160 fix: supervisor TODO.md push fails under concurrent workers, add reconcile-todo command #bugfix #supervisor ~1h (ai:45m test:15m) logged:2026-02-08 started:2026-02-08
- Notes: Root cause: update_todo_on_complete() commits TODO.md locally but git push fails when multiple concurrent workers push to main simultaneously (non-fast-forward). The function logs a warning but treats it as non-blocking, so tasks transition to deployed even though TODO.md was never pushed. Fix: add commit_and_push_todo() helper with pull-rebase retry logic (3 attempts with backoff). Add reconcile-todo command for bulk reconciliation. Wire reconcile into pulse cycle Phase 7 (runs when no workers active). Affected 12 tasks from batch-20260208.
- [ ] t162 fix: supervisor DB safety - add backup-before-migrate and explicit column migrations #bugfix #supervisor ~30m (ai:25m test:5m) logged:2026-02-08 started:2026-02-08
- Notes: Root cause: t128.8 migration uses INSERT INTO tasks SELECT * FROM tasks_old which fails on column count mismatch if migrations run out of order. No backup before destructive table rename/recreate migrations. Fix: 1) Add backup_db() helper with SQLite .backup, timestamped copies, auto-prune to 5. 2) Add backup before t128.8 and t148 destructive migrations. 3) Fix t128.8 SELECT * to use explicit column list. 4) Add backup/restore commands for manual recovery.

Check notice on line 63 in TODO.md

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

TODO.md#L63

Spaces inside emphasis markers

- [x] t152 Fix `((cleaned++))` arithmetic exit code bug in setup.sh causing silent abort under `set -e` #bug #setup ~30m actual:15m (ai:15m) ref:GH#548 logged:2026-02-08 started:2026-02-08 completed:2026-02-08

Expand Down
Loading