-
Notifications
You must be signed in to change notification settings - Fork 5
fix: supervisor DB safety - add backup-before-migrate and explicit column migrations (t162) #591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…lumn migrations (t162) - Add backup_db() helper using SQLite .backup with auto-prune to 5 copies - Add restore_db() with backup validation and pre-restore safety backup - Add backup/restore commands for manual DB recovery - Backup before destructive t128.8 and t148 migrations (table rename/recreate) - Fix t128.8 migration: replace SELECT * with explicit column list to prevent column count mismatch when migrations run out of order - Add t162 to TODO.md
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
WalkthroughThis PR introduces a backup and restore subsystem for the supervisor SQLite database in the shell automation framework. It adds backup_db and restore_db functions with CLI commands, integrates pre-migration backups into specific schema migrations, and updates help documentation to expose these capabilities. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Sun Feb 8 05:54:25 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.agents/scripts/supervisor-helper.sh:
- Around line 454-503: The backup_db function embeds user-supplied reason into
filenames, allowing path traversal or unsafe characters; before composing
backup_file sanitize reason (used in variable reason and backup_file) by
stripping any path separators and leading dots, replacing non-alphanumeric
characters with a safe separator (e.g. hyphen), collapsing repeated separators,
trimming to a reasonable length, and defaulting to "manual" if empty; perform
this normalization immediately after local reason="${1:-manual}" and use the
sanitized value when creating backup_file and when pruning/cleanup so filenames
cannot escape $SUPERVISOR_DIR or contain unsafe characters.
- Around line 510-554: In restore_db(), before copying the backup into
SUPERVISOR_DB, ensure any existing stale SQLite journal files for the target are
removed (delete "${SUPERVISOR_DB}-wal" and "${SUPERVISOR_DB}-shm" if present) so
SQLite won't replay them against the restored file; then perform the cp of
"$backup_file" to "$SUPERVISOR_DB" and immediately check the command exit
status, logging an error via log_error and returning non‑zero on failure; keep
the existing logic that conditionally copies backup-file WAL/SHM after a
successful cp and retain the pre-restore backup_db call.
| ####################################### | ||
| # Backup supervisor database before destructive operations (t162) | ||
| # Creates timestamped copy in supervisor dir. Keeps last 5 backups. | ||
| # Usage: backup_db [reason] | ||
| ####################################### | ||
| backup_db() { | ||
| local reason="${1:-manual}" | ||
|
|
||
| if [[ ! -f "$SUPERVISOR_DB" ]]; then | ||
| log_warn "No database to backup at: $SUPERVISOR_DB" | ||
| return 1 | ||
| fi | ||
|
|
||
| local timestamp | ||
| timestamp=$(date -u +%Y%m%dT%H%M%SZ) | ||
| local backup_file="$SUPERVISOR_DIR/supervisor-backup-${timestamp}-${reason}.db" | ||
|
|
||
| # Use SQLite .backup for consistency (handles WAL correctly) | ||
| if sqlite3 "$SUPERVISOR_DB" ".backup '$backup_file'" 2>/dev/null; then | ||
| log_success "Database backed up: $backup_file" | ||
| else | ||
| # Fallback to file copy if .backup fails | ||
| if cp "$SUPERVISOR_DB" "$backup_file" 2>/dev/null; then | ||
| # Also copy WAL/SHM if present | ||
| [[ -f "${SUPERVISOR_DB}-wal" ]] && cp "${SUPERVISOR_DB}-wal" "${backup_file}-wal" 2>/dev/null || true | ||
| [[ -f "${SUPERVISOR_DB}-shm" ]] && cp "${SUPERVISOR_DB}-shm" "${backup_file}-shm" 2>/dev/null || true | ||
| log_success "Database backed up (file copy): $backup_file" | ||
| else | ||
| log_error "Failed to backup database" | ||
| return 1 | ||
| fi | ||
| fi | ||
|
|
||
| # Prune old backups: keep last 5 | ||
| local backup_count | ||
| # shellcheck disable=SC2012 | ||
| backup_count=$(ls -1 "$SUPERVISOR_DIR"/supervisor-backup-*.db 2>/dev/null | wc -l | tr -d ' ') | ||
| if [[ "$backup_count" -gt 5 ]]; then | ||
| local to_remove | ||
| to_remove=$((backup_count - 5)) | ||
| # shellcheck disable=SC2012 | ||
| ls -1t "$SUPERVISOR_DIR"/supervisor-backup-*.db 2>/dev/null | tail -n "$to_remove" | while IFS= read -r old_backup; do | ||
| rm -f "$old_backup" "${old_backup}-wal" "${old_backup}-shm" 2>/dev/null || true | ||
| done | ||
| log_info "Pruned $to_remove old backup(s)" | ||
| fi | ||
|
|
||
| echo "$backup_file" | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanitize the backup reason before using it in filenames.
reason is user-supplied and is embedded directly in the backup filename. A value containing path separators or .. can escape $SUPERVISOR_DIR, and unsafe characters can break filenames. Normalize it to a safe slug before composing the path.
✅ Suggested fix
backup_db() {
local reason="${1:-manual}"
+ local safe_reason
+ safe_reason=$(echo "$reason" | tr -cs 'A-Za-z0-9._-' '_' | sed -E 's/^_+|_+$//g')
+ [[ -z "$safe_reason" ]] && safe_reason="manual"
if [[ ! -f "$SUPERVISOR_DB" ]]; then
log_warn "No database to backup at: $SUPERVISOR_DB"
return 1
fi
@@
- local backup_file="$SUPERVISOR_DIR/supervisor-backup-${timestamp}-${reason}.db"
+ local backup_file="$SUPERVISOR_DIR/supervisor-backup-${timestamp}-${safe_reason}.db"As per coding guidelines, "Reliability and robustness".
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor-helper.sh around lines 454 - 503, The backup_db
function embeds user-supplied reason into filenames, allowing path traversal or
unsafe characters; before composing backup_file sanitize reason (used in
variable reason and backup_file) by stripping any path separators and leading
dots, replacing non-alphanumeric characters with a safe separator (e.g. hyphen),
collapsing repeated separators, trimming to a reasonable length, and defaulting
to "manual" if empty; perform this normalization immediately after local
reason="${1:-manual}" and use the sanitized value when creating backup_file and
when pruning/cleanup so filenames cannot escape $SUPERVISOR_DIR or contain
unsafe characters.
| restore_db() { | ||
| local backup_file="${1:-}" | ||
|
|
||
| if [[ -z "$backup_file" ]]; then | ||
| log_info "Available backups:" | ||
| # shellcheck disable=SC2012 | ||
| ls -1t "$SUPERVISOR_DIR"/supervisor-backup-*.db 2>/dev/null | while IFS= read -r f; do | ||
| local size | ||
| size=$(du -h "$f" 2>/dev/null | cut -f1) | ||
| local task_count | ||
| task_count=$(sqlite3 "$f" "SELECT count(*) FROM tasks;" 2>/dev/null || echo "?") | ||
| echo " $f ($size, $task_count tasks)" | ||
| done | ||
| return 0 | ||
| fi | ||
|
|
||
| if [[ ! -f "$backup_file" ]]; then | ||
| log_error "Backup file not found: $backup_file" | ||
| return 1 | ||
| fi | ||
|
|
||
| # Verify backup is valid SQLite | ||
| if ! sqlite3 "$backup_file" "SELECT count(*) FROM tasks;" >/dev/null 2>&1; then | ||
| log_error "Backup file is not a valid supervisor database: $backup_file" | ||
| return 1 | ||
| fi | ||
|
|
||
| # Backup current DB before overwriting | ||
| if [[ -f "$SUPERVISOR_DB" ]]; then | ||
| backup_db "pre-restore" >/dev/null 2>&1 || true | ||
| fi | ||
|
|
||
| cp "$backup_file" "$SUPERVISOR_DB" | ||
| [[ -f "${backup_file}-wal" ]] && cp "${backup_file}-wal" "${SUPERVISOR_DB}-wal" 2>/dev/null || true | ||
| [[ -f "${backup_file}-shm" ]] && cp "${backup_file}-shm" "${SUPERVISOR_DB}-shm" 2>/dev/null || true | ||
|
|
||
| local task_count | ||
| task_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM tasks;") | ||
| local batch_count | ||
| batch_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM batches;") | ||
|
|
||
| log_success "Database restored from: $backup_file" | ||
| log_info "Tasks: $task_count | Batches: $batch_count" | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
head -20 .agents/scripts/supervisor-helper.sh && \
wc -l .agents/scripts/supervisor-helper.shRepository: marcusquinn/aidevops
Length of output: 1554
🏁 Script executed:
sed -n '500,560p' .agents/scripts/supervisor-helper.shRepository: marcusquinn/aidevops
Length of output: 2086
🏁 Script executed:
# Search for backup_db and db functions to understand their behavior
grep -n "^backup_db()" .agents/scripts/supervisor-helper.sh
grep -n "^db()" .agents/scripts/supervisor-helper.shRepository: marcusquinn/aidevops
Length of output: 92
🏁 Script executed:
# Check if WAL/SHM handling exists elsewhere in the script
rg -n "wal|shm" .agents/scripts/supervisor-helper.shRepository: marcusquinn/aidevops
Length of output: 775
🏁 Script executed:
sed -n '450,460p' .agents/scripts/supervisor-helper.shRepository: marcusquinn/aidevops
Length of output: 399
🏁 Script executed:
sed -n '459,500p' .agents/scripts/supervisor-helper.shRepository: marcusquinn/aidevops
Length of output: 1767
🌐 Web query:
SQLite WAL file replay behavior when database file replaced restored
💡 Result:
In WAL mode, the on-disk “database state” is not just db.sqlite — it’s effectively the combination of:
db.sqlite(main database file)db.sqlite-wal(write-ahead log containing committed transactions not yet written back)db.sqlite-shm(wal-index/shared-memory coordination file; rebuildable)
SQLite’s docs explicitly warn that if the database file is separated from its WAL file, committed transactions can be lost and the database can become corrupted; therefore the WAL “is part of the persistent state of the database and should be kept with the database if the database is copied or moved.” [1] They also list “Overwriting a database file with another without also deleting any hot journal (including -wal)” as an action “likely to lead to corruption.” [2]
What happens if you restore/replace only db.sqlite but leave an old db.sqlite-wal?
On the next open, SQLite will see the -wal and treat it like crash-recovery state; readers consult the WAL for the latest committed pages, and SQLite may run recovery/checkpoint activity to reconcile WAL content with the database file [1]. If the leftover -wal does not actually belong to the restored db.sqlite, you’ve “mispaird” files; SQLite can then present inconsistent data, fail with errors, or end up corrupting the restored database as it tries to interpret/replay WAL content that was for a different database incarnation [1], [2].
Practical rules
- If you restore
db.sqlite, restore the matchingdb.sqlite-waltoo (and you may restoredb.sqlite-shm, but it’s not required for recovery content). [1] - If you intentionally want the restored
db.sqliteto stand alone, then deletedb.sqlite-wal(anddb.sqlite-shm) before opening the restored database, so SQLite cannot try to use stale WAL state. This aligns with SQLite’s corruption guidance about not leaving a “hot”-walbehind when swapping the main db file. [2]
Backup/restore best practice (to avoid this entirely)
Before taking a backup, ensure the DB is in a quiescent/clean state (e.g., checkpoint and close connections) so there is no meaningful -wal to carry around; otherwise you must treat db.sqlite + db.sqlite-wal as an inseparable pair. [1], [2]
Sources: SQLite WAL mode documentation [1], “How To Corrupt An SQLite Database File” (hot journals / mispairing WAL) [2].
[1] https://sqlite.org/wal.html
[2] https://www.sqlite.org/howtocorrupt.html
Remove stale WAL/SHM before restoring the database.
If ${SUPERVISOR_DB}-wal and -shm from a previous database remain, SQLite will attempt to replay them against the restored file, causing corruption or data inconsistency. These must be cleared before copying the backup. Additionally, add error handling to fail fast if the copy operation fails.
Suggested fix
# Backup current DB before overwriting
if [[ -f "$SUPERVISOR_DB" ]]; then
backup_db "pre-restore" >/dev/null 2>&1 || true
fi
- cp "$backup_file" "$SUPERVISOR_DB"
+ # Remove stale WAL/SHM to avoid replaying old frames
+ rm -f "${SUPERVISOR_DB}-wal" "${SUPERVISOR_DB}-shm" 2>/dev/null || true
+
+ if ! cp "$backup_file" "$SUPERVISOR_DB"; then
+ log_error "Failed to restore database from: $backup_file"
+ return 1
+ fi
[[ -f "${backup_file}-wal" ]] && cp "${backup_file}-wal" "${SUPERVISOR_DB}-wal" 2>/dev/null || true
[[ -f "${backup_file}-shm" ]] && cp "${backup_file}-shm" "${SUPERVISOR_DB}-shm" 2>/dev/null || truePer coding guidelines: Reliability and robustness.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| restore_db() { | |
| local backup_file="${1:-}" | |
| if [[ -z "$backup_file" ]]; then | |
| log_info "Available backups:" | |
| # shellcheck disable=SC2012 | |
| ls -1t "$SUPERVISOR_DIR"/supervisor-backup-*.db 2>/dev/null | while IFS= read -r f; do | |
| local size | |
| size=$(du -h "$f" 2>/dev/null | cut -f1) | |
| local task_count | |
| task_count=$(sqlite3 "$f" "SELECT count(*) FROM tasks;" 2>/dev/null || echo "?") | |
| echo " $f ($size, $task_count tasks)" | |
| done | |
| return 0 | |
| fi | |
| if [[ ! -f "$backup_file" ]]; then | |
| log_error "Backup file not found: $backup_file" | |
| return 1 | |
| fi | |
| # Verify backup is valid SQLite | |
| if ! sqlite3 "$backup_file" "SELECT count(*) FROM tasks;" >/dev/null 2>&1; then | |
| log_error "Backup file is not a valid supervisor database: $backup_file" | |
| return 1 | |
| fi | |
| # Backup current DB before overwriting | |
| if [[ -f "$SUPERVISOR_DB" ]]; then | |
| backup_db "pre-restore" >/dev/null 2>&1 || true | |
| fi | |
| cp "$backup_file" "$SUPERVISOR_DB" | |
| [[ -f "${backup_file}-wal" ]] && cp "${backup_file}-wal" "${SUPERVISOR_DB}-wal" 2>/dev/null || true | |
| [[ -f "${backup_file}-shm" ]] && cp "${backup_file}-shm" "${SUPERVISOR_DB}-shm" 2>/dev/null || true | |
| local task_count | |
| task_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM tasks;") | |
| local batch_count | |
| batch_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM batches;") | |
| log_success "Database restored from: $backup_file" | |
| log_info "Tasks: $task_count | Batches: $batch_count" | |
| return 0 | |
| } | |
| restore_db() { | |
| local backup_file="${1:-}" | |
| if [[ -z "$backup_file" ]]; then | |
| log_info "Available backups:" | |
| # shellcheck disable=SC2012 | |
| ls -1t "$SUPERVISOR_DIR"/supervisor-backup-*.db 2>/dev/null | while IFS= read -r f; do | |
| local size | |
| size=$(du -h "$f" 2>/dev/null | cut -f1) | |
| local task_count | |
| task_count=$(sqlite3 "$f" "SELECT count(*) FROM tasks;" 2>/dev/null || echo "?") | |
| echo " $f ($size, $task_count tasks)" | |
| done | |
| return 0 | |
| fi | |
| if [[ ! -f "$backup_file" ]]; then | |
| log_error "Backup file not found: $backup_file" | |
| return 1 | |
| fi | |
| # Verify backup is valid SQLite | |
| if ! sqlite3 "$backup_file" "SELECT count(*) FROM tasks;" >/dev/null 2>&1; then | |
| log_error "Backup file is not a valid supervisor database: $backup_file" | |
| return 1 | |
| fi | |
| # Backup current DB before overwriting | |
| if [[ -f "$SUPERVISOR_DB" ]]; then | |
| backup_db "pre-restore" >/dev/null 2>&1 || true | |
| fi | |
| # Remove stale WAL/SHM to avoid replaying old frames | |
| rm -f "${SUPERVISOR_DB}-wal" "${SUPERVISOR_DB}-shm" 2>/dev/null || true | |
| if ! cp "$backup_file" "$SUPERVISOR_DB"; then | |
| log_error "Failed to restore database from: $backup_file" | |
| return 1 | |
| fi | |
| [[ -f "${backup_file}-wal" ]] && cp "${backup_file}-wal" "${SUPERVISOR_DB}-wal" 2>/dev/null || true | |
| [[ -f "${backup_file}-shm" ]] && cp "${backup_file}-shm" "${SUPERVISOR_DB}-shm" 2>/dev/null || true | |
| local task_count | |
| task_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM tasks;") | |
| local batch_count | |
| batch_count=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM batches;") | |
| log_success "Database restored from: $backup_file" | |
| log_info "Tasks: $task_count | Batches: $batch_count" | |
| return 0 | |
| } |
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor-helper.sh around lines 510 - 554, In
restore_db(), before copying the backup into SUPERVISOR_DB, ensure any existing
stale SQLite journal files for the target are removed (delete
"${SUPERVISOR_DB}-wal" and "${SUPERVISOR_DB}-shm" if present) so SQLite won't
replay them against the restored file; then perform the cp of "$backup_file" to
"$SUPERVISOR_DB" and immediately check the command exit status, logging an error
via log_error and returning non‑zero on failure; keep the existing logic that
conditionally copies backup-file WAL/SHM after a successful cp and retain the
pre-restore backup_db call.



Summary
backup_db()/restore_db()helpers andbackup/restorecommands for supervisor DB safety and manual recoveryINSERT INTO tasks SELECT * FROM tasks_oldwith explicit column list to prevent column count mismatch when migrations run out of orderRoot Cause
The t128.8 migration used
SELECT *to copy data from the old table to the new one. If a priorALTER TABLE ADD COLUMNmigration had already added columns (e.g.,issue_url,diagnostic_of,triage_result), the column count would mismatch and the migration would fail. Since the old table was already renamed at that point, a failed transaction could leave the DB in an inconsistent state with no recovery path.Additionally, no backup was taken before any destructive migration, so there was no way to recover from a failed migration.
Changes
supervisor-helper.shbackup_db()with SQLite.backup, timestamped copies, auto-prune to 5supervisor-helper.shrestore_db()with validation and pre-restore safety backupsupervisor-helper.shcmd_backup/cmd_restorecommandssupervisor-helper.shsupervisor-helper.shSELECT *→ explicit column listTODO.mdTesting
bash -nsyntax check: passshellcheck -S error: zero violationsSummary by CodeRabbit