Skip to content

feat(cli): add assistant db repair with integrity-check step#32632

Merged
dvargasfuertes merged 1 commit into
mainfrom
apollo/assistant-db-repair
May 29, 2026
Merged

feat(cli): add assistant db repair with integrity-check step#32632
dvargasfuertes merged 1 commit into
mainfrom
apollo/assistant-db-repair

Conversation

@vellum-apollo-bot
Copy link
Copy Markdown
Contributor

What

Adds assistant db repair — the second piece of the assistant db
recovery surface, after db status (#32606).

This PR ships a step-runner framework + the first repair step
(integrity check). Subsequent PRs append more steps to the sequence;
the wiring here is structured so they don't need to touch anything
besides adding their step to the STEPS array.

What this command does

Walks every page in the assistant SQLite database and reports
corruption.

$ assistant db repair
[1/1] integrity-check — starting
        Walk every database page and verify b-tree consistency
[1/1] integrity-check — ok  no corruption detected  (340.6s)
        scanned 993,829 pages

Done. 1 step ran: 1 ok, 0 failed

Runs PRAGMA integrity_check on a read-only handle. Full scan (not
quick_check) — a user typing repair is opting in to a thorough
probe.

Output paths:

  • Healthy DB → exit 0, page count detail line
  • Corrupt DB → exit 1, integrity violations surfaced verbatim (capped
    at 20 lines in human mode; full list in --json)
  • Severely malformed DB (pragma throws before yielding rows) →
    exit 1, normalized into the same corruption signal rather than
    being flagged as a step bug
  • Missing DB → exit 1, loud stderr error pointing at backup list
  • --json → single RepairReport payload with dbPath, okCount,
    errorCount, halted, and per-step result.data for scripting

Step-runner framework

Each step is a RepairStep with name, description, and run(ctx).
The runner:

  • Runs steps sequentially (later steps may depend on earlier ones)
  • Captures StepResult (ok or error) for each step
  • Continues past non-halting failures (best-effort repair — a corrupt
    DB shouldn't block conversation backfill from disk)
  • Stops the sequence only when a step explicitly returns halt: true
  • Never throws — uncaught errors from a step are captured as a
    synthetic error result with detail "step threw an unexpected
    error — this is a bug"
  • Records durationMs per step

That gives PR 3 (conversation backfill) and beyond a stable interface:
add a RepairStep to STEPS and the runner does the rest.

Risk

assistant db repair registered as medium in the gateway risk
registry. The integrity-check step is strictly read-only, but the
description is accurate because future steps in the sequence will
mutate the database to recover state. Approving the path once gates
the whole future surface.

Description fix (review feedback on #32606)

@dvargasfuertes called out that the parent db description's
"(read-only by default)" qualifier had no referent — there's no flag
to flip it writable, and the parent has both read-only (status) and
mutating (repair) subcommands. Dropped the qualifier; description is
now just "Inspect and repair the assistant SQLite database". Lesson
absorbed into the software-engineering skill's cli-design.md.

Testing

11 unit tests, all using real bun:sqlite databases in tmp dirs (the
integrity check needs to walk actual pages — mocking would defeat the
point):

  • Healthy DB → integrity check passes, exit 0
  • Healthy DB → --json shape correct (steps, okCount, durationMs)
  • Corrupt DB → corruption surfaced, exit 1, not flagged as a bug
  • Corrupt DB → --json carries full error list
  • Missing DB → loud stderr, exit 1
  • Missing DB → --json carries missing: true
  • Runner: sequential order
  • Runner: continues past non-halting failures
  • Runner: stops on halt: true
  • Runner: captures thrown errors as synthetic error results
  • Runner: records non-negative durationMs

Smoke-tested against the live ~4 GB workspace DB:

[1/1] integrity-check — ok  no corruption detected  (340.6s)
        scanned 993,829 pages
Done. 1 step ran: 1 ok, 0 failed

5m44s wall-clock for ~994K pages on a ~4 GB DB. Acceptable for a
command the user only runs when they suspect trouble.

--json and missing-DB paths verified manually too.

What's next

PR 3: conversation-backfill step that replays
/workspace/conversations into SQLite. Reuses the logic from
migration 028 (already does meta + jsonl → SQLite with
conversation-level idempotency). Goal: a fresh DB after a wipe can be
rebuilt from the on-disk view.

After PR 3: write up the remaining ~95 tables and propose other
recovery steps (memory consolidation, lost-and-found triage, etc.).

Introduces a step-runner framework for the repair surface so future
remediation passes (conversation backfill, memory consolidation, etc.)
can be appended without restructuring the command. Each step produces a
structured `StepResult` and the runner aggregates them into a
`RepairReport` that renders as plain text or as JSON via `--json`.

First step: `integrity-check` — runs `PRAGMA integrity_check` on a
read-only handle. Full scan (not quick_check) because a user typing
`repair` is opting in to a thorough probe. Healthy DBs report
`ok` + page count. Damaged DBs surface the integrity_check rows
verbatim, capped at 20 lines in human output (full list in --json).
Severely-malformed DBs whose pragma throws before yielding rows are
normalized into the same corruption signal, not flagged as a step bug.

Also drops the "(read-only by default)" qualifier from the parent
`db` description per review feedback on #32606 — no flag exists to
flip the default, so the qualifier had no referent.

Gateway risk registry: `db repair` registered as medium. First step
is read-only; later steps will mutate.

Smoke-tested on the live ~4 GB workspace DB (993 829 pages, 5m44s,
no corruption). 11 unit tests pass covering healthy DB, two corrupt
seed shapes, missing DB, --json shape, and the four runner semantics
(sequential order, continue-on-error, halt-on-error, throw capture).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 17459e48d9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


async function runIntegrityCheck(ctx: RepairContext): Promise<StepResult> {
return withDb<StepResult>(
() => new Database(ctx.dbPath, { readonly: true }),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle database open failures as repair errors

When the file exists but SQLite cannot open it at all (for example an unreadable/root-owned file or assistant.db being a directory), this constructor throws before the inner try around PRAGMA integrity_check, so the generic runner reports step threw an unexpected error — this is a bug instead of an actionable repair/open diagnostic. Since this recovery command is meant for damaged local databases and db status already treats open failures as user-facing errors, catch open failures here and return a normal status: "error" result.

Useful? React with 👍 / 👎.

Comment on lines +6 to +9
* 1. integrity check (this PR)
* 2. conversation backfill (next PR — replay /workspace/conversations
* into SQLite)
* 3. … more to come (memory consolidation, lost-and-found triage, etc.)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete references to pr or the future in comments

@dvargasfuertes dvargasfuertes merged commit 77a20bc into main May 29, 2026
14 checks passed
@dvargasfuertes dvargasfuertes deleted the apollo/assistant-db-repair branch May 29, 2026 22:15
dvargasfuertes pushed a commit that referenced this pull request May 30, 2026
…ilure catch (#32642)

* feat(cli): assistant db repair — conversation-backfill step + integrity-step open-failure catch

Adds the second step to the `assistant db repair` sequence — replays
`<workspace>/conversations/<id>/{meta.json,messages.jsonl}` into the
SQLite conversations/messages tables so a wiped or restored-from-old-
backup database can be rebuilt from the on-disk view.

Architecture: the recovery body lives in a new shared module
`workspace/recovery/conversations-from-disk.ts` that takes a drizzle
handle + workspace dir and returns `{ recovered, skipped, errors,
warnings }`. Two callers consume it:

  1. workspace migration 028 — runs once at startup against the daemon's
     global `getDb()` (rewritten from 271 → 46 lines, delegates to the
     shared function)
  2. `db repair` conversation-backfill step — opens its own RW
     bun:sqlite handle with the same pragmas as the daemon, wraps it in
     drizzle, calls the shared function

Idempotent: the per-conversation existence check guards both call
sites. Malformed `meta.json` / `messages.jsonl` lines are skipped
with warnings (capped at 20 in human output, full list in --json up
to a 500-entry memory cap).

Two follow-ups from PR #32632 review folded in:

  - Vargas: dropped `(this PR)` / `(next PR)` / `(future)` PR-
    chronology callouts from `repair-steps.ts` and `repair.ts`
    module docs and from the `STEPS` comment. Rewritten to describe
    the abstraction (sequence of steps, append by extending the array)
    rather than the timeline. Codified in the software-engineering
    skill's `comments.md` as a lesson entry.
  - Codex P2: `integrity-check` step now catches `new Database(…)`
    failures (file-is-a-directory, unreadable file, header so broken
    SQLite refuses to attach) and surfaces them as a structured
    `status: "error"` with `data.openFailed: true` rather than
    letting the runner's generic "this is a bug" fallback eat it.

Tests: 16 unit tests in `repair.test.ts` (11 carried, 5 new — 1
open-failure + 4 backfill: disk-only convo backfills + verifies SQLite
rows, idempotency on second run, empty-conversations-dir nothing-to-
backfill summary, malformed meta.json surfaced as a warning without
erroring the step). Migration 028's 10 tests all still pass against
the refactored delegator.

Smoke-tested on the live ~4 GB workspace DB:

  [1/2] integrity-check — ok  no corruption detected  (40.1s)
          scanned 993,829 pages
  [2/2] conversation-backfill — ok  nothing to backfill
          (773 on-disk conversations already present)  (1.5s)
  Done. 2 steps ran: 2 ok, 0 failed
  real    0m42.483s

* refactor: don't share recovery logic between migration 028 and db repair

Reverts the `workspace/recovery/conversations-from-disk.ts` shared
module + the migration 028 delegator collapse. Migration 028 is back
to its original 271-line form (unchanged from main); the repair step
gets its own self-contained copy inlined into
`repair-step-conversation-backfill.ts`.

Migrations are frozen historical snapshots. Sharing live code between
a migration and an evolving CLI command risks changing the migration's
behavior on workspaces that have already run it. The two consumers
should be free to drift — bug fixes or schema changes in the repair
step shouldn't retroactively alter what migration 028 does.

---------

Co-authored-by: vellum-apollo-bot[bot] <242025090+vellum-apollo-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant