Skip to content

fix(lbug): recover from WAL corruption by quarantining .wal file (#1402)#1417

Merged
magyargergo merged 4 commits into
abhigyanpatwari:mainfrom
evander-wang:fix/wal-corruption-recovery-1402
May 8, 2026
Merged

fix(lbug): recover from WAL corruption by quarantining .wal file (#1402)#1417
magyargergo merged 4 commits into
abhigyanpatwari:mainfrom
evander-wang:fix/wal-corruption-recovery-1402

Conversation

@evander-wang

Copy link
Copy Markdown
Contributor

Summary

LadybugDB crashes when the WAL file is corrupted during db.init() — the MCP server returns an unhelpful native error. This PR detects WAL corruption, quarantines the offending .wal file, retries the open, and adds a recoverySuggestion field to MCP tool responses so the LLM can guide users.

Motivation / context

Closes #1402. Without this, any WAL corruption (power loss, disk full, kill -9 during analyze) renders the indexed knowledge graph unreachable until the user manually runs gitnexus analyze. The MCP server has no way to self-recover or to tell the user what to do.

Areas touched

  • gitnexus/ (CLI / core / MCP server)
  • gitnexus-web/ (Vite / React UI)
  • .github/ (workflows, actions)
  • eval/ or other tooling
  • Docs / agent config only (AGENTS.md, CLAUDE.md, .cursor/, llms.txt, etc.)

Scope & constraints

In scope

  • isWalCorruptionError() regex-based WAL corruption detector in lbug-config.ts
  • throwOnWalReplayFailure and enableChecksums constructor options in createLbugDatabase()
  • tryQuarantineAndReopen() — quarantine .wal → retry open in doInitLbug()
  • openReadOnlyDatabase() — unified read-only DB open with stdout silencing
  • WAL_RECOVERY_SUGGESTION constant shared across cypher/context/impact error paths
  • Fix restoreStdout() placement (was before await db.init(), now in finally)
  • Unit tests: detection regex, pool recovery, MCP feedback

Explicitly out of scope / not done here

  • No analyze-time WAL repair — this is a read-side recovery only
  • No auto-reindex trigger after quarantine — the user must run gitnexus analyze

Implementation notes

  • The read pool opens with throwOnWalReplayFailure: false so WAL corruption is caught as a thrown error (not a segfault) during db.init()
  • Quarantined WAL files are renamed to *.wal.corrupt.<timestamp> for diagnostic inspection
  • The pool init path is not the hot query path — db.init() overhead is one-time and deduplicated via initPromises

Testing & verification

  • cd gitnexus && npm test (passed: 212/213 files, 5195/5198 tests; 2 pre-existing failures in wiki-flags.test.ts)
  • cd gitnexus && npm run test:integration
  • cd gitnexus && npx tsc --noEmit (passed via pre-commit hook)
  • cd gitnexus-web && npm test
  • cd gitnexus-web && npx tsc -b --noEmit
  • Manual / Playwright E2E

Risk & rollout

  • No breaking changes — all new options default to existing behavior (throwOnWalReplayFailure ?? true, enableChecksums: true)
  • No index refresh required
  • No schema migration

Checklist

  • PR body meets repo minimum length
  • If AGENTS.md / overlays changed: headers, scope block, and changelog updated per project conventions
  • No secrets, tokens, or machine-specific paths committed

@vercel

vercel Bot commented May 7, 2026

Copy link
Copy Markdown

@evander-wang is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

Comment thread gitnexus/test/unit/mcp-wal-feedback.test.ts Fixed
@evander-wang evander-wang force-pushed the fix/wal-corruption-recovery-1402 branch from 562cebb to 6c9ffbc Compare May 7, 2026 19:01
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
8270 8269 0 1 384s

✅ All 8269 tests passed

1 test(s) skipped — expand for details
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 77.91% 25085/32195 N/A% 🟢 ███████████████░░░░░
Branches 66.44% 15860/23870 N/A% 🟢 █████████████░░░░░░░
Functions 83.04% 2517/3031 N/A% 🟢 ████████████████░░░░
Lines 80.96% 22670/28000 N/A% 🟢 ████████████████░░░░

📋 View full run · Generated by CI

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 6m 25s —— View job


PR #1417 Review — WAL Corruption Recovery

Tasks

  • Post initial todo list
  • Read repo context (AGENTS.md, DoD.md, etc.)
  • Read all changed source files
  • Read all new test files
  • Read related context files (stdio-capture.ts, lbug-adapter.ts, etc.)
  • Run Unicode/bidi check on changed files
  • Analyze: LadybugDB constructor arg order and defaults
  • Analyze: WAL corruption detection regex
  • Analyze: Quarantine + retry semantics
  • Analyze: Pool lifecycle (dbCache, initPromises, refCount, handle leaks)
  • Analyze: Stdout silencing/restore pairing through db.init()
  • Analyze: MCP recoverySuggestion gating and response shapes
  • Analyze: Test coverage quality and GHAS status
  • Write and post full review

Review bar for this PR

  • createLbugDatabase() new args must default to existing behavior (throwOnWalReplayFailure ?? true, enableChecksums: true)
  • isWalCorruptionError() must not classify lock, permission, schema, or generic I/O errors as WAL corruption
  • restoreStdout() must be in finally after db.init() resolves or rejects
  • Corrupted .wal must be renamed, not deleted; main DB file must be untouched
  • Recovery failure must produce clear user guidance (gitnexus analyze); success must not silently serve stale data without any hint
  • Failed DB handles must be closed before rethrow; dbCache must not hold a bad DB
  • initPromises deduplication and refCount semantics must be preserved end-to-end

Current PR state

Field Value
Head SHA 07ca3d0 (Merge main → fix/wal-corruption-recovery-1402)
Status Open, not draft
Commits 4 (2 feature commits + 2 merge-from-main)
Changed files 6
CI ✅ All checks passed — 8194/8194 tests, 0 failed, 1 skipped
Typecheck tsc --noEmit
WAL detection tests lbug-config-wal.test.ts — 5 tests
Pool recovery tests pool-wal-recovery.test.ts — 5 tests
MCP feedback tests mcp-wal-feedback.test.ts — 4 tests
GHAS comment Resolved — LocalBackend() called without args in latest head
Hidden Unicode Em-dashes (U+2014) in comments and suggestion strings only — no bidi/direction controls found
Vercel Irrelevant to this review

Branch hygiene / merge-forward assessment

Clean and mergeable.

Two merge-forward commits from main are present; no conflict artifacts, no duplicated imports, no unreachable code, no unrelated churn in the effective diff. The fix(test): remove superfluous argument from LocalBackend constructor commit is a correct, minimal fix in response to the GHAS alert — it is not masking a real test issue.


Understanding of the change

The PR adds three cooperating pieces:

  1. Detection (lbug-config.ts): isWalCorruptionError() + WAL_RECOVERY_SUGGESTION constant.
  2. Recovery (pool-adapter.ts): openReadOnlyDatabase() opens with throwOnWalReplayFailure: false so init throws cleanly; tryQuarantineAndReopen() renames the .wal and retries; doInitLbug() catches WAL errors in the retry loop, attempts recovery, and either populates dbCache from the recovered DB or throws with an actionable message.
  3. MCP UX (local-backend.ts): cypher, context, and impact each gate recoverySuggestion on isWalCorruptionError().

Findings

[medium] Silent stale-data after successful quarantine

  • Category: Data correctness / UX
  • Files: gitnexus/src/core/lbug/pool-adapter.ts (lines 366–379)
  • Issue: When tryQuarantineAndReopen() succeeds, the MCP continues serving the last-checkpointed graph state. There is no logger.warn(), no warning in the response, and no recoverySuggestion on subsequent tool calls. The user has no visible signal that the WAL was discarded and the graph may be stale.
  • Why it matters: Uncheckpointed changes — potentially representing all analysis since the last checkpoint — are silently dropped. The graph may be significantly out of date. An LLM client receiving clean results has no basis to prompt the user to re-analyze.
  • Recommended fix: Add logger.warn({ dbPath, quarantineName }, 'WAL quarantined — graph may be stale; run gitnexus analyze') inside tryQuarantineAndReopen() after the successful rename. Optionally, on first query after a recovered open, include a _warning field in responses.
  • Blocks merge: No, but a follow-up issue should be filed. Fix this →

[low] Missing context WAL positive test

  • Category: Test coverage
  • Files: gitnexus/test/unit/mcp-wal-feedback.test.ts
  • Issue: The context wrapper (lines 1684–1695 in local-backend.ts) is implemented — it returns { error, recoverySuggestion } for WAL errors and re-throws for non-WAL errors. The test covers the non-WAL throw path but has no test asserting that a WAL error produces { recoverySuggestion } instead of throwing.
  • Why it matters: The WAL return-instead-of-throw branch in context could be silently regressed.
  • Recommended fix: Add a test case: mock executeParameterized to reject with 'Corrupted wal file' and call context; assert result.recoverySuggestion is defined and the call does NOT throw. Fix this →
  • Blocks merge: No.

[low] Sparse WAL detector negative tests

  • Category: Test coverage / correctness
  • Files: gitnexus/test/unit/lbug-config-wal.test.ts
  • Issue: Negative cases cover lock, generic, path-not-found, and checksum-without-WAL. Missing: permission denied, schema mismatch, and generic I/O errors that reference "WAL" in their message but are not replay corruption.
  • Why it matters: The wal.*corrupt branch of the regex could match an unlikely but non-empty set of messages that are not WAL replay errors (e.g., a filesystem error whose message includes both "wal" and "corrupt" somewhere in a path or context string).
  • Recommended fix: Add negative cases for EACCES: permission denied '/path/to/wal' and schema version mismatch in WAL. The regex is likely safe in the specific context of db.init() error messages, but broader coverage validates it.
  • Blocks merge: No.

[low] Millisecond-precision quarantine naming

  • Category: Filesystem correctness
  • Files: gitnexus/src/core/lbug/pool-adapter.ts (line 290)
  • Issue: Date.now() provides millisecond precision. If two quarantine attempts race on the same DB in the same millisecond (very unlikely), the second fs.rename() overwrites the first on POSIX (atomic overwrite) or may fail with EEXIST on Windows.
  • Why it matters: Losing a previous quarantine file destroys diagnostic evidence on POSIX; on Windows it could fail the recovery entirely.
  • Recommended fix: Add a random suffix: `${walPath}.corrupt.${Date.now()}-${Math.random().toString(36).slice(2)}`. This is a cosmetic hardening.
  • Blocks merge: No.

LadybugDB constructor / native boundary assessment

Verdict: Acceptable, with one uncertainty.

The constructor adds two positional args after the existing 7:

new (lbugModule.Database as any)(
  databasePath,   // 0: path
  0,              // 1: bufferManagerSize
  false,          // 2: enableCompression (pinned for v0.16.0)
  options.readOnly ?? false,   // 3: readOnly
  LBUG_MAX_DB_SIZE,            // 4: maxDBSize
  true,           // 5: autoCheckpoint
  -1,             // 6: checkpointThreshold
  options.throwOnWalReplayFailure ?? true,  // 7: throwOnWalReplayFailure
  true,           // 8: enableChecksums
)
  • The as any cast is justified and contained to this one call site.
  • Existing defaults are preserved: positions 0–6 are unchanged; throwOnWalReplayFailure ?? true preserves the prior implicit default; enableChecksums: true is an explicit opt-in.
  • Comments document each positional arg.
  • lbug-config-wal.test.ts asserts the exact 9-argument signature with concrete values.
  • The native constructor argument order cannot be independently verified from this repo (LadybugDB is a third-party native module). However, CI is green with 8194/8194 tests passing across 3 platforms, which provides reasonable confidence the order is correct.

Uncertainty: If the native constructor interprets arg 7 as something other than throwOnWalReplayFailure, the detection behavior for WAL errors would be wrong. This is flagged as a point for the author/maintainer to confirm against LadybugDB's changelog or source, but does not block merge given green CI and explicit comments.


WAL corruption detection assessment

const WAL_CORRUPTION_RE = /corrupt(ed)?\s+wal|invalid\s+wal\s+record|wal.*corrupt|checksum.*wal/i;

Positive coverage (verified against test cases and regex):

  • 'Runtime exception: Corrupted wal file. Read out invalid WAL record type.' → matches corrupt(ed)?\s+wal
  • 'Error: invalid WAL record type' → matches invalid\s+wal\s+record
  • 'Checksum verification failed, the WAL file is corrupted.' → matches checksum.*wal
  • 'the WAL file is corrupted' → matches wal.*corrupt

Negative coverage (verified):

  • 'Could not set lock on file' → no WAL → false ✓
  • 'Query failed' → false ✓
  • 'LadybugDB not found at /path' → false ✓
  • 'Checksum verification failed for parquet file' → no WAL → false ✓

Non-Error handling: isWalCorruptionError(null), undefined, 42 all return false via the if (!err) return false guard and String(err) fallback. ✓

Overmatch concern (low): wal.*corrupt is the loosest branch. A filesystem error message whose interpolated path contained both "wal" and "corrupt" could falsely match. In practice, errors from db.init() come from the native LadybugDB engine (not from Node.js path interpolation), so the risk is acceptable. The missing negative cases for EACCES and schema mismatch are minor.


Quarantine / recovery assessment

WAL path: dbPath + '.wal' — correct LadybugDB convention. ✓

Only the WAL is renamed: Main DB file is untouched. ✓

Quarantine name: `${walPath}.corrupt.${Date.now()}` — unique in practice; millisecond collision on same DB is extremely unlikely. POSIX rename() is atomic. ✓ (minor hardening possible, see Findings)

Rename failure: Caught and re-thrown with an actionable error message including repoId and gitnexus analyze. ✓

Retry open: openReadOnlyDatabase(dbPath) is called after rename; if it fails, the exception propagates and doInitLbug throws the outer error with the analyze suggestion. ✓

No infinite retry: One quarantine attempt. The WAL corruption path does not re-enter the for (attempt...) loop. ✓

Active-writer race risk: Low. If gitnexus analyze is actively writing, it holds a write lock; MCP read-only open would receive a lock error ('Could not set lock'), not a WAL corruption error. The WAL corruption path is only triggered by isWalCorruptionError(), which specifically does NOT match lock errors. The two paths are practically exclusive.

Data correctness: After quarantine the DB opens from the last WAL checkpoint, which may be significantly behind the last successful analyze. No user-visible warning is emitted on the success path (see silent stale-data finding above).


Pool lifecycle assessment

initPromises deduplication: Preserved. initLbug() stores the doInitLbug() promise and removes it in finally regardless of success or failure. Concurrent callers await the same promise. ✓

No double-open: dbCache is populated exactly once per dbPath. In the WAL recovery path, shared = { db, refCount: 0, ftsLoaded: false }; dbCache.set(dbPath, shared) is inside the break-guarded try block; if it throws, the cache is never set. ✓

No bad DB in cache: If tryQuarantineAndReopen() throws, doInitLbug() re-throws immediately; shared remains undefined; dbCache.set() is never reached for the failed DB. ✓

RefCount semantics: shared.refCount++ happens only after shared is confirmed non-null (line 396), which is only reachable via a successful open path. ✓

Lock retry preserved: isLockError check and LOCK_RETRY_ATTEMPTS loop is intact; WAL corruption breaks out of the loop immediately, avoiding the lock retry delay. ✓

FTS loading: shared.ftsLoaded starts false on both fresh and recovered open; FTS is loaded before pool registration. ✓

Handle leak: openReadOnlyDatabase() closes db in its own catch block (if (db) await db.close().catch(() => {})). Failed DBs from both the initial attempt and the retry attempt are closed before the exception propagates. ✓


Stdout / MCP safety assessment

openReadOnlyDatabase() structure:

silenceStdout();         // ← outside try (cannot throw)
try {
  db = createLbugDatabase(...);
  await db.init();       // ← native output silenced
  return db;
} catch (err) {
  if (db) await db.close().catch(() => {});
  throw err;
} finally {
  restoreStdout();       // ← always runs
}
  • silenceStdout() is called before db.init(). ✓
  • restoreStdout() is in finally, runs on both success and failure paths. ✓
  • The reference counter (stdoutSilenceCount) is correct: after the first openReadOnlyDatabase fails, restoreStdout() decrements back to 0 before tryQuarantineAndReopen calls openReadOnlyDatabase again. No nesting issue. ✓
  • The safety watchdog (setInterval at 1s) exempts activeQueryCount > 0 and preWarmActive. Neither flag is set during doInitLbug's DB open loop. The watchdog could theoretically fire during a long db.init(), but the 1-second watchdog interval is shorter than a typical stuck init, and the fix is that restoreStdout() is now correctly in finally, so a stuck init won't leave stdout permanently silenced. ✓
  • setActiveStdoutWrite / getActiveStdoutWrite integration: restoreStdout() restores to getActiveStdoutWrite(), which respects the MCP sentinel registered by installGlobalStdoutSentinel. The sentinel chain is preserved. ✓
  • No diagnostic output goes to stdout. ✓

MCP recovery UX assessment

cypher (lines 1218–1230): Returns { error: msg, recoverySuggestion } for WAL; returns { error: msg } for non-WAL. ✓

context (lines 1684–1695): Wraps _contextImpl in try/catch; returns { error, recoverySuggestion } for WAL; re-throws for non-WAL. This is asymmetric with cypher (which returns errors) but correctly preserves the pre-existing context behavior. The WAL branch is properly gated.

impact (lines 2457–2470): Always returns a structured error object (existing behavior since #321); spreads recoverySuggestion only when WAL. ✓

recoverySuggestion content: 'WAL corruption detected. Run \gitnexus analyze` to rebuild the index.'` — concise, actionable, no machine-specific paths. ✓

Gating is exclusive: Non-WAL errors never receive recoverySuggestion. ✓

_contextImpl extraction: The split of context into an outer wrapper + _contextImpl is clean and preserves all original logic. ✓

Gap: The successful quarantine path does not return a recovery suggestion on subsequent tool calls. An LLM client serving healthy (but potentially stale) results after recovery has no basis to prompt the user to re-analyze. This is the stale-data finding above.


Test assessment

lbug-config-wal.test.ts: Covers 4 positive WAL messages (including a real-looking LadybugDB native message), 4 negative non-WAL messages, and 4 non-Error value types. Also asserts exact 9-argument constructor call order. Tests would fail before this PR (the function didn't exist). ✓ Minor gap: sparse on permission/schema negatives.

pool-wal-recovery.test.ts: 5 tests covering: quarantine+retry success, lock-error non-quarantine, retry-also-fails error message, no poisoned cache state, and ENOENT on .wal file. Tests use isolated mock paths, clean up via closeLbug(), and reset mocks in beforeEach. ✓

  • The setTimeout spy for the lock-retry test is functional but fragile (it mocks the global); acceptable for unit scope.
  • No test for restoreStdout() being called after db.init() failure (the test mocks at createLbugDatabase level, which bypasses the real silenceStdout/restoreStdout calls). This is an inherent limitation of the mock seam chosen.
  • No concurrent-init deduplication test during WAL recovery. Low risk given the existing initPromises test coverage elsewhere.

mcp-wal-feedback.test.ts: Covers impact WAL response, cypher WAL response, non-WAL passthrough for impact, and context non-WAL throw preservation. The GHAS alert (LocalBackend constructor argument) is resolved in the latest head. ✓ Missing: explicit test for context WAL positive path (see Findings).

Mocking quality: Tests mock at the right seams (createLbugDatabase, fs/promises, pool-adapter). The mocks are realistic enough to validate the branching logic. They do not validate native constructor arg order end-to-end (integration test would be needed for that), but that limitation is acceptable.


Hidden Unicode / hygiene assessment

Bidi/direction controls (U+202A–U+202E, U+2066–U+2069): None found in any of the 6 changed files. ✓

Non-ASCII characters found:

  • Em-dashes (U+2014) in comments throughout lbug-config.ts and pool-adapter.ts — pre-existing project style, not in executable logic. ✓
  • Em-dash in local-backend.ts line 2467 (suggestion: 'The graph query failed — try...') — pre-existing in the impact fallback string; the test assertions at lines 102 and 132 of mcp-wal-feedback.test.ts correctly mirror this string. ✓
  • Arrow (→) in comment pool-adapter.ts:440 — typographic, comment-only. ✓

No invisible whitespace, no null bytes, no bidi controls in executable code, regex literals, or test assertions. Clean.


Elegance / maintainability assessment

The change is well-scoped. Three functions are added (openReadOnlyDatabase, tryQuarantineAndReopen, isWalCorruptionError) and one constant (WAL_RECOVERY_SUGGESTION). The MCP changes are thin wrappers with clear conditional gating. No auto-reindex logic, no analyze-time repair — the PR correctly limits itself to read-side recovery.

The _contextImpl extraction is the only structural change to existing logic, and it is clean. The doInitLbug loop handles WAL as a fast-break special case, keeping the lock-retry path unchanged.

Future LadybugDB upgrades should audit positional arg counts in createLbugDatabase() and update comments accordingly. The test for constructor args provides a regression harness for this.


Final verdict

production-ready with minor follow-ups

All hard blockers are clean: restoreStdout() is correctly in finally; failed DB handles are closed before any rethrow; dbCache cannot hold a bad DB entry; the MCP response shapes for cypher/context/impact are correct; no bidi controls were found; CI is green across all platforms.

The one meaningful production concern is that a successful WAL quarantine is completely silent — the user and the LLM client receive normal query results from potentially stale graph data with no warning. This should be addressed in a follow-up by adding a logger.warn() in tryQuarantineAndReopen() after a successful rename. The missing context WAL positive test and the sparse negative cases for the WAL detector are minor gaps that should be filed as follow-ups but do not affect correctness of the shipped behavior.

The native constructor argument order cannot be independently confirmed from this repo, but green CI across 3 platforms and the explicit positional-arg test provide sufficient confidence for merge. The PR author should confirm argument positions against the LadybugDB 0.16.x changelog before the next native upgrade.


· PR branch

@magyargergo

Copy link
Copy Markdown
Collaborator

@evander-wang Can you please look into these findings? 🙏

@evander-wang evander-wang force-pushed the fix/wal-corruption-recovery-1402 branch from 6f243ff to 40d870a Compare May 8, 2026 08:02
…igyanpatwari#1402)

LadybugDB crashes when the WAL file is corrupted — the open fails with an
unrecoverable native error. This makes the pool adapter detect WAL corruption
errors, quarantine the offending .wal file, and retry the open. MCP tool
responses (cypher, context, impact) now include a recoverySuggestion field
when WAL corruption is detected.

Changes:
- Add isWalCorruptionError() regex-based detector in lbug-config.ts
- Add throwOnWalReplayFailure and enableChecksums to createLbugDatabase()
- Extract openReadOnlyDatabase() with stdout silencing + db.init()
- Add tryQuarantineAndReopen() for .wal quarantine + retry in doInitLbug
- Wrap cypher/context/impact with WAL recoverySuggestion in MCP responses
- Share WAL_RECOVERY_SUGGESTION constant across all MCP error paths
- Fix restoreStdout() placement (before db.init() → finally block)
- Add unit tests for detection, pool recovery, and MCP feedback
…bhigyanpatwari#1402)

LocalBackend has no constructor — the { registryPath } argument was ignored.
@evander-wang evander-wang force-pushed the fix/wal-corruption-recovery-1402 branch from 40d870a to d8378b1 Compare May 8, 2026 08:10
@magyargergo

Copy link
Copy Markdown
Collaborator

fyi, I'm also working on lbug because i want to make it more robust and to avoid getting IO exception: Could not set lock on file from https://github.com/LadybugDB/ladybug/blob/v0.16.1/src/common/file_system/local_file_system.cpp#L126.

@evander-wang

Copy link
Copy Markdown
Contributor Author

fyi, I'm also working on lbug because i want to make it more robust and to avoid getting IO exception: Could not set lock on file from https://github.com/LadybugDB/ladybug/blob/v0.16.1/src/common/file_system/local_file_system.cpp#L126.

That’s great — really helpful. I haven’t looked through the full code yet, but my understanding is
that this lock is probably related to the WAL issue we’re seeing, right? Just curious.

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 6m 6s —— View job


PR #1417 Review — WAL Corruption Recovery

Tasks

  • Post initial todo list
  • Read repo context (AGENTS.md, DoD.md, GUARDRAILS.md, etc.)
  • Read all changed source files
  • Read all new test files
  • Read related context files (stdio-capture.ts, pool lifecycle, etc.)
  • Run Unicode/bidi check on changed files
  • Check git log / branch hygiene
  • Check CI status on latest head
  • Analyze WAL detection regex soundness
  • Analyze native constructor argument order
  • Analyze quarantine + retry semantics
  • Analyze pool lifecycle (dbCache, initPromises, refCount, handle leaks)
  • Analyze stdout silencing/restore pairing
  • Analyze MCP recoverySuggestion gating and response shapes
  • Analyze test coverage quality
  • Write and post full review

Review bar for this PR

  • WAL corruption is detected only for real LadybugDB WAL corruption messages — lock, permission, schema, and generic I/O errors must not trigger quarantine.
  • throwOnWalReplayFailure and enableChecksums constructor arguments must preserve existing caller defaults and match the pinned native constructor signature.
  • Quarantine must rename (never delete) the .wal and never touch the main DB file; the rename must be exception-safe.
  • The pool must not cache a failed DB, leak handles, or leave initPromises stuck — all paths must converge on a clean state.
  • stdout silencing/restoration must be exception-safe across initial open, failed open, quarantine, and reopen — no path may leave stdout silenced.
  • MCP cypher/context/impact responses must include recoverySuggestion only when isWalCorruptionError() is true, and must preserve the error field for existing clients.
  • CI, typecheck, unit tests, and GHAS are green on the latest head SHA.
  • Branch diff is scoped to the 6 WAL-recovery files only; the merge-from-main commit introduces no unrelated churn to the effective PR diff.

Current PR state

Field Value
Head SHA 2aefc47 (Merge branch 'main' → fix/wal-corruption-recovery-1402)
Status Open, not draft
Commits 4 (3 feature + 1 merge-from-main)
Changed files 6 (all WAL-recovery TypeScript)
Net diff +511 / -11
CI ✅ All 5 checks passed — Docker, CI, Dependency Review, Gitleaks, CodeQL
Typecheck tsc --noEmit (CI)
Unit tests ✅ 8264/8264 tests passed on latest run
GHAS (CodeQL) ✅ Passed — superfluous LocalBackend() constructor arg fixed in 236c204
Hidden Unicode ✅ No bidi/direction controls; em-dashes (U+2014) in comments only — not in executable code, regexes, or test strings
Vercel Irrelevant to this core/MCP reliability review
Package versions Unchanged — no lockfile or dependency churn

Branch hygiene assessment

merge-from-main commit present but harmless and merge-safe.

The merge commit 2aefc47 has two parents: d8378b1 (last PR-specific commit) and 927a172 (perf(mcp): parallelize staleness checks in list_repos (#1416) from main). That staleness commit touches git-staleness.ts, calltool-dispatch.test.ts, and staleness.test.ts — all already merged to main. GitHub's PR diff (base=main) correctly shows only the 6 WAL-recovery files. No unrelated version bumps, workflow changes, lockfile churn, or generated artifacts are introduced by the merge commit.


Understanding of the change

LadybugDB replays its WAL during db.init(). If the WAL is physically corrupted (truncated write, kill-9 during checkpoint, disk failure), LadybugDB throws a native exception instead of returning cleanly. Before this PR, that exception surfaced as a raw native error to MCP callers with no recovery path and no guidance.

The PR adds three cooperating pieces:

  1. Detection (lbug-config.ts): isWalCorruptionError() + WAL_RECOVERY_SUGGESTION — pattern-matches LadybugDB native WAL error strings and centralises the recovery message.

  2. Recovery (pool-adapter.ts): openReadOnlyDatabase() opens with throwOnWalReplayFailure: false (init throws rather than segfaulting), wraps silence/restore in finally. tryQuarantineAndReopen() renames the .wal to *.wal.corrupt.<timestamp>-<random> and retries the open. doInitLbug() catches WAL errors in the lock-retry loop, attempts quarantine once, and either populates dbCache from the recovered DB or throws with an actionable message.

  3. MCP UX (local-backend.ts): cypher, context, and impact each gate recoverySuggestion on isWalCorruptionError(), preserving existing error shapes for non-WAL failures.

What remains out of scope (intentionally): analyze-time WAL repair, auto-reindex after quarantine, and any write-side recovery.


Findings

[low] No test verifying restoreStdout is called after db.init() failure

  • Category: Test coverage / MCP stdio safety
  • Files: gitnexus/test/unit/pool-wal-recovery.test.ts
  • Issue: The mock seam is at createLbugDatabase level — the real silenceStdout/restoreStdout calls in openReadOnlyDatabase are exercised, but the test doesn't assert that restoreStdout was called after a failed db.init(). The implementation is correct (the finally block is present), but there's no regression harness that would catch a future accidental removal of restoreStdout from that path.
  • Why it matters: Leaving stdout permanently silenced after a failed init corrupts MCP JSON-RPC output. The correct path is coded but unproven by test.
  • Recommended fix: Mock silenceStdout/restoreStdout (or assert stdoutSilenceCount resets to 0) in the failed-init scenario. Fix this →
  • Blocks merge: No.

[low] Successful quarantine emits stderr log but no MCP response indicator for stale graph

  • Category: Data correctness / UX
  • Files: gitnexus/src/core/lbug/pool-adapter.ts:299-303
  • Issue: tryQuarantineAndReopen() emits a realStderrWrite warning after a successful rename — this is good for server-side visibility. However, subsequent MCP cypher/context/impact calls return clean results with no _warning field or recoverySuggestion. An LLM client receiving healthy-looking results after recovery has no in-band signal to prompt the user to re-analyze.
  • Why it matters: The quarantined WAL may have lost uncheckpointed changes from the last gitnexus analyze run. The graph is potentially stale. The stderr log is visible to humans watching the process but is invisible to the LLM tool consumer.
  • Recommended fix: This was identified in the previous automated review and the realStderrWrite call was already added as partial mitigation. A follow-up should add a transient _warning field to the first tool response after a recovered open. Filing a follow-up issue is sufficient here. Fix this →
  • Blocks merge: No.

LadybugDB / WAL recovery assessment

Detection regex:

const WAL_CORRUPTION_RE = /corrupt(ed)?\s+wal|invalid\s+wal\s+record|wal.*corrupt|checksum.*wal/i;

Positive coverage verified: the reported issue #1402 error (Corrupted wal file. Read out invalid WAL record type.), invalid WAL record type, Checksum verification failed, the WAL file is corrupted, and the WAL file is corrupted all match. ✅

Negative coverage verified (including cases added in the feedback commit): Could not set lock on file, Query failed, LadybugDB not found at /path, Checksum verification failed for parquet file, EACCES: permission denied '/path/to/wal', and schema version mismatch in WAL all return false. ✅ The path-based false-positive concern (EACCES: permission denied '/path/to/wal') and the schema mismatch concern are explicitly tested.

Non-Error inputs (null, undefined, 42, new Error('ok')) return false via the !err guard. ✅

Minor theoretical concern: checksum.*wal could match an error like Checksum error for wal parquet file (if "wal" appears in a file path in the error message). This is acceptable risk — LadybugDB's native engine does not interpolate paths into its WAL error messages, and the test suite explicitly validates the common path-mention case.

Constructor arguments:

The 9-position call is documented with inline comments and covered by an exact-match constructor test in lbug-config-wal.test.ts. throwOnWalReplayFailure ?? true preserves the pre-existing default for all callers. enableChecksums: true is an explicit opt-in (new arg). The as any cast is documented and contained to one call site. CI is green across 3 platforms, which provides reasonable cross-platform confidence in the argument order. ✅

Quarantine:

  • dbPath + '.wal' is the correct LadybugDB WAL convention. ✅
  • Rename is same-directory (no cross-device EXDEV risk). ✅
  • Main DB file is never touched. ✅
  • Quarantine name: `${walPath}.corrupt.${Date.now()}-${Math.random().toString(36).slice(2)}` — timestamp + random suffix, collision-resistant, filename-safe on Windows. ✅ (This random suffix was added in the feedback commit, addressing the millisecond-collision concern from the prior review.)
  • Failed DB handle is closed in openReadOnlyDatabase's catch block before the error propagates to the quarantine path. ✅
  • No infinite retry: the WAL corruption branch immediately breaks from the lock-retry loop; a second WAL corruption after quarantine fails at the reopen step and throws with a clear message. ✅

Active-writer race: Lock errors (Could not set lock) are classified by isLockError, not isWalCorruptionError. These paths are mutually exclusive at the classification level. A healthy writing process holds a file lock; the reader gets a lock error and retries. WAL corruption errors only arise from physical corruption, which is a different condition. ✅

Stale graph: realStderrWrite is called with a warning after successful quarantine ("graph may be stale. Run gitnexus analyze"). The PR body's "No index refresh required" applies to the general case (no schema migration for all users); users who experience WAL corruption are told to re-run gitnexus analyze. This reconciliation is correct. ✅


Pool lifecycle / concurrency assessment

openReadOnlyDatabase: silenceStdout() before the try, restoreStdout() in finally, db.close() in catch before rethrow. No path skips restore. ✅

doInitLbug: dbCache.set(dbPath, shared) only executes after a successful open (either initial or post-quarantine). Failed paths never write to dbCache. shared.refCount++ only executes when shared is confirmed non-null (line 400). ✅

initPromises deduplication: Set before doInitLbug awaits, deleted in finally — cleans up on both success and failure. Concurrent callers await the same promise; they cannot double-init or double-quarantine the same DB path. ✅

Handle leaks:

  • Initial open fails → db.close() in openReadOnlyDatabase's catch. ✅
  • Quarantine rename fails → throws immediately, no DB was opened for the retry. ✅
  • Reopen (post-quarantine) fails → db.close() in openReadOnlyDatabase's catch. ✅
  • FTS load failure after successful open: shared.ftsLoaded remains false but the DB is in dbCache with refCount++. The pool entry is not registered (pool.set is only reached after FTS load attempt). A future initLbug call would find shared in dbCache (refCount still > 0) and increment again — FTS load would be retried since ftsLoaded is false. This is pre-existing behavior, not introduced by this PR.

Lock retry interaction: WAL corruption breaks immediately from the for (let attempt...) loop — it does not retry as a lock error and does not delay for LOCK_RETRY_DELAY_MS. Lock errors still retry as before. ✅


stdout/stderr / MCP protocol assessment

openReadOnlyDatabase structure is correctly exception-safe:

silenceStdout();         // before try — cannot throw
try {
  db = createLbugDatabase(...);
  await db.init();       // native output silenced
  return db;
} catch (err) {
  if (db) await db.close().catch(() => {});
  throw err;
} finally {
  restoreStdout();       // always runs — even if db.close() throws
}

restoreStdout() is in finally, not after db.init(). This was the fix described in the PR. ✅

The stdoutSilenceCount refcount is correctly balanced: each openReadOnlyDatabase call (initial + quarantine retry) has exactly one silenceStdout and one restoreStdout. No over-decrement or under-decrement. ✅

Watchdog timer: The 1-second watchdog (activeQueryCount === 0 && !preWarmActive) can theoretically fire during db.init() if init takes > 1s, prematurely restoring stdout. This is a pre-existing accepted trade-off — the watchdog is a last-resort safety net for stuck silencing, and DB init is typically fast. The correct finally placement ensures stdout is restored even if the watchdog fires first (idempotent on double-restore due to the <= 0 guard in restoreStdout). ✅

realStderrWrite (not process.stderr) is used for the quarantine diagnostic — safe for MCP mode because stderr is not part of the JSON-RPC transport. ✅

MCP response shapes:

Tool WAL error Non-WAL error
cypher { error, recoverySuggestion } { error }
context { error, recoverySuggestion } (returns) rethrows (pre-existing behavior)
impact { error, suggestion, recoverySuggestion, ... } { error, suggestion, ... } (pre-existing #321 shape)

All three tools preserve the error field for existing clients. recoverySuggestion is gated exclusively on isWalCorruptionError(). WAL_RECOVERY_SUGGESTION is defined once and imported — no duplicated strings. ✅

The context wrapper's asymmetry (return for WAL, rethrow for non-WAL) is intentional and matches pre-existing behavior. The feedback-commit added a test for the context WAL positive path (context returns WAL recoverySuggestion on corrupted WAL error) — this gap from the prior review is resolved. ✅


Data safety / index integrity assessment

Quarantining the WAL discards uncheckpointed writes. This is acknowledged: realStderrWrite emits a stale-graph warning, and WAL_RECOVERY_SUGGESTION tells the user to run gitnexus analyze. The recovered DB is opened read-only and never mutated. No "recovered" marker is persisted to disk (intentional — no design yet for degraded-index tracking). The MCP response on the first query after recovery does not carry an in-band stale-data warning — this is the remaining follow-up item noted in the Findings section above.

The PR body's "No index refresh required" is correct for the general case (no migration needed); gitnexus analyze is the user-facing recovery step for the corrupted-WAL case specifically. There is no contradiction.


Cross-platform filesystem assessment

fs.rename(walPath, quarantineName) is same-directory — avoids EXDEV on any platform. The quarantine name uses only alphanumeric characters, dots, and hyphens — safe on Windows (no reserved characters). If the .wal file does not exist by rename time (e.g., a race with another process), fs.rename throws ENOENT, which is caught and converted to an actionable error message including gitnexus analyze. ✅

Windows locked-file scenario: if LadybugDB holds a file lock on the .wal (rare on read-only open, but possible on some native error paths), fs.rename fails; the error is caught and an actionable message is thrown. This is acceptable — the user is told to gitnexus analyze, which is the correct recovery step regardless. The ENOENT case is explicitly tested in pool-wal-recovery.test.ts. ✅

Multiple corrupted recoveries: the random suffix (added in feedback commit) ensures distinct quarantine names per recovery attempt. ✅


Test assessment

lbug-config-wal.test.ts:

  • 4 WAL-positive cases (including the exact Runtime exception: Corrupted wal file. Read out invalid WAL record type. #1402 error text) ✅
  • 6 WAL-negative cases (including permission-path-with-wal and schema-mismatch-in-WAL — added in feedback commit) ✅
  • 4 non-string input cases ✅
  • Exact 9-argument constructor call assertion (positional regression harness) ✅
  • Tests would fail before this PR — functions didn't exist. ✅

pool-wal-recovery.test.ts:

  • Quarantine + retry success (asserts fs.rename, 2 createLbugDatabase calls, stderrWriteMock message) ✅
  • Lock error does not quarantine ✅
  • Retry-also-fails produces gitnexus analyze message ✅
  • Failed recovery does not cache bad DB (isLbugReady returns false) ✅
  • ENOENT on .wal file produces actionable error ✅
  • Gap: no test asserting restoreStdout is called after failed db.init() (see Findings)

mcp-wal-feedback.test.ts:

  • impact WAL suggestion ✅
  • cypher WAL suggestion ✅
  • context WAL suggestion — present (added in feedback commit; gap from prior review resolved) ✅
  • Non-WAL impact does not include WAL suggestion ✅
  • context non-WAL throw preservation ✅
  • GHAS alert (LocalBackend() superfluous arg) resolved in 236c204

Overall: Tests are well-scoped, use isolated mock paths, reset state in beforeEach/afterEach, and validate the critical branching logic. The mock seam is at the right abstraction level. Integration-level constructor-arg verification is not possible from unit tests but is proxied by exact positional test + green CI.


Hidden Unicode / hygiene assessment

Bidi/direction controls (U+202A–U+202E, U+2066–U+2069): none found in any of the 6 changed files. ✅

Non-ASCII characters: em-dashes (U+2014) in comments (pool-adapter.ts, lbug-config.ts) and one test assertion string (mcp-wal-feedback.test.ts:102 — mirrors the pre-existing impact fallback string). No invisible whitespace, no null bytes, no bidi controls in executable code, regex literals, or test assertion strings. Clean.


Elegance / maintainability assessment

The change is well-scoped: three new functions (openReadOnlyDatabase, tryQuarantineAndReopen, isWalCorruptionError), one constant (WAL_RECOVERY_SUGGESTION), and thin MCP wrappers. The _contextImpl / _impactImpl extraction is clean — original logic is untouched, the outer wrapper adds only a try/catch. No auto-reindex logic, no analyze-time repair, no over-engineered recovery framework.

Future LadybugDB native upgrades should audit positional argument counts in createLbugDatabase(). The exact-match constructor test provides a regression harness for that. The as any cast is isolated to one call site with a documented justification.

The realStderrWrite / stderrWriteMock seam is clean — no raw process.stderr used, consistent with the existing MCP stdio architecture.


Final verdict

✅ production-ready with minor follow-ups

All hard blockers are clean: restoreStdout() is correctly placed in finally; failed DB handles are closed before any rethrow; dbCache cannot hold a bad entry; initPromises is cleaned in finally; WAL detection does not misclassify lock, permission, schema, or generic errors; the effective PR diff is scoped to the 6 WAL-recovery files; CI (including GHAS/CodeQL) is fully green on the latest head.

The two remaining follow-up items are: (1) adding an explicit test asserting restoreStdout is called after db.init() failure — the implementation is correct but the test gap means future regressions in that path go undetected; and (2) adding an in-band stale-data indicator to the first MCP response after a successful quarantine — the current realStderrWrite warning is correct but invisible to LLM clients. Neither blocks merge.

One prior-review finding has been fully addressed in the feedback commit (d8378b1): the context WAL positive test is now present, the random suffix was added to quarantine names, and the realStderrWrite stale-graph warning was added. The branch is merge-safe.


· PR branch ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Runtime exception: Corrupted wal file. Read out invalid WAL record type.

3 participants