fix(lbug): recover from WAL corruption by quarantining .wal file (#1402)#1417
Conversation
|
@evander-wang is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
562cebb to
6c9ffbc
Compare
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 8269 tests passed 1 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
|
Claude finished @magyargergo's task in 6m 25s —— View job PR #1417 Review — WAL Corruption RecoveryTasks
Review bar for this PR
Current PR state
Branch hygiene / merge-forward assessmentClean and mergeable. Two merge-forward commits from Understanding of the changeThe PR adds three cooperating pieces:
Findings[medium] Silent stale-data after successful quarantine
[low] Missing
|
|
@evander-wang Can you please look into these findings? 🙏 |
6f243ff to
40d870a
Compare
…igyanpatwari#1402) LadybugDB crashes when the WAL file is corrupted — the open fails with an unrecoverable native error. This makes the pool adapter detect WAL corruption errors, quarantine the offending .wal file, and retry the open. MCP tool responses (cypher, context, impact) now include a recoverySuggestion field when WAL corruption is detected. Changes: - Add isWalCorruptionError() regex-based detector in lbug-config.ts - Add throwOnWalReplayFailure and enableChecksums to createLbugDatabase() - Extract openReadOnlyDatabase() with stdout silencing + db.init() - Add tryQuarantineAndReopen() for .wal quarantine + retry in doInitLbug - Wrap cypher/context/impact with WAL recoverySuggestion in MCP responses - Share WAL_RECOVERY_SUGGESTION constant across all MCP error paths - Fix restoreStdout() placement (before db.init() → finally block) - Add unit tests for detection, pool recovery, and MCP feedback
…bhigyanpatwari#1402) LocalBackend has no constructor — the { registryPath } argument was ignored.
40d870a to
d8378b1
Compare
|
fyi, I'm also working on lbug because i want to make it more robust and to avoid getting |
That’s great — really helpful. I haven’t looked through the full code yet, but my understanding is |
|
Claude finished @magyargergo's task in 6m 6s —— View job PR #1417 Review — WAL Corruption RecoveryTasks
Review bar for this PR
Current PR state
Branch hygiene assessmentmerge-from-main commit present but harmless and merge-safe. The merge commit Understanding of the changeLadybugDB replays its WAL during The PR adds three cooperating pieces:
What remains out of scope (intentionally): analyze-time WAL repair, auto-reindex after quarantine, and any write-side recovery. Findings[low] No test verifying
|
| Tool | WAL error | Non-WAL error |
|---|---|---|
cypher |
{ error, recoverySuggestion } |
{ error } |
context |
{ error, recoverySuggestion } (returns) |
rethrows (pre-existing behavior) |
impact |
{ error, suggestion, recoverySuggestion, ... } |
{ error, suggestion, ... } (pre-existing #321 shape) |
All three tools preserve the error field for existing clients. recoverySuggestion is gated exclusively on isWalCorruptionError(). WAL_RECOVERY_SUGGESTION is defined once and imported — no duplicated strings. ✅
The context wrapper's asymmetry (return for WAL, rethrow for non-WAL) is intentional and matches pre-existing behavior. The feedback-commit added a test for the context WAL positive path (context returns WAL recoverySuggestion on corrupted WAL error) — this gap from the prior review is resolved. ✅
Data safety / index integrity assessment
Quarantining the WAL discards uncheckpointed writes. This is acknowledged: realStderrWrite emits a stale-graph warning, and WAL_RECOVERY_SUGGESTION tells the user to run gitnexus analyze. The recovered DB is opened read-only and never mutated. No "recovered" marker is persisted to disk (intentional — no design yet for degraded-index tracking). The MCP response on the first query after recovery does not carry an in-band stale-data warning — this is the remaining follow-up item noted in the Findings section above.
The PR body's "No index refresh required" is correct for the general case (no migration needed); gitnexus analyze is the user-facing recovery step for the corrupted-WAL case specifically. There is no contradiction.
Cross-platform filesystem assessment
fs.rename(walPath, quarantineName) is same-directory — avoids EXDEV on any platform. The quarantine name uses only alphanumeric characters, dots, and hyphens — safe on Windows (no reserved characters). If the .wal file does not exist by rename time (e.g., a race with another process), fs.rename throws ENOENT, which is caught and converted to an actionable error message including gitnexus analyze. ✅
Windows locked-file scenario: if LadybugDB holds a file lock on the .wal (rare on read-only open, but possible on some native error paths), fs.rename fails; the error is caught and an actionable message is thrown. This is acceptable — the user is told to gitnexus analyze, which is the correct recovery step regardless. The ENOENT case is explicitly tested in pool-wal-recovery.test.ts. ✅
Multiple corrupted recoveries: the random suffix (added in feedback commit) ensures distinct quarantine names per recovery attempt. ✅
Test assessment
lbug-config-wal.test.ts:
- 4 WAL-positive cases (including the exact Runtime exception: Corrupted wal file. Read out invalid WAL record type. #1402 error text) ✅
- 6 WAL-negative cases (including permission-path-with-wal and schema-mismatch-in-WAL — added in feedback commit) ✅
- 4 non-string input cases ✅
- Exact 9-argument constructor call assertion (positional regression harness) ✅
- Tests would fail before this PR — functions didn't exist. ✅
pool-wal-recovery.test.ts:
- Quarantine + retry success (asserts
fs.rename, 2createLbugDatabasecalls,stderrWriteMockmessage) ✅ - Lock error does not quarantine ✅
- Retry-also-fails produces
gitnexus analyzemessage ✅ - Failed recovery does not cache bad DB (
isLbugReadyreturns false) ✅ - ENOENT on
.walfile produces actionable error ✅ - Gap: no test asserting
restoreStdoutis called after faileddb.init()(see Findings)
mcp-wal-feedback.test.ts:
impactWAL suggestion ✅cypherWAL suggestion ✅contextWAL suggestion — present (added in feedback commit; gap from prior review resolved) ✅- Non-WAL
impactdoes not include WAL suggestion ✅ contextnon-WAL throw preservation ✅- GHAS alert (
LocalBackend()superfluous arg) resolved in236c204✅
Overall: Tests are well-scoped, use isolated mock paths, reset state in beforeEach/afterEach, and validate the critical branching logic. The mock seam is at the right abstraction level. Integration-level constructor-arg verification is not possible from unit tests but is proxied by exact positional test + green CI.
Hidden Unicode / hygiene assessment
Bidi/direction controls (U+202A–U+202E, U+2066–U+2069): none found in any of the 6 changed files. ✅
Non-ASCII characters: em-dashes (U+2014) in comments (pool-adapter.ts, lbug-config.ts) and one test assertion string (mcp-wal-feedback.test.ts:102 — mirrors the pre-existing impact fallback string). No invisible whitespace, no null bytes, no bidi controls in executable code, regex literals, or test assertion strings. Clean. ✅
Elegance / maintainability assessment
The change is well-scoped: three new functions (openReadOnlyDatabase, tryQuarantineAndReopen, isWalCorruptionError), one constant (WAL_RECOVERY_SUGGESTION), and thin MCP wrappers. The _contextImpl / _impactImpl extraction is clean — original logic is untouched, the outer wrapper adds only a try/catch. No auto-reindex logic, no analyze-time repair, no over-engineered recovery framework.
Future LadybugDB native upgrades should audit positional argument counts in createLbugDatabase(). The exact-match constructor test provides a regression harness for that. The as any cast is isolated to one call site with a documented justification.
The realStderrWrite / stderrWriteMock seam is clean — no raw process.stderr used, consistent with the existing MCP stdio architecture.
Final verdict
✅ production-ready with minor follow-ups
All hard blockers are clean: restoreStdout() is correctly placed in finally; failed DB handles are closed before any rethrow; dbCache cannot hold a bad entry; initPromises is cleaned in finally; WAL detection does not misclassify lock, permission, schema, or generic errors; the effective PR diff is scoped to the 6 WAL-recovery files; CI (including GHAS/CodeQL) is fully green on the latest head.
The two remaining follow-up items are: (1) adding an explicit test asserting restoreStdout is called after db.init() failure — the implementation is correct but the test gap means future regressions in that path go undetected; and (2) adding an in-band stale-data indicator to the first MCP response after a successful quarantine — the current realStderrWrite warning is correct but invisible to LLM clients. Neither blocks merge.
One prior-review finding has been fully addressed in the feedback commit (d8378b1): the context WAL positive test is now present, the random suffix was added to quarantine names, and the realStderrWrite stale-graph warning was added. The branch is merge-safe.
· PR branch ·
Summary
LadybugDB crashes when the WAL file is corrupted during
db.init()— the MCP server returns an unhelpful native error. This PR detects WAL corruption, quarantines the offending.walfile, retries the open, and adds arecoverySuggestionfield to MCP tool responses so the LLM can guide users.Motivation / context
Closes #1402. Without this, any WAL corruption (power loss, disk full, kill -9 during analyze) renders the indexed knowledge graph unreachable until the user manually runs
gitnexus analyze. The MCP server has no way to self-recover or to tell the user what to do.Areas touched
gitnexus/(CLI / core / MCP server)gitnexus-web/(Vite / React UI).github/(workflows, actions)eval/or other toolingAGENTS.md,CLAUDE.md,.cursor/,llms.txt, etc.)Scope & constraints
In scope
isWalCorruptionError()regex-based WAL corruption detector inlbug-config.tsthrowOnWalReplayFailureandenableChecksumsconstructor options increateLbugDatabase()tryQuarantineAndReopen()— quarantine.wal→ retry open indoInitLbug()openReadOnlyDatabase()— unified read-only DB open with stdout silencingWAL_RECOVERY_SUGGESTIONconstant shared across cypher/context/impact error pathsrestoreStdout()placement (was beforeawait db.init(), now infinally)Explicitly out of scope / not done here
gitnexus analyzeImplementation notes
throwOnWalReplayFailure: falseso WAL corruption is caught as a thrown error (not a segfault) duringdb.init()*.wal.corrupt.<timestamp>for diagnostic inspectiondb.init()overhead is one-time and deduplicated viainitPromisesTesting & verification
cd gitnexus && npm test(passed: 212/213 files, 5195/5198 tests; 2 pre-existing failures in wiki-flags.test.ts)cd gitnexus && npm run test:integrationcd gitnexus && npx tsc --noEmit(passed via pre-commit hook)cd gitnexus-web && npm testcd gitnexus-web && npx tsc -b --noEmitRisk & rollout
throwOnWalReplayFailure ?? true,enableChecksums: true)Checklist
AGENTS.md/ overlays changed: headers, scope block, and changelog updated per project conventions