Skip to content

fix(lbug): keep serve stable when sidecars are missing#1747

Merged
magyargergo merged 8 commits into
abhigyanpatwari:mainfrom
ChamHerry:codex/lbug-sidecar-recovery-20260521
May 21, 2026
Merged

fix(lbug): keep serve stable when sidecars are missing#1747
magyargergo merged 8 commits into
abhigyanpatwari:mainfrom
ChamHerry:codex/lbug-sidecar-recovery-20260521

Conversation

@ChamHerry

Copy link
Copy Markdown
Contributor

Summary

  • Fixes the Express 5 private-network preflight wildcard route so gitnexus serve can start reliably.
  • Adds shared LadybugDB missing-shadow WAL sidecar preflight/quarantine/finalization helpers.
  • Reuses the helper from both direct and pooled LadybugDB adapters with targeted regression coverage.

Why

Local serving could surface repeated lbug.shadow missing-sidecar warnings after interrupted writes or checkpoint edge cases. The recovery now happens before read-only open for safe tiny orphan WALs, while larger WALs are preserved for explicit recovery.

Validation

  • cd gitnexus && npx vitest run test/unit/sidecar-recovery.test.ts test/unit/lbug-adapter-wal-schema.test.ts test/unit/pool-wal-recovery.test.ts test/unit/web-ui-serving.test.ts && npx tsc --noEmit
  • Pre-commit hook: eslint/prettier + gitnexus typecheck

Follow-up

The i18n changes are intentionally kept on a separate stacked branch (ChamHerry:codex/i18n-20260521). They should be submitted to upstream after this PR lands or after an upstream base branch is available, so reviewers do not see this stability diff duplicated in the i18n PR.

Shared missing-shadow WAL recovery prevents repeated read-only open warnings when LadybugDB sidecars are absent, while the Express preflight fix keeps `gitnexus serve` compatible with Express 5 route parsing.

Constraint: LadybugDB read-only replay can require a `.shadow` sidecar that may be absent after interrupted writes or checkpoint edge cases.
Rejected: keep reactive WARN-only quarantine in each adapter | it leaves repeated user-visible warnings and duplicate recovery behavior.
Confidence: high
Scope-risk: broad
Directive: Do not silently delete large orphan WALs; only quarantine tiny orphan WALs before open and keep large WALs for explicit recovery.
Tested: cd gitnexus && npx vitest run test/unit/sidecar-recovery.test.ts test/unit/lbug-adapter-wal-schema.test.ts test/unit/pool-wal-recovery.test.ts test/unit/web-ui-serving.test.ts && npx tsc --noEmit
Not-tested: full npm test in this split branch; full unit suite passed on the source branch before PR split.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@vercel

vercel Bot commented May 21, 2026

Copy link
Copy Markdown

Someone is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@ChamHerry ChamHerry changed the title Stabilize LadybugDB sidecar recovery fix(lbug): keep serve stable when sidecars are missing May 21, 2026
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
9410 9409 0 1 479s

✅ All 9409 tests passed

1 test(s) skipped — expand for details
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 78.11% 30902/39558 N/A% 🟢 ███████████████░░░░░
Branches 66.65% 19635/29459 N/A% 🟢 █████████████░░░░░░░
Functions 82.28% 3139/3815 N/A% 🟢 ████████████████░░░░
Lines 81.47% 27866/34201 N/A% 🟢 ████████████████░░░░

📋 View full run · Generated by CI

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 8m 23s —— View job


Production-Readiness Review: PR #1747 fix(lbug): keep serve stable when sidecars are missing


1. Review bar for this PR

This PR touches persistent local database state, sidecar recovery/quarantine logic, cross-process initialization, read-only vs writable semantics, pooled connections, and HTTP preflight routing. The bar is strict: confirmed defects in any of those lanes block merge independently. CI passing is necessary but not sufficient.


2. Problem being solved

Problem A: gitnexus serve crashes at startup under Express 5 because app.options('*', ...) is not valid Express 5 syntax (Express 5 rejects '*' as a glob path and throws SyntaxError).

Problem B: After interrupted writes or WAL checkpoint edge cases, lbug.wal can exist without lbug.shadow. A subsequent read-only open (serve, MCP) emits repeated native errors and may fail. Larger WALs should be preserved for explicit recovery; tiny orphan WALs should be silently quarantined (renamed, not deleted) before open.

Problem statement gaps verified:


3. Current PR state

  • State: Open. Head commit 23da8b1 on ChamHerry:codex/lbug-sidecar-recovery-20260521.
  • Changed files: 8 TypeScript files — gitnexus/src/core/lbug/{lbug-adapter.ts,pool-adapter.ts,sidecar-recovery.ts}, gitnexus/src/server/api.ts, and 4 test files. No i18n, workflow, Docker, dependency, parser, or generated-artifact changes visible. Branch hygiene is clean.
  • CI: GitHub Actions CI report (last visible run) shows 9,370 tests passed, typecheck success, E2E success. 1 test skipped (unrelated Ruby closure inference). 2 PR Description Check annotations marked cancelled due to higher-priority re-runs — visible state does not confirm whether these are stale superseded runs or current required-check blockers. Verify live Checks tab before merge.
  • Vercel: Bot requests team-member deployment authorization. Appears non-blocking for merge based on typical repo policy, but confirm against branch protection rules.

4. Merge status and mergeability

Visibility is incomplete. The cancelled PR Description Check annotations should be verified as stale (superseded runs) vs current blockers. Mergeability cannot be confirmed without verifying the live branch protection state. The code review below surfaces blocking defects that prevent merge regardless of CI state.


5. Repository history considered


6. Branch hygiene assessment

Clean. One commit. 8 TypeScript files scoped to gitnexus/src/core/lbug/, gitnexus/src/server/api.ts, and targeted unit tests. No i18n, workflow, Docker, dependency, parser, or generated artifact churn. git diff --name-status HEAD~1...HEAD confirms exactly the 8 files listed in the PR. Unicode bidi/control character scan over all 8 changed files: no hidden characters found.


7. Understanding of the change

sidecar-recovery.ts (new file, 254 lines): Shared helper module. Provides: inspectLbugSidecars (classifies DB+WAL+shadow state), preflightLbugSidecars (quarantines tiny orphan WALs before open, warns on large ones), quarantineWalForMissingShadow (renames .wal to .wal.missing-shadow.{ts}-{rand} — preserves forensics, never deletes), finalizeLbugSidecarsAfterClose (post-close sweep), isMissingShadowSidecarError / shadowSidecarRecoveryMessage, and test-reset export.

lbug-adapter.ts (+191/-40): Adds READ_ONLY_SHADOW_REPLAY_PROBE, reopenReadOnlyAfterMissingShadow, reopenWritableAfterMissingShadow, ensureReadOnlyConnectionUsable, and calls preflightLbugSidecars + finalizeLbugSidecarsAfterClose from doInitLbug / safeClose. Read-only open now runs the probe query before returning the connection. Writable open detects missing-shadow error during schema creation and retries after quarantine.

pool-adapter.ts (+112/-0): Adds openReadOnlyDatabase (preflight + probe + quarantine-and-retry on missing shadow), replayShadowPagesWithWritableOpen (temporary writable open to replay dirty pages), and integrates them into doInitLbug. Uses the same sidecar-recovery.ts helpers as the direct adapter.

api.ts (+2/-1): Replaces app.options('*', ...) (Express 4 glob, throws in Express 5) with app.options(PNA_PREFLIGHT_PATH_REGEX, ...) where PNA_PREFLIGHT_PATH_REGEX = /^\/.*$/. The handler calls next() — the global middleware at line 713 already sets Access-Control-Allow-Private-Network: true on all responses.


8. Findings


FINDING 1 — TOCTOU ENOENT in quarantineWalForMissingShadow pool-adapter callers (confirmed defect, fork PR Codex comment unfixed)

Severity: HIGH

Risk: When two pool-adapter processes (e.g. two concurrent gitnexus serve instances, or an MCP server restarting while serve is initializing) both detect a tiny-orphan-WAL for the same dbPath, the first quarantines successfully and the second gets ENOENT from fs.rename. This propagates as a fatal error, aborting serve initialization even though the condition was already resolved.

Evidence found:

sidecar-recovery.ts:107:

await fs.rename(walPath, quarantinePath);  // throws ENOENT if WAL already gone

No ENOENT guard here. The function doesn't distinguish "WAL vanished because a peer already handled it" from "WAL path is wrong."

Callers and their ENOENT handling:

Call site ENOENT handling
finalizeLbugSidecarsAfterClose (sidecar-recovery.ts:199) catch (err) { if (!missing(err)) { warnOnce(...) } }
preflightLbugSidecars (sidecar-recovery.ts:145) ❌ no try-catch → propagates to doInitLbug
reopenReadOnlyAfterMissingShadow (lbug-adapter.ts:459) catch { throw new Error(shadowSidecarRecoveryMessage) } → ENOENT becomes "rebuild index" error
reopenWritableAfterMissingShadow (lbug-adapter.ts:486) ❌ same pattern
openReadOnlyDatabase in pool (pool-adapter.ts:388) catch { throw new Error(shadowSidecarRecoveryMessage) }
replayShadowPagesWithWritableOpen in pool (pool-adapter.ts:351) ❌ same pattern

Key distinction: The direct adapter (lbug-adapter.ts) wraps preflightLbugSidecars inside acquireInitLock (lbug-adapter.ts:703), a cross-process file lock (lbug-adapter.ts:310). This serializes concurrent direct-adapter opens for the same dbPath, making the TOCTOU window effectively zero there.

The pool adapter (pool-adapter.ts) uses only initPromises (in-process deduplication, line 471). No acquireInitLock call exists anywhere in pool-adapter.ts. Two pool-adapter processes opening the same dbPath concurrently are unprotected. The openReadOnlyDatabase and replayShadowPagesWithWritableOpen paths are directly exposed to the TOCTOU race.

This is the exact defect the fork PR's Codex review identified, and it remains unaddressed.

Recommended fix: Either (a) guard quarantineWalForMissingShadow internally — catch ENOENT from fs.rename, re-inspect, and return success if the WAL is already gone; or (b) wrap each call site's catch block to distinguish ENOENT (non-fatal, WAL already handled) from EACCES/EPERM (still fatal, permission problem). Option (a) is more defensive. Keep ENOENT non-fatal only when re-inspection confirms the WAL is gone.

Blocks merge: YES — pool adapter lacks cross-process init lock, and ENOENT from quarantine rename is an unhandled fatal error reachable from concurrent pool opens.

Fix this →


FINDING 2 — Large WAL silently quarantined via writable schema-creation path, bypassing preflight policy

Severity: MEDIUM

Risk: preflightLbugSidecars intentionally warns but does NOT quarantine large orphan WALs (kind: 'orphan-wal', size > 4 KB) — the code at sidecar-recovery.ts:153-160 is explicit about this. However, after open, runSchemaCreationQueries (lbug-adapter.ts:555) can catch isMissingShadowSidecarError and return it to doInitLbug:759, which then calls reopenWritableAfterMissingShadow (lbug-adapter.ts:481), which calls quarantineWalForMissingShadow unconditionally — no size check. A large orphan WAL that triggers a missing-shadow error during the first DDL write will be quarantined silently without the user seeing the preflight warning.

Evidence found: lbug-adapter.ts:759-773:

const missingShadowError = await runSchemaCreationQueries(dbPath);
if (missingShadowError) {
  await safeClose();
  resetOpenConnectionState();
  const reopened = await reopenWritableAfterMissingShadow(dbPath, missingShadowError);
  // ...
}

reopenWritableAfterMissingShadow (lbug-adapter.ts:481-496) quarantines without checking WAL size.

Mitigating factor: The quarantined WAL is renamed to .wal.missing-shadow.{ts}-{rand}, not deleted. listQuarantinedMissingShadowWals allows recovery. So no silent permanent data loss occurs — the WAL is findable.

Recommended fix: In reopenWritableAfterMissingShadow, check WAL size before quarantining. If the WAL is large, fail with an actionable message ("large WAL found without shadow sidecar; manual recovery required") instead of quarantining. Alternatively, emit a WARN-level message before the rename explaining that a large WAL was quarantined during write recovery, so the user is not surprised.

Blocks merge: MAYBE — the WAL is preserved (not deleted), so there is no silent permanent data loss. However, the stated DoD criterion "large WAL files are preserved for explicit recovery and never silently deleted/quarantined in a way that hides user data loss" is inconsistent with the current writable recovery path. Whether the rename-to-quarantine-dir counts as "preserved for explicit recovery" is a judgment call. Treat as a required follow-up at minimum.

Fix this →


FINDING 3 — No TOCTOU test for quarantine rename ENOENT in adapter-level callers

Severity: MEDIUM

Risk: No test verifies behavior when fs.rename in quarantineWalForMissingShadow throws ENOENT from a pool-adapter path (openReadOnlyDatabase, replayShadowPagesWithWritableOpen). The existing test at pool-wal-recovery.test.ts:247 covers ENOENT during the corrupt WAL quarantine path (.wal.corrupt.), not the missing-shadow quarantine path (.wal.missing-shadow.).

Evidence found: pool-wal-recovery.test.ts:247-258 tests ENOENT in tryQuarantineAndReopen (corrupt WAL path). No equivalent test exists for openReadOnlyDatabase's quarantineWalForMissingShadow call at pool-adapter.ts:388, nor for replayShadowPagesWithWritableOpen at pool-adapter.ts:351.

Recommended fix: Add a test where connectionQueryMock rejects with the missing-shadow error AND fs.rename rejects with ENOENT. Assert that initLbug either succeeds (ideal) or fails with an informative message (not a raw ENOENT stack trace).

Blocks merge: NO independently, but this gap is what allowed Finding 1 to ship undetected.


FINDING 4 — Error string matching in isMissingShadowSidecarError and isReadOnlyShadowReplayError is brittle

Severity: LOW

Risk: isMissingShadowSidecarError matches /Cannot open file .*\.shadow: No such file or directory/i (sidecar-recovery.ts:64). isReadOnlyShadowReplayError matches /replay shadow pages under read-only mode/i (pool-adapter.ts:317, lbug-adapter.ts:451). These are hardcoded native LadybugDB message fragments. A LadybugDB version bump that rewrites these strings (e.g., changes "Cannot open file" to "Failed to open file", or localizes error messages) would silently break recovery — the missing-shadow errors would no longer be caught, and serve would emit the native error raw.

Evidence found: Tests use realistic strings matching the current LadybugDB output (e.g., pool-wal-recovery.test.ts:175: "IO exception: Cannot open file ${dbPath}.shadow: No such file or directory"). No LadybugDB version pinned to a specific string contract is visible. The @ladybugdb/core version in gitnexus/package.json is the source of truth (not read in this review — verify directly).

Recommended fix: Add a comment citing the LadybugDB version(s) and source file where these messages originate. If LadybugDB provides structured error codes, prefer those. At minimum, widen the shadow-missing regex to tolerate minor variations (e.g., /.shadow.*[Nn]o such file|[Cc]annot open.*\.shadow/).

Blocks merge: NO — tests use realistic strings; immediate risk is low. Mark as a follow-up before the next LadybugDB major version bump.


FINDING 5 — Express preflight test does not verify PNA header end-to-end

Severity: LOW

Risk: web-ui-serving.test.ts:328-333 verifies that app.options(PNA_PREFLIGHT_PATH_REGEX, ...) does not throw at registration time, and that a real OPTIONS request returns 404 (because no CORS/PNA middleware is set up in the test app). This confirms the route registration fix but does not verify that Access-Control-Allow-Private-Network: true is returned in production, or that an unauthorized origin does not receive the private-network grant.

Evidence found: The global middleware setting Access-Control-Allow-Private-Network: true (api.ts:713-716) runs on all requests regardless of origin, meaning even non-allowlisted origins receive the PNA header. The CORS origin callback only controls Access-Control-Allow-Origin. A browser receiving the PNA header without Access-Control-Allow-Origin will still block the request — so there is no exploitable CORS bypass. The security boundary is intact.

Recommended fix: Add a test that instantiates the full CORS + PNA middleware stack and verifies Access-Control-Allow-Private-Network: true is returned on OPTIONS requests. Not blocking, but would complete the regression coverage claimed in the PR.

Blocks merge: NO — security boundary is maintained by existing CORS middleware; this is a coverage gap, not a defect.


FINDING 6 — warnedKeys module-level global survives across serve restarts within a process

Severity: INFO

Risk: warnedKeys at sidecar-recovery.ts:19 is a Set<string> in module scope. A missing-shadow warning for a given dbPath will fire at most once per process lifetime. If serve recovers and the condition recurs, subsequent warnings are silently downgraded to debug. This is intentional anti-spam behavior, but if conditions fluctuate (e.g., repeated checkpoint failures), users may miss persistent problems.

_resetSidecarRecoveryWarningsForTest is exported for test cleanup. Production has no reset path.

Blocks merge: NO — intentional design decision, test reset provided. Worth a follow-up if serve becomes long-lived and users report persistent sidecar issues being silent.


9. PR-specific assessment sections

LadybugDB/lbug sidecar expert lane

State machine reconstructed:

State Preflight action Open action Writable schema action Post-close action
clean (no wal, no shadow) return clean open normally run schema queries no-op
wal + shadow (healthy WAL) return wal-with-shadow LadybugDB replays normally run schema queries no-op
tiny orphan wal (≤ 4KB, no shadow) quarantine WAL (rename) → re-inspect → clean open normally run schema queries quarantine if still present
large orphan wal (> 4KB, no shadow) warn, return orphan-wal, do NOT quarantine LadybugDB replay attempt quarantine if schema query triggers missing-shadow ← inconsistency warn, preserve
missing shadow after open (probe error) N/A probe → isMissingShadowSidecarError → quarantine + reopen no-op
shadow replay needed (read-only) N/A probe → isReadOnlyShadowReplayError → writable open → re-read-only no-op

The state machine is largely sound. The inconsistency in the "large orphan wal → writable schema" cell is Finding 2. All other transitions are correct.

Quarantine file naming: ${dbPath}.wal.missing-shadow.${Date.now()}-${Math.random().toString(36).slice(2)} — timestamp + random suffix. Deterministic enough for forensic debugging (timestamp identifies when quarantine occurred). Collision-safe: the random 5-7 char suffix prevents collisions on rapid concurrent calls. listQuarantinedMissingShadowWals correctly enumerates these files.

EACCES/EPERM handling: finalizeLbugSidecarsAfterClose correctly re-raises non-ENOENT errors via warnOnce. The adapter-level callers swallow all errors in the quarantine catch block (converting them to "rebuild index" messages), which may hide EACCES/EPERM. This is acceptable for the recovery path but could make permission problems harder to diagnose.

Windows handle-release timing: safeClose in lbug-adapter.ts:1542 explicitly probes Windows handle release before calling finalizeLbugSidecarsAfterClose. The pool adapter closeOne uses conn.close().catch(() => {}) (fire-and-forget) with no Windows probe — this is a pre-existing pattern, not introduced by this PR.

Direct adapter lane

preflightLbugSidecars is called at doInitLbug:741 inside the acquireInitLock critical section (doInitLbug:703). This cross-process lock serializes concurrent direct-adapter opens for the same dbPath, effectively preventing the TOCTOU race in this path. ✓

ensureReadOnlyConnectionUsable (lbug-adapter.ts:498) is also inside the lock. Its connection handle lifecycle is correct:

  • Initial handle closed before writable open (closeLbugConnection(handle) at 507, 511, 516) ✓
  • Writable handle closed in finally block at 528-530 ✓
  • Reopened handle closed on probe failure at 540 ✓

runSchemaCreationQueries (lbug-adapter.ts:555) correctly distinguishes:

  • isMissingShadowSidecarError → returns error for caller to handle
  • isWalCorruptionError → calls safeClose() + resetOpenConnectionState() + throws with actionable message ✓
  • Other errors: logs WARN if not "already exists" / busy / read-only ✓

resetOpenConnectionState (lbug-adapter.ts:548) clears currentDbPath, ftsLoaded, vectorExtensionLoaded, ensuredFTSIndexes — state reset is consistent on error paths. ✓

FTS: read-only open uses policy: 'load-only' at lbug-adapter.ts:780. Schema DDL is skipped for read-only opens (if (!readOnly) { ... runSchemaCreationQueries ... } at line 758). Read-only path does not trigger network INSTALL. ✓

safeClose (lbug-adapter.ts:1514): calls flushWAL then closes conn then db, then Windows handle probe, then finalizeLbugSidecarsAfterClose. Sequence is correct; no resources left open on the happy path. ✓

queryAndDrain resource management: lbug-adapter.ts:442 always drains (await drainQueryResult(queryResult)) before returning. Not verified from the implementation of drainQueryResult in this review, but callers depend on it consistently. No obvious leak found.

Pool adapter lane

The pool adapter does not use acquireInitLock. In-process deduplication via initPromises (pool-adapter.ts:454) is correct for single-process use. Cross-process protection is absent — this is the source of Finding 1.

openReadOnlyDatabase (pool-adapter.ts:367): preflight + read-only open + probe + quarantine-on-shadow-missing + second preflight + second open + second probe. Connection handle lifecycle in the missing-shadow path (pool-adapter.ts:384-407):

  • db.close() called when probe fails at 385 ✓
  • db set to undefined at 386 ✓
  • On final throw at 423, if (db) await db.close()

replayShadowPagesWithWritableOpen (pool-adapter.ts:342): opens a writable DB, probes, closes in finally. Correct handle management. ✓

Pool integrity check at pool-adapter.ts:698-705: throws if totalConns < MAX_CONNS_PER_REPO, catching connection leaks during pool warmup. ✓

A single bad WAL sidecar does NOT poison the pool — doInitLbug throws before the pool entry is registered (pool entry registered last, at line 602). Concurrent callers waiting on initPromises will all see the same error and each retry independently. ✓

Express 5 serve startup lane

The fix: api.ts:133: export const PNA_PREFLIGHT_PATH_REGEX = /^\/.*$/; used in place of the Express 4 '*' glob. Express 5 accepts regex routes. The regex matches all paths starting with / — correct scope for a global OPTIONS handler.

Behavior preserved: The handler at api.ts:722-724 calls next(). The Access-Control-Allow-Private-Network: true header is already set by global middleware at api.ts:713. This is correct — the OPTIONS handler does not need to set headers itself; it just lets the CORS middleware (which runs on all methods including OPTIONS) handle the response.

CORS security boundary: isAllowedOrigin (api.ts:60) returns false for non-allowlisted origins. The cors() middleware at api.ts:701 uses this. The PNA header (Access-Control-Allow-Private-Network: true) is sent to all origins by the global middleware, but browsers only act on it when Access-Control-Allow-Origin is also present — and that is controlled by the CORS callback. No security regression introduced. ✓

Test at web-ui-serving.test.ts:328-333: Verifies startup no longer crashes. The test makes a real OPTIONS request and gets 404 (no routes registered on the minimal test app). This confirms the route registration fix without false positives. Gap noted in Finding 5.

Test coverage lane

sidecar-recovery.test.ts (real FS, no mocks):

  • inspectLbugSidecars: clean state ✓, wal+shadow ✓
  • Preflight: tiny orphan WAL quarantined without WARN ✓, large orphan WAL not quarantined + WARN ✓
  • Finalize: tiny orphan WAL quarantined ✓
  • Disable via env var ✓
  • listQuarantinedMissingShadowWals filter ✓
  • Gap: No TOCTOU test (ENOENT from quarantine rename). No concurrent open simulation.

pool-wal-recovery.test.ts:

  • Corrupt WAL quarantine + retry ✓, shadow replay ✓, missing shadow quarantine + reopen ✓, lock error no-quarantine ✓, retry failure error message ✓, no poisoned state after failure ✓
  • Line 247 test ("handles quarantine gracefully when .wal file does not exist"): Tests ENOENT during the corrupt WAL quarantine path (tryQuarantineAndReopen). This path does handle ENOENT (it throws, and the test asserts rejects.toThrow(/gitnexus analyze/)). This test does NOT cover the missing-shadow quarantine ENOENT path in openReadOnlyDatabase.

lbug-adapter-wal-schema.test.ts:

  • Writable missing-shadow quarantine + retry schema creation ✓
  • Read-only open skips DDL, uses load-only FTS ✓
  • Shadow replay with temp writable open ✓
  • Read-only missing-shadow quarantine + reopen ✓
  • WAL corruption mid-schema → safeClose called ✓
  • Gap: No test for ENOENT from quarantine rename in reopenReadOnlyAfterMissingShadow or reopenWritableAfterMissingShadow.

web-ui-serving.test.ts:

  • Express 5 route registration no-throw ✓
  • Real OPTIONS request handled without crash ✓
  • Gap: No end-to-end test for Access-Control-Allow-Private-Network header presence.

Overall: Tests prove the happy-path recovery and the large-WAL non-quarantine behavior. The TOCTOU error path (Finding 1) is not tested, which is why the defect shipped.

Cross-domain lane

Both changes (Express 5 route, lbug sidecar recovery) are framed as gitnexus serve startup stability. They are causally connected at the serve-startup level. The Express route fix is a 2-line change with no interaction with the lbug changes. Combining them in one PR is acceptable given the scope, but the lbug sidecar changes are substantially larger and independently risky. If the Express fix is urgently needed, it could be fast-tracked separately. As-is, the combined PR is structurally acceptable — but the blocking lbug defect (Finding 1) prevents the whole PR from merging.


10. Back-and-forth avoided by verifying

  • Init lock scope verified: acquireInitLock is a cross-process file lock at lbug-adapter.ts:310. It protects the direct-adapter init path including preflightLbugSidecars. This eliminates the TOCTOU risk in the direct-adapter path but does NOT cover the pool adapter.
  • Large WAL not deleted verified: quarantineWalForMissingShadow uses fs.rename, not fs.unlink. Quarantined files are findable via listQuarantinedMissingShadowWals. cleanQuarantinedMissingShadowWals is the only deletion path and is not auto-invoked anywhere in this PR.
  • FTS policy verified: Read-only open uses policy: 'load-only' at both direct adapter (lbug-adapter.ts:780) and pool adapter (pool-adapter.ts:595). No DDL or network INSTALL in read-only paths.
  • State reset verified: resetOpenConnectionState clears all 4 module-level fields (currentDbPath, ftsLoaded, vectorExtensionLoaded, ensuredFTSIndexes). Called on all error paths in doInitLbug.
  • Probe query safety verified: MATCH (n) RETURN n LIMIT 1 is valid Cypher on empty/no-node graphs — returns empty result set, not an error. No schema or extension requirement. No DDL side effect. ✓
  • Branch hygiene verified: git diff --name-status HEAD~1...HEAD confirms exactly 8 files. No i18n pollution from PR feat(i18n): make web and CLI language-aware #1748. Unicode scan clean.

11. Open questions (unavoidable)

  1. Live CI state: The 2 cancelled PR Description Check annotations must be confirmed as stale superseded runs before merge. Cannot verify from local checkout.
  2. LadybugDB version and string contract: The @ladybugdb/core version in gitnexus/package.json was not read. Verify the missing-shadow and shadow-replay error strings are stable across the pinned version range before merge.
  3. Pool adapter cross-process init lock: Was the absence of acquireInitLock in pool-adapter.ts intentional (e.g., "pool is MCP-only and single-process")? If the pool is always single-process, Finding 1's TOCTOU risk is substantially reduced to a theoretical concern — but the ENOENT handling in the recovery callers remains incorrect regardless.

12. Final verdict

not production-ready

Justification: Finding 1 is a confirmed defect: quarantineWalForMissingShadow propagates ENOENT from fs.rename as a fatal error in all pool-adapter callers (openReadOnlyDatabase, replayShadowPagesWithWritableOpen), while finalizeLbugSidecarsAfterClose correctly treats ENOENT as non-fatal. The fork PR's Codex review explicitly flagged this, and it remains unaddressed. Two concurrent pool-adapter processes opening the same DB path can race on the WAL quarantine, causing serve initialization to abort with a spurious "rebuild the index" error even though the WAL was already handled. Finding 2 (large WAL quarantined during writable schema creation despite preflight deliberately warning-only for large WALs) is a secondary concern — the WAL is preserved via rename, but the behavior is inconsistent with the stated design goal. Both findings require small, targeted fixes. The Express 5 route fix and the overall sidecar recovery architecture are sound and well-tested, and both fixes should be unblocked once ENOENT handling is added to the quarantine rename call sites in the pool adapter paths.


@magyargergo

Copy link
Copy Markdown
Collaborator

@ChamHerry Could you please look into Claude's findings and address them? 🙏

@magyargergo

Copy link
Copy Markdown
Collaborator

I'm working on it now! This is an important fix and thank you for raising it! 🙏

@magyargergo magyargergo self-assigned this May 21, 2026
…aware errors (PR abhigyanpatwari#1747 review)

Addresses the production-readiness review of PR abhigyanpatwari#1747 (Findings 1, 2, 3 of 6).
Findings 4, 5, 6 are deferred to follow-ups per the plan.

1. ENOENT-tolerance scoped to pool-adapter callers only
   - `quarantineWalForMissingShadow` stays strict in `sidecar-recovery.ts`.
     The direct adapter calls it inside `acquireInitLock` (cross-process
     file lock) — ENOENT there means the file vanished under lock and
     remains a real bug to surface.
   - New `tryQuarantineForMissingShadow` local helper in `pool-adapter.ts`
     returns a discriminated union { kind: 'quarantined', path } |
     { kind: 'peer-handled' }. Catches ENOENT, re-verifies via
     statIfExists, and converts to 'peer-handled' only when WAL really
     is gone. Defensive: if ENOENT but WAL still present, throws as
     classified error rather than silently returning success.

2. Symmetric WAL-size gate on both recovery paths
   - `refuseLargeWalQuarantine` applied in both
     `reopenReadOnlyAfterMissingShadow` and
     `reopenWritableAfterMissingShadow`. Closes the read-only data-loss
     vector (large orphan WAL silently discarded would never be replayed
     by a later writable open).

3. Permission-aware error classifier
   - New `renameFailureMessage` and `isPermissionRenameError` in
     `sidecar-recovery.ts`. EACCES / EPERM / EBUSY now surface a
     permission-specific message pointing at ACLs, AV exclusions, and
     file-locks. Other codes (ENOSPC, EROFS, EIO, ENOENT) fall through
     to `shadowSidecarRecoveryMessage`.
   - Used at both pool-adapter and direct-adapter caller catches around
     `quarantineWalForMissingShadow`.
   - `doInitLbug`'s pass-through classifier extended to include the new
     permission message. The lock-retry substring match tightened so
     "file-lock error" in the permission message is not mistaken for a
     LadybugDB lock-retry trigger.

Tests
   - sidecar-recovery.test.ts: 7 new tests for `renameFailureMessage` and
     `isPermissionRenameError`.
   - pool-wal-recovery.test.ts: 6 new tests covering ENOENT race,
     EACCES/EPERM/EBUSY classification, ENOSPC fallthrough, and the
     defensive "WAL still present after ENOENT" branch.
   - lbug-adapter-wal-schema.test.ts: 5 new tests covering the symmetric
     size gate on both recovery paths, including the boundary at exactly
     TINY_ORPHAN_WAL_BYTES (4096) and the off-by-one at 4097.

Deferred (tracked as follow-up work)
   - Brittle LadybugDB error-string matching (Finding 4).
   - PNA header end-to-end coverage gap (Finding 5).
   - warnedKeys module-global persistence (Finding 6).
   - Cross-process init lock for pool-adapter.
# Conflicts:
#	gitnexus/src/server/api.ts
…am (PR abhigyanpatwari#1747 review, Findings 4 & 6)

Smallest viable response to the two remaining non-blocking findings from the
production-readiness review of PR abhigyanpatwari#1747. An earlier-revision plan proposed
regex widening + a near-miss detector + per-dbPath warn scoping; an
adversarial doc-review found those defended against hypothetical strings
LadybugDB does not produce, added observability theater with no recovery
behavior change, and did not actually fix the long-running gitnexus serve
case for hot dbPaths (where finalizeLbugSidecarsAfterClose rarely fires).
Scope shrunk to dedup + counter-based — strictly behavior-changing and
fully testable.

Finding 4 — dedup + version-coupling markers
   - `isReadOnlyShadowReplayError` was inlined in both `lbug-adapter.ts:451`
     and `pool-adapter.ts:317`. Centralized as an export from
     `sidecar-recovery.ts`. The two local copies are removed; both adapters
     now import from the shared module.
   - Both LadybugDB-coupled predicates (`isMissingShadowSidecarError` and
     `isReadOnlyShadowReplayError`) gain a `// LADYBUGDB-CONTRACT:` marker
     comment citing `@ladybugdb/core ^0.16.1`. When bumping LadybugDB,
     `git grep "LADYBUGDB-CONTRACT"` enumerates every version-coupled spot.
   - Strict matcher unchanged — when LadybugDB actually changes the error
     format, the failure mode stays loud (raw native error propagates) and
     the markers make every affected predicate trivially greppable.

Finding 6 — counter-based warn anti-spam
   - `warnedKeys: Set<string>` → `warnedKeyCounts: Map<string, number>`.
     `warnOnce` keeps its signature `(logger, key, message)` and keying
     convention unchanged — the swap is internal.
   - `WARN_MILESTONES = [1, 10, 100, 1000, 10000]`. Logarithmic spacing
     gives O(log N) warns for a condition that fires N times. Past the
     first occurrence the warn message is suffixed with "(Nth occurrence
     of this condition)" so persistence is visible in the log line itself.
   - Solves the long-running serve case: a hot dbPath hitting the same
     condition 100 times now fires 3 warns (occurrences 1, 10, 100)
     instead of 1 warn + 99 silent debug lines.

Tests (10 new in sidecar-recovery.test.ts, all green)
   - Centralized isReadOnlyShadowReplayError: positive match, false-positive
     guard, structural assertion that the duplicate regex is gone from both
     adapter files, LADYBUGDB-CONTRACT marker count.
   - Counter-based warnOnce: milestone-at-10 with suffix, milestone-at-100,
     key isolation across dbPaths, reset zeroes the counter, first-occurrence
     message does NOT carry the suffix.

Deferred (tracked separately)
   - Finding 5 — PNA header end-to-end coverage gap (CORS boundary is sound).
   - LadybugDB structured error codes (if/when the library exposes them).
   - Per-call milestone configurability — re-open if tuning is needed.
@github-actions

Copy link
Copy Markdown
Contributor

✨ PR Autofix

Found fixable formatting / unused-import issues across 96 changed lines. Comment /autofix on this PR to apply them, or run npm run lint:fix && npm run format locally.

{"schema":"gitnexus.pr-autofix/v2","state":"fixes-available","pr_number":1747,"changed_lines":96,"head_sha":"3d70660b522ca7fb6cdec344d2cac5fa64a29294","run_id":"26221379665","apply_command":"/autofix"}

@magyargergo

Copy link
Copy Markdown
Collaborator

/autofix

@github-actions

Copy link
Copy Markdown
Contributor

✅ Applied autofix and pushed a commit. (apply run)

@magyargergo magyargergo merged commit a9fef2c into abhigyanpatwari:main May 21, 2026
30 of 31 checks passed
ChamHerry pushed a commit to ChamHerry/GitNexus that referenced this pull request May 21, 2026
Add shared CLI translation resources and web i18next/react-i18next wiring so GitNexus can present core command and UI text in English or Chinese while preserving existing defaults.

Constraint: PR abhigyanpatwari#1748 was previously stacked on obsolete abhigyanpatwari#1747 work, so this branch is rebuilt from upstream/main and contains only i18n changes.\nRejected: Keeping the old stacked branch | it carried obsolete lbug sidecar commits and stayed conflicting after abhigyanpatwari#1747 merged.\nConfidence: medium\nScope-risk: broad\nDirective: Keep future i18n follow-ups separated from native-worker and lbug recovery fixes; update both locale trees together when adding translatable UI text.\nTested: GitNexus impact analysis for App, Header, useAppState, cleanCommand, doctorCommand; GitNexus detect_changes on staged linked worktree; CLI i18n vitest targets; web i18n vitest targets; gitnexus npx tsc --noEmit; gitnexus-web npx tsc -b --noEmit; gitnexus-web npm test; gitnexus-web npm run build; gitnexus npm run build.\nNot-tested: Browser visual/manual language switching; production Vercel deployment; full gitnexus npm test ended with Vitest unhandled simulated worker-pool startup crash after all 355 files and 9180 assertions passed.
ChamHerry pushed a commit to ChamHerry/GitNexus that referenced this pull request May 21, 2026
Add shared CLI translation resources and web i18next/react-i18next wiring so GitNexus can present core command and UI text in English or Chinese while preserving existing defaults. Keep the heartbeat reconnect banner on a dedicated translation key so disconnect recovery remains distinguishable from initial server connection failures.

Constraint: PR abhigyanpatwari#1748 was previously stacked on obsolete abhigyanpatwari#1747 work, so this branch is rebuilt from upstream/main and contains only i18n changes.
Rejected: Keeping the old stacked branch | it carried obsolete lbug sidecar commits and stayed conflicting after abhigyanpatwari#1747 merged.
Confidence: medium
Scope-risk: broad
Directive: Keep future i18n follow-ups separated from native-worker and lbug recovery fixes; update both locale trees together when adding translatable UI text.
Tested: GitNexus impact analysis for App, Header, useAppState, cleanCommand, doctorCommand, and AppContent; GitNexus detect_changes on the linked worktree; CLI i18n vitest targets; web i18n vitest targets; gitnexus npx tsc --noEmit; gitnexus-web npx tsc -b --noEmit; gitnexus-web npm test; gitnexus-web npm run build; gitnexus npm run build; local Playwright heartbeat-reconnect e2e spec.
Not-tested: Full browser visual/manual language switching; production Vercel deployment; full gitnexus npm test ended with Vitest unhandled simulated worker-pool startup crash after all 355 files and 9180 assertions passed.
@ChamHerry ChamHerry deleted the codex/lbug-sidecar-recovery-20260521 branch May 21, 2026 15:47
ChamHerry pushed a commit to ChamHerry/GitNexus that referenced this pull request May 21, 2026
Add shared CLI translation resources and web i18next/react-i18next wiring so GitNexus can present core command and UI text in English or Chinese while preserving existing defaults. Keep the heartbeat reconnect banner on a dedicated translation key so disconnect recovery remains distinguishable from initial server connection failures.

Constraint: PR abhigyanpatwari#1748 was previously stacked on obsolete abhigyanpatwari#1747 work, so this branch is rebuilt from upstream/main and contains only i18n changes.
Rejected: Keeping the old stacked branch | it carried obsolete lbug sidecar commits and stayed conflicting after abhigyanpatwari#1747 merged.
Confidence: medium
Scope-risk: broad
Directive: Keep future i18n follow-ups separated from native-worker and lbug recovery fixes; update both locale trees together when adding translatable UI text.
Tested: GitNexus impact analysis for App, Header, useAppState, cleanCommand, doctorCommand, and AppContent; GitNexus detect_changes on the linked worktree; CLI i18n vitest targets; web i18n vitest targets; gitnexus npx tsc --noEmit; gitnexus-web npx tsc -b --noEmit; gitnexus-web npm test; gitnexus-web npm run build; gitnexus npm run build; local Playwright heartbeat-reconnect e2e spec.
Not-tested: Full browser visual/manual language switching; production Vercel deployment; full gitnexus npm test ended with Vitest unhandled simulated worker-pool startup crash after all 355 files and 9180 assertions passed.
ChamHerry pushed a commit to ChamHerry/GitNexus that referenced this pull request May 21, 2026
Add shared CLI translation resources and web i18next/react-i18next wiring so GitNexus can present core command and UI text in English or Chinese while preserving existing defaults. Keep the heartbeat reconnect banner on a dedicated translation key so disconnect recovery remains distinguishable from initial server connection failures.

Constraint: PR abhigyanpatwari#1748 was previously stacked on obsolete abhigyanpatwari#1747 work, so this branch is rebuilt from upstream/main and contains only i18n changes.
Rejected: Keeping the old stacked branch | it carried obsolete lbug sidecar commits and stayed conflicting after abhigyanpatwari#1747 merged.
Confidence: medium
Scope-risk: broad
Directive: Keep future i18n follow-ups separated from native-worker and lbug recovery fixes; update both locale trees together when adding translatable UI text.
Tested: GitNexus impact analysis for App, Header, useAppState, cleanCommand, doctorCommand, and AppContent; GitNexus detect_changes on the linked worktree; CLI i18n vitest targets; web i18n vitest targets; gitnexus npx tsc --noEmit; gitnexus-web npx tsc -b --noEmit; gitnexus-web npm test; gitnexus-web npm run build; gitnexus npm run build; local Playwright heartbeat-reconnect e2e spec.
Not-tested: Full browser visual/manual language switching; production Vercel deployment; full gitnexus npm test ended with Vitest unhandled simulated worker-pool startup crash after all 355 files and 9180 assertions passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants