Skip to content

fix(lbug): add WAL checkpoint-threshold control#1772

Merged
magyargergo merged 17 commits into
mainfrom
copilot/fix-wal-rename-crash
May 22, 2026
Merged

fix(lbug): add WAL checkpoint-threshold control#1772
magyargergo merged 17 commits into
mainfrom
copilot/fix-wal-rename-crash

Conversation

Copilot AI commented May 22, 2026

Copy link
Copy Markdown
Contributor

Fixes the WAL rename/remove race that crashed large gitnexus analyze runs mid-index when LadybugDB auto-checkpoint hit the default ~16 MiB threshold. Keeps native auto-checkpoint always on (per maintainer requirement) and adds a JS-side manual checkpoint driver that pre-empts the native rotation, so the race window moves into a path where bounded retries can absorb transient FS failures (AV scanners, NTFS shadow copy, file locks).

Behavior

  • Manual checkpoint driver (gitnexus/src/core/lbug/wal-checkpoint-driver.ts): periodic CHECKPOINT issued from JS during analyze, wrapped in 3 attempts with 50 / 200 / 500 ms jittered back-off. Retries only on isLbugCheckpointIoError; every other error surfaces immediately. Opt-out via GITNEXUS_WAL_MANUAL_CHECKPOINT=0. Driver is analyze-only; MCP and other long-lived flows keep relying on the close-time safeClose flush.
  • Default checkpoint threshold raised from -1 (Ladybug stock ~16 MiB) to 67108864 (64 MiB). Default users no longer hit the original race; the recovery hint and the new default match.
  • Recovery hint on persistent checkpoint failure recommends the same 64 MiB value plus a 32 MiB fallback for small-disk / CI runners. Hint detection moved into lbug-config.ts next to isWalCorruptionError; matcher uses strict + permissive regex tiers so a future Ladybug message-shape change degrades to the permissive fallback instead of silently breaking. Ladybug version is pinned in a comment beside the regexes.

Public surface

  • CLI flag: --wal-checkpoint-threshold <bytes> (integer ≥ -1; default 67108864; -1 keeps Ladybug stock ~16 MiB).
  • Env var: GITNEXUS_WAL_CHECKPOINT_THRESHOLD (same semantics). Flag overrides env when both are set — documented in --help.
  • Invalid env value now emits a logger.warn and falls back to the default, closing the previous CLI-vs-env asymmetry (CLI rejects; env silently fell through).
  • Typed RecoveryHint union in gitnexus/src/cli/cli-message.ts enumerating the 8 existing hint tags so this enum can't drift unnoticed. New tag: 'wal-checkpoint-threshold'.

Docs

  • Root README.md and gitnexus/README.md updated with the new flag, env var, default (67108864), and a tradeoff note: larger threshold reduces checkpoint frequency but increases WAL size at rotation.

Tests

  • New integration test under gitnexus/test/integration/analyze-wal-checkpoint-failure.test.ts — drives a real lbug.Database + CLI spawn via tsx and provokes a real checkpoint failure portably (planting a directory at the rename target). No vi.mock of the orchestrator; no test-only injection hook.
  • New unit suite for the checkpoint driver covering retry budget, backoff, opt-out, propagation of non-checkpoint errors, and exhaustion.
  • Existing unit tests updated for the new default threshold, the wal-* rename, the strict + permissive regex tiers, and the warn-on-invalid-env path.

Verification

  • tsc --noEmit: clean.
  • Affected unit tests: 67 / 67 passing.
  • Integration test: 1 / 1 passing.

Example:

# default (64 MiB threshold + manual driver — safer on large runs)
gitnexus analyze

# disable the manual driver
GITNEXUS_WAL_MANUAL_CHECKPOINT=0 gitnexus analyze

# tune the native threshold explicitly
gitnexus analyze --wal-checkpoint-threshold 33554432
# equivalent:
GITNEXUS_WAL_CHECKPOINT_THRESHOLD=33554432 gitnexus analyze

@vercel

vercel Bot commented May 22, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
gitnexus Ready Ready Preview, Comment May 22, 2026 12:32pm

Request Review

Copilot AI changed the title [WIP] Fix WAL rename crash during large indexing runs analyze: default Ladybug WAL auto-checkpoint off; add user-controlled checkpoint toggle May 22, 2026
Copilot AI requested a review from magyargergo May 22, 2026 04:37
@magyargergo

Copy link
Copy Markdown
Collaborator

@copilot auto checkpoint must be always on no need cli option for that but the size is limited to 16mb when the last parameter set to -1 and that's what i want to be able to control

Copilot AI commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

@copilot auto checkpoint must be always on no need cli option for that but the size is limited to 16mb when the last parameter set to -1 and that's what i want to be able to control

Implemented in 8828eee.

  • Auto-checkpoint is now always enabled in LadybugDB.
  • Removed the on/off toggle surface (--lbug-auto-checkpoint / GITNEXUS_LBUG_AUTO_CHECKPOINT).
  • Added threshold control instead: --lbug-checkpoint-threshold <value> and GITNEXUS_LBUG_CHECKPOINT_THRESHOLD (default -1, LadybugDB default threshold behavior).

@magyargergo

Copy link
Copy Markdown
Collaborator

@copilot we need to write integration tests as well and if there's a failure about that we need to recommend using this new option. Read ladybug documentation around this checkpoint size and what exception it will raise to nodejs.

Copilot AI commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

@copilot we need to write integration tests as well and if there's a failure about that we need to recommend using this new option. Read ladybug documentation around this checkpoint size and what exce...

Addressed in 508cea1.

I added checkpoint-failure guidance in analyze for Ladybug runtime I/O errors raised during WAL checkpoint file rename/remove, with a recommendation to retry using --lbug-checkpoint-threshold / GITNEXUS_LBUG_CHECKPOINT_THRESHOLD.

I also added tests:

  • unit coverage for the new recommendation path
  • integration coverage in CLI e2e (guarded when the native Ladybug binary is unavailable)

@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
9555 9554 0 1 536s

✅ All 9554 tests passed

1 test(s) skipped — expand for details
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 77.7% 31505/40544 N/A% 🟢 ███████████████░░░░░
Branches 66.39% 20083/30250 N/A% 🟢 █████████████░░░░░░░
Functions 81.16% 3220/3967 N/A% 🟢 ████████████████░░░░
Lines 81.13% 28405/35009 N/A% 🟢 ████████████████░░░░

📋 View full run · Generated by CI

@magyargergo

Copy link
Copy Markdown
Collaborator

@copilot update title + description to reflect our changes

Copilot AI commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

@copilot update title + description to reflect our changes

Updated wording proposal to match the implemented changes:

Title
analyze: keep Ladybug auto-checkpoint on; add checkpoint-threshold control

Description

  • Keep LadybugDB auto-checkpointing always enabled in analyze runs.
  • Replace the on/off toggle with threshold control:
    • --lbug-checkpoint-threshold <value>
    • GITNEXUS_LBUG_CHECKPOINT_THRESHOLD
  • Default threshold is -1 (LadybugDB default behavior).
  • Restrict checkpoint-failure guidance to checkpoint-specific Ladybug I/O error shapes.
  • Add/update tests for threshold parsing and checkpoint-specific recommendation paths.

@magyargergo magyargergo changed the title analyze: default Ladybug WAL auto-checkpoint off; add user-controlled checkpoint toggle analyze: keep Ladybug auto-checkpoint on; add checkpoint-threshold control May 22, 2026
…manual checkpoint driver with retry

Address review feedback on PR #1772:

- Rename CLI flag, env var, AnalyzeOptions field, recovery-hint tag, and
  parser/constants from lbug-* to engine-neutral wal-* (matches the existing
  WAL_RECOVERY_SUGGESTION / isWalCorruptionError convention).
- Raise default threshold from -1 (Ladybug stock ~16 MiB) to 64 MiB so users
  on the default config no longer hit the original rename/remove race.
- Align both READMEs to publish 67108864 (64 MiB) instead of 65536 (which
  would have made the crash more frequent).
- Add wal-checkpoint-driver.ts: a periodic manual CHECKPOINT driver wrapped
  in a 3-attempt jittered retry (50/200/500 ms), driven from runFullAnalysis.
  Opt-out via GITNEXUS_WAL_MANUAL_CHECKPOINT=0. Moves the race window into a
  JS-controllable retry surface while keeping native auto-checkpoint on.
- Move LBUG_CHECKPOINT_RENAME_RE / REMOVE_RE plus the predicate (renamed to
  isLbugCheckpointIoError) into lbug-config.ts alongside isWalCorruptionError.
  Predicate is now exported. Add a permissive fallback matcher and pin the
  matched Ladybug version in comments.
- Warn instead of silently defaulting when GITNEXUS_WAL_CHECKPOINT_THRESHOLD
  is set to a non-empty unparseable value (closes the CLI-vs-env asymmetry).
- Add a typed RecoveryHint string-literal union in cli-message.ts so future
  hint tags can't drift.
- Add a real integration test under test/integration/ that triggers a
  Ladybug checkpoint IO failure via a pre-existing directory at the rename
  target (portable across platforms; no test-only injection hook).
- Add small-disk / CI caveat (32 MiB secondary suggestion) to the recovery
  hint and README env-var rows.
- Document CLI/env precedence in the analyze --help block.
- Help placeholder: <value> -> <bytes>.
- Rename analyze-lbug-auto-checkpoint.test.ts to use the new wal-* token.
@magyargergo magyargergo marked this pull request as ready for review May 22, 2026 09:26
Comment thread gitnexus/src/core/lbug/wal-checkpoint-driver.ts Fixed
@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

✨ PR Autofix

Found fixable formatting / unused-import issues across 170 changed lines. Comment /autofix on this PR to apply them, or run npm run lint:fix && npm run format locally.

{"schema":"gitnexus.pr-autofix/v2","state":"fixes-available","pr_number":1772,"changed_lines":170,"head_sha":"9ef7caa4d4b008420d5c1016bd512afd0d78add1","run_id":"26280260878","apply_command":"/autofix"}

- Drop unused `jitteredDelay` function flagged by CodeQL in PR #1772; the
  retry loop already inlines the same calculation with the injectable
  `randomImpl` so the helper was dead. Move the non-cryptographic-by-design
  comment next to the actual jitter site.
- Apply `prettier --write` to wal-checkpoint-driver.ts and the new
  integration test to absorb the PR autofix bot's formatting findings.
@magyargergo magyargergo changed the title analyze: keep Ladybug auto-checkpoint on; add checkpoint-threshold control fix(lbug): keep Ladybug auto-checkpoint on; add checkpoint-threshold control May 22, 2026
@magyargergo magyargergo changed the title fix(lbug): keep Ladybug auto-checkpoint on; add checkpoint-threshold control fix(lbug): add WALL checkpoint-threshold control May 22, 2026
@magyargergo magyargergo changed the title fix(lbug): add WALL checkpoint-threshold control fix(lbug): add WAL checkpoint-threshold control May 22, 2026
@magyargergo magyargergo merged commit 87b91c8 into main May 22, 2026
62 of 67 checks passed
@magyargergo magyargergo deleted the copilot/fix-wal-rename-crash branch May 22, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

autoCheckpoint: true with 16MB threshold causes WAL rename crash during large indexing runs

3 participants