Skip to content

fix(lbug): resolve non-ASCII paths for KuzuDB on Windows (#1811)#1817

Merged
magyargergo merged 10 commits into
mainfrom
fix/issue-1811-chinese-path-encoding
May 25, 2026
Merged

fix(lbug): resolve non-ASCII paths for KuzuDB on Windows (#1811)#1817
magyargergo merged 10 commits into
mainfrom
fix/issue-1811-chinese-path-encoding

Conversation

@magyargergo

@magyargergo magyargergo commented May 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

KuzuDB's native C++ layer uses CreateFileA (ANSI), not CreateFileW. UTF-8 bytes from Node.js are misinterpreted via the system's Active Code Page (e.g. GBK on Chinese Windows), producing a garbled path — "Error 3: The system cannot find the path specified."

Approach

Layered workaround in toNativeSafePath():

  1. 8.3 short path (fast, stateless) — resolves the path to its Windows short-name form via cmd.exe for expansion. Path passed via environment variable (%GITNEXUS_SP%) to avoid command injection (CodeQL clean).
  2. NTFS junction fallback — when 8.3 is disabled, creates a junction from an ASCII temp path (os.tmpdir()/gitnexus-junction-<hash>) to path.dirname(p), returns junction/basename(p).
  3. Diagnostic warning — when both fail, logs remediation advice and returns the original path.

Applied at all paths into KuzuDB's native layer:

  • DB open: openLbugConnection() converts before createLbugDatabase()
  • COPY CSV: normalizeCopyPath() converts before embedding in SQL; loadGraphToLbug() stages CSVs in os.tmpdir() when storagePath is non-ASCII on Windows
  • Pool adapter: all 4 createLbugDatabase() call sites wrapped

Safety

  • No-op on ASCII paths and on Linux/macOS (zero cost for existing users)
  • Junction cleanup on process.exit / SIGTERM / SIGINT (correct exit codes: 143/130 on Windows)
  • Orphan scan on startup removes stale junctions from prior crashes (uses lstatSync to avoid UNC blocking)
  • Worker threads skip junction creation (isMainThread guard) — use 8.3 only
  • Sidecar sweep operates on safePath (the path KuzuDB opened), not the original

Files changed

File Change
gitnexus/src/core/lbug/lbug-config.ts toNativeSafePath(), junction lifecycle, signal handlers, orphan scan
gitnexus/src/core/lbug/lbug-adapter.ts normalizeCopyPath() integration, tmpdir CSV staging
gitnexus/src/core/lbug/pool-adapter.ts toNativeSafePath() at all 4 createLbugDatabase sites
gitnexus/test/unit/lbug-native-safe-path.test.ts Unit tests: ASCII passthrough, short-path conversion, junction, nonexistent path
gitnexus/test/integration/lbug-non-ascii-path.test.ts E2E: initLbug + loadGraphToLbug + query with CJK directory name
gitnexus/test/unit/pool-wal-recovery.test.ts Mock updated for new toNativeSafePath export
gitnexus/test/unit/lbug-pool-win-fts-probe.test.ts Mock updated for new toNativeSafePath export
gitnexus/scripts/cross-platform-tests.ts Integration test registered in cross-platform matrix
gitnexus/vitest.config.ts Integration test registered in lbug-db project

Test plan

  • npx vitest run test/unit/lbug-native-safe-path.test.ts — 5 passing
  • npx vitest run test/integration/lbug-non-ascii-path.test.ts — 3 passing (all platforms)
  • npx vitest run test/unit/pool-wal-recovery.test.ts test/unit/lbug-pool-win-fts-probe.test.ts — 20 passing (no regression)
  • On Windows with a repo under a CJK-named directory: npx gitnexus analyze completes without COPY ENOENT
  • On Windows with an all-ASCII path: behavior unchanged
  • On Linux/macOS: behavior unchanged (function is a no-op)

KuzuDB's native C++ layer uses ANSI file APIs (fopen) on Windows.
When the repo path contains CJK or other non-ASCII characters, the
UTF-8 bytes from Node.js are misinterpreted as the system's Active
Code Page (e.g. GBK), producing a garbled path — "Error 3: The
system cannot find the path specified."

Add `toNativeSafePath()` which converts non-ASCII paths to their
Windows 8.3 short-name form (all-ASCII) before passing them to the
native layer. Applied to both the database open path and the COPY
CSV paths. No-ops on non-Windows and on all-ASCII paths.

Closes #1811
@vercel

vercel Bot commented May 25, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
gitnexus Ready Ready Preview, Comment May 25, 2026 8:00pm

Request Review

Comment thread gitnexus/src/core/lbug/lbug-config.ts Fixed
Comment thread gitnexus/src/core/lbug/lbug-config.ts Fixed
@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
9857 9854 0 3 612s

✅ All 9854 tests passed

3 test(s) skipped — expand for details
  • PHP pipeline benchmark > scales with file count (workers enabled)
  • Rust pipeline benchmark > scales with file count (workers enabled)
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 79.87% 34002/42567 N/A% 🟢 ███████████████░░░░░
Branches 68.47% 21728/31731 N/A% 🟢 █████████████░░░░░░░
Functions 84.83% 3523/4153 N/A% 🟢 ████████████████░░░░
Lines 83.36% 30625/36736 N/A% 🟢 ████████████████░░░░

📋 View full run · Generated by CI

…1811)

- Unit tests for toNativeSafePath: ASCII passthrough, non-Windows
  no-op, Windows short-path conversion, nonexistent-path fallback
- Integration test: full initLbug + loadGraphToLbug round-trip with
  CJK characters in the storage path — runs on all platforms
- Fix toNativeSafePath to reject cmd.exe output containing '?' chars
  (replacement for unrepresentable Unicode in the console code page)
- Register integration test in vitest lbug-db project and
  cross-platform-tests.ts matrix
@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

✨ PR Autofix

Found fixable formatting / unused-import issues across 16 changed lines. Comment /autofix on this PR to apply them, or run npm run lint:fix && npm run format locally.

{"schema":"gitnexus.pr-autofix/v2","state":"fixes-available","pr_number":1817,"changed_lines":16,"head_sha":"a94bfe50a6601a95d8aa4cd5b1e17131e16c4e8e","run_id":"26413426471","apply_command":"/autofix"}

@magyargergo

Copy link
Copy Markdown
Collaborator Author

/autofix

@github-actions

Copy link
Copy Markdown
Contributor

✅ Applied autofix and pushed a commit. (apply run)

…age (#1811)

U1+U4: toNativeSafePath now tries 8.3 short path → NTFS junction
fallback → diagnostic warning. Junctions target path.dirname(p) and
reconstruct the leaf. Handles EEXIST races. Registers cleanup on
exit/SIGTERM/SIGINT. Orphan scan on first call removes stale
junctions from prior crashes.

U2: loadGraphToLbug redirects csvDir to os.tmpdir() when
storagePath contains non-ASCII on Windows, avoiding non-ASCII
characters in COPY FROM paths entirely.

U3: All 4 createLbugDatabase call sites in pool-adapter.ts now
wrap dbPath with toNativeSafePath.
- Fix lbug-non-ascii-path integration test: use CodeRelation (actual
  relationship table name) instead of CALLS
- Add toNativeSafePath to lbug-config.js mocks in pool-wal-recovery
  and lbug-pool-win-fts-probe tests — pool-adapter now imports it
Reject paths containing cmd.exe metacharacters (" % | & < > ^)
before interpolating into the `for %I` short-path command.
Prevents command injection via crafted path names.
- U1: Use process.exit(0) on Windows instead of process.kill re-raise
  (SIGTERM forcefully kills on Windows, handlers never fire)
- U2: Pass safePath to openWithLockRetry so sidecar sweep targets the
  path KuzuDB actually opened, not the original non-ASCII path
- U3: Skip junction creation in worker threads (isMainThread guard) to
  prevent junction leaks from pool-adapter workers
- U4: Replace existsSync with lstatSync in orphan scan to avoid 30s
  blocking on unreachable UNC network targets
@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 5m 37s —— View job


PR #1817 Production-Readiness Review

Todo List

  • Post initial comment
  • Read repo docs (AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md)
  • Inspect PR diff and changed files
  • Analyze lbug-config.ts — toNativeSafePath, tryShortPath, tryJunction, signal handlers
  • Analyze lbug-adapter.ts — normalizeCopyPath, tempdir CSV staging
  • Analyze pool-adapter.ts — createLbugDatabase call sites, worker-thread behavior
  • Analyze test files — integration + unit coverage
  • Check CI status and CodeQL findings
  • Unicode/hygiene scan on changed files
  • Synthesize findings and post final review

Review Bar

Reviewer persona: Senior production-readiness reviewer for GitNexus. Adversarial but fair. The bar is whether Windows users with non-ASCII repo paths can run gitnexus analyze without COPY/open failures, without command injection, junction leaks, stale temp data, broken read-only pool behavior, WAL/lock regressions, or cross-platform regressions.


Problem Being Solved

Issue #1811: npx gitnexus analyze fails on Windows 11 PowerShell when the repo path contains Chinese characters. Root cause: KuzuDB/LadybugDB's native C++ layer uses CreateFileA (ANSI), not CreateFileW. UTF-8 bytes from Node.js are misinterpreted via the Active Code Page, producing "Error 3: The system cannot find the path." during loadGraphToLbug.


Current PR State

  • Head commit: a94bfe5 — 7 commits on fix/issue-1811-chinese-path-encoding
  • Changed files: 9 (all within gitnexus/ — no unrelated churn confirmed)
  • CI Gate:FAILING — see Finding 1 below
  • CodeQL: ✅ Passing on latest head
  • All tests: ✅ Passing (Windows/macOS/Linux, packaged install smoke on Windows)

Merge Status and Mergeability

CI Gate is currently red. The only failing check is quality / format (Prettier). All functional checks — typecheck, tests on all three platforms, packaged install smoke on Windows, CodeQL, Gitleaks, Docker, dependency review — pass. The PR cannot merge with a red CI Gate.


Repository History Considered


Branch Hygiene Assessment

Clean. All 7 commits are scoped to the #1811 fix, tests, CI fixes, CodeQL hardening, and autofix. No unrelated domains, workflow churn, or mixed refactors.


Understanding of the Change

lbug-config.ts — adds toNativeSafePath(p: string): string which:

  1. Is a no-op on non-Windows or all-ASCII paths
  2. Tries cmd.exe /c for %I in ("${p}") do @echo %~sI to get the 8.3 short-name form (fast, stateless)
  3. Falls back to an NTFS junction from ASCII tmpdir if 8.3 is disabled
  4. Warns and returns original if both fail
  5. Registers process-exit cleanup handlers (first call only, main thread only)
  6. Scans for orphaned junctions from prior crashes (first call only)

lbug-adapter.tsnormalizeCopyPath at line 392 now calls toNativeSafePath before backslash→forward-slash conversion. loadGraphToLbug at line 864 redirects CSV staging to a hash-named tmpdir when storagePath has non-ASCII chars on Windows.

pool-adapter.tsopenReadOnlyDatabase (lines 419, 438, 452) and replayShadowPagesWithWritableOpen (line 394) all call createLbugDatabase(lbug, toNativeSafePath(dbPath), ...). Coverage is consistent across all pool DB open sites.

Tests — integration test (lbug-non-ascii-path.test.ts) creates a CJK-named temp dir and verifies initLbug + COPY + queryability on all platforms. Unit tests cover ASCII passthrough, non-Windows no-op, and Windows short-path/junction cases.


Findings


Finding 1 — CI Gate is red: Prettier format check fails

Severity: HIGH (blocks merge)

Risk: The autofix commit (01532b6) ran before the last four substantive commits (f6f0979, 243d96e, 1457499, a94bfe5). Those commits added formatting issues not caught by subsequent CI runs until the latest full CI run (workflow 26413387479, job 77752546116). The quality / format job fails at npx prettier --check .. CI Gate (77754504223) cascades to failure.

Evidence: CI status — quality / format: ❌ failure (step 5: "Run npx prettier --check ."). All other jobs pass. CI Gate: ❌.

Recommended fix: Run npm run format (or trigger /autofix) and push the formatting commit. This is the only blocker.

Blocks merge: YES

Fix this →


Finding 2 — CMD_UNSAFE_RE missing parentheses (functional gap, not injection)

Severity: LOW

Risk: lbug-config.ts:34: const CMD_UNSAFE_RE = /["%|&<>^]/; does not block ( or ). In cmd.exe's for %I in ("${p}") do @echo %~sI, an unescaped ) in the path closes the in (...) set early, causing cmd.exe to fail or mis-parse the FOR loop. This isn't code injection — it causes an exception that is swallowed by the catch {} at line 53, causing tryShortPath to return null and fall through to the junction fallback. Paths like C:\Project\(backup)\中文\code would silently skip 8.3 conversion and rely on the junction path.

CR/LF in paths is theoretical only — Windows filesystem prohibits control characters (0x000x1F) in filenames.

Evidence: lbug-config.ts:34-57; CodeQL passes on latest head — no command injection finding.

Recommended fix: Add () to CMD_UNSAFE_RE or add a test that confirms parenthesis paths fall through to junction correctly. Not a security issue, but documents the gap. Fix this →

Blocks merge: NO


Finding 3 — SIGTERM handler exits with code 0 on Windows

Severity: LOW-MEDIUM

Risk: lbug-config.ts:107-116 registers global SIGTERM/SIGINT handlers. On Windows, the handler calls process.exit(0) after cleanup. This:

  • Masks the termination with a success exit code (conventional SIGTERM exit code is 143, SIGINT is 130)
  • Calls process.exit(0) synchronously, preventing any other registered SIGTERM handlers (CLI graceful shutdown, server teardown) from firing
  • The cleanupRegistered guard prevents double-registration, but all SIGTERM handlers registered after the first toNativeSafePath call on a non-ASCII path in the main thread will be silently skipped on Windows

On POSIX, the existing approach (process.kill(process.pid, signal)) correctly re-raises the signal. The Windows path is pragmatically limited (Node.js doesn't support real POSIX signals on Windows), but process.exit(0) is the wrong exit code.

Evidence: lbug-config.ts:101-117; registerCleanupHandlers() called from toNativeSafePath on first non-ASCII path in main thread.

Recommended fix: Use process.exit(signal === 'SIGINT' ? 130 : 143) on Windows instead of process.exit(0). Or better: perform cleanup synchronously and don't call process.exit() at all — let Node.js exit normally after all handlers run. Fix this →

Blocks merge: MAYBE — risk is real if existing CLI/server has registered its own SIGTERM handlers before the first non-ASCII path is processed. In practice, on Windows the first toNativeSafePath call happens during analyze, which means the CLI has already registered its own handlers. Recommend fixing before merge regardless.


Finding 4 — Junction orphan scanner has a TOCTOU race window

Severity: LOW

Risk: scanOrphanedJunctions() at lbug-config.ts:119-142 reads all entries in tmpdir prefixed with gitnexus-junction-, checks if their junction target exists via lstatSync, and deletes junctions whose target doesn't exist. In a concurrent multi-instance scenario (two gitnexus analyze processes starting near-simultaneously on the same user session), Process A could create a junction for Target1, and Process B's scan could observe that Target1 hasn't been created yet (filesystem delay) and delete Process A's junction.

Windows tmpdir is per-user (%LOCALAPPDATA%\Temp), so cross-user contamination is not possible.

Evidence: lbug-config.ts:119-142; no concurrent test coverage.

Recommended fix: The race window is narrow and the fallback is benign (Process A falls back to warning). No code change strictly required, but adding a creation-time marker (e.g., checking if the junction link itself is new rather than the target) would eliminate the race. Document the race in a comment.

Blocks merge: NO


Finding 5 — Worker-thread junction skip: pool paths may not be covered with 8.3 disabled

Severity: LOW (likely NO impact in practice)

Risk: toNativeSafePath skips junction creation in non-main threads (lbug-config.ts:167-173). pool-adapter.ts calls toNativeSafePath(dbPath) in openReadOnlyDatabase (lines 419, 438, 452) and replayShadowPagesWithWritableOpen (line 394). These are called from doInitLbug, which runs in the MCP server's main event loop. pool-adapter.ts does not spawn worker threads internally.

However, if the pool adapter is ever invoked from a worker thread context (e.g., via a worker-thread–based MCP transport in the future), and 8.3 short names are disabled, the junction fallback is skipped and the original non-ASCII path would reach LadybugDB.

Evidence: lbug-config.ts:167-173; pool-adapter.ts:394,419,438,452; no worker thread usage confirmed in current pool-adapter.ts.

Recommended fix: Add a comment at each toNativeSafePath(dbPath) call in pool-adapter.ts noting that the caller must be the main thread for junction fallback to work. No code change required now.

Blocks merge: NO


Finding 6 — Integration test doesn't exercise junction fallback on Windows CI

Severity: LOW

Risk: The integration test at lbug-non-ascii-path.test.ts:33-53 creates a CJK tmpdir and calls toNativeSafePath. On Windows CI runners, 8.3 short names are typically enabled, so tryShortPath succeeds and the junction path is never exercised. The unit test at lbug-native-safe-path.test.ts:33-55 does test junction behavior but only on Windows. No CI path simulates 8.3 disabled + junction required.

The test is correctly wired into the lbug-db project (vitest.config.ts:72) with fileParallelism: false, and runs on all three CI platforms. The integration test verifies end-to-end queryability, which is the right DoD.

Evidence: vitest.config.ts:72; lbug-native-safe-path.test.ts:32-62; CI tests / windows-latest (platform-sensitive): ✅.

Recommended fix: No immediate action required. If a Windows CI runner with fsutil 8dot3name query <drive> showing disabled becomes available, add a CI matrix variant. Document this gap in the test file.

Blocks merge: NO


Finding 7 — Unicode scan: no dangerous characters

Severity: NONE

Non-ASCII characters in lbug-config.ts are all em-dashes in comments (U+2014, harmless). CJK characters in test files are intentional reproduction strings for issue #1811. No bidi control characters (U+202A–U+202E, U+2066–U+2069), zero-width spaces (U+200B–U+200F), or BOM (U+FEFF) found in any changed file.

Blocks merge: NO


PR-Specific Assessment Sections

Lane A — Windows path conversion: Logic is correct and layered. 8.3 short path first (fast, stateless), junction fallback (persistent but cleaned up), warn-and-return-original (last resort). The toNativeSafePath guard (!NON_ASCII_RE.test(p)) ensures zero cost for ASCII paths on all platforms.

Lane B — LBUG DB open, sidecar, pool: openLbugConnection correctly passes safePath to both createLbugDatabase and openWithLockRetry. isTestFixturePath(safePath) correctly rejects junction paths (junction prefix gitnexus-junction- doesn't match TEST_FIXTURE_PREFIXES gitnexus-lbug- / gitnexus-test-). Sidecar sweep operates on safePath + '.wal' / safePath + '.lock', which is correct since LadybugDB writes sidecars at the path it was opened with.

Lane C — COPY CSV tempdir staging: csvDir = toNativeSafePath(path.join(os.tmpdir(), 'gitnexus-csv-${hash}')) — since os.tmpdir() returns ASCII on Windows and the hash is hex, toNativeSafePath returns it unchanged. Hash input is storagePath (deterministic per repo); concurrent runs for the same repo would collide on the tmpdir name, but each run overwrites CSV files fresh before COPY so stale data from a prior run is overwritten. Cleanup at lbug-adapter.ts:989-1007 is best-effort and correct.

Lane D — Tests and CI wiring: Integration test correctly placed in lbug-db project (line 72 of vitest.config.ts) with fileParallelism: false. Windows CI tests pass. CodeQL passes. Only failure is format check (Finding 1).

Lane E — Security and hygiene: CodeQL ✅, Gitleaks ✅, no hidden Unicode, no unrelated churn. Branch is clean.


Back-and-Forth Avoided

  1. The CodeQL alert on lbug-config.ts:44 was from an earlier commit. The latest head (a94bfe5) adds CMD_UNSAFE_RE sanitization and CodeQL now passes — verified directly from CI status.
  2. The typecheck failure mentioned in the CI bot comment was from an earlier CI run, not the latest. Latest CI shows typecheck ✅.
  3. The github-advanced-security[bot] review comments reference line 44 of an earlier version of lbug-config.ts. The sanitization fix was applied in commit 1457499. These alerts are resolved on latest head per CodeQL CI status.

Final Verdict

❌ not production-ready

The single hard blocker is the failing CI Gate (quality / format — Prettier check). Every other signal is positive: all tests pass on Windows/macOS/Linux, CodeQL is green, the approach correctly covers both the DB-open path and COPY CSV path, and the fallback chain is sound. The fix is trivial — run npm run format (or trigger /autofix) and push. The secondary finding of process.exit(0) in the Windows SIGTERM handler (Finding 3) is worth fixing in the same pass since it takes one line and prevents masking failed analyze exits as successes.

To unblock this PR:

  1. Fix Prettier formatting: run npm run format in gitnexus/ and push (or comment /autofix)
  2. Consider changing process.exit(0) to process.exit(signal === 'SIGINT' ? 130 : 143) in the Windows SIGTERM handler (lbug-config.ts:111)
  3. Verify CI Gate goes green after the format fix

· Branch

@magyargergo

Copy link
Copy Markdown
Collaborator Author

/autofix

@github-actions

Copy link
Copy Markdown
Contributor

✅ Applied autofix and pushed a commit. (apply run)

- Use exit code 143 (SIGTERM) / 130 (SIGINT) on Windows instead of 0
  so termination is not masked as success
- Run Prettier to fix formatting (CI Gate blocker)
Pass the path via GITNEXUS_SP environment variable instead of
interpolating it into the cmd.exe command string. The FOR loop
reads %GITNEXUS_SP% from the environment, so the command text is
entirely static — no user-controlled data in the shell command.

Also removes CMD_UNSAFE_RE since the env var approach makes
character-level sanitization unnecessary.
@magyargergo magyargergo changed the title fix(lbug): resolve non-ASCII paths to 8.3 short form on Windows (#1811) fix(lbug): resolve non-ASCII paths for KuzuDB on Windows (#1811) May 25, 2026
@magyargergo magyargergo merged commit d4449b4 into main May 25, 2026
32 checks passed
@magyargergo magyargergo deleted the fix/issue-1811-chinese-path-encoding branch May 25, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Encoding Failure in Windows PowerShell with Chinese Directory Paths

2 participants