Skip to content

fix: shard parse cache persistence on large repos#1580

Merged
magyargergo merged 9 commits into
abhigyanpatwari:mainfrom
BlackOvOoo:fix/sharded-parse-cache-persistence
May 16, 2026
Merged

fix: shard parse cache persistence on large repos#1580
magyargergo merged 9 commits into
abhigyanpatwari:mainfrom
BlackOvOoo:fix/sharded-parse-cache-persistence

Conversation

@BlackOvOoo

Copy link
Copy Markdown
Contributor

Large repos can fail to persist parse cache with "Invalid string length" because saveParseCache serializes every chunk into one huge JSON string. This changes persistence to a sharded on-disk format under .gitnexus/parse-cache, keeps backward compatibility by loading legacy parse-cache.json when present, and adds unit coverage for sharded persistence, legacy fallback, and corrupt or missing shard tolerance.

Validation: node_modules.bin\vitest.cmd run test/unit/incremental-parse-cache.test.ts

Closes #1562.

@vercel

vercel Bot commented May 14, 2026

Copy link
Copy Markdown

@BlackOvOoo is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 4m 54s —— View job


Adversarial Production-Readiness Review — PR #1580

Tasks:

  • Read repo context files (AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, ARCHITECTURE.md, DoD.md, TESTING.md)
  • Inspect changed files (parse-cache.ts, incremental-parse-cache.test.ts)
  • Inspect callers and related orchestration code
  • Run branch hygiene checks
  • Unicode/hygiene scan (partial — see §9 below)
  • Check CI status
  • Produce findings and final verdict

1. Review bar for this PR

This PR must: (a) fix Invalid string length on large repos by eliminating the single-giant-JSON serialization path; (b) keep sharded load/save/prune correct and backward-compatible; (c) not introduce path traversal or unsafe reads from untrusted local cache files; (d) be covered by tests that catch wiring breakage; (e) pass CI (typecheck + unit tests + quality gates) on the head SHA. Documentation referencing the old parse-cache.json must be updated to reflect the new format.


2. Current PR state

  • State: Open. Base abhigyanpatwari:main, head BlackOvOoo:fix/sharded-parse-cache-persistence.
  • Head SHA: e8dcc7c (merge from main into feature branch).
  • Fix commit SHA: b75ac7d — only two files changed in the actual fix.
  • Visible delta: +142/-16 across 2 files (parse-cache.ts, incremental-parse-cache.test.ts).
  • Checks status: Checks 0 / no CI jobs visible. This is a fork PR — CI may not have been authorized or triggered. This is a merge blocker until confirmed resolved.
  • Vercel preview: Pending team authorization — not validated, but not a hard blocker per repo policy.
  • Unicode/hygiene scan: Command execution restrictions prevented running rg/grep -P scans; see §9.

3. Branch hygiene assessment

Verdict: merge-from-main commit present but harmless and merge-safe.

git diff --name-status b75ac7d..HEAD confirms the merge commit (e8dcc7c) pulled in three unrelated C++ resolution PRs (#1520, #1581, #1586). However, none of those commits modified the two PR files (parse-cache.ts, incremental-parse-cache.test.ts). The actual PR diff is clean: only the intended two files are changed from the perspective of GitHub's base→head diff. The merge is merge-safe for this change. Rebase is not required, though a rebase would produce a cleaner history.


4. Understanding of the change

Old format: saveParseCache serialized the entire Map<chunkHash, ParseWorkerResult[]> into a single Record<string, ParseWorkerResult[]> object, then called JSON.stringify() on the whole thing. On large repos with ~50 chunks of 20MB source each, this could produce a string exceeding Node.js/V8's ~512MB contiguous allocation limit, throwing Invalid string length.

New format: On save, a tmpDir ${cacheDir}.tmp is created, each cache entry is written as a separate <chunkHash>.json shard file, then index.json ({version, keys[]}) is written, the old cacheDir is removed, tmpDir is renamed to cacheDir, and the legacy parse-cache.json is removed. On load, loadShardedParseCache reads index.json, then reads each shard file individually. If loadShardedParseCache returns null (index missing/unreadable), it falls back to loadLegacyParseCache.

Cache lifecycle: run-analyze.ts L356 calls loadParseCache → pipeline runs (populating cache.usedKeys) → pruneCache drops stale entries → saveParseCache persists. The entire save is wrapped in a try/catch (L801-808) so save failure only logs a warning and never breaks analysis.


5. Findings


Finding A — Unvalidated shard keys allow path traversal

  • Severity: High
  • File/line: parse-cache.ts:205-208 (loadShardedParseCache)
  • Risk: A malicious or corrupted .gitnexus/parse-cache/index.json with a key like ../../../etc/shadow causes getCacheChunkPath(storagePath, chunkHash)path.join(cacheDir, '../../../etc/shadow.json') to escape the cache directory. This allows arbitrary file reads of any JSON-parseable .json file on the filesystem accessible to the process. The current guard is only typeof chunkHash !== 'string' || chunkHash.length === 0 — it does not reject traversal sequences or absolute paths.
  • Evidence:
    // parse-cache.ts:163
    const getCacheChunkPath = (storagePath: string, chunkHash: string): string =>
      path.join(getCacheDirPath(storagePath), `${chunkHash}.json`);
    
    // parse-cache.ts:205-208
    for (const chunkHash of data.keys) {
      if (typeof chunkHash !== 'string' || chunkHash.length === 0) continue;
      const chunkRaw = await fs.readFile(getCacheChunkPath(storagePath, chunkHash), 'utf-8');
    path.join('/repo/.gitnexus/parse-cache', '../../../etc/shadow.json')/etc/shadow.json. In the save path, keys always come from computeChunkHash() which produces sha256 hex, so the save path is safe. But the load path trusts the index file.
  • Recommended fix: Add validation: if (!/^[a-f0-9]{64}$/.test(chunkHash)) continue;. Add a test with traversal-like keys (../foo, /absolute, ..\\windows). Fix this →
  • Blocks merge: Yes — DoD §2.5 explicitly requires no new path injection surfaces on paths that consume untrusted input. This is an untrusted local file.

Finding B — Misleading "atomic" comment on saveParseCache

  • Severity: Low
  • File/line: parse-cache.ts:232-235 (doc comment for saveParseCache)
  • Risk: The comment says "Persist the cache to disk atomically (write-and-rename)". The implementation is:
    1. rm tmpDir → 2. mkdir tmpDir → 3. write shards → 4. write index → 5. rm cacheDir6. rename tmpDir → cacheDir → 7. rm legacyFile.
      There is a crash window between steps 5 and 6 where no usable cache exists on disk. This is acceptable for a cache (regenerates on next run), but the "atomic" claim is misleading — true atomic replacement would rename into an existing target (POSIX rename is atomic but requires the target to be a file, not a directory; directory rename on POSIX removes the target before rename). After a crash at step 5-6, loadParseCache correctly returns an empty cache and analysis continues safely (re-parses all chunks). Not a data-loss risk for correctness, but the comment should not say "atomically."
  • Recommended fix: Change the comment to say "near-atomic (tmp-dir write + rename)" and note that a crash between rm and rename leaves an empty cache which is safely regenerated. Fix this →
  • Blocks merge: No — behavior is safe; comment accuracy is a maintenance concern.

Finding C — Version-mismatched sharded dir silently blocks legacy fallback

  • Severity: Medium
  • File/line: parse-cache.ts:190-230 (loadShardedParseCache / loadParseCache)
  • Risk: loadShardedParseCache returns null only on thrown exception (i.e., index file missing or unreadable). It returns emptyCache() (non-null) on version mismatch, corrupt JSON structure, or missing keys array. loadParseCache falls back to legacy only when sharded === null. Consequence: If a valid legacy parse-cache.json exists alongside a version-mismatched sharded directory (e.g., after a partial migration), loadParseCache returns empty instead of using the legacy file. The user loses incremental speedup until a successful new save.
    This is likely intentional (version-mismatched sharded cache exists → treat as "we already migrated, restart fresh") but it is not documented and not tested. The test at line 166 only tests version mismatch on the legacy format.
  • Recommended fix: Add a test for "sharded dir exists with version-mismatched index.json + valid legacy file → expect empty (document that legacy is not used in this scenario)." This makes the design decision explicit and verified. Fix this →
  • Blocks merge: Maybe — the behavior is likely intentional but untested. Flag to maintainer; acceptable if explicitly noted in PR description.

Finding D — CI has not run on the head SHA

  • Severity: Critical
  • File/line: GitHub Checks panel (PR fix: shard parse cache persistence on large repos #1580)
  • Risk: No typecheck (tsc --noEmit), unit tests, quality gate (prettier/ESLint), or security scans have validated this change. This is a fork PR from BlackOvOoo/GitNexus; fork PRs require CI authorization. The DoD §3 "No false-done" gate explicitly requires the validation baseline to have been run. ci-tests.yml (unit + integration, cross-platform), ci-quality.yml (tsc, prettier, ESLint), and security workflows (CodeQL, Gitleaks, Dependency Review) have not produced results.
  • Recommended fix: A maintainer must authorize CI to run on this fork PR, or the author must push to the head branch to trigger a new run. Include evidence of passing cd gitnexus && npx tsc --noEmit && npm test before merge. The targeted test command from the PR body is node_modules\.bin\vitest.cmd run test/unit/incremental-parse-cache.test.ts — this must pass, but the full suite must also pass.
  • Blocks merge: Yes — DoD §4.2 requires tsc --noEmit + npm test for gitnexus/ changes. Confirmed zero CI runs on head SHA.

Finding E — Hidden/bidi Unicode scan not completed

  • Severity: Informational
  • File/line: Both changed files
  • Risk: GitHub rendered Unicode warnings on changed files per the trigger comment. Shell execution restrictions in this review environment prevented running rg --pcre2 or grep -P scans.
  • Evidence from file content: Reading both files via the Read tool shows no visually unusual characters. The files contain standard TypeScript/ASCII. The commit author name (尹兰鑫) contains CJK characters in commit metadata, but that does not affect source files. No bidi controls are visible in the source text.
  • Recommended fix: A maintainer should run the following before merge to confirm:
    rg -n --pcre2 "[\x{202A}-\x{202E}\x{2066}-\x{2069}\x{200B}-\x{200F}\x{FEFF}]" \
      gitnexus/src/storage/parse-cache.ts \
      gitnexus/test/unit/incremental-parse-cache.test.ts
  • Blocks merge: Conditional — blocks if bidi/hidden controls are found in executable code; does not block if scan is clean.

Finding F — Single-shard size could theoretically still exceed string limits

  • Severity: Low
  • File/line: parse-cache.ts:244-247 (shard write loop)
  • Risk: Each shard is JSON.stringify(chunkResults, mapReplacer) for one ParseWorkerResult[]. The chunk budget is ~20MB source bytes (comment line 9). While ParseWorkerResult includes expanded nodes/relationships/symbols/etc., in practice the serialized shard is expected to be well under Node's ~512MB string limit for a 20MB source chunk. However, there is no guard against oversized individual shards.
  • Evidence: On pathologically dense codebases (e.g., generated code, minified JS), one chunk's parse output could expand significantly beyond source size. No stress test exists.
  • Recommended fix: Consider adding a try/catch around each individual shard writeFile that logs a warning if serialization fails on an oversized shard (the current outer try in run-analyze.ts handles the whole save, not per-shard). A comment explaining why chunk sizing makes this unlikely would also satisfy this concern.
  • Blocks merge: No — unlikely in practice given 20MB chunk budgets; not a regression vs. the old format (which was worse).

Finding G — Documentation not updated for new cache format

  • Severity: Low
  • File/line: AGENTS.md:158,160; GUARDRAILS.md:39
  • Risk: AGENTS.md says "the tree-sitter dispatch is skipped ... at .gitnexus/parse-cache.json" and "Safe to delete .gitnexus/parse-cache.json at any time". GUARDRAILS.md says "Safe to delete .gitnexus/parse-cache.json". After this PR, the cache lives in .gitnexus/parse-cache/ (directory). Users following these docs to manually clear the cache would delete only the legacy file, not the sharded directory.
  • Recommended fix: Update both references to mention .gitnexus/parse-cache/ (directory) and note that parse-cache.json is the legacy format. DoD §2.4 requires docs to be updated when user-visible behavior changes. Fix this →
  • Blocks merge: Maybe — DoD §2.4 flag; should not be deferred.

Finding H — Test gaps for sharded-format-specific scenarios

  • Severity: Medium
  • File/line: incremental-parse-cache.test.ts (overall)
  • Risk: Tests don't cover:
    1. Path traversal key in index.json — no test for keys: ['../foo'] or /absolute key being safely skipped (required for Finding A fix)
    2. Multiple entries → multiple shard files created — the round-trip tests use 0 or 1 entries; no test proves 3 entries produce 3 <hash>.json files
    3. Version-mismatched sharded index — only legacy version mismatch is tested (line 166-181)
    4. Save overwriting existing sharded cache — no test for a second saveParseCache call replacing the first
    5. Legacy file cleanup after sharded save — the empty cache test (line 145-146) checks parse-cache.json doesn't exist, but doesn't test that a pre-existing parse-cache.json is removed after save
  • Recommended fix: Add the missing test cases listed above. Fix this →%20path%20traversal%20key%20in%20index.json%20(e.g.%20keys%3A%20%5B'..%2Ffoo'%5D)%20is%20safely%20skipped%3B%202)%20multiple%20cache%20entries%20produce%20multiple%20shard%20files%3B%203)%20version-mismatched%20sharded%20index%20returns%20empty%20(not%20legacy%20fallback)%3B%204)%20second%20saveParseCache%20call%20correctly%20replaces%20the%20first%3B%205)%20pre-existing%20parse-cache.json%20is%20removed%20after%20a%20successful%20sharded%20save.%20Branch%3A%20fix%2Fsharded-parse-cache-persistence&repo=abhigyanpatwari/GitNexus)
  • Blocks merge: Maybe — item 1 is required once Finding A is fixed; the rest are coverage gaps, not wiring failures.

6. PR-specific assessment sections

Storage migration correctness: The ShardedParseCacheIndex format is minimal and correct — {version, keys[]} with no stale entries field. PARSE_CACHE_VERSION is consistently used in both save and load paths. ParseCacheFile is correctly retained as a read-only legacy interface. The CACHE_DIRNAME/CACHE_INDEX_FILENAME/LEGACY_CACHE_FILENAME constants are clean. ✓

Load/fallback semantics: Precedence is: sharded (if index readable) → legacy (if sharded index unreadable). A version-mismatched sharded index returns emptyCache() (not null), blocking legacy fallback (see Finding C). This is likely intentional but untested. Corrupt shards are skipped with partial cache returned — correct. ✓ with caveat (Finding C)

Save atomicity and crash behavior: Crash between rm(cacheDir) and rename(tmpDir, cacheDir) leaves no cache. Safe for a cache since loadParseCache returns empty and analysis regenerates safely. Save failures are wrapped in try/catch in the caller (run-analyze.ts:801-808) and produce only a warning log — correct. The "atomically" comment is misleading (see Finding B). ✓ with caveat (Finding B)

Path safety/cache poisoning:FailsloadShardedParseCache does not validate chunkHash values from index.json against a sha256 hex pattern before constructing paths. Path traversal possible (Finding A). This is the primary blocking issue.

Large-repo performance/memory: The fix correctly addresses the root cause by eliminating the single-object JSON.stringify. Each shard is serialized independently. For 50 chunks of 20MB source, each shard serializes one ParseWorkerResult[] — expected to be well under V8 string limits. ✓ with minor caveat (Finding F)

Cross-platform filesystem behavior: fs.rm with {recursive: true, force: true} is available in Node 14.14+. fs.rename on Windows fails if destination exists (but rm(cacheDir) precedes it). The .tmp naming (${cacheDir}.tmp) is a sibling of cacheDir — same filesystem on any typical setup, making rename atomic on POSIX. No backslash path issues since path.join handles separators. Save failures caught by caller. The PR body mentions the issue was on Windows 11/Node 24 — fs.rename on Windows may be sensitive to antivirus locks, but that's acceptable given the outer catch. ⚠️ Minor (Windows antivirus not tested, unlikely to be a hard blocker)

Test coverage: Good happy-path coverage (empty cache, missing cache, legacy fallback, corrupt shard tolerance, Map/Set round-trip, prune). Missing: path traversal key safety, version-mismatched sharded index, save overwrite, multiple-shard count assertion, legacy cleanup after save. See Finding H.

CI/security validation: Zero CI checks on the head SHA (Finding D). This is a merge blocker. TypeScript typecheck, vitest unit tests, prettier, ESLint, CodeQL, Gitleaks, and Dependency Review have not run on this PR.

Hidden Unicode/hygiene: Scan could not be completed in this environment. Visual inspection of both files shows standard ASCII TypeScript. Maintainer must run the rg --pcre2 scan before merge (Finding E).


7. Final verdict

not production-ready

Three lanes block this PR: (1) loadShardedParseCache does not validate chunkHash values from disk against the expected sha256 hex pattern, creating a path traversal surface for reads from a malicious or corrupted index.json (Finding A — DoD §2.5 violation, hard security gate); (2) zero CI checks have run on the head SHA, meaning typecheck, unit tests, quality gates, and security scans are entirely unvalidated (Finding D — DoD §3, §4.2 "no false-done" gate); (3) user-facing documentation (AGENTS.md, GUARDRAILS.md) still directs users to delete parse-cache.json rather than parse-cache/, which will fail to clear the cache after this PR ships (Finding G — DoD §2.4). The underlying approach is architecturally sound and correctly addresses the Invalid string length root cause; with the path-traversal guard, test gaps filled, CI authorized, and documentation updated, this PR would be mergeable.


@magyargergo

Copy link
Copy Markdown
Collaborator

@BlackOvOoo Can you please look into claude's findings? 🙏

@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
9081 9080 0 1 452s

✅ All 9080 tests passed

1 test(s) skipped — expand for details
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 78.32% 28853/36836 N/A% 🟢 ███████████████░░░░░
Branches 66.72% 18299/27423 N/A% 🟢 █████████████░░░░░░░
Functions 83.1% 2882/3468 N/A% 🟢 ████████████████░░░░
Lines 81.6% 26044/31914 N/A% 🟢 ████████████████░░░░

📋 View full run · Generated by CI

magyargergo and others added 2 commits May 14, 2026 17:27
- Reject non-sha256-hex keys from index.json before path.join (path traversal).

- saveParseCache: skip invalid keys defensively; try/catch per-shard JSON.stringify.

- Clarify save comment (tmp dir + rename vs atomic).

- Tests: hex keys throughout, traversal keys, multi-shard, version-mismatch+legacy, second save, legacy removal.

- AGENTS.md / GUARDRAILS.md: document .gitnexus/parse-cache/ vs legacy parse-cache.json.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions

Copy link
Copy Markdown
Contributor

✨ PR Autofix

Found fixable formatting / unused-import issues across 32 changed lines. Comment /autofix on this PR to apply them, or run npm run lint:fix && npm run format locally.

{"schema":"gitnexus.pr-autofix/v2","state":"fixes-available","pr_number":1580,"changed_lines":32,"head_sha":"17516966527ae6b1d24267380b283a6e60c4424d","run_id":"25872475754","apply_command":"/autofix"}

@magyargergo

Copy link
Copy Markdown
Collaborator

/autofix

@github-actions

Copy link
Copy Markdown
Contributor

✅ Applied autofix and pushed a commit. (apply run)

@BlackOvOoo

Copy link
Copy Markdown
Contributor Author

Thanks, I saw the follow-up commits and the merges from main. It looks like the main issues Claude pointed out have already been addressed in the branch. Just wanted to check if there’s anything else you’d still like me to handle on my side, or if I should just wait for the remaining checks/review.

@TonyReg

TonyReg commented May 15, 2026

Copy link
Copy Markdown
Contributor

Related issue: #1553
I did run the changes on one of my big repositories, works.

@magyargergo magyargergo added bug Something isn't working 1.6.4 labels May 16, 2026
@magyargergo magyargergo merged commit 263ca35 into abhigyanpatwari:main May 16, 2026
31 of 32 checks passed
hohaivu pushed a commit to hohaivu/GitNexus that referenced this pull request May 19, 2026
* fix: shard parse cache persistence on large repos

* fix(parse-cache): validate shard keys, docs, and sharded-cache tests

- Reject non-sha256-hex keys from index.json before path.join (path traversal).

- saveParseCache: skip invalid keys defensively; try/catch per-shard JSON.stringify.

- Clarify save comment (tmp dir + rename vs atomic).

- Tests: hex keys throughout, traversal keys, multi-shard, version-mismatch+legacy, second save, legacy removal.

- AGENTS.md / GUARDRAILS.md: document .gitnexus/parse-cache/ vs legacy parse-cache.json.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(autofix): apply prettier + eslint fixes via /autofix command

---------

Co-authored-by: Gergő Magyar <gergomagyar@icloud.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
hohaivu pushed a commit to hohaivu/GitNexus that referenced this pull request May 21, 2026
* fix: shard parse cache persistence on large repos

* fix(parse-cache): validate shard keys, docs, and sharded-cache tests

- Reject non-sha256-hex keys from index.json before path.join (path traversal).

- saveParseCache: skip invalid keys defensively; try/catch per-shard JSON.stringify.

- Clarify save comment (tmp dir + rename vs atomic).

- Tests: hex keys throughout, traversal keys, multi-shard, version-mismatch+legacy, second save, legacy removal.

- AGENTS.md / GUARDRAILS.md: document .gitnexus/parse-cache/ vs legacy parse-cache.json.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(autofix): apply prettier + eslint fixes via /autofix command

---------

Co-authored-by: Gergő Magyar <gergomagyar@icloud.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
hohaivu pushed a commit to hohaivu/GitNexus that referenced this pull request Jun 2, 2026
* fix: shard parse cache persistence on large repos

* fix(parse-cache): validate shard keys, docs, and sharded-cache tests

- Reject non-sha256-hex keys from index.json before path.join (path traversal).

- saveParseCache: skip invalid keys defensively; try/catch per-shard JSON.stringify.

- Clarify save comment (tmp dir + rename vs atomic).

- Tests: hex keys throughout, traversal keys, multi-shard, version-mismatch+legacy, second save, legacy removal.

- AGENTS.md / GUARDRAILS.md: document .gitnexus/parse-cache/ vs legacy parse-cache.json.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(autofix): apply prettier + eslint fixes via /autofix command

---------

Co-authored-by: Gergő Magyar <gergomagyar@icloud.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1.6.4 bug Something isn't working

Projects

None yet

3 participants