perf: content-addressed parse and embedding caches for faster reindexing by davidbeesley · Pull Request #533 · abhigyanpatwari/GitNexus

davidbeesley · 2026-03-27T02:40:41Z

Summary

Cache deterministic per-file operations to reduce reindexing time.

Motivation / context

When under active development, code changes quickly and the graph gets stale. Reducing reindex time improves DX.

Areas touched

gitnexus/ (CLI / core / MCP server)
gitnexus-web/ (Vite / React UI)
.github/ (workflows, actions)
eval/ or other tooling
Docs / agent config only (AGENTS.md, CLAUDE.md, .cursor/, llms.txt, etc.)

Scope & constraints

In scope

Parse cache: content-addressed per-file Tree-sitter results (SHA-256 of file content).
Embedding cache: content-addressed per-node embedding vectors (SHA-256 of embedding input text).
Both caches survive --force (content-addressed, always safe to reuse).
Replaces old LadybugDB-based embedding cache (lost on --force, keyed by node ID not content).

Explicitly out of scope / not done here

Incremental LadybugDB load

Implementation notes

Shared cache-io.ts for atomic JSON read/write/delete with version validation
Parse cache stored in .gitnexus/parse-cache.json, embedding cache in .gitnexus/embedding-cache.json
Both cleaned up by gitnexus clean (lives inside .gitnexus/)
Embedding cache invalidates on model ID or dimension change
Parse cache version bump invalidates on ParseWorkerResult schema change
mergeParseResult extracted from parsing-processor.ts for shared use by worker merge and cache replay
Embedding pipeline defers ONNX model load until after cache check — skips entirely when 100% cached

Testing & verification

cd gitnexus && npm test — 33 new tests across parse-cache.test.ts and embedding-cache.test.ts
cd gitnexus && npx tsc --noEmit
Manual: analyze --force on GitNexus verified cache hit counts and timing improvements

Risk & rollout

Cache files are new — no migration needed, absent cache triggers full rebuild
--force no longer clears parse cache (intentional — content-addressed means always correct)
Run npx gitnexus analyze after upgrading to populate caches on first run

Checklist

PR body meets repo minimum length
If AGENTS.md / overlays changed: N/A
No secrets, tokens, or machine-specific paths committed

…ndexing

vercel · 2026-03-27T02:40:46Z

@davidbeesley is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

github-actions · 2026-03-27T05:17:47Z

CI Report

✅ All checks passed

Pipeline Status

Stage	Status	Details
✅ Typecheck	`success`	tsc --noEmit
✅ Tests	`success`	unit tests, 3 platforms
✅ E2E	`success`	gitnexus-web changes only

Test Results

Tests	Passed	Failed	Skipped	Duration
4708	4707	0	1	187s

✅ All 4707 tests passed

1 test(s) skipped — expand for details

buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric	Coverage	Covered	Base	Delta	Status
Statements	71.93%	12841/17851	72.24%	📉 -0.3	🔴 ██████████████░░░░░░
Branches	61.65%	8495/13779	62.16%	📉 -0.5	🔴 ████████████░░░░░░░░
Functions	76.36%	1131/1481	75.86%	📈 +0.5	🟢 ███████████████░░░░░
Lines	74.41%	11503/15458	74.72%	📉 -0.3	🔴 ██████████████░░░░░░

_{📋 View full run · Generated by CI}

magyargergo · 2026-03-27T05:20:05Z

@claude ultra-think while reviewing this pr and check the performance characteristics and try to find gaps and potential bugs. We need a way to benchmark this very soon.

github-actions · 2026-03-27T05:20:17Z

Claude finished @magyargergo's task in 3m 19s —— View job

PR Review: Content-Addressed Parse & Embedding Caches

Tasks

Read and understand the PR context
Read all changed files in depth
Analyze performance characteristics
Identify bugs and correctness gaps
Assess benchmarking readiness
Post full review

Overview

The design is clean and the core abstractions are sound. Content-addressing both caches is the right call — it makes --force not destroying the cache safe and simplifies invalidation logic. The atomic write pattern (tmp → rename) is correct. Tests cover the happy paths and basic error handling well.

That said, deep-reading the integration across pipeline.ts, embedding-pipeline.ts, and analyze.ts surfaced several real bugs and performance issues worth addressing before merging.

Bugs

1. `pruneCache` is imported but never called — cache grows forever

analyze.ts:18 imports pruneCache. pipeline.ts:6 also imports it. Neither file actually calls it.

Impact: Every renamed or deleted file accumulates a dead entry in parse-cache.json permanently. On an actively-developed large repo over weeks, the cache file balloons with dead entries.

Fix: Call pruneCache(parseCache, new Set(allPaths)) in analyze.ts after pipelineResult returns and before saveParseCache.

Fix this →

2. `filterRawResultsByFile` is O(n × m) — quadratic for large chunks

pipeline.ts:537-572 — for each newly-parsed file, this function iterates over all raw worker results and scans every nodes/calls/imports/etc. array to extract per-file data.

For a 20MB chunk with 800 files producing 100,000 call records: filterRawResultsByFile is called 800 times, each scanning 100,000 calls = 80 million comparisons just for calls. On a mid-size repo this can add several seconds of post-parse overhead.

Better approach: During processParsingWithWorkers, the worker already has per-file results; emit a Map<filePath, ParseWorkerResult> alongside the merged result, so the cache-write loop is O(1) per file.

Fix this →

3. Sequential fallback never writes to the parse cache

pipeline.ts:777-783:

if (parseCache && chunkWorkerData?.rawResults) {
  // ← only runs when rawResults is set (worker path only)
}

When a worker pool cannot be created (small repo under 15 files, or fallback triggered), processParsing returns null and the guard chunkWorkerData?.rawResults is falsy. The sequential path also never checks the cache — it goes straight to processParsingSequential. Small repos and fallback scenarios never benefit from caching at all, and if a repo shrinks below the worker threshold after previously being cached, the cache entries go stale.

4. Double SHA-256 computation in embedding pipeline

embedding-pipeline.ts:184-193 hashes every text to check allCached, then embedding-pipeline.ts:231-241 hashes them again during the main cache-lookup pass:

// Pass 1 — just to check allCached:
for (let i = 0; i < allTexts.length; i++) {
  if (!embeddingCache.entries[embeddingTextHash(allTexts[i])]) { ...}
}

// Pass 2 — same hashes recomputed:
for (let i = 0; i < nodes.length; i++) {
  const hash = embeddingTextHash(allTexts[i]);  // ← duplicate
  ...
}

For 50,000 nodes with 100-char average text, SHA-256 over all texts twice wastes a measurable amount of time. Combine the two passes — compute the hash once in Pass 2, and remove the separate allCached check.

Fix this →

5. Potential false-positive in `belongsToFile` path matching

pipeline.ts:550-551:

const filePathSuffix = `:${filePath}`;
const belongsToFile = (id: string) =>
  id === fileNodeId || id.includes(filePathWithColon) || id.endsWith(filePathSuffix);

The endsWith(filePathSuffix) arm can match an ID belonging to a different file that shares a path suffix. Example: if filePath = "src/foo.ts", then filePathWithColon = ":src/foo.ts:" — the includes check would also match an ID like "Class:vendor/lib/src/foo.ts:Bar" since it contains :src/foo.ts:. This would cause cache entries to include relationships from different files, corrupting the per-file cached result.

Fix this →

Performance Concerns

6. Single JSON blob — scalability wall

Both caches are serialized as one JSON object. For a repo at the 50,000-node limit:

Embedding cache: 50,000 × 384 floats ≈ 75MB JSON on disk and in memory per run
Parse cache on a 3,000-file TS repo: each ParseWorkerResult entry carries nodes, relationships, symbols, calls, imports — could easily reach 100–200MB

JSON.parse on a 100MB string is synchronous and blocks the event loop for ~500ms. JSON.stringify on the same data is similarly expensive. On every analyze run — even when nothing changed and 100% hits — the full cache is loaded, parsed, and re-serialized.

Recommendation for a follow-up: Sharding the cache by filename prefix (e.g., first 2 hex chars of file hash → separate shard file) would reduce per-run I/O to only the shards touched by changed files.

7. Embedding cache loaded into memory before model init check

analyze.ts:275 loads the full embedding cache (potentially 75MB) unconditionally when --embeddings is passed. If the cache dimensions/modelId don't match, it's immediately discarded and re-created empty. This is wasteful — check the metadata fields first before deserializing the full entries.

Benchmarking Gap

The PR explicitly mentions --force on GitNexus was tested manually and the need for benchmarks is acknowledged. Here's a concrete proposal for the required benchmark infrastructure:

Suggested additions in a follow-up:

gitnexus/bench/
  run-analyze-bench.ts      # Runs analyze twice, reports cold vs warm timing + cache stats
  fixtures/                 # Small synthetic repo with ~500 known files

Key metrics to capture:

Metric	How
Cold parse time (no cache)	First run, capture total time
Warm parse time (100% cached)	Second run, assert `cacheHits == totalFiles`
Cache file size	`stat .gitnexus/parse-cache.json`
Embedding cold vs warm	With `--embeddings`, run twice
Memory peak (RSS)	`process.memoryUsage().rss` before/after each phase

The CI report shows 4701 tests passing and full typecheck success — the functional correctness is solid. The benchmarking gap is the main outstanding need before this change can be properly validated at scale.

Minor Notes

EmbeddingCache validation lives in caller, not in loadEmbeddingCache: The dimensions/modelId staleness check is in analyze.ts lines 276-280, not in the cache module itself. If another caller uses loadEmbeddingCache without re-implementing this check, stale vectors silently poison results. Consider adding a validateEmbeddingCache(cache, dimensions, modelId): boolean helper to embedding-cache.ts.
Parse cache keys are absolute paths: If the same cache is used across machines (e.g., checked into a shared volume) or after git worktree, all entries will miss. Not a current use case but worth documenting.
cacheReplayAccumulators draining: The drainAndMerge pattern (pipeline.ts line 788) is clever for cross-chunk accumulation but the comment says "drain" while it actually copies and clears. The name cacheReplayAccumulators for a variable that also accumulates fresh-parse data from cache hits is somewhat confusing.

Summary

Issue	Severity
`pruneCache` never called — unbounded cache growth	High (correctness)
`filterRawResultsByFile` O(n×m) post-parse overhead	High (performance)
Sequential path never cached	Medium
Double SHA-256 in embedding pipeline	Low (perf)
`belongsToFile` false positives	Medium (correctness)
JSON blob scalability	Medium (future)
Missing benchmark harness	High (validation)

The core design is right and the tests are good. Fixing issue #1 (prune) and #2 (O(n×m)) before merge would be my recommendation — both have clear, bounded fixes. The benchmarking infrastructure can land in a follow-up once the gaps above are addressed.

magyargergo · 2026-03-27T05:52:53Z

@davidbeesley could you please look at it? 🙏

davidbeesley · 2026-03-27T08:12:12Z

Took a crack at the list:

pruneCache never called — Addressed.
filterRawResultsByFile O(n×m) — Addressed.
Sequential fallback never cached — Not addressed. The worker and sequential paths are architecturally different as workers return per-file rawResults that feed into cache storage, while the sequential path writes directly into the graph without producing per-file results. Only parse cache writes are skipped, the embedding cache is unaffected since it operates on nodes already in the graph regardless of how they were parsed.
Double SHA-256 in embedding pipeline — Addressed.
belongsToFile false positives — Addressed.
Single JSON blob / sharding — Not addressed. Cli tool has to have the entire cached equivalent content in memory to do its work anyway. Sharding would add complexity and overhead without a clear performance benefit. Could revisit later if profiling shows io or memory bottlenecks due to .json file size.
Embedding cache loaded before model check — Addressed.
Missing benchmark harness — Not addressed.

magyargergo · 2026-03-27T08:19:08Z

Thank you for your contribution! This is now blocked on #536.

magyargergo · 2026-03-28T14:43:27Z

⚠️ Upcoming Prettier formatting — rebase instructions

PR #563 adds Prettier as the code formatter for the repo. When it merges, the bulk format commit will touch ~350 files (style-only: whitespace, quotes, trailing commas). Your branch will likely conflict.

After #563 merges, rebase your branch:

git fetch origin
git checkout <your-branch>
git rebase origin/main

# Conflicts will be formatting-only — accept your version:
git checkout --theirs .
git add .
git rebase --continue

# Then re-format your branch to match the new style:
npx prettier --write .
git add -A
git commit -m "style: apply prettier formatting"
git push --force-with-lease

New setup step: Run npm install at the repo root (not just in gitnexus/) to get prettier + activate the pre-commit hook. The hook auto-formats staged files on every commit going forward.

magyargergo · 2026-05-11T07:04:38Z

Please submit a new PR if this is still relevant

…pe-res short-circuit) (#1479) * docs: incremental indexing design spec Captures the design agreed in brainstorming on 2026-05-10: - Transitive importer closure with public-surface-change optimization - Git-only change detection (non-git repos: full rebuild as today) - New default behavior; --force opts out - New hydratePhase + loadGraphFromLbug primitive - Iterative closure expansion with parseCache reuse - incrementalInProgress dirty flag for crash recovery Prior art: PR #592 (zenprocess), PR #533 (davidbeesley), PR #1146 (azeemshaik025) — referenced and credited. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(communities): seed Leiden RNG for deterministic community detection The vendored Leiden algorithm defaults to Math.random for tie-breaking and randomized walks, which produces non-deterministic community assignments and modularity values across runs on the same graph. Pass a seeded mulberry32 RNG (LEIDEN_SEED=0xC0DE) so: - The same graph always produces the same partition - Modularity values are reproducible - Equivalence tests for incremental indexing can compare community assignments byte-for-byte This is foundational for the upcoming incremental-indexing feature (see docs/superpowers/specs/2026-05-10-incremental-indexing-design.md) where the correctness contract is incremental output ≡ full rebuild output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(incremental): change-detection, surface signatures, closure expansion Three new modules supporting the incremental-indexing pipeline: * core/incremental/git-diff.ts — getChangedFilesSinceCommit() unions 'git diff lastCommit HEAD' (committed) with 'git status --porcelain' (dirty tree). Renames flattened to delete(orig) + add(new). Throws LastCommitMissingError when lastCommit is gone (caller falls back to full rebuild). * core/incremental/surface.ts — extractSurfaceSignature() produces a stable hash of a file's publicly-visible symbols (functions, classes, methods, interfaces, types, heritage). Body-only edits → same hash. Signature/heritage changes → different hash. Drives the closure scoping optimization. * core/incremental/closure.ts — computeImporterClosure() iterative fixpoint: parse each closure file, extract surface, query DB importers, expand. Uses a parseCache so each file is parsed once. Generic over TParseResult so closure logic is decoupled from the pipeline's parse representation. 32 unit tests across the three modules. Tests cover edge cases: clean tree, dirty-only, mixed, renames, deletes, multi-hop cascade, cycle termination, surface invariance, etc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(lbug): loadGraphFromLbug, queryImporters, deleteAllCommunitiesAndProcesses Three new primitives in lbug-adapter.ts to support incremental indexing: * loadGraphFromLbug(graph, unchangedFilePaths) — streams all nodes for files in the set across every hydratable node table (excludes Community/Process — graph-wide, regenerated downstream). Then loads edges where both endpoints belong to loaded nodes, excluding MEMBER_OF / STEP_IN_PROCESS edges (also graph-wide). FilePaths chunked at 200 per query to keep statement size bounded on huge repos. Endpoint-level join filters by source-side filePath in the query, target-side checked JS-side via the loadedNodeIds set. * queryImporters(targetFilePath) — returns DISTINCT a.filePath where a -[IMPORTS]-> b and b.filePath = target. Powers closure expansion: when a changed file's surface signature changes, all its importers must be re-parsed. * deleteAllCommunitiesAndProcesses() — drops Community/Process nodes (and their edges via DETACH DELETE) at the start of each incremental run so the communities/processes phases regenerate them from the fully-merged graph. Required for the 'Leiden runs on full graph' correctness invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(pipeline): hydrate phase + parse-filter for incremental indexing Wires the incremental-indexing infrastructure into the phase-based pipeline. Three coordinated changes: * New hydratePhase (deps: structure) — loads node/edge state for files OUTSIDE ctx.options.filesToParse from the existing LadybugDB index. Runs before parse so the parse phase can produce a partial graph while downstream phases (mro, communities, processes) still see the full graph. No-op in full-rebuild mode (filesToParse unset). * PipelineOptions.filesToParse: optional ReadonlySet<string>. When set, parse phase filters scanned files to this set; hydrate fills the complement. Set by runFullAnalysis when it detects an eligible incremental run; never set by callers directly. * gitnexus-shared PipelinePhase enum: 'hydrate' added so progress callbacks can report the new phase distinctly from 'structure'. Phase order: scan → structure → hydrate → markdown,cobol → parse → routes,tools,orm → crossFile → scopeResolution → mro → communities → processes. Communities (Leiden) still runs on the full graph, satisfying the correctness invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(analyze): incremental orchestrator branch + meta schema Wires incremental indexing into runFullAnalysis. Highlights: * RepoMeta schema extended: schemaVersion, surfaceSignatures, and incrementalInProgress fields. INCREMENTAL_SCHEMA_VERSION = 1. * core/incremental/file-hash.ts — v1 surface signature: SHA-256 of file content. v2 will switch to a true surface-only signature (defined in surface.ts) so body-only edits don't expand the closure. The plumbing is signature-agnostic so the swap is local. * core/incremental/orchestrator.ts — eligibility check, closure computation (uses file-hash as the surface signal), dirty-flag management, subgraph extraction, signature merge. * run-analyze.ts adds: - hasDirtyTree() check on the existing 'lastCommit==HEAD' early-exit so an uncommitted edit triggers re-index (was a coarse equality check before). - incremental branch: try incremental first; fall through to full rebuild on any setup failure or eligibility miss. - runIncrementalBranch() — opens existing DB, deletes closure-file rows + Community/Process, runs pipeline with filesToParse, writes only the changed-subgraph back, refreshes FTS, updates meta with new surfaceSignatures and clears the dirty flag. - Full-rebuild path now populates surfaceSignatures + schemaVersion in meta.json so the next run is eligible for incremental. Crash recovery: incrementalInProgress is set BEFORE any DB mutation and cleared on success by overwriting meta.json. A crash anywhere in between leaves the flag set, and the next analyze run forces a full rebuild (cheapest path back to a known-good index). v1 limitation documented: body-only edits trigger 1-hop closure expansion (content-hash signal). True surface-only optimization is deferred to v2 — see design doc for the integration path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(incremental): drop invalid --no-renames=false from git diff The flag --no-renames=false isn't valid git syntax (it's parsed as a file path). Git's default rename detection is on; removing the flag keeps that behavior. Caught while running an end-to-end smoke test against a small fixture repo: incremental setup failed with 'Command failed: git diff --name-status -z --no-renames=false ...'. After the fix, the incremental path runs cleanly: closure is computed, hydrate phase loads unchanged-file state from DB, parse phase only re-parses files in closure, and the writeback updates only changed nodes/edges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert v1 incremental indexing (5 commits) Reverts the v1 design that parsed only closure files into a fresh graph and tried to hydrate the rest from DB. Real-repo equivalence test failed: cross-file resolution operates on partial parse data (closure files only), so CALLS edges that resolve through unchanged files silently fall off. Diff against full rebuild on the same edited state: -50 nodes, -425 edges, -5 communities, -48 processes. Architecture pivot: switch to PR #533-style content-addressed parse cache. Pipeline parses every file (cache-served when possible), giving cross-file resolution full data, with DB writeback then restricted to changed-file rows. Reverts: d4b9de4 fix(incremental): drop invalid --no-renames=false f35f763 feat(analyze): incremental orchestrator branch + meta schema bc03968 feat(pipeline): hydrate phase + parse-filter 98bb893 feat(lbug): loadGraphFromLbug, queryImporters, ... aa8d7ae feat(incremental): change-detection, surface signatures, closure Kept: d9e340b feat(communities): seed Leiden RNG (foundational) 8235ca3 docs: incremental indexing design spec (will be revised) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(analyze): incremental DB writeback (Option B) Equivalence-preserving incremental analyze. The pipeline still parses every file (correctness invariant: cross-file resolution / scope resolution / MRO / community detection all need full graph data); the saving comes from selectively replacing only changed-file rows in LadybugDB instead of wiping and reloading the whole graph. How it works: * On every analyze, we hash all source files (SHA-256 of content) and store the map in meta.json.fileHashes alongside schemaVersion. * The next run loads the prior map and diffs: - changed: content hash differs → file's DB rows replaced. - added: not in prior map → file's DB rows inserted. - deleted: in prior map but not on disk → file's DB rows dropped. * If the diff is non-empty AND no --force / no schema mismatch / no dirty flag, take the incremental path: - Set incrementalInProgress dirty flag (BEFORE any DB mutation). - Open existing DB (no wipe). - deleteNodesForFile() for each changed/added/deleted file. - deleteAllCommunitiesAndProcesses() — Leiden regenerates these. - extractChangedSubgraph() from the in-memory ctx.graph: nodes whose filePath is in the writable set + Community + Process + edges with at least one endpoint in the writable set (edges entirely between hydrated unchanged nodes are skipped — already in DB). - loadGraphToLbug() on the subgraph. Unchanged-file rows in DB untouched. - Recreate FTS indexes. - Update meta with new fileHashes; clear dirty flag. * Otherwise full-rebuild path runs as before. Crash recovery: incrementalInProgress is the dirty flag. Set before destructive ops; cleared on success. Set on next-run startup → forces full rebuild (cheapest path back to known-good). Other changes: * Dirty-tree gate on the existing 'lastCommit==HEAD' early-return: uncommitted edits no longer slip through as 'already up to date'. * deleteAllCommunitiesAndProcesses helper in lbug-adapter. * Skip the embedding cache+restore cycle when willTryIncremental is true — embeddings stay in DB; re-inserting them would PK-conflict. End-to-end equivalence verified on this repo (993 files, 24K nodes): incremental run produces byte-identical {nodes, edges, clusters, flows} to a full rebuild from the same edited state. Speedup is currently modest (~5% on this repo) because the parse phase still runs in full. Parse-cache integration is a separate follow-up that composes cleanly on top of this work. See docs/superpowers/specs/2026-05-10-incremental-indexing-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(analyze): chunk-level parse cache for full incremental speedup Composes with the incremental DB writeback (commit 27f3b49) to deliver the major-speedup half of incremental indexing. Previously, the parse phase ran in full on every analyze; the speedup came purely from selective DB rewriting. With this commit the parse phase also reuses prior tree-sitter output for chunks whose contents haven't changed. How it works: * Cache layer (gitnexus/src/storage/parse-cache.ts): - File: <repo>/.gitnexus/parse-cache.json. Versioned, atomic write. - Key: chunk content hash = sha256(sorted(filePath:fileContentHash for each file in chunk)). - Value: ParseWorkerResult[] (raw worker output for the chunk, pre-merge). - Granularity: per chunk (~20MB byte-budget). A change to one file invalidates only its chunk — typically 1 of ~50 on a 1000-file repo (~98% cache hit ratio on a small edit). * Worker contract (gitnexus/src/core/ingestion/parsing-processor.ts): - Extracted the chunk-result merge loop into a public mergeChunkResults() so the same logic applies to live worker output AND replayed cache entries. - processParsingWithWorkers / processParsing accept an optional outRawResults out-parameter that captures worker output before merging — used by parse-impl to populate the cache after a miss. * Parse phase wiring (parse-impl.ts): - For each chunk, compute its content hash (after reading file contents). Cache hit → mergeChunkResults() on cached results, skip the worker dispatch entirely. Cache miss → run workers normally, capture raw results, store under the chunk hash. - Cache mutations happen in-place on the ParseCache passed via PipelineOptions.parseCache. * Lifecycle (run-analyze.ts): - loadParseCache() before pipeline runs. - Cache passed via runPipelineFromRepo's PipelineOptions. - saveParseCache() after the pipeline + DB writeback succeed. Equivalence verified on this repo (993 files, 24K nodes): Cold (no cache, full work): 141.1s Warm cache + 1-file edit, incremental: 63.6s ← 55% speedup Warm cache + 1-file edit, --force: 71.6s ← 49% speedup All three runs produce byte-identical {nodes, edges, clusters, flows}. The cache survives --force (content-addressed = always correct), so even forced rebuilds get the parse-skip benefit. Why chunk-level rather than per-file: workers process sub-batches and emit aggregated ParseWorkerResults. Per-file granularity would require restructuring the worker contract; chunk-level captures most of the practical speedup with no worker-side changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(parse-impl): smaller default chunk budget (20MB→2MB) for cache granularity The parse cache is keyed at chunk granularity. With the previous 20MB budget, a typical mid-size repo (e.g. this worktree at 9MB total parseable source) fits in a single chunk — meaning ANY file change invalidates the whole chunk and re-parses every file. 2MB default produces ~5x more chunks on the same input, so a one-file edit invalidates ~1/N of cached chunks instead of the whole thing. Cold-run overhead from more chunks is <5% (one extra serialization pass per chunk). Override via GITNEXUS_CHUNK_BYTE_BUDGET env var for benchmarking. Measured on this repo (~9MB / 887 parseable files): Cold (no cache): 143s Warm cache, no source changes: 2s (early-return) Warm cache + 1-file edit: 81s (~43% off cold) Speedup is bounded by the scopeResolution phase (~58s flat regardless of parse cache) and by GitNexus's own auto-writes during analyze (AGENTS.md / .claude/skills/ etc. mutate between runs and invalidate chunks containing them). Both are addressable in follow-ups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(scope-resolution): reuse worker-produced ParsedFile + stabilize chunk order Two compounding optimizations that drop warm-cache analyze from ~134s to ~38s on a 1000-file repo (72% faster), and cold rebuild from ~143s to ~86s (40% faster) by short-circuiting work that was previously re-done. 1. SCOPE-RESOLUTION: REUSE WORKER PARSEDFILE Previously, the scope-resolution phase re-parsed every file with tree-sitter on the main thread (~58s on a 1000-file repo) because worker-produced tree-sitter Trees can't cross the worker MessageChannel. But the worker ALSO produces a artifact via , which structured-clones fine — and it's exactly what scope-resolution would re-derive. Threading those ParsedFiles through the parse phase () into ( map) lets scope- resolution skip its extract loop on a per-file basis. The fast path is bounded only by per file (cheap graph mutation). On this repo: scopeResolution went from 58s → 5s. 2. MAP-PRESERVING PARSE-CACHE SERIALIZATION is a which JSON.stringify collapses to . The first attempt at threading parsedFiles through the parse cache crashed at runtime with "importerModule.typeBindings is not iterable" because cached entries came back as plain objects. Added a JSON replacer/reviver pair in parse-cache.ts that round-trips Map and Set instances through tagged plain objects (). Symmetric: save uses replacer, load uses reviver. 3. STABLE CHUNK ORDERING The byte-budget chunker walked files in filesystem-scan order, which on Windows isn't guaranteed to be stable across runs. Even with identical source content, two scans could place files in different chunks, shifting chunk hashes and causing 100% parse-cache misses. Added a deterministic alphabetical sort on before chunking. Chunk membership is now stable across runs, so a single-file edit invalidates exactly one chunk, not all of them. Measured on this repo (993 files, 24K nodes): Cold rebuild: 86s (was 143s) Warm cache, no source changes: 3s (early-return) Warm cache + 1-file edit: 38s (was 134s) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(incremental): update spec + AGENTS.md + GUARDRAILS.md for shipped design - Rewrite docs/superpowers/specs/2026-05-10-incremental-indexing-design.md to describe the architecture that actually shipped (parse cache + incremental DB writeback + scope-resolution short-circuit), with the v1 hydrate-phase post-mortem preserved as historical context. - AGENTS.md "Keeping the Index Fresh" section: note that incremental is the new default and --force is the explicit opt-out; mention the parse-cache file location and that it's safe to delete. - GUARDRAILS.md Signs: add an "Index seems corrupt or incremental is misbehaving" entry pointing users to --force as the manual escape hatch (the dirty flag handles automatic recovery). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(autofix): apply prettier + eslint fixes via /autofix command * fix(incremental): bugbot review + CI test failures Bugbot (PR #1479): - Medium: pruneCache was exported but never called -> cache grew unbounded. Wire pruneCache into run-analyze before saveParseCache, using a transient usedKeys Set on ParseCache that the parse phase populates as it processes chunks. - Low: willTryIncremental (pre-pipeline) and isIncremental (post-pipeline) could desync, silently dropping embeddings on mispredicted runs. Removed the prediction; the embedding cache now loads unconditionally when shouldLoadCache is true. The re-insert step gates on the actual isIncremental value to avoid PK-conflicts when the incremental-writeback path keeps DB rows. CI test failures: - cli-e2e #1169 + run-analyze.test.ts #1233: my dirty-tree gate on the lastCommit==HEAD early-return saw GitNexus's own auto-generated outputs (.claude/, .cursor/, AGENTS.md, CLAUDE.md) as dirty, perpetually defeating the up-to-date fast path. Extended the pathspec exclusion to cover all auto-gen outputs, not just .gitnexus/. - ruby field-type disambig: my chunk-stability sort exposed a pre-existing order-dependency in Ruby cross-file resolution (`user.address.save -> Address#save` only resolves correctly when user.rb parses before address.rb in some configurations). Removed the sort. Filesystem ordering is stable enough in practice that the parse cache still hits the common case; the pre-existing fragility is left for a separate fix. - pipeline-graph-golden: regenerated. Seeded Leiden RNG produces a partition different from the previous Math.random snapshot. - staleness `parallel calls` was a CI timing flake; passes locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(incremental): re-insert cached embeddings on incremental path Bugbot re-review caught: deleteNodesForFile cascades to the CodeEmbedding table (DELETE WHERE e.nodeId STARTS WITH ...), so changed-file embedding rows are wiped along with their nodes. The previous fix gated re-insert on `!isIncremental`, which silently dropped those embeddings — a regression versus the full-rebuild path's "preserve embeddings by default" guarantee. Remove the `!isIncremental` gate. The per-batch try/catch already handles the unchanged-file PK-conflict case ("some may fail if node was removed, that's fine") with the same semantics, so re-inserting the full cached set on incremental works: - changed-file rows: deleted, then re-inserted from cache (preserved) - unchanged-file rows: still in DB, re-insert PK-conflicts and is silently ignored (existing rows are correct) Cost: re-inserting ~24K embeddings on incremental when only a few files changed — most are no-op conflicts. Bounded by batch size of 200; ~3-5s overhead. Worth it for correctness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(incremental): address Claude+Bugbot review findings + remove design doc Addresses CHANGES_REQUESTED review on PR #1479: 1. Remove docs/superpowers/specs/2026-05-10-incremental-indexing-design.md per maintainer request. 2. BLOCKER (Claude Finding 1, Bugbot Round 3): Stale cross-file edges between unchanged files. extractChangedSubgraph excluded edges where both endpoints were unchanged-file nodes — when a barrel/re-export file changes, cross-file resolution may update CALLS edges between two unchanged files that would then be silently lost. Fix: 1-hop importer-closure expansion of the writable set in run-analyze.ts. Before deleting/rewriting rows, query DB for importers of every changed/deleted file and add them to the writable set. Their nodes get deleted+rewritten too, so cross-file's refined edges land in the DB. Re-added queryImporters to lbug-adapter.ts. 3. BLOCKER (Claude Finding 3): Parse cache key omitted parser version. After a GitNexus upgrade, the cache silently replays pre-upgrade ParseWorkerResults against the new schema → wrong CALLS/IMPORTS/ scope edges with no visible signal. Fix: PARSE_CACHE_VERSION now embeds the gitnexus npm package version (read at module load via createRequire on package.json). Format: `${SCHEMA_BUMP}+${PKG_VERSION}` e.g. "1+1.6.4". Any release that bumps package.json automatically invalidates the on-disk cache. Mismatched versions fall through to an empty cache (next save overwrites with the new version baked in). 4. BLOCKER (Claude Finding 2): No automated tests for incremental behavior. Added 28 unit tests across 3 files: - incremental-file-hash.test.ts (10 tests) diffFileHashes classification, computeFileHash determinism, computeFileHashes batch / missing-file tolerance, sorted output. - incremental-parse-cache.test.ts (12 tests) computeChunkHash stability and order-independence, version prefix format, pruneCache, load/save round-trip on empty / missing / corrupt / version-mismatched files, AND a Map/Set round-trip test that pins the JSON replacer/reviver behaviour (without it, ParsedFile.scopes[*].typeBindings collapses to {} and downstream `.get()` / iteration throws). - incremental-subgraph-extract.test.ts (6 tests) writable-set node inclusion, Community/Process always kept, edge inclusion when at least one endpoint is writable, MEMBER_OF edges via graph-wide endpoints, empty subgraph case. 5. Medium (Claude Finding 6): AGENTS.md "Keeping the Index Fresh" said "only changed files are re-parsed." Imprecise — the pipeline parses every file every run; the cache skips tree-sitter for chunks whose contents haven't changed. Reworded to match the design doc. Test plan still expects: [x] Typecheck clean [x] All 28 new unit tests pass [x] All previously-failing tests still pass on the rebased branch [x] Equivalence verified locally (incremental ≡ --force, byte-identical stats on this repo) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(incremental): round 3 review feedback — bounded BFS, atomic meta, integration test, docs Addresses remaining findings on PR #1479 from Claude's re-review of commit ad7bd31 + verifies the outstanding Bugbot HIGH severity. 1. F1 — Transitive importer expansion (Claude, was Medium-but-noted). Previous 1-hop importer expansion missed barrel re-export chains (A imports C, C re-exports B; when B changes, only C was pulled in — A was left with potentially-stale CALLS edges to refined targets). Replaced the single pass with a bounded BFS over the IMPORTS graph (depth ≤ 4). Catches nested barrel pyramids without ballooning into a near-full rebuild on monorepos with deep re-export trees. `--force` remains the escape hatch documented in GUARDRAILS.md for cases that exceed the bound. 2. F2 — Integration test for incremental orchestration (Claude, BLOCKER, DoD §2.7). The unit tests added in ad7bd31 covered `diffFileHashes`, `extractChangedSubgraph`, `computeChunkHash`, `pruneCache`, and the Map/Set JSON round-trip — but none of them exercised the real `runFullAnalysis` orchestration. Added gitnexus/test/unit/ incremental-orchestration.test.ts with four end-to-end tests against a real git-initialized fixture repo + real LadybugDB: a. First run populates fileHashes + schemaVersion and clears incrementalInProgress on success. b. Second run on unchanged state takes the alreadyUpToDate fast path (early-return). c. Second run after a source edit takes the incremental path (not full rebuild) and rotates fileHashes for the touched file while keeping the dirty flag cleared. d. A pre-set incrementalInProgress flag forces a full rebuild that clears it (crash-recovery wire). These would catch any regression that wires `isIncremental` from a pre-pipeline prediction (the Bugbot finding from commit 5eb0597) or accidentally re-gates the embedding re-insert on `!isIncremental` (the Bugbot finding from commit 60c10f1). 3. F3 — GUARDRAILS.md docs accuracy (Claude, Low). Line 33 still said "only changed files are re-parsed" — AGENTS.md was already corrected in ad7bd31 but GUARDRAILS.md was missed. Reworded to match. 4. F5 — Atomic saveMeta (Claude, Medium; vvladescu-tb fork). The dirty flag (`incrementalInProgress`) travels through meta.json. A crash mid-write would leave a corrupt meta.json that `loadMeta` would silently treat as "no prior index", losing the flag and skipping recovery. Switched to tmp-file + rename matching saveParseCache. 5. Bugbot's "Subgraph edges reference nodes absent from subgraph" (HIGH severity). Verified as FALSE POSITIVE: `getNodeLabel` in lbug-adapter.ts derives labels from the node-ID string (parses the table prefix), not from the in-memory graph. The CSV generator writes (src_id, dst_id, type) rows without consulting node objects; `splitRelCsvByLabelPair` routes by ID-derived label; `COPY ... (from=X, to=Y)` resolves both endpoints against the live LadybugDB where unchanged-file nodes still exist. No fix needed. All 213 tests pass locally (including the 4 new integration tests and the previously-failing CI tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(incremental): address Bugbot round-4 findings (added-file shadow seed + dedupe) Bugbot review on commit e23e440 surfaced two new findings against the incremental writeback in run-analyze.ts: HIGH — Incremental BFS misses importers of newly added files. queryImporters() reads the pre-pipeline DB. For a NEWLY ADDED file there are no IMPORTS rows pointing to it yet, so unchanged files whose pre-existing import statements now resolve to the newcomer keep stale CALLS edges pointing at the OLD resolution target. LOW — Deleted files double-counted in filesToDelete. hashDiff.deleted entries can reappear in writableFiles via the BFS expansion (queryImporters can return a now-deleted path), so deleteNodesForFile() ran twice for the same file. Fixes: - Add gitnexus/src/core/incremental/shadow-candidates.ts: derive the pre-existing file paths whose JS/TS module-resolution claim an added file can steal. Pattern catalogue: same-basename/ different-extension, bare-file-beats-directory-index, and directory-index-beats-bare-file. Emit both POSIX and Windows separators because the prior fileHashes map may have been written from either OS. - In run-analyze.ts, seed the BFS frontier with shadow candidates that exist in the prior meta.fileHashes. Their importers — found via queryImporters — get pulled into the writable set so their CALLS edges re-resolve against the new file. - Dedupe filesToDelete via Set to avoid the double-call. Tests: gitnexus/test/unit/incremental-shadow-candidates.test.ts — 8 cases covering each shadow pattern, separator handling, .d.ts as a single extension token, deduplication, and the no-self-shadow invariant. All 40 incremental tests (file-hash, parse-cache, subgraph-extract, shadow-candidates, orchestration) pass locally. Note on the third Bugbot finding ("Subgraph edges reference nodes absent from subgraph"): re-anchored from a prior review pass — the code at subgraph-extract.ts:48 is unchanged. Already verified as a false positive: getNodeLabel parses labels from ID strings, CSV write is by ID, and COPY resolves against the live DB. * chore(autofix): apply prettier + eslint fixes via /autofix command * test(incremental): exact-equality stats invariant + analyze ≡ analyze --force Addresses the only remaining Claude production-readiness review finding on PR #1479 (Low-Medium, test-quality only — Claude itself said it does NOT block merge, but the central PR claim "incremental ≡ full rebuild" deserves explicit CI coverage rather than implicit trust). Changes to gitnexus/test/unit/incremental-orchestration.test.ts: 1) Tighten the existing "comment-only edit takes incremental path" test. - Replace toBeGreaterThan(0) bounds assertions on stats.files and stats.nodes with exact toBe(firstMeta) per-field equality across files / nodes / edges / communities / processes. DoD §2.7 calls out bounds-only assertions as masking regressions that drop half the graph; this swap closes that gap. - Rationale: a comment-only edit must change the file content hash (driving the incremental path) without changing any graph data. Therefore every stat MUST be identical to the first run. Anything else is a regression. 2) New test: incremental output is byte-equivalent to a full rebuild. - Run analyze → comment-only edit → analyze (incremental writeback) → analyze --force (full rebuild from same on-disk state). - Assert files / nodes / edges / communities / processes are exactly equal across the incremental and the --force passes. - This is the PR's central correctness contract, now proven by a test that exercises the real runtime path end-to-end against a real on-disk LadybugDB. All 5 orchestration tests pass locally (52s), including the new equivalence test — every stat field matches exactly between incremental and --force on the mini-repo fixture. tsc --noEmit clean. * fix(incremental): F1 cross-file edge consistency + F4 stable chunk sort + unit coverage (#1511) Patch addressing two of the still-open changes-requested findings on PR #1479, rebased onto the current feat/incremental-indexing head. F3 (parser fingerprint in the cache key), F5 (atomic saveMeta), and F6 (AGENTS.md phrasing) were already handled on the branch, so the corresponding parts of the original patch were dropped as redundant. F1 (Blocker) — Cross-file edges between unchanged files Adds `computeEffectiveWriteSet(graph, toWriteSet)` to subgraph-extract.ts: a single pass over the new graph's edges that pulls the unchanged-side file of every writable-boundary-crossing edge into the write set. run-analyze composes it ON TOP of the existing importer-BFS expansion and feeds the combined set to BOTH `deleteNodesForFile` and `extractChangedSubgraph`, so the delete cascade and the writeback subgraph cover identical files (asymmetry would leave stale rows or PK-conflict at COPY time). The BFS reads IMPORTS from the pre-pipeline DB (catches files that *stopped* importing a changed file); the edge walk reads the new graph (catches refined CALLS edges the pre-run DB couldn't predict, e.g. a barrel re-export shifting a symbol from B to D). `extractChangedSubgraph` stays a pure filter — all expansion is the orchestrator's job. F4 (Medium) — Restore alphabetical chunk sort `parseableScanned` is sorted before chunking. Filesystem-scan order isn't stable enough across runs/platforms (notably macOS APFS) to keep chunk hashes consistent, so the parse cache thrashes without it. The pre-existing Ruby cross-file resolution order-dependency the old comment cited is independent — the sort surfaces it but doesn't cause it; tracked separately rather than leaving the cache cold. Tests — incremental-subgraph-extract.test.ts Locks the F1 invariants: `extractChangedSubgraph` is a pure filter (includes only the set it's given, plus graph-wide nodes; edges fire on one writable endpoint), and `computeEffectiveWriteSet` covers the barrel-re-export scenario, the symmetric edge-into- changed-file case, the no-boundary-crossed no-op, graph-wide-node edges, and input-immutability. Supersedes the prior extractChangedSubgraph-only test file on the branch. Co-authored-by: Val Vladescu <vvladescu-tb@users.noreply.github.com> * chore(autofix): apply prettier + eslint fixes via /autofix command * fix(call-processor): register properties in pre-pass to fix order-dependent field type disambiguation + regenerate golden snapshot Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/2d66666f-861c-432e-a4b0-11f2aefca98a Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com> * fix(call-processor): port worker-path property enrichment into the sequential pre-pass Copilot's pre-pass in 8184439 fixed the Ruby attr_accessor order-dependence, but it copied the OLD in-loop registration logic, not the canonical worker path in parse-worker.ts. That left the sequential and worker paths emitting non-identical Property nodes/symbols for the same source — silently breaking the `incremental ≡ --force` invariant the moment a repo crosses the worker threshold between runs. Two concrete divergences are closed here: * Node id: worker keys Property as `${file}:${className}.${propName}` (qualified). Pre-pass was using `${file}:${propName}` (unqualified). Same source produced different graph ids depending on which path ran. * Field metadata: worker enriches each routed property with `provider.fieldExtractor` + `getFieldInfo`, falling back to `routedFieldInfo.type` for `declaredType` when the routing payload lacks one (e.g. types discovered from `@address = Address.new` ctor assignments rather than YARD `@return [Type]`), and propagates `visibility` / `isStatic` / `isReadonly`. Pre-pass did none of this, so on the sequential path `resolveFieldAccessType` failed to walk chains where the type only came from the FieldExtractor. The pre-pass now mirrors parse-worker.ts:1803-1898 verbatim, with one deliberate difference: the FieldInfo cache is scoped to a single `processCalls` invocation rather than module-level (the worker process is short-lived; the main thread is not, and a module-level cache would leak state between analyze runs). Also drops the now-stale "Defer resolution: Ruby attr_accessor properties are registered during this same loop" comment on `pendingWrites.push` — the rationale is no longer accurate after Copilot's pre-pass, but the deferral is still needed so write-access tracking sees inference that completes during the main loop. Comment updated to reflect that. Verification: * `tsc --noEmit`: 0 errors * test/unit (call-processor, call-routing, field-extraction, ruby-self-call): 224 passing * test/integration (ruby, ruby-sequential-mixin, pipeline-graph-golden): 137 passing Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(call-processor): key fieldInfoCache by filePath:startIndex, not raw byte offset Claude's review of 255bdf6 caught a real collision in the FieldInfoCache I added: keying by `classNode.startIndex` alone is a per-file byte offset, so two files that both begin with a class at byte 0 — extremely common in Ruby / Python, where files frequently open with `class Foo`, `module Foo` — collide on the same cache entry. The second file's `getFieldInfo` then returns the first file's FieldInfo map, producing wrong `declaredType` / `visibility` / `isReadonly` on its properties. Same shape as the bug that already exists in parse-worker.ts:377 (also keyed by `classNode.startIndex` in a module-level map, persistent across files processed by the same worker). Fixing the symmetric pre-existing leak in parse-worker.ts is a separate, scoped follow-up — left out of this commit to keep the fix minimal and reviewable. Cache map and key are now both string-typed. Composite key `${context.filePath}:${classNode.startIndex}` keeps the within-file hit rate (one FieldExtractor.extract() per class regardless of how many `attr_accessor` lines it has) while eliminating cross-file aliasing. Verification on the patched HEAD: * `tsc --noEmit`: 0 errors * test/unit (call-processor, call-routing, field-extraction, ruby-self-call): 224 passing * test/integration (ruby, ruby-sequential-mixin, pipeline-graph-golden): 137 passing Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Gergő Magyar <gergomagyar@icloud.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Val Vladescu <val.vladescu@thirdbridge.com> Co-authored-by: Val Vladescu <vvladescu-tb@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>

davidbeesley added 2 commits March 26, 2026 20:23

feat: add content-addressed parse and embedding caches for faster rei…

fa4ab83

…ndexing

refactor: remove dead loadCachedEmbeddings from lbug-adapter

f43deba

davidbeesley changed the title ~~[perf] Content-addressed parse and embedding caches~~ perf: content-addressed parse and embedding caches for faster reindexing Mar 27, 2026

fix: address pr feedback

4ce9240

abhigyanpatwari mentioned this pull request May 10, 2026

feat(analyze): incremental indexing (parse cache + DB writeback + scope-res short-circuit) #1479

Merged

6 tasks

magyargergo closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: content-addressed parse and embedding caches for faster reindexing#533

perf: content-addressed parse and embedding caches for faster reindexing#533
davidbeesley wants to merge 3 commits into
abhigyanpatwari:mainfrom
davidbeesley:feat/parse-and-embedding-cache

davidbeesley commented Mar 27, 2026

Uh oh!

vercel Bot commented Mar 27, 2026

Uh oh!

github-actions Bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

magyargergo commented Mar 27, 2026

Uh oh!

github-actions Bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

magyargergo commented Mar 27, 2026

Uh oh!

davidbeesley commented Mar 27, 2026

Uh oh!

magyargergo commented Mar 27, 2026

Uh oh!

magyargergo commented Mar 28, 2026

Uh oh!

magyargergo commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

davidbeesley commented Mar 27, 2026

Summary

Motivation / context

Areas touched

Scope & constraints

Implementation notes

Testing & verification

Risk & rollout

Checklist

Uh oh!

vercel Bot commented Mar 27, 2026

Uh oh!

github-actions Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Report

Pipeline Status

Test Results

Code Coverage

Tests

Uh oh!

magyargergo commented Mar 27, 2026

Uh oh!

github-actions Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Content-Addressed Parse & Embedding Caches

Overview

Bugs

1. pruneCache is imported but never called — cache grows forever

2. filterRawResultsByFile is O(n × m) — quadratic for large chunks

3. Sequential fallback never writes to the parse cache

4. Double SHA-256 computation in embedding pipeline

5. Potential false-positive in belongsToFile path matching

Performance Concerns

6. Single JSON blob — scalability wall

7. Embedding cache loaded into memory before model init check

Benchmarking Gap

Minor Notes

Summary

Uh oh!

magyargergo commented Mar 27, 2026

Uh oh!

davidbeesley commented Mar 27, 2026

Uh oh!

magyargergo commented Mar 27, 2026

Uh oh!

magyargergo commented Mar 28, 2026

⚠️ Upcoming Prettier formatting — rebase instructions

Uh oh!

magyargergo commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Mar 27, 2026 •

edited

Loading

github-actions Bot commented Mar 27, 2026 •

edited

Loading

1. `pruneCache` is imported but never called — cache grows forever

2. `filterRawResultsByFile` is O(n × m) — quadratic for large chunks

5. Potential false-positive in `belongsToFile` path matching