Skip to content

feat: add --incremental flag to analyze command#592

Closed
zenprocess wants to merge 2 commits into
abhigyanpatwari:mainfrom
zenprocess:feat/incremental-v2
Closed

feat: add --incremental flag to analyze command#592
zenprocess wants to merge 2 commits into
abhigyanpatwari:mainfrom
zenprocess:feat/incremental-v2

Conversation

@zenprocess

Copy link
Copy Markdown

Summary

Adds an --incremental flag to the analyze command that skips re-parsing unchanged files on subsequent runs. This dramatically reduces re-index time for large repos when only a few files changed.

How it works

  1. During indexing, computes SHA-256 hashes for every parsed file
  2. Stores hashes in meta.json alongside the existing repo metadata
  3. On subsequent runs with --incremental, compares current hashes against stored ones
  4. Only files with changed hashes are re-parsed — structure scanning and graph phases still run fully to ensure correct community/process detection
  5. Stale nodes (from deleted/renamed files) are removed from LadybugDB instead of wiping and recreating

Performance

For a 60K+ symbol TypeScript repo, re-index time drops from ~30s to ~2-5s when only a few files changed between runs.

Usage

gitnexus analyze /path/to/repo                # full index (default)
gitnexus analyze -i /path/to/repo             # incremental (skip unchanged)
gitnexus analyze --incremental /path/to/repo  # same

Docker:

docker run -e GITNEXUS_INCREMENTAL=1 ...      # enables --incremental on container restart

Files changed

File Change
src/storage/file-hasher.ts NEWcomputeFileHashes() and diffFileHashes()
src/cli/index.ts Register -i/--incremental CLI flag
src/cli/analyze.ts Forward incremental option to runFullAnalysis
src/core/run-analyze.ts Load previous hashes, pass to pipeline, save new hashes, incremental LadybugDB merge
src/core/ingestion/pipeline.ts Accept previousFileHashes, filter parse phase to changed files only, return fileHashes in result
src/types/pipeline.ts Add fileHashes to PipelineResult
src/storage/repo-manager.ts Add fileHashes to RepoMeta
Dockerfile Production two-stage build with incremental support
entrypoint.sh Container entrypoint — indexes repos, starts eval-server

Test plan

  • gitnexus analyze /path/to/repo produces same graph as before (no regression)
  • gitnexus analyze -i /path/to/repo on second run skips unchanged files (check console output: "Incremental: N changed, M removed, K unchanged")
  • Deleting a source file and re-running with -i removes stale nodes from the graph
  • Adding a new file and re-running with -i indexes only the new file
  • meta.json contains fileHashes after first run

Test added 2 commits March 30, 2026 23:07
Computes per-file SHA-256 hashes during indexing and stores them in meta.json.
On subsequent runs with --incremental, only files with changed hashes are
re-parsed. Structure scanning and graph phases still run fully to ensure
correct community/process detection.

For large repos (60K+ symbols), this reduces re-index time from ~30s to ~2-5s
when only a few files changed between sessions.

New files:
- src/storage/file-hasher.ts: computeFileHashes() and diffFileHashes()

Modified:
- src/cli/index.ts: Register -i/--incremental CLI flag
- src/cli/analyze.ts: Forward incremental option to runFullAnalysis
- src/core/run-analyze.ts: Load previous hashes, pass to pipeline, save new hashes,
  incremental LadybugDB merge (delete stale nodes instead of wipe-and-recreate)
- src/core/ingestion/pipeline.ts: Accept previousFileHashes, filter parse phase
  to changed files only, return fileHashes in result
- src/types/pipeline.ts: Add fileHashes to PipelineResult
- src/storage/repo-manager.ts: Add fileHashes to RepoMeta
Two-stage build: builder installs deps + compiles, slim image has
only git + gitnexus CLI. Entrypoint indexes all repos in /data/repos/
and starts the eval-server on port 3456.

Supports GITNEXUS_INCREMENTAL=1 env var to enable --incremental
on container restart (skips unchanged files).
@vercel

vercel Bot commented Mar 30, 2026

Copy link
Copy Markdown

@TESTPERSONAL is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@xkonjin xkonjin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Incremental Analyze

Overall: Well-structured feature, one blocking bug 🟡

The SHA-256 file hashing + diff approach for incremental indexing is clean and the batched I/O (100 files at a time) avoids fd exhaustion. The Docker + entrypoint additions look solid.

Positives

  • file-hasher.ts is well-isolated and testable. Clean separation of computeFileHashes and diffFileHashes.
  • The merged-hash logic in run-analyze.ts correctly maintains the full hash map across incremental runs.
  • Docker entrypoint auto-detects empty registry and falls back to --force, which is good UX.

Issues

1. 🔴 Missing export: deleteNodesForFile not in lbug-adapter.ts
run-analyze.ts imports deleteNodesForFile from ./lbug/lbug-adapter.js, but this function does not exist in that module. It exists in kuzu-adapter.ts. This will cause a runtime import error when incremental mode tries to delete stale nodes. Either re-export it from lbug-adapter or change the import path.

2. 🟡 Hash merge uses (pipelineResult as any).fileHashes cast
Line ~210 in run-analyze.ts mutates the pipeline result via as any cast. Consider adding fileHashes as a mutable field on the result type, or returning the merged hashes separately to avoid the type escape hatch.

3. 🟡 No tests for file-hasher.ts
The new module has no accompanying tests. At minimum, diffFileHashes deserves unit tests for the three cases: all new files, mixed changed/unchanged/removed, and no previous hashes.

4. (nit) Entrypoint $GITNEXUS_INCREMENTAL env var is undocumented
The Docker entrypoint checks GITNEXUS_INCREMENTAL=1 but this is not mentioned in any README or Dockerfile comment.

Verdict

The import bug (#1) will cause a runtime crash in incremental mode and needs fixing before merge. The rest is solid work.

@magyargergo

Copy link
Copy Markdown
Collaborator

Please submit a new PR if this is still relevant

@vvladescu-tb

vvladescu-tb commented May 11, 2026

Copy link
Copy Markdown
Contributor

Thanks for the nudge @magyargergo. This PR is superseded by #1479 (your own incremental-indexing branch), which we're already pinning at 42302bc7 and running in production against several repos.

Two companion PRs incoming from us:

  1. Tests-only PR against feat/incremental-indexing — 212 lines of unit tests covering parse-cache fingerprinting + subgraph extraction (none of which exist in feat(analyze): incremental indexing (parse cache + DB writeback + scope-res short-circuit) #1479 today). Addresses your "look into Claude's findings" review-feedback on feat(analyze): incremental indexing (parse cache + DB writeback + scope-res short-circuit) #1479 from a test-coverage angle.
  2. Issue proposing POST /reindex on eval-server — daemon consumers currently have no way to trigger an incremental re-analyze through the HTTP API; we orchestrate by spawning gitnexus analyze subprocesses. A daemon-side endpoint that wraps run-analyze + hot-swaps the LadybugDB handle would make incremental usable for the running-server use case.

Closing this in favor of those.

magyargergo added a commit that referenced this pull request May 12, 2026
…pe-res short-circuit) (#1479)

* docs: incremental indexing design spec

Captures the design agreed in brainstorming on 2026-05-10:
- Transitive importer closure with public-surface-change optimization
- Git-only change detection (non-git repos: full rebuild as today)
- New default behavior; --force opts out
- New hydratePhase + loadGraphFromLbug primitive
- Iterative closure expansion with parseCache reuse
- incrementalInProgress dirty flag for crash recovery

Prior art: PR #592 (zenprocess), PR #533 (davidbeesley),
PR #1146 (azeemshaik025) — referenced and credited.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(communities): seed Leiden RNG for deterministic community detection

The vendored Leiden algorithm defaults to Math.random for tie-breaking
and randomized walks, which produces non-deterministic community
assignments and modularity values across runs on the same graph.

Pass a seeded mulberry32 RNG (LEIDEN_SEED=0xC0DE) so:
- The same graph always produces the same partition
- Modularity values are reproducible
- Equivalence tests for incremental indexing can compare community
  assignments byte-for-byte

This is foundational for the upcoming incremental-indexing feature
(see docs/superpowers/specs/2026-05-10-incremental-indexing-design.md)
where the correctness contract is incremental output ≡ full rebuild
output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(incremental): change-detection, surface signatures, closure expansion

Three new modules supporting the incremental-indexing pipeline:

* core/incremental/git-diff.ts — getChangedFilesSinceCommit() unions
  'git diff lastCommit HEAD' (committed) with 'git status --porcelain'
  (dirty tree). Renames flattened to delete(orig) + add(new). Throws
  LastCommitMissingError when lastCommit is gone (caller falls back to
  full rebuild).

* core/incremental/surface.ts — extractSurfaceSignature() produces a
  stable hash of a file's publicly-visible symbols (functions, classes,
  methods, interfaces, types, heritage). Body-only edits → same hash.
  Signature/heritage changes → different hash. Drives the closure
  scoping optimization.

* core/incremental/closure.ts — computeImporterClosure() iterative
  fixpoint: parse each closure file, extract surface, query DB
  importers, expand. Uses a parseCache so each file is parsed once.
  Generic over TParseResult so closure logic is decoupled from the
  pipeline's parse representation.

32 unit tests across the three modules. Tests cover edge cases:
clean tree, dirty-only, mixed, renames, deletes, multi-hop cascade,
cycle termination, surface invariance, etc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(lbug): loadGraphFromLbug, queryImporters, deleteAllCommunitiesAndProcesses

Three new primitives in lbug-adapter.ts to support incremental indexing:

* loadGraphFromLbug(graph, unchangedFilePaths) — streams all nodes for
  files in the set across every hydratable node table (excludes
  Community/Process — graph-wide, regenerated downstream). Then loads
  edges where both endpoints belong to loaded nodes, excluding
  MEMBER_OF / STEP_IN_PROCESS edges (also graph-wide).
  FilePaths chunked at 200 per query to keep statement size bounded
  on huge repos. Endpoint-level join filters by source-side filePath
  in the query, target-side checked JS-side via the loadedNodeIds set.

* queryImporters(targetFilePath) — returns DISTINCT a.filePath where
  a -[IMPORTS]-> b and b.filePath = target. Powers closure expansion:
  when a changed file's surface signature changes, all its importers
  must be re-parsed.

* deleteAllCommunitiesAndProcesses() — drops Community/Process nodes
  (and their edges via DETACH DELETE) at the start of each incremental
  run so the communities/processes phases regenerate them from the
  fully-merged graph. Required for the 'Leiden runs on full graph'
  correctness invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(pipeline): hydrate phase + parse-filter for incremental indexing

Wires the incremental-indexing infrastructure into the phase-based
pipeline. Three coordinated changes:

* New hydratePhase (deps: structure) — loads node/edge state for files
  OUTSIDE ctx.options.filesToParse from the existing LadybugDB index.
  Runs before parse so the parse phase can produce a partial graph
  while downstream phases (mro, communities, processes) still see the
  full graph. No-op in full-rebuild mode (filesToParse unset).

* PipelineOptions.filesToParse: optional ReadonlySet<string>. When
  set, parse phase filters scanned files to this set; hydrate fills
  the complement. Set by runFullAnalysis when it detects an eligible
  incremental run; never set by callers directly.

* gitnexus-shared PipelinePhase enum: 'hydrate' added so progress
  callbacks can report the new phase distinctly from 'structure'.

Phase order: scan → structure → hydrate → markdown,cobol → parse
→ routes,tools,orm → crossFile → scopeResolution → mro → communities
→ processes. Communities (Leiden) still runs on the full graph,
satisfying the correctness invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(analyze): incremental orchestrator branch + meta schema

Wires incremental indexing into runFullAnalysis. Highlights:

* RepoMeta schema extended: schemaVersion, surfaceSignatures, and
  incrementalInProgress fields. INCREMENTAL_SCHEMA_VERSION = 1.

* core/incremental/file-hash.ts — v1 surface signature: SHA-256 of file
  content. v2 will switch to a true surface-only signature (defined in
  surface.ts) so body-only edits don't expand the closure. The plumbing
  is signature-agnostic so the swap is local.

* core/incremental/orchestrator.ts — eligibility check, closure
  computation (uses file-hash as the surface signal), dirty-flag
  management, subgraph extraction, signature merge.

* run-analyze.ts adds:
  - hasDirtyTree() check on the existing 'lastCommit==HEAD' early-exit
    so an uncommitted edit triggers re-index (was a coarse equality
    check before).
  - incremental branch: try incremental first; fall through to full
    rebuild on any setup failure or eligibility miss.
  - runIncrementalBranch() — opens existing DB, deletes closure-file
    rows + Community/Process, runs pipeline with filesToParse, writes
    only the changed-subgraph back, refreshes FTS, updates meta with
    new surfaceSignatures and clears the dirty flag.
  - Full-rebuild path now populates surfaceSignatures + schemaVersion
    in meta.json so the next run is eligible for incremental.

Crash recovery: incrementalInProgress is set BEFORE any DB mutation
and cleared on success by overwriting meta.json. A crash anywhere in
between leaves the flag set, and the next analyze run forces a full
rebuild (cheapest path back to a known-good index).

v1 limitation documented: body-only edits trigger 1-hop closure
expansion (content-hash signal). True surface-only optimization is
deferred to v2 — see design doc for the integration path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(incremental): drop invalid --no-renames=false from git diff

The flag --no-renames=false isn't valid git syntax (it's parsed as a
file path). Git's default rename detection is on; removing the flag
keeps that behavior.

Caught while running an end-to-end smoke test against a small fixture
repo: incremental setup failed with 'Command failed: git diff
--name-status -z --no-renames=false ...'. After the fix, the
incremental path runs cleanly: closure is computed, hydrate phase
loads unchanged-file state from DB, parse phase only re-parses files
in closure, and the writeback updates only changed nodes/edges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Revert v1 incremental indexing (5 commits)

Reverts the v1 design that parsed only closure files into a fresh
graph and tried to hydrate the rest from DB. Real-repo equivalence
test failed: cross-file resolution operates on partial parse data
(closure files only), so CALLS edges that resolve through unchanged
files silently fall off. Diff against full rebuild on the same
edited state: -50 nodes, -425 edges, -5 communities, -48 processes.

Architecture pivot: switch to PR #533-style content-addressed parse
cache. Pipeline parses every file (cache-served when possible),
giving cross-file resolution full data, with DB writeback then
restricted to changed-file rows.

Reverts:
  d4b9de4 fix(incremental): drop invalid --no-renames=false
  f35f763 feat(analyze): incremental orchestrator branch + meta schema
  bc03968 feat(pipeline): hydrate phase + parse-filter
  98bb893 feat(lbug): loadGraphFromLbug, queryImporters, ...
  aa8d7ae feat(incremental): change-detection, surface signatures, closure

Kept:
  d9e340b feat(communities): seed Leiden RNG (foundational)
  8235ca3 docs: incremental indexing design spec (will be revised)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(analyze): incremental DB writeback (Option B)

Equivalence-preserving incremental analyze. The pipeline still parses
every file (correctness invariant: cross-file resolution / scope
resolution / MRO / community detection all need full graph data); the
saving comes from selectively replacing only changed-file rows in
LadybugDB instead of wiping and reloading the whole graph.

How it works:

* On every analyze, we hash all source files (SHA-256 of content) and
  store the map in meta.json.fileHashes alongside schemaVersion.
* The next run loads the prior map and diffs:
  - changed: content hash differs → file's DB rows replaced.
  - added: not in prior map → file's DB rows inserted.
  - deleted: in prior map but not on disk → file's DB rows dropped.
* If the diff is non-empty AND no --force / no schema mismatch / no
  dirty flag, take the incremental path:
  - Set incrementalInProgress dirty flag (BEFORE any DB mutation).
  - Open existing DB (no wipe).
  - deleteNodesForFile() for each changed/added/deleted file.
  - deleteAllCommunitiesAndProcesses() — Leiden regenerates these.
  - extractChangedSubgraph() from the in-memory ctx.graph: nodes whose
    filePath is in the writable set + Community + Process + edges with
    at least one endpoint in the writable set (edges entirely between
    hydrated unchanged nodes are skipped — already in DB).
  - loadGraphToLbug() on the subgraph. Unchanged-file rows in DB
    untouched.
  - Recreate FTS indexes.
  - Update meta with new fileHashes; clear dirty flag.
* Otherwise full-rebuild path runs as before.

Crash recovery: incrementalInProgress is the dirty flag. Set before
destructive ops; cleared on success. Set on next-run startup → forces
full rebuild (cheapest path back to known-good).

Other changes:
* Dirty-tree gate on the existing 'lastCommit==HEAD' early-return:
  uncommitted edits no longer slip through as 'already up to date'.
* deleteAllCommunitiesAndProcesses helper in lbug-adapter.
* Skip the embedding cache+restore cycle when willTryIncremental is
  true — embeddings stay in DB; re-inserting them would PK-conflict.

End-to-end equivalence verified on this repo (993 files, 24K nodes):
incremental run produces byte-identical {nodes, edges, clusters,
flows} to a full rebuild from the same edited state.

Speedup is currently modest (~5% on this repo) because the parse
phase still runs in full. Parse-cache integration is a separate
follow-up that composes cleanly on top of this work.

See docs/superpowers/specs/2026-05-10-incremental-indexing-design.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(analyze): chunk-level parse cache for full incremental speedup

Composes with the incremental DB writeback (commit 27f3b49) to deliver
the major-speedup half of incremental indexing. Previously, the parse
phase ran in full on every analyze; the speedup came purely from
selective DB rewriting. With this commit the parse phase also reuses
prior tree-sitter output for chunks whose contents haven't changed.

How it works:

* Cache layer (gitnexus/src/storage/parse-cache.ts):
  - File: <repo>/.gitnexus/parse-cache.json. Versioned, atomic write.
  - Key: chunk content hash = sha256(sorted(filePath:fileContentHash
    for each file in chunk)).
  - Value: ParseWorkerResult[] (raw worker output for the chunk,
    pre-merge).
  - Granularity: per chunk (~20MB byte-budget). A change to one file
    invalidates only its chunk — typically 1 of ~50 on a 1000-file
    repo (~98% cache hit ratio on a small edit).

* Worker contract (gitnexus/src/core/ingestion/parsing-processor.ts):
  - Extracted the chunk-result merge loop into a public
    mergeChunkResults() so the same logic applies to live worker
    output AND replayed cache entries.
  - processParsingWithWorkers / processParsing accept an optional
    outRawResults out-parameter that captures worker output before
    merging — used by parse-impl to populate the cache after a miss.

* Parse phase wiring (parse-impl.ts):
  - For each chunk, compute its content hash (after reading file
    contents). Cache hit → mergeChunkResults() on cached results,
    skip the worker dispatch entirely. Cache miss → run workers
    normally, capture raw results, store under the chunk hash.
  - Cache mutations happen in-place on the ParseCache passed via
    PipelineOptions.parseCache.

* Lifecycle (run-analyze.ts):
  - loadParseCache() before pipeline runs.
  - Cache passed via runPipelineFromRepo's PipelineOptions.
  - saveParseCache() after the pipeline + DB writeback succeed.

Equivalence verified on this repo (993 files, 24K nodes):

  Cold (no cache, full work):           141.1s
  Warm cache + 1-file edit, incremental: 63.6s  ← 55% speedup
  Warm cache + 1-file edit, --force:     71.6s  ← 49% speedup

All three runs produce byte-identical {nodes, edges, clusters,
flows}. The cache survives --force (content-addressed = always
correct), so even forced rebuilds get the parse-skip benefit.

Why chunk-level rather than per-file: workers process sub-batches and
emit aggregated ParseWorkerResults. Per-file granularity would require
restructuring the worker contract; chunk-level captures most of the
practical speedup with no worker-side changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(parse-impl): smaller default chunk budget (20MB→2MB) for cache granularity

The parse cache is keyed at chunk granularity. With the previous 20MB
budget, a typical mid-size repo (e.g. this worktree at 9MB total
parseable source) fits in a single chunk — meaning ANY file change
invalidates the whole chunk and re-parses every file.

2MB default produces ~5x more chunks on the same input, so a one-file
edit invalidates ~1/N of cached chunks instead of the whole thing.
Cold-run overhead from more chunks is <5% (one extra serialization
pass per chunk).

Override via GITNEXUS_CHUNK_BYTE_BUDGET env var for benchmarking.

Measured on this repo (~9MB / 887 parseable files):
  Cold (no cache):                    143s
  Warm cache, no source changes:        2s  (early-return)
  Warm cache + 1-file edit:            81s  (~43% off cold)

Speedup is bounded by the scopeResolution phase (~58s flat regardless
of parse cache) and by GitNexus's own auto-writes during analyze
(AGENTS.md / .claude/skills/ etc. mutate between runs and invalidate
chunks containing them). Both are addressable in follow-ups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(scope-resolution): reuse worker-produced ParsedFile + stabilize chunk order

Two compounding optimizations that drop warm-cache analyze from
~134s to ~38s on a 1000-file repo (72% faster), and cold rebuild
from ~143s to ~86s (40% faster) by short-circuiting work that was
previously re-done.

1. SCOPE-RESOLUTION: REUSE WORKER PARSEDFILE

Previously, the scope-resolution phase re-parsed every file with
tree-sitter on the main thread (~58s on a 1000-file repo) because
worker-produced tree-sitter Trees can't cross the worker MessageChannel.

But the worker ALSO produces a  artifact via
, which structured-clones fine — and it's exactly
what scope-resolution would re-derive. Threading those ParsedFiles
through the parse phase () into
 ( map) lets scope-
resolution skip its extract loop on a per-file basis.

The fast path is bounded only by  per file (cheap
graph mutation). On this repo: scopeResolution went from 58s → 5s.

2. MAP-PRESERVING PARSE-CACHE SERIALIZATION

 is a
which JSON.stringify collapses to . The first attempt at threading
parsedFiles through the parse cache crashed at runtime with
"importerModule.typeBindings is not iterable" because cached entries
came back as plain objects.

Added a JSON replacer/reviver pair in parse-cache.ts that round-trips
Map and Set instances through tagged plain objects (). Symmetric: save uses replacer, load uses reviver.

3. STABLE CHUNK ORDERING

The byte-budget chunker walked files in filesystem-scan order, which
on Windows isn't guaranteed to be stable across runs. Even with
identical source content, two scans could place files in different
chunks, shifting chunk hashes and causing 100% parse-cache misses.

Added a deterministic alphabetical sort on  before
chunking. Chunk membership is now stable across runs, so a single-file
edit invalidates exactly one chunk, not all of them.

Measured on this repo (993 files, 24K nodes):
  Cold rebuild:                        86s  (was 143s)
  Warm cache, no source changes:        3s  (early-return)
  Warm cache + 1-file edit:            38s  (was 134s)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(incremental): update spec + AGENTS.md + GUARDRAILS.md for shipped design

- Rewrite docs/superpowers/specs/2026-05-10-incremental-indexing-design.md
  to describe the architecture that actually shipped (parse cache +
  incremental DB writeback + scope-resolution short-circuit), with the
  v1 hydrate-phase post-mortem preserved as historical context.
- AGENTS.md "Keeping the Index Fresh" section: note that incremental
  is the new default and --force is the explicit opt-out; mention
  the parse-cache file location and that it's safe to delete.
- GUARDRAILS.md Signs: add an "Index seems corrupt or incremental is
  misbehaving" entry pointing users to --force as the manual escape
  hatch (the dirty flag handles automatic recovery).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(autofix): apply prettier + eslint fixes via /autofix command

* fix(incremental): bugbot review + CI test failures

Bugbot (PR #1479):
- Medium: pruneCache was exported but never called -> cache grew
  unbounded. Wire pruneCache into run-analyze before saveParseCache,
  using a transient usedKeys Set on ParseCache that the parse phase
  populates as it processes chunks.
- Low: willTryIncremental (pre-pipeline) and isIncremental
  (post-pipeline) could desync, silently dropping embeddings on
  mispredicted runs. Removed the prediction; the embedding cache
  now loads unconditionally when shouldLoadCache is true. The
  re-insert step gates on the actual isIncremental value to avoid
  PK-conflicts when the incremental-writeback path keeps DB rows.

CI test failures:
- cli-e2e #1169 + run-analyze.test.ts #1233: my dirty-tree gate on
  the lastCommit==HEAD early-return saw GitNexus's own auto-generated
  outputs (.claude/, .cursor/, AGENTS.md, CLAUDE.md) as dirty,
  perpetually defeating the up-to-date fast path. Extended the
  pathspec exclusion to cover all auto-gen outputs, not just
  .gitnexus/.
- ruby field-type disambig: my chunk-stability sort exposed a
  pre-existing order-dependency in Ruby cross-file resolution
  (`user.address.save -> Address#save` only resolves correctly when
  user.rb parses before address.rb in some configurations). Removed
  the sort. Filesystem ordering is stable enough in practice that
  the parse cache still hits the common case; the pre-existing
  fragility is left for a separate fix.
- pipeline-graph-golden: regenerated. Seeded Leiden RNG produces a
  partition different from the previous Math.random snapshot.
- staleness `parallel calls` was a CI timing flake; passes locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(incremental): re-insert cached embeddings on incremental path

Bugbot re-review caught: deleteNodesForFile cascades to the
CodeEmbedding table (DELETE WHERE e.nodeId STARTS WITH ...), so
changed-file embedding rows are wiped along with their nodes. The
previous fix gated re-insert on `!isIncremental`, which silently
dropped those embeddings — a regression versus the full-rebuild path's
"preserve embeddings by default" guarantee.

Remove the `!isIncremental` gate. The per-batch try/catch already
handles the unchanged-file PK-conflict case ("some may fail if node
was removed, that's fine") with the same semantics, so re-inserting
the full cached set on incremental works:

  - changed-file rows: deleted, then re-inserted from cache (preserved)
  - unchanged-file rows: still in DB, re-insert PK-conflicts and is
    silently ignored (existing rows are correct)

Cost: re-inserting ~24K embeddings on incremental when only a few
files changed — most are no-op conflicts. Bounded by batch size of
200; ~3-5s overhead. Worth it for correctness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(incremental): address Claude+Bugbot review findings + remove design doc

Addresses CHANGES_REQUESTED review on PR #1479:

1. Remove docs/superpowers/specs/2026-05-10-incremental-indexing-design.md
   per maintainer request.

2. BLOCKER (Claude Finding 1, Bugbot Round 3): Stale cross-file edges
   between unchanged files. extractChangedSubgraph excluded edges where
   both endpoints were unchanged-file nodes — when a barrel/re-export
   file changes, cross-file resolution may update CALLS edges between
   two unchanged files that would then be silently lost.

   Fix: 1-hop importer-closure expansion of the writable set in
   run-analyze.ts. Before deleting/rewriting rows, query DB for
   importers of every changed/deleted file and add them to the writable
   set. Their nodes get deleted+rewritten too, so cross-file's refined
   edges land in the DB. Re-added queryImporters to lbug-adapter.ts.

3. BLOCKER (Claude Finding 3): Parse cache key omitted parser version.
   After a GitNexus upgrade, the cache silently replays pre-upgrade
   ParseWorkerResults against the new schema → wrong CALLS/IMPORTS/
   scope edges with no visible signal.

   Fix: PARSE_CACHE_VERSION now embeds the gitnexus npm package
   version (read at module load via createRequire on package.json).
   Format: `${SCHEMA_BUMP}+${PKG_VERSION}` e.g. "1+1.6.4". Any release
   that bumps package.json automatically invalidates the on-disk cache.
   Mismatched versions fall through to an empty cache (next save
   overwrites with the new version baked in).

4. BLOCKER (Claude Finding 2): No automated tests for incremental
   behavior. Added 28 unit tests across 3 files:

     - incremental-file-hash.test.ts (10 tests)
       diffFileHashes classification, computeFileHash determinism,
       computeFileHashes batch / missing-file tolerance, sorted output.

     - incremental-parse-cache.test.ts (12 tests)
       computeChunkHash stability and order-independence, version
       prefix format, pruneCache, load/save round-trip on empty /
       missing / corrupt / version-mismatched files, AND a Map/Set
       round-trip test that pins the JSON replacer/reviver behaviour
       (without it, ParsedFile.scopes[*].typeBindings collapses to
       {} and downstream `.get()` / iteration throws).

     - incremental-subgraph-extract.test.ts (6 tests)
       writable-set node inclusion, Community/Process always kept,
       edge inclusion when at least one endpoint is writable, MEMBER_OF
       edges via graph-wide endpoints, empty subgraph case.

5. Medium (Claude Finding 6): AGENTS.md "Keeping the Index Fresh"
   said "only changed files are re-parsed." Imprecise — the pipeline
   parses every file every run; the cache skips tree-sitter for chunks
   whose contents haven't changed. Reworded to match the design doc.

Test plan still expects:
  [x] Typecheck clean
  [x] All 28 new unit tests pass
  [x] All previously-failing tests still pass on the rebased branch
  [x] Equivalence verified locally (incremental ≡ --force, byte-identical
      stats on this repo)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(incremental): round 3 review feedback — bounded BFS, atomic meta, integration test, docs

Addresses remaining findings on PR #1479 from Claude's re-review of
commit ad7bd31 + verifies the outstanding Bugbot HIGH severity.

1. F1 — Transitive importer expansion (Claude, was Medium-but-noted).
   Previous 1-hop importer expansion missed barrel re-export chains
   (A imports C, C re-exports B; when B changes, only C was pulled in
   — A was left with potentially-stale CALLS edges to refined targets).
   Replaced the single pass with a bounded BFS over the IMPORTS graph
   (depth ≤ 4). Catches nested barrel pyramids without ballooning into
   a near-full rebuild on monorepos with deep re-export trees. `--force`
   remains the escape hatch documented in GUARDRAILS.md for cases that
   exceed the bound.

2. F2 — Integration test for incremental orchestration (Claude, BLOCKER,
   DoD §2.7). The unit tests added in ad7bd31 covered `diffFileHashes`,
   `extractChangedSubgraph`, `computeChunkHash`, `pruneCache`, and the
   Map/Set JSON round-trip — but none of them exercised the real
   `runFullAnalysis` orchestration. Added gitnexus/test/unit/
   incremental-orchestration.test.ts with four end-to-end tests against
   a real git-initialized fixture repo + real LadybugDB:

     a. First run populates fileHashes + schemaVersion and clears
        incrementalInProgress on success.
     b. Second run on unchanged state takes the alreadyUpToDate fast
        path (early-return).
     c. Second run after a source edit takes the incremental path
        (not full rebuild) and rotates fileHashes for the touched file
        while keeping the dirty flag cleared.
     d. A pre-set incrementalInProgress flag forces a full rebuild
        that clears it (crash-recovery wire).

   These would catch any regression that wires `isIncremental` from a
   pre-pipeline prediction (the Bugbot finding from commit 5eb0597) or
   accidentally re-gates the embedding re-insert on `!isIncremental`
   (the Bugbot finding from commit 60c10f1).

3. F3 — GUARDRAILS.md docs accuracy (Claude, Low). Line 33 still said
   "only changed files are re-parsed" — AGENTS.md was already corrected
   in ad7bd31 but GUARDRAILS.md was missed. Reworded to match.

4. F5 — Atomic saveMeta (Claude, Medium; vvladescu-tb fork). The dirty
   flag (`incrementalInProgress`) travels through meta.json. A crash
   mid-write would leave a corrupt meta.json that `loadMeta` would
   silently treat as "no prior index", losing the flag and skipping
   recovery. Switched to tmp-file + rename matching saveParseCache.

5. Bugbot's "Subgraph edges reference nodes absent from subgraph"
   (HIGH severity). Verified as FALSE POSITIVE: `getNodeLabel` in
   lbug-adapter.ts derives labels from the node-ID string (parses
   the table prefix), not from the in-memory graph. The CSV
   generator writes (src_id, dst_id, type) rows without consulting
   node objects; `splitRelCsvByLabelPair` routes by ID-derived label;
   `COPY ... (from=X, to=Y)` resolves both endpoints against the live
   LadybugDB where unchanged-file nodes still exist. No fix needed.

All 213 tests pass locally (including the 4 new integration tests
and the previously-failing CI tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(incremental): address Bugbot round-4 findings (added-file shadow seed + dedupe)

Bugbot review on commit e23e440 surfaced two new findings against the
incremental writeback in run-analyze.ts:

  HIGH — Incremental BFS misses importers of newly added files.
    queryImporters() reads the pre-pipeline DB. For a NEWLY ADDED
    file there are no IMPORTS rows pointing to it yet, so unchanged
    files whose pre-existing import statements now resolve to the
    newcomer keep stale CALLS edges pointing at the OLD resolution
    target.

  LOW — Deleted files double-counted in filesToDelete.
    hashDiff.deleted entries can reappear in writableFiles via the
    BFS expansion (queryImporters can return a now-deleted path),
    so deleteNodesForFile() ran twice for the same file.

Fixes:

  - Add gitnexus/src/core/incremental/shadow-candidates.ts: derive
    the pre-existing file paths whose JS/TS module-resolution claim
    an added file can steal. Pattern catalogue: same-basename/
    different-extension, bare-file-beats-directory-index, and
    directory-index-beats-bare-file. Emit both POSIX and Windows
    separators because the prior fileHashes map may have been
    written from either OS.

  - In run-analyze.ts, seed the BFS frontier with shadow candidates
    that exist in the prior meta.fileHashes. Their importers — found
    via queryImporters — get pulled into the writable set so their
    CALLS edges re-resolve against the new file.

  - Dedupe filesToDelete via Set to avoid the double-call.

Tests: gitnexus/test/unit/incremental-shadow-candidates.test.ts —
8 cases covering each shadow pattern, separator handling, .d.ts as
a single extension token, deduplication, and the no-self-shadow
invariant. All 40 incremental tests (file-hash, parse-cache,
subgraph-extract, shadow-candidates, orchestration) pass locally.

Note on the third Bugbot finding ("Subgraph edges reference nodes
absent from subgraph"): re-anchored from a prior review pass — the
code at subgraph-extract.ts:48 is unchanged. Already verified as a
false positive: getNodeLabel parses labels from ID strings, CSV
write is by ID, and COPY resolves against the live DB.

* chore(autofix): apply prettier + eslint fixes via /autofix command

* test(incremental): exact-equality stats invariant + analyze ≡ analyze --force

Addresses the only remaining Claude production-readiness review finding
on PR #1479 (Low-Medium, test-quality only — Claude itself said it does
NOT block merge, but the central PR claim "incremental ≡ full rebuild"
deserves explicit CI coverage rather than implicit trust).

Changes to gitnexus/test/unit/incremental-orchestration.test.ts:

1) Tighten the existing "comment-only edit takes incremental path" test.
   - Replace toBeGreaterThan(0) bounds assertions on stats.files and
     stats.nodes with exact toBe(firstMeta) per-field equality across
     files / nodes / edges / communities / processes. DoD §2.7 calls
     out bounds-only assertions as masking regressions that drop half
     the graph; this swap closes that gap.
   - Rationale: a comment-only edit must change the file content hash
     (driving the incremental path) without changing any graph data.
     Therefore every stat MUST be identical to the first run. Anything
     else is a regression.

2) New test: incremental output is byte-equivalent to a full rebuild.
   - Run analyze → comment-only edit → analyze (incremental writeback)
     → analyze --force (full rebuild from same on-disk state).
   - Assert files / nodes / edges / communities / processes are exactly
     equal across the incremental and the --force passes.
   - This is the PR's central correctness contract, now proven by a
     test that exercises the real runtime path end-to-end against a
     real on-disk LadybugDB.

All 5 orchestration tests pass locally (52s), including the new
equivalence test — every stat field matches exactly between incremental
and --force on the mini-repo fixture.

tsc --noEmit clean.

* fix(incremental): F1 cross-file edge consistency + F4 stable chunk sort + unit coverage (#1511)

Patch addressing two of the still-open changes-requested findings on PR
#1479, rebased onto the current feat/incremental-indexing head. F3
(parser fingerprint in the cache key), F5 (atomic saveMeta), and F6
(AGENTS.md phrasing) were already handled on the branch, so the
corresponding parts of the original patch were dropped as redundant.

  F1 (Blocker) — Cross-file edges between unchanged files
    Adds `computeEffectiveWriteSet(graph, toWriteSet)` to
    subgraph-extract.ts: a single pass over the new graph's edges that
    pulls the unchanged-side file of every writable-boundary-crossing
    edge into the write set. run-analyze composes it ON TOP of the
    existing importer-BFS expansion and feeds the combined set to BOTH
    `deleteNodesForFile` and `extractChangedSubgraph`, so the delete
    cascade and the writeback subgraph cover identical files (asymmetry
    would leave stale rows or PK-conflict at COPY time). The BFS reads
    IMPORTS from the pre-pipeline DB (catches files that *stopped*
    importing a changed file); the edge walk reads the new graph
    (catches refined CALLS edges the pre-run DB couldn't predict, e.g.
    a barrel re-export shifting a symbol from B to D). `extractChangedSubgraph`
    stays a pure filter — all expansion is the orchestrator's job.

  F4 (Medium) — Restore alphabetical chunk sort
    `parseableScanned` is sorted before chunking. Filesystem-scan order
    isn't stable enough across runs/platforms (notably macOS APFS) to
    keep chunk hashes consistent, so the parse cache thrashes without
    it. The pre-existing Ruby cross-file resolution order-dependency the
    old comment cited is independent — the sort surfaces it but doesn't
    cause it; tracked separately rather than leaving the cache cold.

  Tests — incremental-subgraph-extract.test.ts
    Locks the F1 invariants: `extractChangedSubgraph` is a pure filter
    (includes only the set it's given, plus graph-wide nodes; edges
    fire on one writable endpoint), and `computeEffectiveWriteSet`
    covers the barrel-re-export scenario, the symmetric edge-into-
    changed-file case, the no-boundary-crossed no-op, graph-wide-node
    edges, and input-immutability. Supersedes the prior
    extractChangedSubgraph-only test file on the branch.

Co-authored-by: Val Vladescu <vvladescu-tb@users.noreply.github.com>

* chore(autofix): apply prettier + eslint fixes via /autofix command

* fix(call-processor): register properties in pre-pass to fix order-dependent field type disambiguation + regenerate golden snapshot

Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/2d66666f-861c-432e-a4b0-11f2aefca98a

Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>

* fix(call-processor): port worker-path property enrichment into the sequential pre-pass

Copilot's pre-pass in 8184439 fixed the Ruby attr_accessor order-dependence,
but it copied the OLD in-loop registration logic, not the canonical worker
path in parse-worker.ts. That left the sequential and worker paths emitting
non-identical Property nodes/symbols for the same source — silently breaking
the `incremental ≡ --force` invariant the moment a repo crosses the worker
threshold between runs.

Two concrete divergences are closed here:

  * Node id: worker keys Property as `${file}:${className}.${propName}`
    (qualified). Pre-pass was using `${file}:${propName}` (unqualified).
    Same source produced different graph ids depending on which path ran.

  * Field metadata: worker enriches each routed property with
    `provider.fieldExtractor` + `getFieldInfo`, falling back to
    `routedFieldInfo.type` for `declaredType` when the routing payload
    lacks one (e.g. types discovered from `@address = Address.new`
    ctor assignments rather than YARD `@return [Type]`), and propagates
    `visibility` / `isStatic` / `isReadonly`. Pre-pass did none of this,
    so on the sequential path `resolveFieldAccessType` failed to walk
    chains where the type only came from the FieldExtractor.

The pre-pass now mirrors parse-worker.ts:1803-1898 verbatim, with one
deliberate difference: the FieldInfo cache is scoped to a single
`processCalls` invocation rather than module-level (the worker process
is short-lived; the main thread is not, and a module-level cache would
leak state between analyze runs).

Also drops the now-stale "Defer resolution: Ruby attr_accessor properties
are registered during this same loop" comment on `pendingWrites.push` —
the rationale is no longer accurate after Copilot's pre-pass, but the
deferral is still needed so write-access tracking sees inference that
completes during the main loop. Comment updated to reflect that.

Verification:
  * `tsc --noEmit`: 0 errors
  * test/unit (call-processor, call-routing, field-extraction, ruby-self-call): 224 passing
  * test/integration (ruby, ruby-sequential-mixin, pipeline-graph-golden): 137 passing

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(call-processor): key fieldInfoCache by filePath:startIndex, not raw byte offset

Claude's review of 255bdf6 caught a real collision in the FieldInfoCache I
added: keying by `classNode.startIndex` alone is a per-file byte offset, so
two files that both begin with a class at byte 0 — extremely common in Ruby /
Python, where files frequently open with `class Foo`, `module Foo` — collide
on the same cache entry. The second file's `getFieldInfo` then returns the
first file's FieldInfo map, producing wrong `declaredType` / `visibility` /
`isReadonly` on its properties.

Same shape as the bug that already exists in parse-worker.ts:377 (also keyed
by `classNode.startIndex` in a module-level map, persistent across files
processed by the same worker). Fixing the symmetric pre-existing leak in
parse-worker.ts is a separate, scoped follow-up — left out of this commit to
keep the fix minimal and reviewable.

Cache map and key are now both string-typed. Composite key
`${context.filePath}:${classNode.startIndex}` keeps the within-file hit rate
(one FieldExtractor.extract() per class regardless of how many
`attr_accessor` lines it has) while eliminating cross-file aliasing.

Verification on the patched HEAD:
  * `tsc --noEmit`: 0 errors
  * test/unit (call-processor, call-routing, field-extraction, ruby-self-call): 224 passing
  * test/integration (ruby, ruby-sequential-mixin, pipeline-graph-golden): 137 passing

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Gergő Magyar <gergomagyar@icloud.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Val Vladescu <val.vladescu@thirdbridge.com>
Co-authored-by: Val Vladescu <vvladescu-tb@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants