Skip to content

Fix/detect changes hunk mapping#141

Closed
PurpleNewNew wants to merge 3 commits into
abhigyanpatwari:mainfrom
PurpleNewNew:fix/detect-changes-hunk-mapping
Closed

Fix/detect changes hunk mapping#141
PurpleNewNew wants to merge 3 commits into
abhigyanpatwari:mainfrom
PurpleNewNew:fix/detect-changes-hunk-mapping

Conversation

@PurpleNewNew

Copy link
Copy Markdown
Contributor

Summary

This PR upgrades MCP detect_changes from file-level detection to hunk-level symbol mapping, and fixes line-side alignment issues that previously caused missed matches.

What changed

  • Switched from git diff --name-only to full patch parsing (--unified=0 --no-color --find-renames --relative).
  • Added robust diff parsing:
    • hunk headers (@@ -a,b +c,d @@)
    • quoted/escaped diff --git paths (including octal escapes)
    • normalized path handling.
  • Replaced file substring matching with exact file-path + line-range overlap matching.
  • Added support for change types: Modified, Added, Deleted, Renamed.
  • Added fallback UNINDEXED:<file> nodes when no indexed symbol matches, so file changes are not silently dropped.
  • Enriched output with:
    • summary.changed_hunks
    • changed_file_hunks
    • per-symbol changed_ranges, changed_new_ranges, hunk_count
    • filePath in affected process steps.

Critical fix included

  • Symbol matching now prefers old-side hunk ranges (oldRanges) and keeps newRanges for diagnostics.
  • This addresses missed detections in scenarios such as:
    • pure deletions
    • insertions near top-of-file followed by edits later (line-shift issue).

Scope

  • 1 file changed: gitnexus/src/mcp/local/local-backend.ts
  • Branch commits:
    • 2094935 map detect_changes diff hunks to symbols
    • db91b94 parse quoted diff header paths in detect_changes
    • e0e9dfd align detect_changes symbol mapping to old-side hunks

Validation

  • npm run build passed.
  • Manual smoke checks confirmed expected old/new range behavior for deletion and insertion+modify cases.

Dear Reviewer

This PR is a substantial behavior change and may include changes you do not want. (︶^︶)
Please review carefully before merging, especially output-schema expectations and old/new range semantics for downstream consumers.

@vercel

vercel Bot commented Mar 2, 2026

Copy link
Copy Markdown

@PurpleNewNew is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

Repository owner deleted a comment from claude Bot Mar 16, 2026
@magyargergo

Copy link
Copy Markdown
Collaborator

@claude ultra-think during reviewing this PR and try to find performance issues and verify them.

@claude

claude Bot commented Mar 16, 2026

Copy link
Copy Markdown

Claude finished @magyargergo's task in 3m 31s —— View job


Performance Review: detect_changes Hunk Mapping

Tasks:

  • Gather context and understand the request
  • Read the changed file and full diff
  • Analyze algorithmic complexity
  • Analyze regex and string operation performance
  • Analyze memory usage patterns
  • Verify findings against the actual implementation
  • Post comprehensive performance review

Summary

The PR is a meaningful correctness upgrade but introduces several new performance regressions compared to the old --name-only approach, and inherits one N+1 pattern from the original code with a larger blast radius. Below are findings ordered by severity, each verified against the source.


🔴 Critical

1. execFileSync with no maxBuffer — can crash on large diffs

File: local-backend.ts:1289

const output = execFileSync('git', diffArgs, { cwd: repo.repoPath, encoding: 'utf-8' });

The old implementation used --name-only, producing output of ~50 bytes/file (e.g. 500 files ≈ 25KB). The new implementation captures the full unified patch (--unified=0 --find-renames). For a large PR or compare scope against a stale base branch, this can easily produce multi-megabyte output.

Node.js execFileSync defaults to maxBuffer: 1024 * 1024 (1 MB). A diff touching 200 files with 20 hunks and 50-line average hunk bodies ≈ 16MB — 16× over the default, producing an uncaught ENOBUFS error. Unlike the rg call at line 1552, no maxBuffer or timeout is set here.

Additionally, there is no timeout set, so a slow git process on a large repo or network-mounted filesystem will hang the entire Node.js event loop indefinitely.

Fix this →


2. N+1 DB query pattern for symbols — up to 3N sequential queries

File: local-backend.ts:1327-1338

for (const fileChange of changedFiles) {                     // N iterations
  for (const candidatePath of candidatePaths) {              // up to 3 iterations
    symbols = await executeQuery(repo.id, `
      MATCH (n) WHERE n.filePath = '${escaped}' ...
    `);
  }
}

For N changed files, this fires up to 3N sequential, awaited DB round-trips (one per candidate path: filePath, oldPath, newPath). For a 50-file diff that's up to 150 sequential queries where a single batched query would suffice:

MATCH (n) WHERE n.filePath IN ['path1', 'path2', ...]
RETURN n.id, n.name, labels(n)[0], n.filePath, n.startLine, n.endLine

The old code also had N queries, but only one path per file and capped results at LIMIT 20. This PR removes that cap and triples the potential query count.

Fix this →


3. N+1 DB query pattern for processes — 1 query per changed symbol

File: local-backend.ts:1395-1420

for (const sym of changedSymbols) {           // M iterations
  const procs = await executeQuery(repo.id, `
    MATCH (n {id: '${escaped}'})-[r:CodeRelation ...]->(p:Process) ...
  `);
}

Each changed symbol triggers its own process-lookup query. This pattern existed in the old code but is now worse: the old code capped symbols at LIMIT 20 per file and only matched by file path substring. The new code uses exact path + line-range overlap which can return far more symbols, multiplying M significantly.

For 100 changed symbols: 100 sequential queries → should be 1 WHERE n.id IN [...] query.

Fix this →


🟡 Moderate

4. Entire diff materialized into memory via split('\n')

File: local-backend.ts:164

const lines = diffText.split('\n');

split() eagerly allocates an array of every line in the diff. With the old --name-only, this is negligible. With full patch output, a large diff (10K+ lines) creates a 10K+ element array in one shot. For the compare scope on a main branch, this could be hundreds of thousands of lines. A line-by-line approach (iterating with indexOf('\n')) avoids allocating the full array at once.

5. execFileSync blocks the Node.js event loop

File: local-backend.ts:1289

execFileSync is synchronous — it blocks the event loop for the entire duration of the git process. On large repos, a cold git diff HEAD can take 1–5 seconds. During this time no other MCP requests can be processed. The async execFile (promisified) would allow other requests to proceed concurrently. The rg call in rename at line 1552 shares this problem.

6. Unnecessary dynamic import of built-in child_process

File: local-backend.ts:1264

const { execFileSync } = await import('child_process');

The comment at line 12 explains that lazy imports are used to avoid loading onnxruntime-node at startup. child_process is a Node.js built-in — it never crashes, is always available, and incurs zero startup overhead. This await import(...) adds a microtask and Promise allocation on every detectChanges call. It should be a static top-of-file import.


🟢 Minor

7. decodeGitQuotedPath uses string concatenation in a character loop

File: local-backend.ts:75-121

let out = '';
for (let i = 0; i < value.length; i++) {
  out += ch;  // O(n²) string allocations in worst case

In JavaScript, string += char in a tight loop creates n intermediate string objects for an n-character path. For typical short file paths this is negligible (< 200 chars), but for deeply escaped Unicode paths it degrades. Using const parts: string[] = []; ... parts.push(ch); return parts.join('') is the idiomatic O(n) approach.

8. rangesOverlap is linear rather than binary

File: local-backend.ts:157-160

return ranges.some((r) => start <= r.end && end >= r.start);

The ranges array produced by mergeRanges is already sorted. A binary search for the first range where r.end >= start would be O(log n) vs O(n). In practice, hunk counts are typically small (1–20), so the impact is minor, but on the hot path (called for every symbol in every changed file) it accumulates.

9. Duplicate field in output doubles serialization

File: local-backend.ts:1440-1441

ranges: f.oldRanges,
old_ranges: f.oldRanges,

Both ranges and old_ranges point to the same array object and will be independently serialized into the JSON output, doubling the bytes transmitted for oldRanges data. One of them should be removed.


Verified Complexity Summary

Scenario Old code New code
DB queries per invocation N + M Up to 3N + M (where N = changed files, M = changed symbols)
Git output size ~50 bytes/file Full patch (~5–50KB/file)
Memory for diff Negligible O(total diff lines) in single array
Event loop block O(git diff --name-only time) O(git diff full patch time) — larger

For a typical small diff (5 files, 10 symbols) the new code remains fast in absolute terms. The regressions surface on large diffs (50+ files, compare scope against a distant base), where the N+1 DB queries are the dominant cost and the missing maxBuffer becomes a hard failure.

@magyargergo

Copy link
Copy Markdown
Collaborator

⚠️ Upcoming Prettier formatting — rebase instructions

PR #563 adds Prettier as the code formatter for the repo. When it merges, the bulk format commit will touch ~350 files (style-only: whitespace, quotes, trailing commas). Your branch will likely conflict.

After #563 merges, rebase your branch:

git fetch origin
git checkout <your-branch>
git rebase origin/main

# Conflicts will be formatting-only — accept your version:
git checkout --theirs .
git add .
git rebase --continue

# Then re-format your branch to match the new style:
npx prettier --write .
git add -A
git commit -m "style: apply prettier formatting"
git push --force-with-lease

New setup step: Run npm install at the repo root (not just in gitnexus/) to get prettier + activate the pre-commit hook. The hook auto-formats staged files on every commit going forward.

@magyargergo

Copy link
Copy Markdown
Collaborator

Please submit a new PR if this is still relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants