Fix/detect changes hunk mapping by PurpleNewNew · Pull Request #141 · abhigyanpatwari/GitNexus

PurpleNewNew · 2026-03-02T03:22:47Z

Summary

This PR upgrades MCP detect_changes from file-level detection to hunk-level symbol mapping, and fixes line-side alignment issues that previously caused missed matches.

What changed

Switched from git diff --name-only to full patch parsing (--unified=0 --no-color --find-renames --relative).
Added robust diff parsing:
- hunk headers (@@ -a,b +c,d @@)
- quoted/escaped diff --git paths (including octal escapes)
- normalized path handling.
Replaced file substring matching with exact file-path + line-range overlap matching.
Added support for change types: Modified, Added, Deleted, Renamed.
Added fallback UNINDEXED:<file> nodes when no indexed symbol matches, so file changes are not silently dropped.
Enriched output with:
- summary.changed_hunks
- changed_file_hunks
- per-symbol changed_ranges, changed_new_ranges, hunk_count
- filePath in affected process steps.

Critical fix included

Symbol matching now prefers old-side hunk ranges (oldRanges) and keeps newRanges for diagnostics.
This addresses missed detections in scenarios such as:
- pure deletions
- insertions near top-of-file followed by edits later (line-shift issue).

Scope

1 file changed: gitnexus/src/mcp/local/local-backend.ts
Branch commits:
- 2094935 map detect_changes diff hunks to symbols
- db91b94 parse quoted diff header paths in detect_changes
- e0e9dfd align detect_changes symbol mapping to old-side hunks

Validation

npm run build passed.
Manual smoke checks confirmed expected old/new range behavior for deletion and insertion+modify cases.

Dear Reviewer

This PR is a substantial behavior change and may include changes you do not want. （︶^︶）
Please review carefully before merging, especially output-schema expectations and old/new range semantics for downstream consumers.

vercel · 2026-03-02T03:22:51Z

@PurpleNewNew is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

magyargergo · 2026-03-16T21:34:06Z

@claude ultra-think during reviewing this PR and try to find performance issues and verify them.

claude · 2026-03-16T21:34:24Z

Claude finished @magyargergo's task in 3m 31s —— View job

Performance Review: `detect_changes` Hunk Mapping

Tasks:

Gather context and understand the request
Read the changed file and full diff
Analyze algorithmic complexity
Analyze regex and string operation performance
Analyze memory usage patterns
Verify findings against the actual implementation
Post comprehensive performance review

Summary

The PR is a meaningful correctness upgrade but introduces several new performance regressions compared to the old --name-only approach, and inherits one N+1 pattern from the original code with a larger blast radius. Below are findings ordered by severity, each verified against the source.

🔴 Critical

1. `execFileSync` with no `maxBuffer` — can crash on large diffs

File: local-backend.ts:1289

const output = execFileSync('git', diffArgs, { cwd: repo.repoPath, encoding: 'utf-8' });

The old implementation used --name-only, producing output of ~50 bytes/file (e.g. 500 files ≈ 25KB). The new implementation captures the full unified patch (--unified=0 --find-renames). For a large PR or compare scope against a stale base branch, this can easily produce multi-megabyte output.

Node.js execFileSync defaults to maxBuffer: 1024 * 1024 (1 MB). A diff touching 200 files with 20 hunks and 50-line average hunk bodies ≈ 16MB — 16× over the default, producing an uncaught ENOBUFS error. Unlike the rg call at line 1552, no maxBuffer or timeout is set here.

Additionally, there is no timeout set, so a slow git process on a large repo or network-mounted filesystem will hang the entire Node.js event loop indefinitely.

Fix this →

2. N+1 DB query pattern for symbols — up to 3N sequential queries

File: local-backend.ts:1327-1338

for (const fileChange of changedFiles) {                     // N iterations
  for (const candidatePath of candidatePaths) {              // up to 3 iterations
    symbols = await executeQuery(repo.id, `
      MATCH (n) WHERE n.filePath = '${escaped}' ...
    `);
  }
}

For N changed files, this fires up to 3N sequential, awaited DB round-trips (one per candidate path: filePath, oldPath, newPath). For a 50-file diff that's up to 150 sequential queries where a single batched query would suffice:

MATCH (n) WHERE n.filePath IN ['path1', 'path2', ...]
RETURN n.id, n.name, labels(n)[0], n.filePath, n.startLine, n.endLine

The old code also had N queries, but only one path per file and capped results at LIMIT 20. This PR removes that cap and triples the potential query count.

Fix this →

3. N+1 DB query pattern for processes — 1 query per changed symbol

File: local-backend.ts:1395-1420

for (const sym of changedSymbols) {           // M iterations
  const procs = await executeQuery(repo.id, `
    MATCH (n {id: '${escaped}'})-[r:CodeRelation ...]->(p:Process) ...
  `);
}

Each changed symbol triggers its own process-lookup query. This pattern existed in the old code but is now worse: the old code capped symbols at LIMIT 20 per file and only matched by file path substring. The new code uses exact path + line-range overlap which can return far more symbols, multiplying M significantly.

For 100 changed symbols: 100 sequential queries → should be 1 WHERE n.id IN [...] query.

Fix this →

🟡 Moderate

4. Entire diff materialized into memory via `split('\n')`

File: local-backend.ts:164

const lines = diffText.split('\n');

split() eagerly allocates an array of every line in the diff. With the old --name-only, this is negligible. With full patch output, a large diff (10K+ lines) creates a 10K+ element array in one shot. For the compare scope on a main branch, this could be hundreds of thousands of lines. A line-by-line approach (iterating with indexOf('\n')) avoids allocating the full array at once.

5. `execFileSync` blocks the Node.js event loop

File: local-backend.ts:1289

execFileSync is synchronous — it blocks the event loop for the entire duration of the git process. On large repos, a cold git diff HEAD can take 1–5 seconds. During this time no other MCP requests can be processed. The async execFile (promisified) would allow other requests to proceed concurrently. The rg call in rename at line 1552 shares this problem.

6. Unnecessary dynamic import of built-in `child_process`

File: local-backend.ts:1264

const { execFileSync } = await import('child_process');

The comment at line 12 explains that lazy imports are used to avoid loading onnxruntime-node at startup. child_process is a Node.js built-in — it never crashes, is always available, and incurs zero startup overhead. This await import(...) adds a microtask and Promise allocation on every detectChanges call. It should be a static top-of-file import.

🟢 Minor

7. `decodeGitQuotedPath` uses string concatenation in a character loop

File: local-backend.ts:75-121

let out = '';
for (let i = 0; i < value.length; i++) {
  out += ch;  // O(n²) string allocations in worst case

In JavaScript, string += char in a tight loop creates n intermediate string objects for an n-character path. For typical short file paths this is negligible (< 200 chars), but for deeply escaped Unicode paths it degrades. Using const parts: string[] = []; ... parts.push(ch); return parts.join('') is the idiomatic O(n) approach.

8. `rangesOverlap` is linear rather than binary

File: local-backend.ts:157-160

return ranges.some((r) => start <= r.end && end >= r.start);

The ranges array produced by mergeRanges is already sorted. A binary search for the first range where r.end >= start would be O(log n) vs O(n). In practice, hunk counts are typically small (1–20), so the impact is minor, but on the hot path (called for every symbol in every changed file) it accumulates.

9. Duplicate field in output doubles serialization

File: local-backend.ts:1440-1441

ranges: f.oldRanges,
old_ranges: f.oldRanges,

Both ranges and old_ranges point to the same array object and will be independently serialized into the JSON output, doubling the bytes transmitted for oldRanges data. One of them should be removed.

Verified Complexity Summary

Scenario	Old code	New code
DB queries per invocation	`N + M`	Up to `3N + M` (where N = changed files, M = changed symbols)
Git output size	~50 bytes/file	Full patch (~5–50KB/file)
Memory for diff	Negligible	`O(total diff lines)` in single array
Event loop block	`O(git diff --name-only time)`	`O(git diff full patch time)` — larger

For a typical small diff (5 files, 10 symbols) the new code remains fast in absolute terms. The regressions surface on large diffs (50+ files, compare scope against a distant base), where the N+1 DB queries are the dominant cost and the missing maxBuffer becomes a hard failure.

magyargergo · 2026-03-28T14:44:38Z

⚠️ Upcoming Prettier formatting — rebase instructions

PR #563 adds Prettier as the code formatter for the repo. When it merges, the bulk format commit will touch ~350 files (style-only: whitespace, quotes, trailing commas). Your branch will likely conflict.

After #563 merges, rebase your branch:

git fetch origin
git checkout <your-branch>
git rebase origin/main

# Conflicts will be formatting-only — accept your version:
git checkout --theirs .
git add .
git rebase --continue

# Then re-format your branch to match the new style:
npx prettier --write .
git add -A
git commit -m "style: apply prettier formatting"
git push --force-with-lease

New setup step: Run npm install at the repo root (not just in gitnexus/) to get prettier + activate the pre-commit hook. The hook auto-formats staged files on every commit going forward.

magyargergo · 2026-05-11T07:02:34Z

Please submit a new PR if this is still relevant

PurpleNewNew added 3 commits February 28, 2026 16:51

fix(mcp): map detect_changes diff hunks to symbols

2094935

fix(mcp): parse quoted diff header paths in detect_changes

db91b94

fix(mcp): align detect_changes symbol mapping to old-side hunks

e0e9dfd

magyargergo force-pushed the main branch from cf3f885 to b4fbf33 Compare March 8, 2026 18:20

Repository owner deleted a comment from claude Bot Mar 16, 2026

magyargergo force-pushed the main branch from 7b00e7f to b48cfe9 Compare March 18, 2026 07:31

github-actions Bot mentioned this pull request Mar 22, 2026

Triage Sweep Report — 2026-03-22 #422

Open

magyargergo closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/detect changes hunk mapping#141

Fix/detect changes hunk mapping#141
PurpleNewNew wants to merge 3 commits into
abhigyanpatwari:mainfrom
PurpleNewNew:fix/detect-changes-hunk-mapping

PurpleNewNew commented Mar 2, 2026

Uh oh!

vercel Bot commented Mar 2, 2026

Uh oh!

magyargergo commented Mar 16, 2026

Uh oh!

claude Bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

magyargergo commented Mar 28, 2026

Uh oh!

magyargergo commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

PurpleNewNew commented Mar 2, 2026

Summary

What changed

Critical fix included

Scope

Validation

Dear Reviewer

Uh oh!

vercel Bot commented Mar 2, 2026

Uh oh!

magyargergo commented Mar 16, 2026

Uh oh!

claude Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Review: detect_changes Hunk Mapping

Summary

🔴 Critical

1. execFileSync with no maxBuffer — can crash on large diffs

2. N+1 DB query pattern for symbols — up to 3N sequential queries

3. N+1 DB query pattern for processes — 1 query per changed symbol

🟡 Moderate

4. Entire diff materialized into memory via split('\n')

5. execFileSync blocks the Node.js event loop

6. Unnecessary dynamic import of built-in child_process

🟢 Minor

7. decodeGitQuotedPath uses string concatenation in a character loop

8. rangesOverlap is linear rather than binary

9. Duplicate field in output doubles serialization

Verified Complexity Summary

Uh oh!

magyargergo commented Mar 28, 2026

⚠️ Upcoming Prettier formatting — rebase instructions

Uh oh!

magyargergo commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Mar 16, 2026 •

edited

Loading

Performance Review: `detect_changes` Hunk Mapping

1. `execFileSync` with no `maxBuffer` — can crash on large diffs

4. Entire diff materialized into memory via `split('\n')`

5. `execFileSync` blocks the Node.js event loop

6. Unnecessary dynamic import of built-in `child_process`

7. `decodeGitQuotedPath` uses string concatenation in a character loop

8. `rangesOverlap` is linear rather than binary