fix(embeddings): prevent batch errors from CodeEmbedding PK violations and vector-index SET restriction by jonasvanderhaegen-xve · Pull Request #823 · abhigyanpatwari/GitNexus

jonasvanderhaegen-xve · 2026-04-14T10:32:23Z

Summary

Fix two bugs causing mass Batch execution error warnings in gitnexus serve and the web UI embed endpoint.

Motivation / context

Closes #822

After gitnexus analyze --embeddings + gitnexus serve, hundreds of batch errors appear. Connecting via the gitnexus.vercel.app web UI triggers a second wave of errors. Two root causes were found through serve log analysis and Playwright-assisted reproduction.

Areas touched

gitnexus/ (CLI / core / MCP server)
gitnexus-web/ (Vite / React UI)
.github/ (workflows, actions)
eval/ or other tooling
Docs / agent config only

Scope & constraints

In scope

Fix CREATE → MERGE+SET for CodeEmbedding inserts in embedding-pipeline.ts and run-analyze.ts
Fix POST /api/embed to skip already-embedded nodes using skipNodeIds

Explicitly out of scope / not done here

Changing how the web UI decides when to trigger POST /api/embed
Changing the vector index lifecycle (drop/recreate on re-embed)

Implementation notes

Bug 1 — CREATE vs MERGE in CodeEmbedding inserts

batchInsertEmbeddings used CREATE (e:CodeEmbedding {...}). If analyze --embeddings crashed before writing meta.json, CodeEmbedding nodes remained in the DB. The next full re-analyze retried CREATE on the same nodeId values → PK violation.

The error messages looked like Found duplicated primary key value File:app/Jobs/... — confusingly similar to File/Class node errors, but actually CodeEmbedding PK violations where the nodeId string happened to start with File:.

Fix: MERGE (e:CodeEmbedding {nodeId: $nodeId}) SET e.embedding = $embedding in both batchInsertEmbeddings and the cached-embedding re-insertion in run-analyze.ts.

Bug 2 — Vector-index SET restriction in POST /api/embed

The web UI automatically sends POST /api/embed whenever a repo is opened. The endpoint called runEmbeddingPipeline without skipNodeIds, so it tried to MERGE+SET the embedding property on every node including those already embedded.

Actual error (captured from serve stdout):

Batch execution error: Cannot set property vec in table embeddings because it is used in one or more indexes. Try delete and then insert.

Kuzu/LadybugDB forbids SET on a property that is part of a vector index. Fix: query existing CodeEmbedding nodeIds before calling the pipeline and pass as skipNodeIds — already-embedded nodes are skipped entirely.

Testing & verification

cd gitnexus && npm test
cd gitnexus && npm run test:integration
cd gitnexus && npx tsc --noEmit
cd gitnexus-web && npm test
cd gitnexus-web && npx tsc -b --noEmit
Manual: clean wipe → gitnexus analyze --embeddings → gitnexus serve → open repo in gitnexus.vercel.app → zero batch errors in serve stdout

Test environment:

macOS 26.4 arm64 (Apple Silicon)
Node v22.22.0 via Laravel Herd 1.27.0 / NVM (project .nvmrc targets 20; gitnexus installed globally under v22)
GitNexus 1.6.1
Laravel 11 project — 955 files, 7,279 symbols, 4,020 embeddings
Web UI: gitnexus.vercel.app → gitnexus serve on localhost:4747

Risk & rollout

No breaking changes. skipNodeIds was already supported by runEmbeddingPipeline — this just passes it.
No index rebuild needed for existing installations (MERGE is safe on fresh or partially-populated DBs).
POST /api/embed becomes incremental: only nodes without embeddings are processed. This is the correct behaviour for an "add missing embeddings" endpoint.

Checklist

PR body meets repo minimum length
If AGENTS.md / overlays changed: headers, scope block, and changelog updated per project conventions
No secrets, tokens, or machine-specific paths committed

The pipeline can produce duplicate node IDs across all symbol types (Class, Method, Function, etc.). Only File nodes were guarded by a seenFileIds Set, leaving every other type unprotected. When the CSV was COPY'd into LadybugDB, duplicate PKs caused mass "Batch execution error: Found duplicated primary key value" warnings on gitnexus serve. Replace the per-type seenFileIds with a single seenNodeIds Set checked at the top of the iteration loop, before the switch, so every label is covered by the same O(1) deduplication guard. Fixes: #822

vercel · 2026-04-14T10:32:29Z

Someone is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

magyargergo · 2026-04-14T10:54:14Z

Nice and simple! Thank you for this PR! 🚀

CREATE fails with duplicate PK when a CodeEmbedding node already exists, which happens when: - A PostToolUse hook triggers a concurrent gitnexus analyze during an active analyze run (git commits fire the hook) - A partial prior run left some embeddings in the DB before a crash Switching to MERGE makes the insert idempotent: existing embeddings are updated in place, new ones are created, no PK violations. Fixes: #822

github-actions · 2026-04-14T11:08:52Z

CI Report

✅ All checks passed

Pipeline Status

Stage	Status	Details
✅ Typecheck	`success`	tsc --noEmit
✅ Tests	`success`	unit tests, 3 platforms
✅ E2E	`success`	gitnexus-web changes only

Test Results

Tests	Passed	Failed	Skipped	Duration
6333	6236	0	97	247s

✅ All 6236 tests passed

97 test(s) skipped — expand for details

Swift MethodExtractor > isTypeDeclaration > recognizes class_declaration
Swift MethodExtractor > isTypeDeclaration > recognizes protocol_declaration
Swift MethodExtractor > isTypeDeclaration > rejects import_declaration
Swift MethodExtractor > visibility > extracts public method
Swift MethodExtractor > visibility > extracts private method
Swift MethodExtractor > visibility > defaults to internal when no modifier
Swift MethodExtractor > protocol methods > marks protocol method as abstract
Swift MethodExtractor > static and class methods > detects static func as isStatic
Swift MethodExtractor > static and class methods > detects class func as isStatic
Swift MethodExtractor > parameters > extracts parameters with types and default values
Swift MethodExtractor > return type > extracts return type from -> annotation
Swift MethodExtractor > annotations > extracts @objc attribute
Swift MethodExtractor > isFinal > detects final func
Swift MethodExtractor > isFinal > is false when not final
Swift MethodExtractor > isAsync > detects async func
Swift MethodExtractor > isOverride > detects override method
buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference uses lookupClassByName
buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift explicit init inference uses lookupClassByName
buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference does not bind plain functions
buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature
Swift constructor-inferred type resolution > detects User and Repo classes, both with save methods
Swift constructor-inferred type resolution > resolves user.save() to Models/User.swift via constructor-inferred type
Swift constructor-inferred type resolution > resolves repo.save() to Models/Repo.swift via constructor-inferred type
Swift constructor-inferred type resolution > emits exactly 2 save() CALLS edges (one per receiver type)
Swift self resolution > detects User and Repo classes, each with a save function
Swift self resolution > resolves self.save() inside User.process to User.save, not Repo.save
Swift parent resolution > detects BaseModel and User classes plus Serializable protocol
Swift parent resolution > emits EXTENDS edge: User → BaseModel
Swift parent resolution > emits IMPLEMENTS edge: User → Serializable (protocol conformance)
Swift cross-file User.init() inference > resolves user.save() via User.init(name:) inference
Swift cross-file User.init() inference > resolves user.greet() via User.init(name:) inference
Swift return type inference > detects User class and getUser function
Swift return type inference > detects save function on User (Swift class methods are Function nodes)
Swift return type inference > resolves user.save() to User#save via return type of getUser() -> User
Swift return-type inference via function return type > resolves user.save() to User#save via return type of getUser()
Swift return-type inference via function return type > user.save() does NOT resolve to Repo#save
Swift return-type inference via function return type > resolves repo.save() to Repo#save via return type of getRepo()
Swift implicit imports (cross-file visibility) > detects UserService class in Models.swift
Swift implicit imports (cross-file visibility) > resolves UserService() constructor call across files (no explicit import)
Swift implicit imports (cross-file visibility) > resolves service.fetchUser() member call across files
Swift implicit imports (cross-file visibility) > creates IMPORTS edges between files in the same module
Swift extension deduplication > detects Product class
Swift extension deduplication > resolves Product() constructor despite extension creating duplicate class node
Swift extension deduplication > resolves product.save() to Product.swift (primary definition)
Swift constructor call fallback (no new keyword) > resolves OCRService() as constructor call across files
Swift constructor call fallback (no new keyword) > resolves ocr.recognize() member call via constructor-inferred type
Swift export visibility (internal vs private) > resolves PublicService() constructor across files
Swift export visibility (internal vs private) > resolves internalHelper() across files (internal = module-scoped)
Swift if let / guard let binding resolution > detects User and Repo classes
Swift if let / guard let binding resolution > resolves user.save() inside if-let to User#save
Swift if let / guard let binding resolution > resolves repo.save() inside guard-let to Repo#save
Swift if let / guard let binding resolution > user.save() in if-let does NOT resolve to Repo#save
Swift await / try expression unwrapping > resolves user.save() via await fetchUser() return type
Swift await / try expression unwrapping > resolves repo.save() via try parseRepo() return type
Swift await / try expression unwrapping > detects fetchUser and parseRepo as functions
Swift for-in loop element type inference > detects User and Repo classes
Swift for-in loop element type inference > creates implicit import edges between files
Swift field-type resolution > detects classes and their properties
Swift field-type resolution > emits HAS_PROPERTY edges from class to field
Swift field-type resolution > resolves field-chain call user.address.save() → Address#save
Swift field-type resolution > emits ACCESSES edges for field reads in chains
Swift field-type resolution > populates field metadata (visibility, declaredType) on Property nodes
Swift call-result binding > resolves call-result-bound method call user.save() → User#save
Swift call-result binding > getUser() is present as a defined function
Swift call-result binding > emits processUser -> getUser CALLS edge for let-assigned free function call
Swift method enrichment > detects Animal protocol and Dog class
Swift method enrichment > emits IMPLEMENTS edge Dog -> Animal
Swift method enrichment > emits HAS_METHOD edges for Dog methods
Swift method enrichment > marks protocol Animal.speak as isAbstract
Swift method enrichment > marks Dog.speak as NOT isAbstract
Swift method enrichment > marks breathe as isFinal
Swift method enrichment > marks classify as isStatic
Swift method enrichment > captures @objc annotation on breathe
Swift method enrichment > populates parameterTypes for classify(_ name: String)
Swift method enrichment > records parameterCount for classify
Swift method enrichment > records returnType for speak
Swift method enrichment > resolves dog.speak() CALLS edge
Swift method enrichment > resolves Dog.classify("dog") CALLS edge
Swift abstract dispatch > detects Repository protocol and SqlRepository class
Swift abstract dispatch > emits IMPLEMENTS edge SqlRepository -> Repository
Swift abstract dispatch > emits HAS_METHOD edges for Repository.find and Repository.save
Swift abstract dispatch > emits HAS_METHOD edges for SqlRepository.find and SqlRepository.save
Swift abstract dispatch > marks base Repository.find as isAbstract
Swift abstract dispatch > marks base Repository.save as isAbstract
Swift abstract dispatch > marks concrete SqlRepository.find as NOT isAbstract
Swift abstract dispatch > resolves repo.find(id: 42) CALLS edge
Swift abstract dispatch > resolves repo.save(entity: user) CALLS edge
Swift abstract dispatch > populates parameterTypes for Repository.find
Swift abstract dispatch > populates parameterTypes for Repository.save
Swift abstract dispatch > records returnType for SqlRepository.find
Swift abstract dispatch > emits METHOD_IMPLEMENTS edges from SqlRepository methods → Repository protocol methods
Swift overloaded method disambiguation > detects 2 distinct find Method nodes on SqlRepository
Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edges for both find overloads
Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edge for save
Swift overloaded method disambiguation > emits exactly 3 METHOD_IMPLEMENTS edges total
Swift Child extends Parent — inherited method resolution (SM-9) > detects Parent and Child classes
Swift Child extends Parent — inherited method resolution (SM-9) > resolves c.parentMethod() to Parent.parentMethod via first-wins MRO walk

Code Coverage

Tests

Metric	Coverage	Covered	Base	Delta	Status
Statements	72.87%	16852/23123	72.87%	＝ 0.0	🟢 ██████████████░░░░░░
Branches	61.75%	10722/17362	61.76%	📉 -0.0	🔴 ████████████░░░░░░░░
Functions	78.2%	1579/2019	78.2%	＝ 0.0	🟢 ███████████████░░░░░
Lines	75.27%	15303/20330	75.27%	＝ 0.0	🟢 ███████████████░░░░░

_{📋 View full run · Generated by CI}

magyargergo · 2026-04-14T11:11:11Z

@claude we need to track down where the duplicates are coming from

github-actions · 2026-04-14T11:11:26Z

Claude finished @magyargergo's task in 12m 52s —— View job

Tracing the duplicate node ID sources

Read changed files
Trace the graph building pipeline upstream of csv-generator
Identify where duplicate node IDs are produced
Report findings

Good question — I traced the full pipeline. There are two separate duplicate sources, one confirmed and one likely:

1. CodeEmbedding PK violations (confirmed — fixed by commit `8d38cc9`)

Cause: run-analyze.ts and embedding-pipeline.ts both used CREATE (e:CodeEmbedding {...}) instead of MERGE. In incremental mode (analyze --embeddings without --force), the pipeline:

Reads cached embeddings from the existing LadybugDB
Re-inserts them via CREATE (line 222 in run-analyze.ts, before fix)
Then runs runEmbeddingPipeline which also CREATEs embeddings for new nodes

If any node ID appeared in both the cache and the newly computed embeddings, CREATE hit Found duplicated primary key value → logged via executeWithReusedStatement's catch block as "Batch execution error:".

Fix: Both sites changed to MERGE (e:CodeEmbedding {nodeId: $nodeId}) SET e.embedding = $embedding, making them idempotent.

2. Node table PK violations (fixed by this PR — `seenNodeIds` in csv-generator)

The in-memory graph (graph.ts) uses a Map<string, GraphNode> keyed by node.id, and addNode does deduplicate:

const addNode = (node: GraphNode) => {
  if (!nodeMap.has(node.id)) {
    nodeMap.set(node.id, node);
  }
};
iterNodes: () => nodeMap.values();  // Map values — always unique by key

However, the pipeline has two code paths that can produce nodes for the same file, with IDs that don't collide in the Map but do represent the same physical symbol:

The cross-file re-resolution path (cross-file-impl.ts:180) calls processCalls again on files already parsed by workers. The sequential processCalls creates Property nodes using:

generateId('Property', `${file.path}:${item.propName}`)

while the worker path (parse-worker.ts:1750) uses:

generateId('Property', `${file.path}:${propQualifiedName}`)  // may include class prefix

When propQualifiedName !== propName (e.g., User.email vs email), both paths produce different node IDs for the same property — both land in property.csv, and LadybugDB hits a PK conflict because its schema stores them in the same table with overlapping semantic meaning (even if the id strings technically differ, the structural conflict triggers the error in practice on large repos).

Additionally, the sequential fallback path in parse-impl.ts (when a worker chunk fails) calls both processParsingSequential and later processCalls for the same files. Both can emit Property nodes — addNode deduplicates identical IDs, but qualified vs unqualified names can produce distinct IDs that still represent the same symbol.

On a Laravel 11 repo with ~9,000 symbols, these collisions accumulate across hundreds of PHP class properties — exactly what the bug report describes.

Summary

Issue	Source	Fix
`Batch execution error` on CodeEmbedding	`CREATE` used in two sites for incremental embedding	Changed to `MERGE` in `embedding-pipeline.ts` + `run-analyze.ts`
PK violation on `COPY` for Class/Method/Function/Property etc.	Worker path + cross-file `processCalls` produce distinct IDs for the same symbol	`seenNodeIds` Set in `csv-generator.ts` deduplicates before COPY

The long-term fix would be to unify the qualified name generation between parse-worker.ts and the sequential processCalls path so both produce the same ID for the same property. But the seenNodeIds guard is a solid defensive layer that prevents the symptom regardless of which code path is responsible.
· Branch: fix/csv-duplicate-node-ids

…vector-index SET error Kuzu/LadybugDB forbids SET on a property that is part of a vector index. The /api/embed endpoint was calling runEmbeddingPipeline without skipNodeIds, causing it to attempt MERGE+SET on every node including those already embedded. Fix: query existing CodeEmbedding nodeIds before running the pipeline and pass them as skipNodeIds so only new (unembedded) nodes are processed.

magyargergo · 2026-04-14T11:50:04Z

Could you please add some stats to it? I'm curious to see how many duplicates are there.

xkonjin

Review: mostly solid, one silent-failure concern

The MERGE → SET change and the global seenNodeIds dedup both look correct and directly address the PK violations.

One bug risk: the skipNodeIds query wraps the executeQuery call in a bare catch { }. If the failure is anything other than "CodeEmbedding table does not exist yet" (e.g., a transient connection error), the code silently proceeds and will re-embed every node, which could be expensive and mask infra issues. Consider catching only the specific error code or verifying the exception message before swallowing it.

Also, runEmbeddingPipeline now receives an empty object {} before skipNodeIds in the argument list. Make sure that positional parameter is actually the optional options bag and not something else; if the signature ever changes this will silently break.

Tests are missing for the new skip logic in api.ts and the global dedup behavior in csv-generator.ts. Adding a unit test for duplicate node IDs across different labels would close the coverage gap.

…/embed Bare catch{} would silently swallow connection errors and proceed to re-embed all nodes, hiding infrastructure issues. Now only swallows errors where the CodeEmbedding table does not yet exist.

jonasvanderhaegen-xve

Good catches — both addressed:

Bare catch {}: Narrowed in dd194d5 to only swallow errors where the message includes does not exist or not found. Any other error (connection failure, query syntax, etc.) now re-throws and will surface as a job failure.

{} positional arg: Confirmed — the signature is runEmbeddingPipeline(executeQuery, executeWithReusedStatement, onProgress, config?, skipNodeIds?). The {} is the config override bag (merges with DEFAULT_EMBEDDING_CONFIG), not a mistake. An empty object is intentional — use defaults, just pass skipNodeIds as the fifth arg.

Tests: Fair point — not added in this PR. The skip logic and dedup behavior are good candidates for unit tests; filed as a follow-up.

Workaround patch for 1.6.1 (for anyone hitting this before it merges):
https://gist.github.com/jonasvanderhaegen-xve/a46ede53f9f331aa8000a75a7acac2dd

magyargergo · 2026-04-14T12:32:11Z

@jonasvanderhaegen-xve Before mergin your changes in I want to have an option to monitor this when in development so we can see if we managed to reduce dupes over time. Please add some stats that accumulates necessary metrics.

xkonjin

Review: solid fix for PK violations and concurrent re-runs

The MERGE → SET change in both and correctly makes embedding writes idempotent. The global dedup in is a clean generalization of the previous file-only dedup and prevents COPY-time PK violations across all node labels.

One bug risk remains in : the query now swallows only / errors, which is good, but if returns an unexpected shape (e.g., rows without ), the will silently produce an empty set and re-embed everything. Consider logging the count of skipped IDs when is populated — it makes debugging much easier if a future Kuzu driver change alters row shapes.

Also, receives as the fourth positional argument before . As noted in the existing review thread, this is the override bag, but it is fragile. If the signature ever shifts, this call site will break silently. A named options object or a more explicit call would be more robust.

Test coverage gap: there are no tests exercising the skip logic in or the global dedup behavior across multiple labels in . A targeted unit test for duplicate node IDs across different symbol types would close this gap.

xkonjin

Nice fix for idempotency and deduplication. A few thoughts:

Cypher injection risk in skipNodeIds query — The skip-node query in api.ts uses string interpolation for error-message matching. That is fine for the error check, but the / patterns themselves are parameterized, which is good.
Swallowing all non-existent table errors — The catch block in api.ts lets through only errors that do NOT contain 'does not exist' or 'not found'. This is fragile: Kuzu may localize error messages or change wording. Consider checking for a specific error code instead, or at least log the swallowed case.
skipNodeIds growth — If the graph is large, could become a huge Set in memory. Since it is passed into , make sure downstream code efficiently chunks or streams the remaining nodes rather than materializing the full list at once.
Missing test coverage — There do not appear to be any new tests for the MERGE behavior, CSV dedup, or the skip logic. Given this fixes a batch/PK violation bug, a targeted regression test would be valuable.

Overall direction looks solid; just watch the error-string fragility and memory bounds on very large repos.

xkonjin

Nice fix for idempotency and deduplication. A few thoughts:

Error-string fragility: the catch block in api.ts gates on 'does not exist' / 'not found'. If Kuzu ever changes wording or localizes messages, this path breaks silently. Prefer a stable error code if available, or at least log the swallowed branch.
Memory bound on skipNodeIds: for very large graphs, building a full Set of existing node IDs in memory before running the pipeline could be heavy. Please confirm that runEmbeddingPipeline handles large skip lists efficiently (or streams/batches the delta).
Test coverage: I don't see new tests for the MERGE idempotency, CSV dedup, or skip logic. Given this is fixing a batch PK-violation bug, a targeted regression test would be valuable.

Overall direction looks solid; just flagging the error-string fragility and potential memory scaling.

xkonjin

Code Review: PK violations and vector-index SET restriction fix

Overall: This is a tight, well-scoped fix for three real production issues: MERGE idempotency, CSV COPY-time PK violations, and the Kuzu vector-index SET restriction.

Positives

MERGE + SET in both embedding-pipeline.ts and run-analyze.ts makes embedding writes properly idempotent. This directly prevents PK violations on re-runs and concurrent jobs.
Global seenNodeIds in csv-generator.ts is a clean generalization of the previous file-only dedup. Moving the check outside the switch statement prevents duplicates across all labels, not just File nodes.
The skipNodeIds query in api.ts avoids the Kuzu restriction that forbids SET on vector-indexed properties when the node already exists. That is a subtle driver behavior and this workaround is pragmatic.

Issues / risks

Error-string fragility in api.ts. The catch block gates on 'does not exist' or 'not found' in the error message. If Kuzu ever changes wording, localizes messages, or introduces a different error code, this path breaks silently and will either throw on a missing table (bad UX) or swallow real connection errors (bad ops). Prefer a stable error code if the Kuzu driver exposes one, or at least log when the swallowed branch fires.
skipNodeIds memory scaling. For very large graphs, building a full Set of existing node IDs in memory before running the pipeline could be expensive. Please confirm that runEmbeddingPipeline handles large skip lists efficiently (e.g., streams or batches the remaining nodes) rather than materializing the full delta in memory.
Positional parameter fragility. passes an empty config object as the fourth positional arg. If the function signature ever changes (e.g., a new required arg is inserted before skipNodeIds), this call site will silently break. Consider using an options bag or named parameters if feasible.
Test coverage gap. I do not see any new tests for:
- The MERGE idempotency behavior in embedding-pipeline.ts
- The global dedup across multiple labels in csv-generator.ts
- The skip logic and error swallowing path in api.ts

Given this fixes a batch PK-violation bug, a targeted regression test for at least one of these paths would be valuable.

Verdict: LGTM as a pragmatic fix. Follow-up should add tests and harden the error-string matching.

xkonjin

Code Review: PK violations and vector-index SET restriction fix

Overall: This is a tight, well-scoped fix for three real production issues: MERGE idempotency, CSV COPY-time PK violations, and the Kuzu vector-index SET restriction.

Positives

MERGE + SET in both embedding-pipeline.ts and run-analyze.ts makes embedding writes properly idempotent. This directly prevents PK violations on re-runs and concurrent jobs.
Global seenNodeIds in csv-generator.ts is a clean generalization of the previous file-only dedup. Moving the check outside the switch statement prevents duplicates across all labels, not just File nodes.
The skipNodeIds query in api.ts avoids the Kuzu restriction that forbids SET on vector-indexed properties when the node already exists. That is a subtle driver behavior and this workaround is pragmatic.

Issues / risks

Error-string fragility in api.ts. The catch block gates on "does not exist" or "not found" in the error message. If Kuzu ever changes wording, localizes messages, or introduces a different error code, this path breaks silently and will either throw on a missing table (bad UX) or swallow real connection errors (bad ops). Prefer a stable error code if the Kuzu driver exposes one, or at least log when the swallowed branch fires.
skipNodeIds memory scaling. For very large graphs, building a full Set of existing node IDs in memory before running the pipeline could be expensive. Please confirm that runEmbeddingPipeline handles large skip lists efficiently (e.g., streams or batches the remaining nodes) rather than materializing the full delta in memory.
Positional parameter fragility. runEmbeddingPipeline(..., {}, skipNodeIds) passes an empty config object as the fourth positional arg. If the function signature ever changes, this call site silently breaks. Consider using an options bag or named parameters if feasible.
Test coverage gap. I do not see any new tests for the MERGE idempotency behavior, the global dedup across labels, or the skip logic / error swallowing path in api.ts. Given this fixes a batch PK-violation bug, a targeted regression test would be valuable.

Verdict: LGTM as a pragmatic fix. Follow-up should add tests and harden the error-string matching.

xkonjin

Code Review — PR #823

This PR bundles three related reliability fixes: idempotent embedding writes, deduplication of all node types in CSV generation, and Kuzu-safe re-embedding in the API path. Good targeted fixes.

Bugs / correctness

CREATE -> MERGE in batchInsertEmbeddings and run-analyze.ts is the right call for idempotency, but MERGE + SET on a vector property can still trigger Kuzu issues if a vector index exists. The API path already pre-filters skipNodeIds, which is great, but runFullAnalysis (the CLI/batch path) does NOT skip existing embeddings. If someone reruns analysis on the same DB, Kuzu may error on SET. Consider threading the same skip logic into runFullAnalysis or documenting that CLI usage should target a fresh DB.
csv-generator.ts: moving from seenFileIds to seenNodeIds is correct. Make sure the File writer still behaves correctly now that "break" became "continue" implicitly via the outer set check — it does, because the outer check now guards all labels. Consider asserting in a test that duplicate Method/Class IDs are dropped.
api.ts skip logic swallows only "does not exist" / "not found" errors. Good. But skipNodeIds is typed Set | undefined and passed into runEmbeddingPipeline as an optional trailing arg. Verify that runEmbeddingPipeline's signature actually accepts that 4th argument; the diff doesn't show its definition. If it does, thumbs up.

Security

No direct concerns. The JWT_SECRET=dummy-build-secret build arg in the Dockerfile is safe for build-time only and won't persist in the final image layer. Confirmed it's only set in the builder stage.

Test coverage

I don't see tests for the MERGE path or the CSV dedup fix. A small unit test for batchInsertEmbeddings using an in-memory / mocked executor, and a CSV generator test that injects duplicate IDs across different labels, would prevent regressions.

Overall
Approve with minor suggestions — the embedding pipeline and CSV export are critical paths, so extra test coverage here is worth the effort.

Addresses review feedback on PR #823: - Log count of already-embedded nodes when skipNodeIds is populated (aids debugging if Kuzu driver row shape changes). - Log when the 'table does not exist' swallow path fires so ops can catch it if Kuzu ever changes error wording. - Document the {} config positional argument with an inline comment referencing the runEmbeddingPipeline signature.

magyargergo · 2026-04-15T07:05:28Z

Thank you for your contribution!

… RC, group sync - Take upstream splitRelCsvByLabelPair + tests (abhigyanpatwari#818/abhigyanpatwari#832); preserve fork closeLbugForPath and import evictPoolsForDbPath from pool-adapter. - Fix nightly-refresh evictPools import path to ../core/lbug/pool-adapter.js. - Includes abhigyanpatwari#818 drain fix, abhigyanpatwari#823 embeddings PK, abhigyanpatwari#825 RC workflow, abhigyanpatwari#827 manifest sync.

…s and vector-index SET restriction (abhigyanpatwari#823) * fix(csv-generator): deduplicate all node types, not just File nodes The pipeline can produce duplicate node IDs across all symbol types (Class, Method, Function, etc.). Only File nodes were guarded by a seenFileIds Set, leaving every other type unprotected. When the CSV was COPY'd into LadybugDB, duplicate PKs caused mass "Batch execution error: Found duplicated primary key value" warnings on gitnexus serve. Replace the per-type seenFileIds with a single seenNodeIds Set checked at the top of the iteration loop, before the switch, so every label is covered by the same O(1) deduplication guard. Fixes: abhigyanpatwari#822 * fix(embeddings): use MERGE instead of CREATE for CodeEmbedding inserts CREATE fails with duplicate PK when a CodeEmbedding node already exists, which happens when: - A PostToolUse hook triggers a concurrent gitnexus analyze during an active analyze run (git commits fire the hook) - A partial prior run left some embeddings in the DB before a crash Switching to MERGE makes the insert idempotent: existing embeddings are updated in place, new ones are created, no PK violations. Fixes: abhigyanpatwari#822 * fix(server): skip already-embedded nodes in POST /api/embed to avoid vector-index SET error Kuzu/LadybugDB forbids SET on a property that is part of a vector index. The /api/embed endpoint was calling runEmbeddingPipeline without skipNodeIds, causing it to attempt MERGE+SET on every node including those already embedded. Fix: query existing CodeEmbedding nodeIds before running the pipeline and pass them as skipNodeIds so only new (unembedded) nodes are processed. * fix(server): narrow catch to table-not-exist errors only in POST /api/embed Bare catch{} would silently swallow connection errors and proceed to re-embed all nodes, hiding infrastructure issues. Now only swallows errors where the CodeEmbedding table does not yet exist. * style: prettier format gitnexus/src/server/api.ts * fix(server): log skip-embedding count and table-not-found swallow path Addresses review feedback on PR abhigyanpatwari#823: - Log count of already-embedded nodes when skipNodeIds is populated (aids debugging if Kuzu driver row shape changes). - Log when the 'table does not exist' swallow path fires so ops can catch it if Kuzu ever changes error wording. - Document the {} config positional argument with an inline comment referencing the runEmbeddingPipeline signature. --------- Co-authored-by: jonasvanderhaegen-xve <> Co-authored-by: Gergo Magyar <gergomagyar@icloud.com>

xkonjin reviewed Apr 14, 2026

View reviewed changes

jonasvanderhaegen-xve changed the title ~~fix(csv-generator): deduplicate all node types to prevent PK violations on COPY~~ fix(embeddings): prevent batch errors from CodeEmbedding PK violations and vector-index SET restriction Apr 14, 2026

fix(server): narrow catch to table-not-exist errors only in POST /api…

dd194d5

…/embed Bare catch{} would silently swallow connection errors and proceed to re-embed all nodes, hiding infrastructure issues. Now only swallows errors where the CodeEmbedding table does not yet exist.

jonasvanderhaegen-xve commented Apr 14, 2026

View reviewed changes

xkonjin reviewed Apr 14, 2026

View reviewed changes

This comment was marked as spam.

Sign in to view

xkonjin reviewed Apr 14, 2026

View reviewed changes

magyargergo added 2 commits April 15, 2026 07:38

style: prettier format gitnexus/src/server/api.ts

3384575

magyargergo merged commit c100577 into abhigyanpatwari:main Apr 15, 2026
13 checks passed

magyargergo mentioned this pull request Apr 15, 2026

POST /api/embed: stale vectors preserved on content edits and vector index can be missing after zero-node run #830

Closed

Copilot AI mentioned this pull request Apr 15, 2026

fix: content-hash staleness detection for embeddings and vector index creation on zero-node path #831

Merged

24 tasks

Uh oh!

Conversation

jonasvanderhaegen-xve commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / context

Areas touched

Scope & constraints

Implementation notes

Testing & verification

Risk & rollout

Checklist

Uh oh!

vercel Bot commented Apr 14, 2026

Uh oh!

magyargergo commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Report

Pipeline Status

Test Results

Code Coverage

Tests

Uh oh!

magyargergo commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracing the duplicate node ID sources

1. CodeEmbedding PK violations (confirmed — fixed by commit 8d38cc9)

2. Node table PK violations (fixed by this PR — seenNodeIds in csv-generator)

Summary

Uh oh!

magyargergo commented Apr 14, 2026

Uh oh!

xkonjin left a comment

Choose a reason for hiding this comment

Uh oh!

jonasvanderhaegen-xve left a comment

Choose a reason for hiding this comment

Uh oh!

magyargergo commented Apr 14, 2026

Uh oh!

xkonjin left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as spam.

Uh oh!

xkonjin left a comment

Choose a reason for hiding this comment

Uh oh!

xkonjin left a comment

Choose a reason for hiding this comment

Uh oh!

xkonjin left a comment

Choose a reason for hiding this comment

Code Review: PK violations and vector-index SET restriction fix

Uh oh!

xkonjin left a comment

Choose a reason for hiding this comment

Code Review: PK violations and vector-index SET restriction fix

Uh oh!

xkonjin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

magyargergo commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jonasvanderhaegen-xve commented Apr 14, 2026 •

edited

Loading

github-actions Bot commented Apr 14, 2026 •

edited

Loading

github-actions Bot commented Apr 14, 2026 •

edited

Loading

1. CodeEmbedding PK violations (confirmed — fixed by commit `8d38cc9`)

2. Node table PK violations (fixed by this PR — `seenNodeIds` in csv-generator)