Skip to content

fix(embeddings): prevent batch errors from CodeEmbedding PK violations and vector-index SET restriction#823

Merged
magyargergo merged 6 commits into
abhigyanpatwari:mainfrom
jonasvanderhaegen-xve:fix/csv-duplicate-node-ids
Apr 15, 2026
Merged

fix(embeddings): prevent batch errors from CodeEmbedding PK violations and vector-index SET restriction#823
magyargergo merged 6 commits into
abhigyanpatwari:mainfrom
jonasvanderhaegen-xve:fix/csv-duplicate-node-ids

Conversation

@jonasvanderhaegen-xve

@jonasvanderhaegen-xve jonasvanderhaegen-xve commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Fix two bugs causing mass Batch execution error warnings in gitnexus serve and the web UI embed endpoint.

Motivation / context

Closes #822

After gitnexus analyze --embeddings + gitnexus serve, hundreds of batch errors appear. Connecting via the gitnexus.vercel.app web UI triggers a second wave of errors. Two root causes were found through serve log analysis and Playwright-assisted reproduction.

Areas touched

  • gitnexus/ (CLI / core / MCP server)
  • gitnexus-web/ (Vite / React UI)
  • .github/ (workflows, actions)
  • eval/ or other tooling
  • Docs / agent config only

Scope & constraints

In scope

  • Fix CREATEMERGE+SET for CodeEmbedding inserts in embedding-pipeline.ts and run-analyze.ts
  • Fix POST /api/embed to skip already-embedded nodes using skipNodeIds

Explicitly out of scope / not done here

  • Changing how the web UI decides when to trigger POST /api/embed
  • Changing the vector index lifecycle (drop/recreate on re-embed)

Implementation notes

Bug 1 — CREATE vs MERGE in CodeEmbedding inserts

batchInsertEmbeddings used CREATE (e:CodeEmbedding {...}). If analyze --embeddings crashed before writing meta.json, CodeEmbedding nodes remained in the DB. The next full re-analyze retried CREATE on the same nodeId values → PK violation.

The error messages looked like Found duplicated primary key value File:app/Jobs/... — confusingly similar to File/Class node errors, but actually CodeEmbedding PK violations where the nodeId string happened to start with File:.

Fix: MERGE (e:CodeEmbedding {nodeId: $nodeId}) SET e.embedding = $embedding in both batchInsertEmbeddings and the cached-embedding re-insertion in run-analyze.ts.

Bug 2 — Vector-index SET restriction in POST /api/embed

The web UI automatically sends POST /api/embed whenever a repo is opened. The endpoint called runEmbeddingPipeline without skipNodeIds, so it tried to MERGE+SET the embedding property on every node including those already embedded.

Actual error (captured from serve stdout):

Batch execution error: Cannot set property vec in table embeddings because it is used in one or more indexes. Try delete and then insert.

Kuzu/LadybugDB forbids SET on a property that is part of a vector index. Fix: query existing CodeEmbedding nodeIds before calling the pipeline and pass as skipNodeIds — already-embedded nodes are skipped entirely.

Testing & verification

  • cd gitnexus && npm test
  • cd gitnexus && npm run test:integration
  • cd gitnexus && npx tsc --noEmit
  • cd gitnexus-web && npm test
  • cd gitnexus-web && npx tsc -b --noEmit
  • Manual: clean wipe → gitnexus analyze --embeddingsgitnexus serve → open repo in gitnexus.vercel.app → zero batch errors in serve stdout

Test environment:

  • macOS 26.4 arm64 (Apple Silicon)
  • Node v22.22.0 via Laravel Herd 1.27.0 / NVM (project .nvmrc targets 20; gitnexus installed globally under v22)
  • GitNexus 1.6.1
  • Laravel 11 project — 955 files, 7,279 symbols, 4,020 embeddings
  • Web UI: gitnexus.vercel.app → gitnexus serve on localhost:4747

Risk & rollout

  • No breaking changes. skipNodeIds was already supported by runEmbeddingPipeline — this just passes it.
  • No index rebuild needed for existing installations (MERGE is safe on fresh or partially-populated DBs).
  • POST /api/embed becomes incremental: only nodes without embeddings are processed. This is the correct behaviour for an "add missing embeddings" endpoint.

Checklist

  • PR body meets repo minimum length
  • If AGENTS.md / overlays changed: headers, scope block, and changelog updated per project conventions
  • No secrets, tokens, or machine-specific paths committed

The pipeline can produce duplicate node IDs across all symbol types
(Class, Method, Function, etc.). Only File nodes were guarded by a
seenFileIds Set, leaving every other type unprotected. When the CSV
was COPY'd into LadybugDB, duplicate PKs caused mass "Batch execution
error: Found duplicated primary key value" warnings on gitnexus serve.

Replace the per-type seenFileIds with a single seenNodeIds Set checked
at the top of the iteration loop, before the switch, so every label is
covered by the same O(1) deduplication guard.

Fixes: #822
@vercel

vercel Bot commented Apr 14, 2026

Copy link
Copy Markdown

Someone is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@magyargergo

Copy link
Copy Markdown
Collaborator

Nice and simple! Thank you for this PR! 🚀

CREATE fails with duplicate PK when a CodeEmbedding node already exists,
which happens when:
- A PostToolUse hook triggers a concurrent gitnexus analyze during an
  active analyze run (git commits fire the hook)
- A partial prior run left some embeddings in the DB before a crash

Switching to MERGE makes the insert idempotent: existing embeddings are
updated in place, new ones are created, no PK violations.

Fixes: #822
@github-actions

github-actions Bot commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
6333 6236 0 97 247s

✅ All 6236 tests passed

97 test(s) skipped — expand for details
  • Swift MethodExtractor > isTypeDeclaration > recognizes class_declaration
  • Swift MethodExtractor > isTypeDeclaration > recognizes protocol_declaration
  • Swift MethodExtractor > isTypeDeclaration > rejects import_declaration
  • Swift MethodExtractor > visibility > extracts public method
  • Swift MethodExtractor > visibility > extracts private method
  • Swift MethodExtractor > visibility > defaults to internal when no modifier
  • Swift MethodExtractor > protocol methods > marks protocol method as abstract
  • Swift MethodExtractor > static and class methods > detects static func as isStatic
  • Swift MethodExtractor > static and class methods > detects class func as isStatic
  • Swift MethodExtractor > parameters > extracts parameters with types and default values
  • Swift MethodExtractor > return type > extracts return type from -> annotation
  • Swift MethodExtractor > annotations > extracts @objc attribute
  • Swift MethodExtractor > isFinal > detects final func
  • Swift MethodExtractor > isFinal > is false when not final
  • Swift MethodExtractor > isAsync > detects async func
  • Swift MethodExtractor > isOverride > detects override method
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference uses lookupClassByName
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift explicit init inference uses lookupClassByName
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference does not bind plain functions
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature
  • Swift constructor-inferred type resolution > detects User and Repo classes, both with save methods
  • Swift constructor-inferred type resolution > resolves user.save() to Models/User.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > resolves repo.save() to Models/Repo.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > emits exactly 2 save() CALLS edges (one per receiver type)
  • Swift self resolution > detects User and Repo classes, each with a save function
  • Swift self resolution > resolves self.save() inside User.process to User.save, not Repo.save
  • Swift parent resolution > detects BaseModel and User classes plus Serializable protocol
  • Swift parent resolution > emits EXTENDS edge: User → BaseModel
  • Swift parent resolution > emits IMPLEMENTS edge: User → Serializable (protocol conformance)
  • Swift cross-file User.init() inference > resolves user.save() via User.init(name:) inference
  • Swift cross-file User.init() inference > resolves user.greet() via User.init(name:) inference
  • Swift return type inference > detects User class and getUser function
  • Swift return type inference > detects save function on User (Swift class methods are Function nodes)
  • Swift return type inference > resolves user.save() to User#save via return type of getUser() -> User
  • Swift return-type inference via function return type > resolves user.save() to User#save via return type of getUser()
  • Swift return-type inference via function return type > user.save() does NOT resolve to Repo#save
  • Swift return-type inference via function return type > resolves repo.save() to Repo#save via return type of getRepo()
  • Swift implicit imports (cross-file visibility) > detects UserService class in Models.swift
  • Swift implicit imports (cross-file visibility) > resolves UserService() constructor call across files (no explicit import)
  • Swift implicit imports (cross-file visibility) > resolves service.fetchUser() member call across files
  • Swift implicit imports (cross-file visibility) > creates IMPORTS edges between files in the same module
  • Swift extension deduplication > detects Product class
  • Swift extension deduplication > resolves Product() constructor despite extension creating duplicate class node
  • Swift extension deduplication > resolves product.save() to Product.swift (primary definition)
  • Swift constructor call fallback (no new keyword) > resolves OCRService() as constructor call across files
  • Swift constructor call fallback (no new keyword) > resolves ocr.recognize() member call via constructor-inferred type
  • Swift export visibility (internal vs private) > resolves PublicService() constructor across files
  • Swift export visibility (internal vs private) > resolves internalHelper() across files (internal = module-scoped)
  • Swift if let / guard let binding resolution > detects User and Repo classes
  • Swift if let / guard let binding resolution > resolves user.save() inside if-let to User#save
  • Swift if let / guard let binding resolution > resolves repo.save() inside guard-let to Repo#save
  • Swift if let / guard let binding resolution > user.save() in if-let does NOT resolve to Repo#save
  • Swift await / try expression unwrapping > resolves user.save() via await fetchUser() return type
  • Swift await / try expression unwrapping > resolves repo.save() via try parseRepo() return type
  • Swift await / try expression unwrapping > detects fetchUser and parseRepo as functions
  • Swift for-in loop element type inference > detects User and Repo classes
  • Swift for-in loop element type inference > creates implicit import edges between files
  • Swift field-type resolution > detects classes and their properties
  • Swift field-type resolution > emits HAS_PROPERTY edges from class to field
  • Swift field-type resolution > resolves field-chain call user.address.save() → Address#save
  • Swift field-type resolution > emits ACCESSES edges for field reads in chains
  • Swift field-type resolution > populates field metadata (visibility, declaredType) on Property nodes
  • Swift call-result binding > resolves call-result-bound method call user.save() → User#save
  • Swift call-result binding > getUser() is present as a defined function
  • Swift call-result binding > emits processUser -> getUser CALLS edge for let-assigned free function call
  • Swift method enrichment > detects Animal protocol and Dog class
  • Swift method enrichment > emits IMPLEMENTS edge Dog -> Animal
  • Swift method enrichment > emits HAS_METHOD edges for Dog methods
  • Swift method enrichment > marks protocol Animal.speak as isAbstract
  • Swift method enrichment > marks Dog.speak as NOT isAbstract
  • Swift method enrichment > marks breathe as isFinal
  • Swift method enrichment > marks classify as isStatic
  • Swift method enrichment > captures @objc annotation on breathe
  • Swift method enrichment > populates parameterTypes for classify(_ name: String)
  • Swift method enrichment > records parameterCount for classify
  • Swift method enrichment > records returnType for speak
  • Swift method enrichment > resolves dog.speak() CALLS edge
  • Swift method enrichment > resolves Dog.classify("dog") CALLS edge
  • Swift abstract dispatch > detects Repository protocol and SqlRepository class
  • Swift abstract dispatch > emits IMPLEMENTS edge SqlRepository -> Repository
  • Swift abstract dispatch > emits HAS_METHOD edges for Repository.find and Repository.save
  • Swift abstract dispatch > emits HAS_METHOD edges for SqlRepository.find and SqlRepository.save
  • Swift abstract dispatch > marks base Repository.find as isAbstract
  • Swift abstract dispatch > marks base Repository.save as isAbstract
  • Swift abstract dispatch > marks concrete SqlRepository.find as NOT isAbstract
  • Swift abstract dispatch > resolves repo.find(id: 42) CALLS edge
  • Swift abstract dispatch > resolves repo.save(entity: user) CALLS edge
  • Swift abstract dispatch > populates parameterTypes for Repository.find
  • Swift abstract dispatch > populates parameterTypes for Repository.save
  • Swift abstract dispatch > records returnType for SqlRepository.find
  • Swift abstract dispatch > emits METHOD_IMPLEMENTS edges from SqlRepository methods → Repository protocol methods
  • Swift overloaded method disambiguation > detects 2 distinct find Method nodes on SqlRepository
  • Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edges for both find overloads
  • Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edge for save
  • Swift overloaded method disambiguation > emits exactly 3 METHOD_IMPLEMENTS edges total
  • Swift Child extends Parent — inherited method resolution (SM-9) > detects Parent and Child classes
  • Swift Child extends Parent — inherited method resolution (SM-9) > resolves c.parentMethod() to Parent.parentMethod via first-wins MRO walk

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 72.87% 16852/23123 72.87% = 0.0 🟢 ██████████████░░░░░░
Branches 61.75% 10722/17362 61.76% 📉 -0.0 🔴 ████████████░░░░░░░░
Functions 78.2% 1579/2019 78.2% = 0.0 🟢 ███████████████░░░░░
Lines 75.27% 15303/20330 75.27% = 0.0 🟢 ███████████████░░░░░

📋 View full run · Generated by CI

@magyargergo

Copy link
Copy Markdown
Collaborator

@claude we need to track down where the duplicates are coming from

@github-actions

github-actions Bot commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 12m 52s —— View job


Tracing the duplicate node ID sources

  • Read changed files
  • Trace the graph building pipeline upstream of csv-generator
  • Identify where duplicate node IDs are produced
  • Report findings

Good question — I traced the full pipeline. There are two separate duplicate sources, one confirmed and one likely:


1. CodeEmbedding PK violations (confirmed — fixed by commit 8d38cc9)

Cause: run-analyze.ts and embedding-pipeline.ts both used CREATE (e:CodeEmbedding {...}) instead of MERGE. In incremental mode (analyze --embeddings without --force), the pipeline:

  1. Reads cached embeddings from the existing LadybugDB
  2. Re-inserts them via CREATE (line 222 in run-analyze.ts, before fix)
  3. Then runs runEmbeddingPipeline which also CREATEs embeddings for new nodes

If any node ID appeared in both the cache and the newly computed embeddings, CREATE hit Found duplicated primary key value → logged via executeWithReusedStatement's catch block as "Batch execution error:".

Fix: Both sites changed to MERGE (e:CodeEmbedding {nodeId: $nodeId}) SET e.embedding = $embedding, making them idempotent.


2. Node table PK violations (fixed by this PR — seenNodeIds in csv-generator)

The in-memory graph (graph.ts) uses a Map<string, GraphNode> keyed by node.id, and addNode does deduplicate:

const addNode = (node: GraphNode) => {
  if (!nodeMap.has(node.id)) {
    nodeMap.set(node.id, node);
  }
};
iterNodes: () => nodeMap.values();  // Map values — always unique by key

However, the pipeline has two code paths that can produce nodes for the same file, with IDs that don't collide in the Map but do represent the same physical symbol:

The cross-file re-resolution path (cross-file-impl.ts:180) calls processCalls again on files already parsed by workers. The sequential processCalls creates Property nodes using:

generateId('Property', `${file.path}:${item.propName}`)

while the worker path (parse-worker.ts:1750) uses:

generateId('Property', `${file.path}:${propQualifiedName}`)  // may include class prefix

When propQualifiedName !== propName (e.g., User.email vs email), both paths produce different node IDs for the same property — both land in property.csv, and LadybugDB hits a PK conflict because its schema stores them in the same table with overlapping semantic meaning (even if the id strings technically differ, the structural conflict triggers the error in practice on large repos).

Additionally, the sequential fallback path in parse-impl.ts (when a worker chunk fails) calls both processParsingSequential and later processCalls for the same files. Both can emit Property nodes — addNode deduplicates identical IDs, but qualified vs unqualified names can produce distinct IDs that still represent the same symbol.

On a Laravel 11 repo with ~9,000 symbols, these collisions accumulate across hundreds of PHP class properties — exactly what the bug report describes.


Summary

Issue Source Fix
Batch execution error on CodeEmbedding CREATE used in two sites for incremental embedding Changed to MERGE in embedding-pipeline.ts + run-analyze.ts
PK violation on COPY for Class/Method/Function/Property etc. Worker path + cross-file processCalls produce distinct IDs for the same symbol seenNodeIds Set in csv-generator.ts deduplicates before COPY

The long-term fix would be to unify the qualified name generation between parse-worker.ts and the sequential processCalls path so both produce the same ID for the same property. But the seenNodeIds guard is a solid defensive layer that prevents the symptom regardless of which code path is responsible.
· Branch: fix/csv-duplicate-node-ids

…vector-index SET error

Kuzu/LadybugDB forbids SET on a property that is part of a vector index.
The /api/embed endpoint was calling runEmbeddingPipeline without skipNodeIds,
causing it to attempt MERGE+SET on every node including those already embedded.

Fix: query existing CodeEmbedding nodeIds before running the pipeline and pass
them as skipNodeIds so only new (unembedded) nodes are processed.
@magyargergo

Copy link
Copy Markdown
Collaborator

Could you please add some stats to it? I'm curious to see how many duplicates are there.

@xkonjin xkonjin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: mostly solid, one silent-failure concern

The MERGESET change and the global seenNodeIds dedup both look correct and directly address the PK violations.

One bug risk: the skipNodeIds query wraps the executeQuery call in a bare catch { }. If the failure is anything other than "CodeEmbedding table does not exist yet" (e.g., a transient connection error), the code silently proceeds and will re-embed every node, which could be expensive and mask infra issues. Consider catching only the specific error code or verifying the exception message before swallowing it.

Also, runEmbeddingPipeline now receives an empty object {} before skipNodeIds in the argument list. Make sure that positional parameter is actually the optional options bag and not something else; if the signature ever changes this will silently break.

Tests are missing for the new skip logic in api.ts and the global dedup behavior in csv-generator.ts. Adding a unit test for duplicate node IDs across different labels would close the coverage gap.

@jonasvanderhaegen-xve jonasvanderhaegen-xve changed the title fix(csv-generator): deduplicate all node types to prevent PK violations on COPY fix(embeddings): prevent batch errors from CodeEmbedding PK violations and vector-index SET restriction Apr 14, 2026
…/embed

Bare catch{} would silently swallow connection errors and proceed to
re-embed all nodes, hiding infrastructure issues. Now only swallows
errors where the CodeEmbedding table does not yet exist.

@jonasvanderhaegen-xve jonasvanderhaegen-xve left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catches — both addressed:

Bare catch {}: Narrowed in dd194d5 to only swallow errors where the message includes does not exist or not found. Any other error (connection failure, query syntax, etc.) now re-throws and will surface as a job failure.

{} positional arg: Confirmed — the signature is runEmbeddingPipeline(executeQuery, executeWithReusedStatement, onProgress, config?, skipNodeIds?). The {} is the config override bag (merges with DEFAULT_EMBEDDING_CONFIG), not a mistake. An empty object is intentional — use defaults, just pass skipNodeIds as the fifth arg.

Tests: Fair point — not added in this PR. The skip logic and dedup behavior are good candidates for unit tests; filed as a follow-up.

Workaround patch for 1.6.1 (for anyone hitting this before it merges):
https://gist.github.com/jonasvanderhaegen-xve/a46ede53f9f331aa8000a75a7acac2dd

@magyargergo

Copy link
Copy Markdown
Collaborator

@jonasvanderhaegen-xve Before mergin your changes in I want to have an option to monitor this when in development so we can see if we managed to reduce dupes over time. Please add some stats that accumulates necessary metrics.

@xkonjin xkonjin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: solid fix for PK violations and concurrent re-runs

The MERGE → SET change in both and correctly makes embedding writes idempotent. The global dedup in is a clean generalization of the previous file-only dedup and prevents COPY-time PK violations across all node labels.

One bug risk remains in : the query now swallows only / errors, which is good, but if returns an unexpected shape (e.g., rows without ), the will silently produce an empty set and re-embed everything. Consider logging the count of skipped IDs when is populated — it makes debugging much easier if a future Kuzu driver change alters row shapes.

Also, receives as the fourth positional argument before . As noted in the existing review thread, this is the override bag, but it is fragile. If the signature ever shifts, this call site will break silently. A named options object or a more explicit call would be more robust.

Test coverage gap: there are no tests exercising the skip logic in or the global dedup behavior across multiple labels in . A targeted unit test for duplicate node IDs across different symbol types would close this gap.

xkonjin

This comment was marked as spam.

@xkonjin xkonjin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix for idempotency and deduplication. A few thoughts:

  1. Cypher injection risk in skipNodeIds query — The skip-node query in api.ts uses string interpolation for error-message matching. That is fine for the error check, but the / patterns themselves are parameterized, which is good.

  2. Swallowing all non-existent table errors — The catch block in api.ts lets through only errors that do NOT contain 'does not exist' or 'not found'. This is fragile: Kuzu may localize error messages or change wording. Consider checking for a specific error code instead, or at least log the swallowed case.

  3. skipNodeIds growth — If the graph is large, could become a huge Set in memory. Since it is passed into , make sure downstream code efficiently chunks or streams the remaining nodes rather than materializing the full list at once.

  4. Missing test coverage — There do not appear to be any new tests for the MERGE behavior, CSV dedup, or the skip logic. Given this fixes a batch/PK violation bug, a targeted regression test would be valuable.

Overall direction looks solid; just watch the error-string fragility and memory bounds on very large repos.

@xkonjin xkonjin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix for idempotency and deduplication. A few thoughts:

  1. Error-string fragility: the catch block in api.ts gates on 'does not exist' / 'not found'. If Kuzu ever changes wording or localizes messages, this path breaks silently. Prefer a stable error code if available, or at least log the swallowed branch.

  2. Memory bound on skipNodeIds: for very large graphs, building a full Set of existing node IDs in memory before running the pipeline could be heavy. Please confirm that runEmbeddingPipeline handles large skip lists efficiently (or streams/batches the delta).

  3. Test coverage: I don't see new tests for the MERGE idempotency, CSV dedup, or skip logic. Given this is fixing a batch PK-violation bug, a targeted regression test would be valuable.

Overall direction looks solid; just flagging the error-string fragility and potential memory scaling.

@xkonjin xkonjin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PK violations and vector-index SET restriction fix

Overall: This is a tight, well-scoped fix for three real production issues: MERGE idempotency, CSV COPY-time PK violations, and the Kuzu vector-index SET restriction.

Positives

  • MERGE + SET in both embedding-pipeline.ts and run-analyze.ts makes embedding writes properly idempotent. This directly prevents PK violations on re-runs and concurrent jobs.
  • Global seenNodeIds in csv-generator.ts is a clean generalization of the previous file-only dedup. Moving the check outside the switch statement prevents duplicates across all labels, not just File nodes.
  • The skipNodeIds query in api.ts avoids the Kuzu restriction that forbids SET on vector-indexed properties when the node already exists. That is a subtle driver behavior and this workaround is pragmatic.

Issues / risks

  1. Error-string fragility in api.ts. The catch block gates on 'does not exist' or 'not found' in the error message. If Kuzu ever changes wording, localizes messages, or introduces a different error code, this path breaks silently and will either throw on a missing table (bad UX) or swallow real connection errors (bad ops). Prefer a stable error code if the Kuzu driver exposes one, or at least log when the swallowed branch fires.

  2. skipNodeIds memory scaling. For very large graphs, building a full Set of existing node IDs in memory before running the pipeline could be expensive. Please confirm that runEmbeddingPipeline handles large skip lists efficiently (e.g., streams or batches the remaining nodes) rather than materializing the full delta in memory.

  3. Positional parameter fragility. passes an empty config object as the fourth positional arg. If the function signature ever changes (e.g., a new required arg is inserted before skipNodeIds), this call site will silently break. Consider using an options bag or named parameters if feasible.

  4. Test coverage gap. I do not see any new tests for:

    • The MERGE idempotency behavior in embedding-pipeline.ts
    • The global dedup across multiple labels in csv-generator.ts
    • The skip logic and error swallowing path in api.ts

Given this fixes a batch PK-violation bug, a targeted regression test for at least one of these paths would be valuable.

Verdict: LGTM as a pragmatic fix. Follow-up should add tests and harden the error-string matching.

@xkonjin xkonjin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PK violations and vector-index SET restriction fix

Overall: This is a tight, well-scoped fix for three real production issues: MERGE idempotency, CSV COPY-time PK violations, and the Kuzu vector-index SET restriction.

Positives

  • MERGE + SET in both embedding-pipeline.ts and run-analyze.ts makes embedding writes properly idempotent. This directly prevents PK violations on re-runs and concurrent jobs.
  • Global seenNodeIds in csv-generator.ts is a clean generalization of the previous file-only dedup. Moving the check outside the switch statement prevents duplicates across all labels, not just File nodes.
  • The skipNodeIds query in api.ts avoids the Kuzu restriction that forbids SET on vector-indexed properties when the node already exists. That is a subtle driver behavior and this workaround is pragmatic.

Issues / risks

  1. Error-string fragility in api.ts. The catch block gates on "does not exist" or "not found" in the error message. If Kuzu ever changes wording, localizes messages, or introduces a different error code, this path breaks silently and will either throw on a missing table (bad UX) or swallow real connection errors (bad ops). Prefer a stable error code if the Kuzu driver exposes one, or at least log when the swallowed branch fires.

  2. skipNodeIds memory scaling. For very large graphs, building a full Set of existing node IDs in memory before running the pipeline could be expensive. Please confirm that runEmbeddingPipeline handles large skip lists efficiently (e.g., streams or batches the remaining nodes) rather than materializing the full delta in memory.

  3. Positional parameter fragility. runEmbeddingPipeline(..., {}, skipNodeIds) passes an empty config object as the fourth positional arg. If the function signature ever changes, this call site silently breaks. Consider using an options bag or named parameters if feasible.

  4. Test coverage gap. I do not see any new tests for the MERGE idempotency behavior, the global dedup across labels, or the skip logic / error swallowing path in api.ts. Given this fixes a batch PK-violation bug, a targeted regression test would be valuable.

Verdict: LGTM as a pragmatic fix. Follow-up should add tests and harden the error-string matching.

@xkonjin xkonjin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #823

This PR bundles three related reliability fixes: idempotent embedding writes, deduplication of all node types in CSV generation, and Kuzu-safe re-embedding in the API path. Good targeted fixes.

Bugs / correctness

  1. CREATE -> MERGE in batchInsertEmbeddings and run-analyze.ts is the right call for idempotency, but MERGE + SET on a vector property can still trigger Kuzu issues if a vector index exists. The API path already pre-filters skipNodeIds, which is great, but runFullAnalysis (the CLI/batch path) does NOT skip existing embeddings. If someone reruns analysis on the same DB, Kuzu may error on SET. Consider threading the same skip logic into runFullAnalysis or documenting that CLI usage should target a fresh DB.
  2. csv-generator.ts: moving from seenFileIds to seenNodeIds is correct. Make sure the File writer still behaves correctly now that "break" became "continue" implicitly via the outer set check — it does, because the outer check now guards all labels. Consider asserting in a test that duplicate Method/Class IDs are dropped.
  3. api.ts skip logic swallows only "does not exist" / "not found" errors. Good. But skipNodeIds is typed Set | undefined and passed into runEmbeddingPipeline as an optional trailing arg. Verify that runEmbeddingPipeline's signature actually accepts that 4th argument; the diff doesn't show its definition. If it does, thumbs up.

Security

  • No direct concerns. The JWT_SECRET=dummy-build-secret build arg in the Dockerfile is safe for build-time only and won't persist in the final image layer. Confirmed it's only set in the builder stage.

Test coverage

  • I don't see tests for the MERGE path or the CSV dedup fix. A small unit test for batchInsertEmbeddings using an in-memory / mocked executor, and a CSV generator test that injects duplicate IDs across different labels, would prevent regressions.

Overall
Approve with minor suggestions — the embedding pipeline and CSV export are critical paths, so extra test coverage here is worth the effort.

Addresses review feedback on PR #823:
- Log count of already-embedded nodes when skipNodeIds is populated
  (aids debugging if Kuzu driver row shape changes).
- Log when the 'table does not exist' swallow path fires so ops can
  catch it if Kuzu ever changes error wording.
- Document the {} config positional argument with an inline comment
  referencing the runEmbeddingPipeline signature.
@magyargergo magyargergo merged commit c100577 into abhigyanpatwari:main Apr 15, 2026
13 checks passed
@magyargergo

Copy link
Copy Markdown
Collaborator

Thank you for your contribution!

jyhk1314 pushed a commit to jyhk1314/GitNexus that referenced this pull request Apr 15, 2026
… RC, group sync

- Take upstream splitRelCsvByLabelPair + tests (abhigyanpatwari#818/abhigyanpatwari#832); preserve fork
  closeLbugForPath and import evictPoolsForDbPath from pool-adapter.
- Fix nightly-refresh evictPools import path to ../core/lbug/pool-adapter.js.
- Includes abhigyanpatwari#818 drain fix, abhigyanpatwari#823 embeddings PK, abhigyanpatwari#825 RC workflow, abhigyanpatwari#827 manifest sync.
github714801013 pushed a commit to github714801013/GitNexus that referenced this pull request Apr 28, 2026
…s and vector-index SET restriction (abhigyanpatwari#823)

* fix(csv-generator): deduplicate all node types, not just File nodes

The pipeline can produce duplicate node IDs across all symbol types
(Class, Method, Function, etc.). Only File nodes were guarded by a
seenFileIds Set, leaving every other type unprotected. When the CSV
was COPY'd into LadybugDB, duplicate PKs caused mass "Batch execution
error: Found duplicated primary key value" warnings on gitnexus serve.

Replace the per-type seenFileIds with a single seenNodeIds Set checked
at the top of the iteration loop, before the switch, so every label is
covered by the same O(1) deduplication guard.

Fixes: abhigyanpatwari#822

* fix(embeddings): use MERGE instead of CREATE for CodeEmbedding inserts

CREATE fails with duplicate PK when a CodeEmbedding node already exists,
which happens when:
- A PostToolUse hook triggers a concurrent gitnexus analyze during an
  active analyze run (git commits fire the hook)
- A partial prior run left some embeddings in the DB before a crash

Switching to MERGE makes the insert idempotent: existing embeddings are
updated in place, new ones are created, no PK violations.

Fixes: abhigyanpatwari#822

* fix(server): skip already-embedded nodes in POST /api/embed to avoid vector-index SET error

Kuzu/LadybugDB forbids SET on a property that is part of a vector index.
The /api/embed endpoint was calling runEmbeddingPipeline without skipNodeIds,
causing it to attempt MERGE+SET on every node including those already embedded.

Fix: query existing CodeEmbedding nodeIds before running the pipeline and pass
them as skipNodeIds so only new (unembedded) nodes are processed.

* fix(server): narrow catch to table-not-exist errors only in POST /api/embed

Bare catch{} would silently swallow connection errors and proceed to
re-embed all nodes, hiding infrastructure issues. Now only swallows
errors where the CodeEmbedding table does not yet exist.

* style: prettier format gitnexus/src/server/api.ts

* fix(server): log skip-embedding count and table-not-found swallow path

Addresses review feedback on PR abhigyanpatwari#823:
- Log count of already-embedded nodes when skipNodeIds is populated
  (aids debugging if Kuzu driver row shape changes).
- Log when the 'table does not exist' swallow path fires so ops can
  catch it if Kuzu ever changes error wording.
- Document the {} config positional argument with an inline comment
  referencing the runEmbeddingPipeline signature.

---------

Co-authored-by: jonasvanderhaegen-xve <>
Co-authored-by: Gergo Magyar <gergomagyar@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: analyzer indexes every symbol twice on fresh analyze, causing mass duplicate primary key errors

3 participants