fix(embeddings): prevent batch errors from CodeEmbedding PK violations and vector-index SET restriction#823
Conversation
The pipeline can produce duplicate node IDs across all symbol types (Class, Method, Function, etc.). Only File nodes were guarded by a seenFileIds Set, leaving every other type unprotected. When the CSV was COPY'd into LadybugDB, duplicate PKs caused mass "Batch execution error: Found duplicated primary key value" warnings on gitnexus serve. Replace the per-type seenFileIds with a single seenNodeIds Set checked at the top of the iteration loop, before the switch, so every label is covered by the same O(1) deduplication guard. Fixes: #822
|
Someone is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
|
Nice and simple! Thank you for this PR! 🚀 |
CREATE fails with duplicate PK when a CodeEmbedding node already exists, which happens when: - A PostToolUse hook triggers a concurrent gitnexus analyze during an active analyze run (git commits fire the hook) - A partial prior run left some embeddings in the DB before a crash Switching to MERGE makes the insert idempotent: existing embeddings are updated in place, new ones are created, no PK violations. Fixes: #822
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 6236 tests passed 97 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
|
@claude we need to track down where the duplicates are coming from |
|
Claude finished @magyargergo's task in 12m 52s —— View job Tracing the duplicate node ID sources
Good question — I traced the full pipeline. There are two separate duplicate sources, one confirmed and one likely: 1. CodeEmbedding PK violations (confirmed — fixed by commit
|
| Issue | Source | Fix |
|---|---|---|
Batch execution error on CodeEmbedding |
CREATE used in two sites for incremental embedding |
Changed to MERGE in embedding-pipeline.ts + run-analyze.ts |
PK violation on COPY for Class/Method/Function/Property etc. |
Worker path + cross-file processCalls produce distinct IDs for the same symbol |
seenNodeIds Set in csv-generator.ts deduplicates before COPY |
The long-term fix would be to unify the qualified name generation between parse-worker.ts and the sequential processCalls path so both produce the same ID for the same property. But the seenNodeIds guard is a solid defensive layer that prevents the symptom regardless of which code path is responsible.
· Branch: fix/csv-duplicate-node-ids
…vector-index SET error Kuzu/LadybugDB forbids SET on a property that is part of a vector index. The /api/embed endpoint was calling runEmbeddingPipeline without skipNodeIds, causing it to attempt MERGE+SET on every node including those already embedded. Fix: query existing CodeEmbedding nodeIds before running the pipeline and pass them as skipNodeIds so only new (unembedded) nodes are processed.
|
Could you please add some stats to it? I'm curious to see how many duplicates are there. |
xkonjin
left a comment
There was a problem hiding this comment.
Review: mostly solid, one silent-failure concern
The MERGE → SET change and the global seenNodeIds dedup both look correct and directly address the PK violations.
One bug risk: the skipNodeIds query wraps the executeQuery call in a bare catch { }. If the failure is anything other than "CodeEmbedding table does not exist yet" (e.g., a transient connection error), the code silently proceeds and will re-embed every node, which could be expensive and mask infra issues. Consider catching only the specific error code or verifying the exception message before swallowing it.
Also, runEmbeddingPipeline now receives an empty object {} before skipNodeIds in the argument list. Make sure that positional parameter is actually the optional options bag and not something else; if the signature ever changes this will silently break.
Tests are missing for the new skip logic in api.ts and the global dedup behavior in csv-generator.ts. Adding a unit test for duplicate node IDs across different labels would close the coverage gap.
…/embed
Bare catch{} would silently swallow connection errors and proceed to
re-embed all nodes, hiding infrastructure issues. Now only swallows
errors where the CodeEmbedding table does not yet exist.
jonasvanderhaegen-xve
left a comment
There was a problem hiding this comment.
Good catches — both addressed:
Bare catch {}: Narrowed in dd194d5 to only swallow errors where the message includes does not exist or not found. Any other error (connection failure, query syntax, etc.) now re-throws and will surface as a job failure.
{} positional arg: Confirmed — the signature is runEmbeddingPipeline(executeQuery, executeWithReusedStatement, onProgress, config?, skipNodeIds?). The {} is the config override bag (merges with DEFAULT_EMBEDDING_CONFIG), not a mistake. An empty object is intentional — use defaults, just pass skipNodeIds as the fifth arg.
Tests: Fair point — not added in this PR. The skip logic and dedup behavior are good candidates for unit tests; filed as a follow-up.
Workaround patch for 1.6.1 (for anyone hitting this before it merges):
https://gist.github.com/jonasvanderhaegen-xve/a46ede53f9f331aa8000a75a7acac2dd
|
@jonasvanderhaegen-xve Before mergin your changes in I want to have an option to monitor this when in development so we can see if we managed to reduce dupes over time. Please add some stats that accumulates necessary metrics. |
xkonjin
left a comment
There was a problem hiding this comment.
Review: solid fix for PK violations and concurrent re-runs
The MERGE → SET change in both and correctly makes embedding writes idempotent. The global dedup in is a clean generalization of the previous file-only dedup and prevents COPY-time PK violations across all node labels.
One bug risk remains in : the query now swallows only / errors, which is good, but if returns an unexpected shape (e.g., rows without ), the will silently produce an empty set and re-embed everything. Consider logging the count of skipped IDs when is populated — it makes debugging much easier if a future Kuzu driver change alters row shapes.
Also, receives as the fourth positional argument before . As noted in the existing review thread, this is the override bag, but it is fragile. If the signature ever shifts, this call site will break silently. A named options object or a more explicit call would be more robust.
Test coverage gap: there are no tests exercising the skip logic in or the global dedup behavior across multiple labels in . A targeted unit test for duplicate node IDs across different symbol types would close this gap.
xkonjin
left a comment
There was a problem hiding this comment.
Nice fix for idempotency and deduplication. A few thoughts:
-
Cypher injection risk in skipNodeIds query — The skip-node query in api.ts uses string interpolation for error-message matching. That is fine for the error check, but the / patterns themselves are parameterized, which is good.
-
Swallowing all non-existent table errors — The catch block in api.ts lets through only errors that do NOT contain 'does not exist' or 'not found'. This is fragile: Kuzu may localize error messages or change wording. Consider checking for a specific error code instead, or at least log the swallowed case.
-
skipNodeIds growth — If the graph is large, could become a huge Set in memory. Since it is passed into , make sure downstream code efficiently chunks or streams the remaining nodes rather than materializing the full list at once.
-
Missing test coverage — There do not appear to be any new tests for the MERGE behavior, CSV dedup, or the skip logic. Given this fixes a batch/PK violation bug, a targeted regression test would be valuable.
Overall direction looks solid; just watch the error-string fragility and memory bounds on very large repos.
xkonjin
left a comment
There was a problem hiding this comment.
Nice fix for idempotency and deduplication. A few thoughts:
-
Error-string fragility: the catch block in api.ts gates on 'does not exist' / 'not found'. If Kuzu ever changes wording or localizes messages, this path breaks silently. Prefer a stable error code if available, or at least log the swallowed branch.
-
Memory bound on skipNodeIds: for very large graphs, building a full Set of existing node IDs in memory before running the pipeline could be heavy. Please confirm that runEmbeddingPipeline handles large skip lists efficiently (or streams/batches the delta).
-
Test coverage: I don't see new tests for the MERGE idempotency, CSV dedup, or skip logic. Given this is fixing a batch PK-violation bug, a targeted regression test would be valuable.
Overall direction looks solid; just flagging the error-string fragility and potential memory scaling.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review: PK violations and vector-index SET restriction fix
Overall: This is a tight, well-scoped fix for three real production issues: MERGE idempotency, CSV COPY-time PK violations, and the Kuzu vector-index SET restriction.
Positives
- MERGE + SET in both embedding-pipeline.ts and run-analyze.ts makes embedding writes properly idempotent. This directly prevents PK violations on re-runs and concurrent jobs.
- Global seenNodeIds in csv-generator.ts is a clean generalization of the previous file-only dedup. Moving the check outside the switch statement prevents duplicates across all labels, not just File nodes.
- The skipNodeIds query in api.ts avoids the Kuzu restriction that forbids SET on vector-indexed properties when the node already exists. That is a subtle driver behavior and this workaround is pragmatic.
Issues / risks
-
Error-string fragility in api.ts. The catch block gates on 'does not exist' or 'not found' in the error message. If Kuzu ever changes wording, localizes messages, or introduces a different error code, this path breaks silently and will either throw on a missing table (bad UX) or swallow real connection errors (bad ops). Prefer a stable error code if the Kuzu driver exposes one, or at least log when the swallowed branch fires.
-
skipNodeIds memory scaling. For very large graphs, building a full Set of existing node IDs in memory before running the pipeline could be expensive. Please confirm that runEmbeddingPipeline handles large skip lists efficiently (e.g., streams or batches the remaining nodes) rather than materializing the full delta in memory.
-
Positional parameter fragility. passes an empty config object as the fourth positional arg. If the function signature ever changes (e.g., a new required arg is inserted before skipNodeIds), this call site will silently break. Consider using an options bag or named parameters if feasible.
-
Test coverage gap. I do not see any new tests for:
- The MERGE idempotency behavior in embedding-pipeline.ts
- The global dedup across multiple labels in csv-generator.ts
- The skip logic and error swallowing path in api.ts
Given this fixes a batch PK-violation bug, a targeted regression test for at least one of these paths would be valuable.
Verdict: LGTM as a pragmatic fix. Follow-up should add tests and harden the error-string matching.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review: PK violations and vector-index SET restriction fix
Overall: This is a tight, well-scoped fix for three real production issues: MERGE idempotency, CSV COPY-time PK violations, and the Kuzu vector-index SET restriction.
Positives
- MERGE + SET in both embedding-pipeline.ts and run-analyze.ts makes embedding writes properly idempotent. This directly prevents PK violations on re-runs and concurrent jobs.
- Global seenNodeIds in csv-generator.ts is a clean generalization of the previous file-only dedup. Moving the check outside the switch statement prevents duplicates across all labels, not just File nodes.
- The skipNodeIds query in api.ts avoids the Kuzu restriction that forbids SET on vector-indexed properties when the node already exists. That is a subtle driver behavior and this workaround is pragmatic.
Issues / risks
-
Error-string fragility in api.ts. The catch block gates on "does not exist" or "not found" in the error message. If Kuzu ever changes wording, localizes messages, or introduces a different error code, this path breaks silently and will either throw on a missing table (bad UX) or swallow real connection errors (bad ops). Prefer a stable error code if the Kuzu driver exposes one, or at least log when the swallowed branch fires.
-
skipNodeIds memory scaling. For very large graphs, building a full Set of existing node IDs in memory before running the pipeline could be expensive. Please confirm that runEmbeddingPipeline handles large skip lists efficiently (e.g., streams or batches the remaining nodes) rather than materializing the full delta in memory.
-
Positional parameter fragility.
runEmbeddingPipeline(..., {}, skipNodeIds)passes an empty config object as the fourth positional arg. If the function signature ever changes, this call site silently breaks. Consider using an options bag or named parameters if feasible. -
Test coverage gap. I do not see any new tests for the MERGE idempotency behavior, the global dedup across labels, or the skip logic / error swallowing path in api.ts. Given this fixes a batch PK-violation bug, a targeted regression test would be valuable.
Verdict: LGTM as a pragmatic fix. Follow-up should add tests and harden the error-string matching.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review — PR #823
This PR bundles three related reliability fixes: idempotent embedding writes, deduplication of all node types in CSV generation, and Kuzu-safe re-embedding in the API path. Good targeted fixes.
Bugs / correctness
- CREATE -> MERGE in batchInsertEmbeddings and run-analyze.ts is the right call for idempotency, but MERGE + SET on a vector property can still trigger Kuzu issues if a vector index exists. The API path already pre-filters skipNodeIds, which is great, but runFullAnalysis (the CLI/batch path) does NOT skip existing embeddings. If someone reruns analysis on the same DB, Kuzu may error on SET. Consider threading the same skip logic into runFullAnalysis or documenting that CLI usage should target a fresh DB.
- csv-generator.ts: moving from seenFileIds to seenNodeIds is correct. Make sure the File writer still behaves correctly now that "break" became "continue" implicitly via the outer set check — it does, because the outer check now guards all labels. Consider asserting in a test that duplicate Method/Class IDs are dropped.
- api.ts skip logic swallows only "does not exist" / "not found" errors. Good. But skipNodeIds is typed Set | undefined and passed into runEmbeddingPipeline as an optional trailing arg. Verify that runEmbeddingPipeline's signature actually accepts that 4th argument; the diff doesn't show its definition. If it does, thumbs up.
Security
- No direct concerns. The JWT_SECRET=dummy-build-secret build arg in the Dockerfile is safe for build-time only and won't persist in the final image layer. Confirmed it's only set in the builder stage.
Test coverage
- I don't see tests for the MERGE path or the CSV dedup fix. A small unit test for batchInsertEmbeddings using an in-memory / mocked executor, and a CSV generator test that injects duplicate IDs across different labels, would prevent regressions.
Overall
Approve with minor suggestions — the embedding pipeline and CSV export are critical paths, so extra test coverage here is worth the effort.
Addresses review feedback on PR #823: - Log count of already-embedded nodes when skipNodeIds is populated (aids debugging if Kuzu driver row shape changes). - Log when the 'table does not exist' swallow path fires so ops can catch it if Kuzu ever changes error wording. - Document the {} config positional argument with an inline comment referencing the runEmbeddingPipeline signature.
|
Thank you for your contribution! |
… RC, group sync - Take upstream splitRelCsvByLabelPair + tests (abhigyanpatwari#818/abhigyanpatwari#832); preserve fork closeLbugForPath and import evictPoolsForDbPath from pool-adapter. - Fix nightly-refresh evictPools import path to ../core/lbug/pool-adapter.js. - Includes abhigyanpatwari#818 drain fix, abhigyanpatwari#823 embeddings PK, abhigyanpatwari#825 RC workflow, abhigyanpatwari#827 manifest sync.
…s and vector-index SET restriction (abhigyanpatwari#823) * fix(csv-generator): deduplicate all node types, not just File nodes The pipeline can produce duplicate node IDs across all symbol types (Class, Method, Function, etc.). Only File nodes were guarded by a seenFileIds Set, leaving every other type unprotected. When the CSV was COPY'd into LadybugDB, duplicate PKs caused mass "Batch execution error: Found duplicated primary key value" warnings on gitnexus serve. Replace the per-type seenFileIds with a single seenNodeIds Set checked at the top of the iteration loop, before the switch, so every label is covered by the same O(1) deduplication guard. Fixes: abhigyanpatwari#822 * fix(embeddings): use MERGE instead of CREATE for CodeEmbedding inserts CREATE fails with duplicate PK when a CodeEmbedding node already exists, which happens when: - A PostToolUse hook triggers a concurrent gitnexus analyze during an active analyze run (git commits fire the hook) - A partial prior run left some embeddings in the DB before a crash Switching to MERGE makes the insert idempotent: existing embeddings are updated in place, new ones are created, no PK violations. Fixes: abhigyanpatwari#822 * fix(server): skip already-embedded nodes in POST /api/embed to avoid vector-index SET error Kuzu/LadybugDB forbids SET on a property that is part of a vector index. The /api/embed endpoint was calling runEmbeddingPipeline without skipNodeIds, causing it to attempt MERGE+SET on every node including those already embedded. Fix: query existing CodeEmbedding nodeIds before running the pipeline and pass them as skipNodeIds so only new (unembedded) nodes are processed. * fix(server): narrow catch to table-not-exist errors only in POST /api/embed Bare catch{} would silently swallow connection errors and proceed to re-embed all nodes, hiding infrastructure issues. Now only swallows errors where the CodeEmbedding table does not yet exist. * style: prettier format gitnexus/src/server/api.ts * fix(server): log skip-embedding count and table-not-found swallow path Addresses review feedback on PR abhigyanpatwari#823: - Log count of already-embedded nodes when skipNodeIds is populated (aids debugging if Kuzu driver row shape changes). - Log when the 'table does not exist' swallow path fires so ops can catch it if Kuzu ever changes error wording. - Document the {} config positional argument with an inline comment referencing the runEmbeddingPipeline signature. --------- Co-authored-by: jonasvanderhaegen-xve <> Co-authored-by: Gergo Magyar <gergomagyar@icloud.com>
Summary
Fix two bugs causing mass
Batch execution errorwarnings ingitnexus serveand the web UI embed endpoint.Motivation / context
Closes #822
After
gitnexus analyze --embeddings+gitnexus serve, hundreds of batch errors appear. Connecting via the gitnexus.vercel.app web UI triggers a second wave of errors. Two root causes were found through serve log analysis and Playwright-assisted reproduction.Areas touched
gitnexus/(CLI / core / MCP server)gitnexus-web/(Vite / React UI).github/(workflows, actions)eval/or other toolingScope & constraints
In scope
CREATE→MERGE+SETforCodeEmbeddinginserts inembedding-pipeline.tsandrun-analyze.tsPOST /api/embedto skip already-embedded nodes usingskipNodeIdsExplicitly out of scope / not done here
POST /api/embedImplementation notes
Bug 1 —
CREATEvsMERGEinCodeEmbeddinginsertsbatchInsertEmbeddingsusedCREATE (e:CodeEmbedding {...}). Ifanalyze --embeddingscrashed before writingmeta.json,CodeEmbeddingnodes remained in the DB. The next full re-analyze retriedCREATEon the samenodeIdvalues → PK violation.The error messages looked like
Found duplicated primary key value File:app/Jobs/...— confusingly similar to File/Class node errors, but actuallyCodeEmbeddingPK violations where thenodeIdstring happened to start withFile:.Fix:
MERGE (e:CodeEmbedding {nodeId: $nodeId}) SET e.embedding = $embeddingin bothbatchInsertEmbeddingsand the cached-embedding re-insertion inrun-analyze.ts.Bug 2 — Vector-index SET restriction in
POST /api/embedThe web UI automatically sends
POST /api/embedwhenever a repo is opened. The endpoint calledrunEmbeddingPipelinewithoutskipNodeIds, so it tried toMERGE+SETtheembeddingproperty on every node including those already embedded.Actual error (captured from serve stdout):
Kuzu/LadybugDB forbids
SETon a property that is part of a vector index. Fix: query existingCodeEmbeddingnodeIds before calling the pipeline and pass asskipNodeIds— already-embedded nodes are skipped entirely.Testing & verification
cd gitnexus && npm testcd gitnexus && npm run test:integrationcd gitnexus && npx tsc --noEmitcd gitnexus-web && npm testcd gitnexus-web && npx tsc -b --noEmitgitnexus analyze --embeddings→gitnexus serve→ open repo in gitnexus.vercel.app → zero batch errors in serve stdoutTest environment:
.nvmrctargets20; gitnexus installed globally under v22)gitnexus serveon localhost:4747Risk & rollout
skipNodeIdswas already supported byrunEmbeddingPipeline— this just passes it.POST /api/embedbecomes incremental: only nodes without embeddings are processed. This is the correct behaviour for an "add missing embeddings" endpoint.Checklist
AGENTS.md/ overlays changed: headers, scope block, and changelog updated per project conventions