Skip to content

feat(analyze): preserve existing embeddings by default; --force regenerates them; add --drop-embeddings opt-out (CLI + HTTP API)#1055

Merged
magyargergo merged 4 commits into
mainfrom
copilot/fix-embeddings-deletion-issue
Apr 24, 2026
Merged

feat(analyze): preserve existing embeddings by default; --force regenerates them; add --drop-embeddings opt-out (CLI + HTTP API)#1055
magyargergo merged 4 commits into
mainfrom
copilot/fix-embeddings-deletion-issue

Conversation

Copilot AI commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

gitnexus analyze (with or without --force) silently wiped every embedding from the index whenever --embeddings was omitted, throwing away minutes of work on every routine re-analyze (post-commit hooks, agent integrations, CI). The information needed to make this safe (stats.embeddings in meta.json) was already written but never read.

This PR implements proposal (1) from the issue: preserve on default, with --force upgraded to regenerate embeddings when the repo was already embedded, and an explicit --drop-embeddings opt-out wired through both the CLI and the HTTP API.

Behavior change

Invocation Before After
analyze (no flag) on a repo with embeddings wipes all embeddings preserves them (no generation)
analyze --force on a repo with embeddings wipes all embeddings preserves cache + regenerates / tops up new/changed nodes
analyze --force on a repo with no embeddings wipes (nothing to wipe) unchanged
analyze --embeddings regenerates / tops up unchanged
analyze --drop-embeddings n/a new explicit wipe

Changes

  • gitnexus/src/core/embedding-mode.ts (new) — pure helper deriveEmbeddingMode(options, existingEmbeddingCount) that returns { shouldGenerateEmbeddings, preserveExistingEmbeddings, forceRegenerateEmbeddings, shouldLoadCache }. Lives in its own module (no native imports) so the branching contract can be unit-tested without LadybugDB / tree-sitter.
  • gitnexus/src/core/run-analyze.ts — added dropEmbeddings to AnalyzeOptions and delegates flag derivation to deriveEmbeddingMode. The resolved shouldGenerateEmbeddings flag gates the Phase-4 generation pass — so a forced re-index of an embedded repo loads the cache, restores it, and regenerates for new/changed nodes instead of quietly downgrading to "preserve only". Cache-load fires whenever the resolved load flag is set, regardless of --force (previously gated by !options.force, which is why the issue's --force repro also wiped). Logs an explicit line for each of the preserve / force-regenerate / explicit-drop branches, and the cache-load catch now emits Warning: could not load cached embeddings (<reason>). Embeddings will not be preserved on this run. so corrupt-DB / schema-mismatch failures are no longer indistinguishable from the original silent-data-loss bug.
  • gitnexus/src/cli/analyze.ts + cli/index.ts — new --drop-embeddings flag, plumbed through AnalyzeOptions. Worker IPC (analyze-worker.ts) forwards AnalyzeOptions as a plain object.
  • gitnexus/src/server/api.tsPOST /api/analyze now destructures dropEmbeddings from the request body and forwards it to the analyze worker as options.dropEmbeddings. Without this, the new escape hatch was dead code on the HTTP/web-UI path.
  • Docs
    • AGENTS.md "Keeping the Index Fresh" block (the old text wrongly said plain analyze "deletes existing vectors") and the three gitnexus-cli SKILL tables.
    • GUARDRAILS.md — non-negotiable Embeddings pipeline #5 rewritten to "plain analyze now preserves; --drop-embeddings is the explicit wipe", and the "Embeddings vanished after analyze" Sign updated to point at --drop-embeddings and the new cache-load warning as the only paths to zero.
  • Testsgitnexus/test/unit/run-analyze.test.ts now contains 9 behavioral cases over deriveEmbeddingMode covering every (force × embeddings × dropEmbeddings × existingCount) combination, including: default-preserve, --force regenerate, --force no-op when no embeddings exist, --drop-embeddings suppressing cache load (also when combined with --force or --embeddings).

Notes for reviewers

  • I dropped the !options.force guard from the cache-load deliberately. --force retains its original meaning of "re-run the pipeline"; on top of that, when the repo was already embedded it now also re-runs the embedding pass (per maintainer feedback) instead of either wiping or silently preserving. Repos that never had embeddings see no change in --force behavior.
  • The dimensions-mismatch branch in phase 3.5 still discards the cache cleanly, so a model swap continues to behave correctly — it just means a default analyze after a model change leaves the repo with zero embeddings until --embeddings (or --force) is passed, which matches the prior behavior in that edge case.
  • Validation: npx tsc --noEmit shows no new errors on the changed files (pre-existing gitnexus-shared workspace-resolution errors are unrelated sandbox noise). vitest itself can't boot in the sandbox because the global setup requires the LadybugDB N-API binding and the sandbox npm install can't complete the native postinstall; the new pure deriveEmbeddingMode cases were verified locally with tsx (9 passed, 0 failed). Relying on CI for the full suite.

@vercel

vercel Bot commented Apr 24, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
gitnexus Ready Ready Preview, Comment Apr 24, 2026 8:07am

Request Review

…eddings opt-out

Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/da1da041-afcd-4d38-8a2f-39ca52a462ff

Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix embedding deletion in gitnexus analyze command analyze: preserve existing embeddings by default; add --drop-embeddings opt-out Apr 24, 2026
Copilot AI requested a review from magyargergo April 24, 2026 07:36
@magyargergo

Copy link
Copy Markdown
Collaborator

@copilot if there are embeddings and user is using the --force option, we must re-calculate the embeddings too but only if they were present.

…e+top-up)

Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/e2759765-b8f6-453a-8c28-595439d23cb4

Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>

Copilot AI commented Apr 24, 2026

Copy link
Copy Markdown
Contributor Author

@copilot if there are embeddings and user is using the --force option, we must re-calculate the embeddings too but only if they were present.

Done in bafd3df. --force now auto-implies --embeddings when the existing index already has embeddings (existingMeta.stats.embeddings > 0): cache is loaded, restored, and the generation phase runs to top up new/changed nodes. If the repo had no embeddings to begin with, --force keeps its old behavior. Logs --force on a repo with N existing embeddings: regenerating embeddings for new/changed nodes. so the behavior is observable.

Copilot AI changed the title analyze: preserve existing embeddings by default; add --drop-embeddings opt-out analyze: preserve existing embeddings by default; --force regenerates them; add --drop-embeddings opt-out Apr 24, 2026
@magyargergo magyargergo marked this pull request as ready for review April 24, 2026 07:50
@github-actions

github-actions Bot commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 3m 39s —— View job


Production Readiness Review

Tasks:

  • Gather context (DoD.md, AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, TESTING.md, ARCHITECTURE.md)
  • Read changed files
  • Analyze runtime paths and contracts
  • Produce review findings
  • Final verdict

Review bar for this change

Derived from DoD.md for this specific diff:

  • Correctness §2.1 — The fix must work end-to-end on all surfaces (CLI, MCP, HTTP API/worker); no dead-code paths or partial wiring.
  • Contracts §2.4AnalyzeOptions gains dropEmbeddings; every consumer of that type must handle it or explicitly ignore it.
  • Persisted data §2.4 — Embeddings are a persisted artifact; the rollback/wipe path (--drop-embeddings) must be reachable by all callers.
  • Design §2.3 — Smallest correct solution; no speculative flags or logic branches.
  • Tests §2.7 — Tests must cover the real changed path; type assertions alone are insufficient; fixtures must be realistic enough for the risk.
  • Observability §2.8 — Silent data loss was the original bug; the fix must not introduce silent failure in the new code paths.
  • Docs/Contracts §2.4 — All docs that describe the old behavior must be updated; stale authoritative docs are a correctness risk for agents and contributors.
  • Agent guardrails §3DoD.md §3 explicitly calls out "Embeddings preserved" as a guardrail gate. Any doc that instructs agents on embedding behavior must reflect the new contract.

Understanding of the change

What changed:

  • run-analyze.ts — core logic: derives preserveExistingEmbeddings, forceRegenerateEmbeddings, shouldGenerateEmbeddings from options, loads cache when needed, re-inserts cached embeddings after DB rebuild, and logs each branch explicitly.
  • cli/analyze.ts + cli/index.ts — new --drop-embeddings CLI flag added and plumbed through to runFullAnalysis.
  • AGENTS.md — "Keeping the Index Fresh" block updated.
  • Three SKILL.md files — updated tables.
  • gitnexus/test/unit/run-analyze.test.ts — type-level smoke test for dropEmbeddings.

What should remain unchanged: All existing --embeddings, --force behavior on repos with no embeddings. The --force flag retains its original meaning; behavior on repos with zero embeddings is identical.

Runtime paths affected:

  1. CLI → analyzeCommandrunFullAnalysis
  2. HTTP API → POST /api/analyzeanalyze-worker.ts (fork) → runFullAnalysis

Contracts at risk: AnalyzeOptions interface (two copies: cli/analyze.ts and core/run-analyze.ts), HTTP API request body schema, IPC message options field.


Findings

[high] --drop-embeddings is silently ignored via the HTTP API / web UI path

  • Category: Partial wiring / contract drift

  • Files: gitnexus/src/server/api.ts:1148, 1342

  • Issue: The POST /api/analyze handler destructures only { url, path, force, embeddings } from req.body and passes options: { force: !!force, embeddings: !!embeddings } to the worker (line 1342). dropEmbeddings is never extracted from the body nor forwarded.

    The PR description states: "Worker IPC (analyze-worker.ts) forwards AnalyzeOptions as a plain object, so the HTTP-bridge path picks it up with no further wiring." This is incorrect. analyze-worker.ts does correctly accept and pass through any AnalyzeOptions object it receives — but api.ts constructs the options object itself and doesn't include dropEmbeddings.

    Positive-path behavior (preserve-by-default when dropEmbeddings is absent) works correctly through the HTTP path because undefined is falsy. But a caller using POST /api/analyze with { "dropEmbeddings": true } in the body will have that value silently ignored; embeddings will be preserved when the user intended to wipe them.

  • Why it matters here: The web UI is a first-class consumer of the HTTP API. Users who trigger re-analysis through the UI or via the REST API have no way to invoke the new escape hatch. The escape hatch was the correctness fix for legitimate wipe scenarios (e.g., model swap via API path).

  • Recommended fix:

    // api.ts line 1148
    const { url: repoUrl, path: repoLocalPath, force, embeddings, dropEmbeddings } = req.body;
    
    // api.ts line 1342
    options: { force: !!force, embeddings: !!embeddings, dropEmbeddings: !!dropEmbeddings },

    Fix this →


[high] GUARDRAILS.md still describes the old (now-incorrect) embedding behavior

  • Category: Stale authoritative documentation / contract drift

  • Files: GUARDRAILS.md (not changed in this PR)

  • Issue: Two places in GUARDRAILS.md directly contradict the new behavior:

    1. Non-negotiable Embeddings pipeline #5 (line 22): "Preserve embeddings — if .gitnexus/meta.json shows embeddings, use npx gitnexus analyze --embeddings; plain analyze drops them." — The second clause is now false: plain analyze preserves them.

    2. Sign "Embeddings vanished after analyze" (line ~38): "Why: Embedding generation is opt-in; analyze without the flag does not preserve prior vectors." — Also false post-PR.

    GUARDRAILS.md is a source-of-truth document that agents and contributors read to understand safe operational behavior. The stale text will cause agents to incorrectly add --embeddings when it is unnecessary and, more importantly, fail to recognize --drop-embeddings as the intentional wipe path.

  • Why it matters here: DoD.md §2.4 requires: "If user-visible behavior, public usage, CLI help, or README examples change, the relevant docs, examples, help text, or migration notes are updated in the same change." GUARDRAILS.md is not a README but is equally authoritative per DoD.md §1. The PR updated AGENTS.md but left GUARDRAILS.md contradicting the new behavior.

  • Recommended fix: Update GUARDRAILS.md non-negotiable Embeddings pipeline #5 and the "Embeddings vanished after analyze" Sign to reflect that plain analyze now preserves embeddings and --drop-embeddings is the explicit wipe path. Fix this →%20Non-negotiable%20%235%3A%20change%20%22plain%20analyze%20drops%20them%22%20to%20%22plain%20analyze%20preserves%20them%3B%20use%20--drop-embeddings%20to%20wipe%22.%202)%20Sign%20%22Embeddings%20vanished%20after%20analyze%22%20Why%20clause%3A%20update%20to%20say%20plain%20analyze%20now%20preserves%20existing%20embeddings%20and%20suggest%20--embeddings%20only%20if%20new%20nodes%20need%20coverage&repo=abhigyanpatwari/GitNexus)


[medium] Cache-load failure is swallowed silently — silent data loss survives the fix

  • Category: Observability / correctness

  • Files: gitnexus/src/core/run-analyze.ts:224–237

  • Issue: The cache-load block:

    try {
      await initLbug(lbugPath);
      const cached = await loadCachedEmbeddings();
      cachedEmbeddingNodeIds = cached.embeddingNodeIds;
      cachedEmbeddings = cached.embeddings;
      await closeLbug();
    } catch {
      try { await closeLbug(); } catch { /* swallow */ }
    }

    Any failure (corrupt db, schema mismatch, native crash) causes cachedEmbeddings to stay empty with zero log output. The pipeline then continues, eventually producing embeddingCount = 0 in meta.json — the same observable symptom as the original bug. A user who sees embeddings disappear will not know whether the fix failed silently or they forgot --embeddings.

  • Why it matters here: The PR's primary motivation is to end silent embedding loss. Swallowing cache-load failures in the main non-error path re-introduces silent data loss through a different door. The original bug was already diagnosed via stats.embeddings; a failed cache-load is equally undetectable without this fix.

  • Recommended fix: Add a log() call in the outer catch:

    } catch (err: any) {
      log(`Warning: could not load cached embeddings (${err?.message ?? err}). Embeddings will not be preserved.`);
      try { await closeLbug(); } catch { /* swallow */ }
    }

    Fix this →


[medium] Tests do not prove runtime behavior — they only assert the type compiles

  • Category: Test strength

  • Files: gitnexus/test/unit/run-analyze.test.ts

  • Issue: The single new test is:

    it('AnalyzeOptions accepts dropEmbeddings (compile-time check)', async () => {
      const opts: AnalyzeOptions = { embeddings: false, dropEmbeddings: true };
      expect(opts.dropEmbeddings).toBe(true);
      expect(typeof mod.runFullAnalysis).toBe('function');
    });

    This would pass even if the dropEmbeddings field were added to the type but never read in runFullAnalysis. It would not catch:

    • preserveExistingEmbeddings being computed incorrectly
    • forceRegenerateEmbeddings not triggering when --force + embeddings exist
    • The cache restore being skipped due to a logic inversion
    • dropEmbeddings: true failing to suppress cache-load (the branch at line 223 depends on !options.dropEmbeddings)

    DoD.md §2.7: "Tests cover the real changed path — they would fail if behavior, wiring, or contracts were broken." The test as written cannot fail on any of the above.

  • Why it matters here: This is a behavior change to a critical data-preservation path. The risk surface is the multi-flag logic in run-analyze.ts lines 189–238. Unit tests that mock loadCachedEmbeddings and spy on initLbug/closeLbug would be proportionate to the risk.

  • Recommended fix: Add at minimum three unit tests covering: (1) preserveExistingEmbeddings = true when no flags are passed and existingMeta.stats.embeddings > 0, (2) forceRegenerateEmbeddings = true when --force is set and existingMeta.stats.embeddings > 0, (3) dropEmbeddings: true suppresses cache-load. These can be pure logic tests over the flag derivation without needing a real DB.


Definition of Done check

DoD Item Status
Behavior implemented end-to-end on real runtime path partially satisfied — CLI path is complete; HTTP API path preserves-by-default correctly but dropEmbeddings escape hatch is not wired
Edge cases handled or documented as out of scope partially satisfied — cache-load failure is silently swallowed with no observable signal
Contracts preserved or updated in all consumers partially satisfiedAnalyzeOptions updated in CLI and core; api.ts HTTP schema not updated
Docs updated where user-visible behavior changed partially satisfiedAGENTS.md updated; GUARDRAILS.md contradicts new behavior
Tests cover the real changed path not satisfied — type-assertion only; no behavioral coverage
Logging is proportionate — no silent catches that swallow diagnostics not satisfied — cache-load failure produces no output
Destructive operations are opt-in or guarded satisfied--drop-embeddings is explicit opt-in
Scope matches — no unrelated churn satisfied

Validation assessment

  • Evidence of tests: One new test in run-analyze.test.ts — type assertion only; no behavioral coverage of the new flag logic.
  • Evidence of typecheck/build: PR notes that tsc --noEmit and vitest could not be run locally (native tree-sitter build constraint). CI is the only validation gate.
  • Missing or unverifiable:
    • No evidence that --force + embeddings case was integration-tested end-to-end.
    • No evidence that dropEmbeddings: true actually suppresses cache loading vs. merely being passed through.
    • GUARDRAILS.md not changed; api.ts not in the changed-files list — gap was not caught before submission.

Final verdict

not production-ready

The primary behavior fix (preserve-by-default on the default analyze path) is correctly implemented in run-analyze.ts and works end-to-end for both CLI and HTTP paths. However, the --drop-embeddings escape hatch is dead code on the HTTP API path because api.ts does not extract or forward it from the request body — the PR description's claim that "the HTTP-bridge path picks it up with no further wiring" is demonstrably incorrect. GUARDRAILS.md non-negotiable #5 and the "Embeddings vanished after analyze" Sign now actively contradict the new behavior, which creates an agent correctness risk in a source-of-truth document that DoD.md requires to be kept accurate. Additionally, a silent catch {} on the cache-load re-introduces unlogged embedding loss through a new failure mode, and the single new test proves only that a type field exists, not that any of the new logic branches are correct. The two fixes needed are small and targeted; this should be blocked until they land.

…extract pure deriveEmbeddingMode + behavioral tests; sync GUARDRAILS.md

Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/7d88e595-cbd8-47b2-ba4f-fb5b9a60cda4

Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com>

Copilot AI commented Apr 24, 2026

Copy link
Copy Markdown
Contributor Author

Claude finished @magyargergo's task in 3m 39s —— View job

Production Readiness Review

...

All four findings addressed in ba36230:

  1. [high] HTTP API --drop-embeddings wiringapi.ts now destructures dropEmbeddings from the request body and forwards it to the worker as options.dropEmbeddings. The PR description claim was wrong; thanks for catching it.
  2. [high] GUARDRAILS.md — Non-negotiable Embeddings pipeline #5 rewritten to "plain analyze now preserves; --drop-embeddings is the explicit wipe", and the "Embeddings vanished after analyze" Sign updated to point at --drop-embeddings / cache-load warnings as the only paths to zero.
  3. [medium] Silent cache-load swallow — outer catch now emits Warning: could not load cached embeddings (<reason>). Embeddings will not be preserved on this run. so corrupt-DB / schema-mismatch failures are no longer indistinguishable from the original bug.
  4. [medium] Tests prove behavior, not just types — extracted the flag derivation into pure src/core/embedding-mode.ts#deriveEmbeddingMode (re-exported from run-analyze.ts) and added 9 behavioral cases covering every (force × embeddings × dropEmbeddings × existingCount) combination, including the three you called out: default-preserve, --force regenerate, and --drop-embeddings suppressing cache load. Verified locally (9 passed, 0 failed); vitest itself can't run in the sandbox due to the LadybugDB native binding, so I'm relying on CI for the full suite.

Copilot AI changed the title analyze: preserve existing embeddings by default; --force regenerates them; add --drop-embeddings opt-out analyze: preserve existing embeddings by default; --force regenerates them; add --drop-embeddings opt-out (CLI + HTTP API) Apr 24, 2026
@magyargergo magyargergo changed the title analyze: preserve existing embeddings by default; --force regenerates them; add --drop-embeddings opt-out (CLI + HTTP API) feat(analyze): preserve existing embeddings by default; --force regenerates them; add --drop-embeddings opt-out (CLI + HTTP API) Apr 24, 2026
@magyargergo magyargergo merged commit 2b0392c into main Apr 24, 2026
28 of 29 checks passed
@magyargergo magyargergo deleted the copilot/fix-embeddings-deletion-issue branch April 24, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

analyze silently wipes existing embeddings when run without --embeddings

2 participants