diff --git a/.claude/skills/gitnexus/gitnexus-cli/SKILL.md b/.claude/skills/gitnexus/gitnexus-cli/SKILL.md index c9e0af341a..499bfb932a 100644 --- a/.claude/skills/gitnexus/gitnexus-cli/SKILL.md +++ b/.claude/skills/gitnexus/gitnexus-cli/SKILL.md @@ -17,10 +17,11 @@ npx gitnexus analyze Run from the project root. This parses all source files, builds the knowledge graph, writes it to `.gitnexus/`, and generates CLAUDE.md / AGENTS.md context files. -| Flag | Effect | -| -------------- | ---------------------------------------------------------------- | -| `--force` | Force full re-index even if up to date | -| `--embeddings` | Enable embedding generation for semantic search (off by default) | +| Flag | Effect | +| ------------------- | ------------------------------------------------------------------------------------------------------- | +| `--force` | Force full re-index even if up to date | +| `--embeddings` | Enable embedding generation for semantic search (off by default) | +| `--drop-embeddings` | Drop existing embeddings on rebuild. By default, an `analyze` without `--embeddings` preserves them. | **When to run:** First time in a project, after major code changes, or when `gitnexus://repo/{name}/context` reports the index is stale. In Claude Code, a PostToolUse hook runs `analyze` automatically after `git commit` and `git merge`, preserving embeddings if previously generated. diff --git a/AGENTS.md b/AGENTS.md index 17ff788bd3..8e2fa4610e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -148,11 +148,12 @@ Indexed as **GitNexus** (4325 symbols, 10556 relationships, 300 execution flows) ## Keeping the Index Fresh ```bash -npx gitnexus analyze # basic refresh -npx gitnexus analyze --embeddings # preserve embeddings +npx gitnexus analyze # basic refresh; preserves any existing embeddings +npx gitnexus analyze --embeddings # also generate embeddings for new/changed nodes +npx gitnexus analyze --drop-embeddings # explicit opt-in to wipe existing embeddings ``` -Check `.gitnexus/meta.json` `stats.embeddings` (0 = none). Running without `--embeddings` deletes existing vectors. +Check `.gitnexus/meta.json` `stats.embeddings` (0 = none). A plain `analyze` no longer drops existing vectors — pass `--drop-embeddings` to wipe. > Claude Code: PostToolUse hook handles this after `git commit` and `git merge`. diff --git a/GUARDRAILS.md b/GUARDRAILS.md index ac48ab906e..1cc0327591 100644 --- a/GUARDRAILS.md +++ b/GUARDRAILS.md @@ -19,7 +19,7 @@ Maintainer may widen scope per task. 2. **Never rename with find-and-replace** in GitNexus-indexed projects — use `rename` MCP tool with `dry_run: true` first, review `graph` vs `text_search` edits. No separate `gitnexus rename` CLI exists. 3. **Run impact analysis before editing shared symbols** — `impact` (upstream) for functions/classes/methods others call. Do not ignore HIGH/CRITICAL without maintainer sign-off. 4. **Run `detect_changes` before commit** — confirm diffs map to expected symbols/processes when the graph is available. -5. **Preserve embeddings** — if `.gitnexus/meta.json` shows embeddings, use `npx gitnexus analyze --embeddings`; plain `analyze` drops them. +5. **Preserve embeddings** — plain `npx gitnexus analyze` now preserves any embeddings recorded in `.gitnexus/meta.json` (the previous behavior wiped them). Use `--embeddings` to also generate vectors for new/changed nodes; use `--drop-embeddings` only when an explicit wipe is intended (e.g., model swap). --- @@ -36,8 +36,8 @@ Format: **Trigger → Instruction → Reason**. Append new Signs when the same m ### Embeddings vanished after analyze - **Trigger:** Semantic search quality drops; `stats.embeddings` in `meta.json` is 0 after refresh. -- **Do:** `npx gitnexus analyze --embeddings`, confirm `meta.json` reflects stored embeddings. -- **Why:** Embedding generation is opt-in; analyze without the flag does not preserve prior vectors. +- **Do:** Re-run `npx gitnexus analyze --embeddings` to regenerate. Check the analyze log for a `Warning: could not load cached embeddings` line — if present, the cache restore failed (corrupt DB / schema mismatch) and the rebuild had nothing to preserve. If you intentionally passed `--drop-embeddings`, this is expected. +- **Why:** Plain `analyze` preserves prior vectors by re-inserting them after the rebuild; the only ways to end up at zero are an explicit `--drop-embeddings`, a cache-load failure (now logged), or a model/dimension change that invalidates the cache. ### MCP lists no repos diff --git a/gitnexus-claude-plugin/skills/gitnexus-cli/SKILL.md b/gitnexus-claude-plugin/skills/gitnexus-cli/SKILL.md index 607aa8c4a6..1c38face40 100644 --- a/gitnexus-claude-plugin/skills/gitnexus-cli/SKILL.md +++ b/gitnexus-claude-plugin/skills/gitnexus-cli/SKILL.md @@ -21,6 +21,7 @@ Run from the project root. This parses all source files, builds the knowledge gr |------|--------| | `--force` | Force full re-index even if up to date | | `--embeddings` | Enable embedding generation for semantic search (off by default) | +| `--drop-embeddings` | Drop existing embeddings on rebuild. By default, an `analyze` without `--embeddings` preserves them. | **When to run:** First time in a project, after major code changes, or when `gitnexus://repo/{name}/context` reports the index is stale. diff --git a/gitnexus/skills/gitnexus-cli.md b/gitnexus/skills/gitnexus-cli.md index c9e0af341a..a10104aefb 100644 --- a/gitnexus/skills/gitnexus-cli.md +++ b/gitnexus/skills/gitnexus-cli.md @@ -21,6 +21,7 @@ Run from the project root. This parses all source files, builds the knowledge gr | -------------- | ---------------------------------------------------------------- | | `--force` | Force full re-index even if up to date | | `--embeddings` | Enable embedding generation for semantic search (off by default) | +| `--drop-embeddings` | Drop existing embeddings on rebuild. By default, an `analyze` without `--embeddings` preserves them. | **When to run:** First time in a project, after major code changes, or when `gitnexus://repo/{name}/context` reports the index is stale. In Claude Code, a PostToolUse hook runs `analyze` automatically after `git commit` and `git merge`, preserving embeddings if previously generated. diff --git a/gitnexus/src/cli/analyze.ts b/gitnexus/src/cli/analyze.ts index 7d76c88143..37a0ca5b37 100644 --- a/gitnexus/src/cli/analyze.ts +++ b/gitnexus/src/cli/analyze.ts @@ -56,6 +56,12 @@ function ensureHeap(): boolean { export interface AnalyzeOptions { force?: boolean; embeddings?: boolean; + /** + * Explicitly drop existing embeddings on rebuild instead of preserving + * them. Without this flag, a routine `analyze` keeps any embeddings + * already present in the index even when `--embeddings` is omitted. + */ + dropEmbeddings?: boolean; skills?: boolean; verbose?: boolean; /** Skip AGENTS.md and CLAUDE.md gitnexus block updates. */ @@ -226,6 +232,7 @@ export const analyzeCommand = async (inputPath?: string, options?: AnalyzeOption // collision guard (see allowDuplicateName below). force: options?.force || options?.skills, embeddings: options?.embeddings, + dropEmbeddings: options?.dropEmbeddings, skipGit: options?.skipGit, skipAgentsMd: options?.skipAgentsMd, noStats: options?.noStats, diff --git a/gitnexus/src/cli/index.ts b/gitnexus/src/cli/index.ts index beb2f47f21..41e48a7802 100644 --- a/gitnexus/src/cli/index.ts +++ b/gitnexus/src/cli/index.ts @@ -24,6 +24,11 @@ program .description('Index a repository (full analysis)') .option('-f, --force', 'Force full re-index even if up to date') .option('--embeddings', 'Enable embedding generation for semantic search (off by default)') + .option( + '--drop-embeddings', + 'Drop existing embeddings on rebuild. By default, an `analyze` without `--embeddings` ' + + 'preserves any embeddings already present in the index.', + ) .option('--skills', 'Generate repo-specific skill files from detected communities') .option('--skip-agents-md', 'Skip updating the gitnexus section in AGENTS.md and CLAUDE.md') .option('--no-stats', 'Omit volatile file/symbol counts from AGENTS.md and CLAUDE.md') diff --git a/gitnexus/src/core/embedding-mode.ts b/gitnexus/src/core/embedding-mode.ts new file mode 100644 index 0000000000..7c5b5cf02c --- /dev/null +++ b/gitnexus/src/core/embedding-mode.ts @@ -0,0 +1,54 @@ +/** + * Pure derivation of the embedding-mode flags for `runFullAnalysis`. + * + * Lives in its own module (no native imports) so the branching contract can + * be unit-tested without spinning up LadybugDB, tree-sitter, or any of the + * other side-effecting dependencies pulled in by `run-analyze.ts`. + * + * Semantics: + * --drop-embeddings -> wipe (skip cache load entirely) + * --embeddings -> load cache, restore, then generate + * --force + existing>0 -> load cache, restore, then generate (regenerate top-up) + * (default) + existing>0 -> preserve only (load + restore, no generation) + * any path with existing=0 -> no cache work, no preservation + */ + +export interface EmbeddingModeInput { + force?: boolean; + embeddings?: boolean; + dropEmbeddings?: boolean; +} + +export interface EmbeddingMode { + /** True when phase 4 should run the embedding generation pipeline. */ + shouldGenerateEmbeddings: boolean; + /** True when we should load the cache to re-insert vectors after rebuild without generating new ones. */ + preserveExistingEmbeddings: boolean; + /** True when `--force` upgraded a default analyze into a regeneration because the repo was already embedded. */ + forceRegenerateEmbeddings: boolean; + /** True when we need to load cached embeddings from the existing DB before the rebuild. */ + shouldLoadCache: boolean; +} + +export function deriveEmbeddingMode( + options: EmbeddingModeInput, + existingEmbeddingCount: number, +): EmbeddingMode { + const hasExisting = existingEmbeddingCount > 0; + const drop = !!options.dropEmbeddings; + const explicit = !!options.embeddings; + const force = !!options.force; + + const forceRegenerateEmbeddings = force && !explicit && !drop && hasExisting; + const preserveExistingEmbeddings = + !explicit && !drop && !forceRegenerateEmbeddings && hasExisting; + const shouldGenerateEmbeddings = explicit || forceRegenerateEmbeddings; + const shouldLoadCache = !drop && (shouldGenerateEmbeddings || preserveExistingEmbeddings); + + return { + shouldGenerateEmbeddings, + preserveExistingEmbeddings, + forceRegenerateEmbeddings, + shouldLoadCache, + }; +} diff --git a/gitnexus/src/core/run-analyze.ts b/gitnexus/src/core/run-analyze.ts index 00e0574acc..6198746e6c 100644 --- a/gitnexus/src/core/run-analyze.ts +++ b/gitnexus/src/core/run-analyze.ts @@ -53,6 +53,15 @@ export interface AnalyzeOptions { */ force?: boolean; embeddings?: boolean; + /** + * Explicitly drop any embeddings present in the existing index instead of + * preserving them. Only meaningful when `embeddings` is false/undefined: + * the default behavior in that case is to load the previously generated + * embeddings and re-insert them after the rebuild so a routine + * re-analyze does not silently wipe a long embedding pass (#issue: analyze + * silently wipes existing embeddings when run without --embeddings). + */ + dropEmbeddings?: boolean; skipGit?: boolean; /** Skip AGENTS.md and CLAUDE.md gitnexus block updates. */ skipAgentsMd?: boolean; @@ -94,6 +103,12 @@ export interface AnalyzeResult { /** Threshold: auto-skip embeddings for repos with more nodes than this */ const EMBEDDING_NODE_LIMIT = 50_000; +// Re-export the pure flag-derivation helper so external callers (and tests) +// keep importing from this module's stable surface. +export { deriveEmbeddingMode } from './embedding-mode.js'; +export type { EmbeddingMode } from './embedding-mode.js'; +import { deriveEmbeddingMode as _deriveEmbeddingMode } from './embedding-mode.js'; + export const PHASE_LABELS: Record = { extracting: 'Scanning files', structure: 'Building structure', @@ -160,10 +175,51 @@ export async function runFullAnalysis( } // ── Cache embeddings from existing index before rebuild ──────────── + // Four modes: + // --embeddings -> load cache, restore, then generate any new ones + // --force (with existing + // embeddings) -> auto-imply --embeddings: load cache, restore, + // regenerate embeddings for new/changed nodes + // (a forced re-index of an embedded repo + // shouldn't quietly downgrade to "preserve only") + // (default) -> if existing index has embeddings, preserve them + // (load + restore, but do not generate); otherwise no-op + // --drop-embeddings -> skip cache load entirely; rebuild wipes embeddings + // + // The default-preserve branch is what makes a routine `analyze` (e.g. a + // post-commit hook) safe: a multi-minute embedding pass is no longer + // silently dropped just because the caller omitted `--embeddings`. let cachedEmbeddingNodeIds = new Set(); let cachedEmbeddings: CachedEmbedding[] = []; - if (options.embeddings && existingMeta && !options.force) { + const existingEmbeddingCount = existingMeta?.stats?.embeddings ?? 0; + const { + forceRegenerateEmbeddings, + preserveExistingEmbeddings, + shouldGenerateEmbeddings, + shouldLoadCache, + } = _deriveEmbeddingMode(options, existingEmbeddingCount); + + if (options.dropEmbeddings && existingEmbeddingCount > 0) { + log( + `Dropping ${existingEmbeddingCount} existing embeddings (--drop-embeddings). ` + + `Re-run with --embeddings to regenerate.`, + ); + } else if (forceRegenerateEmbeddings) { + log( + `--force on a repo with ${existingEmbeddingCount} existing embeddings: ` + + `regenerating embeddings for new/changed nodes. ` + + `Pass --drop-embeddings to wipe them instead.`, + ); + } else if (preserveExistingEmbeddings) { + log( + `Preserving ${existingEmbeddingCount} existing embeddings. ` + + `Pass --embeddings to also generate embeddings for new/changed nodes, ` + + `or --drop-embeddings to wipe them.`, + ); + } + + if (shouldLoadCache && existingMeta) { try { progress('embeddings', 0, 'Caching embeddings...'); await initLbug(lbugPath); @@ -171,7 +227,17 @@ export async function runFullAnalysis( cachedEmbeddingNodeIds = cached.embeddingNodeIds; cachedEmbeddings = cached.embeddings; await closeLbug(); - } catch { + } catch (err: any) { + // Surface cache-load failures explicitly: silently swallowing here would + // re-introduce the original silent-data-loss symptom (embeddings end up + // at 0 in meta.json with no diagnostic) through a different door. + log( + `Warning: could not load cached embeddings ` + + `(${err?.message ?? String(err)}). ` + + `Embeddings will not be preserved on this run.`, + ); + cachedEmbeddingNodeIds = new Set(); + cachedEmbeddings = []; try { await closeLbug(); } catch { @@ -253,7 +319,7 @@ export async function runFullAnalysis( const stats = await getLbugStats(); let embeddingSkipped = true; - if (options.embeddings) { + if (shouldGenerateEmbeddings) { if (stats.nodes <= EMBEDDING_NODE_LIMIT) { embeddingSkipped = false; } diff --git a/gitnexus/src/server/api.ts b/gitnexus/src/server/api.ts index 6cae3bbe11..9c82477d4f 100644 --- a/gitnexus/src/server/api.ts +++ b/gitnexus/src/server/api.ts @@ -1145,7 +1145,7 @@ export const createServer = async (port: number, host: string = '127.0.0.1') => // POST /api/analyze — start a new analysis job app.post('/api/analyze', async (req, res) => { try { - const { url: repoUrl, path: repoLocalPath, force, embeddings } = req.body; + const { url: repoUrl, path: repoLocalPath, force, embeddings, dropEmbeddings } = req.body; // Input type validation if (repoUrl !== undefined && typeof repoUrl !== 'string') { @@ -1339,7 +1339,11 @@ export const createServer = async (port: number, host: string = '127.0.0.1') => child.send({ type: 'start', repoPath: targetPath, - options: { force: !!force, embeddings: !!embeddings }, + options: { + force: !!force, + embeddings: !!embeddings, + dropEmbeddings: !!dropEmbeddings, + }, }); }; diff --git a/gitnexus/test/unit/run-analyze.test.ts b/gitnexus/test/unit/run-analyze.test.ts index 4a1637192c..747fffe4a7 100644 --- a/gitnexus/test/unit/run-analyze.test.ts +++ b/gitnexus/test/unit/run-analyze.test.ts @@ -1,4 +1,5 @@ import { describe, it, expect } from 'vitest'; +import { deriveEmbeddingMode } from '../../src/core/embedding-mode.js'; describe('run-analyze module', () => { it('exports runFullAnalysis as a function', async () => { @@ -12,3 +13,83 @@ describe('run-analyze module', () => { expect(mod.PHASE_LABELS.parsing).toBe('Parsing code'); }); }); + +describe('deriveEmbeddingMode', () => { + // Default `analyze` on a repo with existing embeddings: must preserve, must + // NOT regenerate, must load the cache so phase 3.5 can re-insert vectors. + it('default + existing>0 → preserve only (load cache, no generation)', () => { + const m = deriveEmbeddingMode({}, 1234); + expect(m.preserveExistingEmbeddings).toBe(true); + expect(m.shouldGenerateEmbeddings).toBe(false); + expect(m.forceRegenerateEmbeddings).toBe(false); + expect(m.shouldLoadCache).toBe(true); + }); + + it('default + existing=0 → no-op (no preserve, no generation, no cache load)', () => { + const m = deriveEmbeddingMode({}, 0); + expect(m.preserveExistingEmbeddings).toBe(false); + expect(m.shouldGenerateEmbeddings).toBe(false); + expect(m.forceRegenerateEmbeddings).toBe(false); + expect(m.shouldLoadCache).toBe(false); + }); + + // The headline behavior change requested in PR feedback: --force on an + // already-embedded repo must regenerate (top up new/changed nodes), not + // silently downgrade to "preserve only". + it('--force + existing>0 → forceRegenerate + generate + load cache', () => { + const m = deriveEmbeddingMode({ force: true }, 500); + expect(m.forceRegenerateEmbeddings).toBe(true); + expect(m.shouldGenerateEmbeddings).toBe(true); + expect(m.preserveExistingEmbeddings).toBe(false); + expect(m.shouldLoadCache).toBe(true); + }); + + it('--force + existing=0 → no embedding work (force keeps prior semantics)', () => { + const m = deriveEmbeddingMode({ force: true }, 0); + expect(m.forceRegenerateEmbeddings).toBe(false); + expect(m.shouldGenerateEmbeddings).toBe(false); + expect(m.preserveExistingEmbeddings).toBe(false); + expect(m.shouldLoadCache).toBe(false); + }); + + it('--embeddings → generate + load cache (incremental top-up)', () => { + const m = deriveEmbeddingMode({ embeddings: true }, 500); + expect(m.shouldGenerateEmbeddings).toBe(true); + expect(m.preserveExistingEmbeddings).toBe(false); + expect(m.shouldLoadCache).toBe(true); + }); + + it('--embeddings + existing=0 → generate; cache load still fires (harmless empty load)', () => { + const m = deriveEmbeddingMode({ embeddings: true }, 0); + expect(m.shouldGenerateEmbeddings).toBe(true); + // Cache load is gated at the call site by `existingMeta`, not by count; + // when explicit `--embeddings` is set we always attempt the load so any + // stray vectors from a partial prior run get picked up. + expect(m.shouldLoadCache).toBe(true); + }); + + // --drop-embeddings is the explicit wipe path; it must suppress cache load + // even when --force is also set (the dominant escape hatch). + it('--drop-embeddings → suppresses cache load, no generation', () => { + const m = deriveEmbeddingMode({ dropEmbeddings: true }, 1234); + expect(m.shouldLoadCache).toBe(false); + expect(m.shouldGenerateEmbeddings).toBe(false); + expect(m.preserveExistingEmbeddings).toBe(false); + expect(m.forceRegenerateEmbeddings).toBe(false); + }); + + it('--force + --drop-embeddings → drop wins (no cache load, no generation)', () => { + const m = deriveEmbeddingMode({ force: true, dropEmbeddings: true }, 1234); + expect(m.shouldLoadCache).toBe(false); + expect(m.shouldGenerateEmbeddings).toBe(false); + expect(m.forceRegenerateEmbeddings).toBe(false); + }); + + it('--embeddings + --drop-embeddings → drop suppresses cache load (no preservation)', () => { + // --embeddings still generates, but the prior vectors are wiped first. + const m = deriveEmbeddingMode({ embeddings: true, dropEmbeddings: true }, 1234); + expect(m.shouldLoadCache).toBe(false); + expect(m.shouldGenerateEmbeddings).toBe(true); + expect(m.preserveExistingEmbeddings).toBe(false); + }); +});