Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions .claude/skills/gitnexus/gitnexus-cli/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,11 @@ npx gitnexus analyze

Run from the project root. This parses all source files, builds the knowledge graph, writes it to `.gitnexus/`, and generates CLAUDE.md / AGENTS.md context files.

| Flag | Effect |
| -------------- | ---------------------------------------------------------------- |
| `--force` | Force full re-index even if up to date |
| `--embeddings` | Enable embedding generation for semantic search (off by default) |
| Flag | Effect |
| ------------------- | ------------------------------------------------------------------------------------------------------- |
| `--force` | Force full re-index even if up to date |
| `--embeddings` | Enable embedding generation for semantic search (off by default) |
| `--drop-embeddings` | Drop existing embeddings on rebuild. By default, an `analyze` without `--embeddings` preserves them. |

**When to run:** First time in a project, after major code changes, or when `gitnexus://repo/{name}/context` reports the index is stale. In Claude Code, a PostToolUse hook runs `analyze` automatically after `git commit` and `git merge`, preserving embeddings if previously generated.

Expand Down
7 changes: 4 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,11 +148,12 @@ Indexed as **GitNexus** (4325 symbols, 10556 relationships, 300 execution flows)
## Keeping the Index Fresh

```bash
npx gitnexus analyze # basic refresh
npx gitnexus analyze --embeddings # preserve embeddings
npx gitnexus analyze # basic refresh; preserves any existing embeddings
npx gitnexus analyze --embeddings # also generate embeddings for new/changed nodes
npx gitnexus analyze --drop-embeddings # explicit opt-in to wipe existing embeddings
```

Check `.gitnexus/meta.json` `stats.embeddings` (0 = none). Running without `--embeddings` deletes existing vectors.
Check `.gitnexus/meta.json` `stats.embeddings` (0 = none). A plain `analyze` no longer drops existing vectors — pass `--drop-embeddings` to wipe.

> Claude Code: PostToolUse hook handles this after `git commit` and `git merge`.

Expand Down
6 changes: 3 additions & 3 deletions GUARDRAILS.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Maintainer may widen scope per task.
2. **Never rename with find-and-replace** in GitNexus-indexed projects — use `rename` MCP tool with `dry_run: true` first, review `graph` vs `text_search` edits. No separate `gitnexus rename` CLI exists.
3. **Run impact analysis before editing shared symbols** — `impact` (upstream) for functions/classes/methods others call. Do not ignore HIGH/CRITICAL without maintainer sign-off.
4. **Run `detect_changes` before commit** — confirm diffs map to expected symbols/processes when the graph is available.
5. **Preserve embeddings** — if `.gitnexus/meta.json` shows embeddings, use `npx gitnexus analyze --embeddings`; plain `analyze` drops them.
5. **Preserve embeddings** — plain `npx gitnexus analyze` now preserves any embeddings recorded in `.gitnexus/meta.json` (the previous behavior wiped them). Use `--embeddings` to also generate vectors for new/changed nodes; use `--drop-embeddings` only when an explicit wipe is intended (e.g., model swap).

---

Expand All @@ -36,8 +36,8 @@ Format: **Trigger → Instruction → Reason**. Append new Signs when the same m
### Embeddings vanished after analyze

- **Trigger:** Semantic search quality drops; `stats.embeddings` in `meta.json` is 0 after refresh.
- **Do:** `npx gitnexus analyze --embeddings`, confirm `meta.json` reflects stored embeddings.
- **Why:** Embedding generation is opt-in; analyze without the flag does not preserve prior vectors.
- **Do:** Re-run `npx gitnexus analyze --embeddings` to regenerate. Check the analyze log for a `Warning: could not load cached embeddings` line — if present, the cache restore failed (corrupt DB / schema mismatch) and the rebuild had nothing to preserve. If you intentionally passed `--drop-embeddings`, this is expected.
- **Why:** Plain `analyze` preserves prior vectors by re-inserting them after the rebuild; the only ways to end up at zero are an explicit `--drop-embeddings`, a cache-load failure (now logged), or a model/dimension change that invalidates the cache.

### MCP lists no repos

Expand Down
1 change: 1 addition & 0 deletions gitnexus-claude-plugin/skills/gitnexus-cli/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Run from the project root. This parses all source files, builds the knowledge gr
|------|--------|
| `--force` | Force full re-index even if up to date |
| `--embeddings` | Enable embedding generation for semantic search (off by default) |
| `--drop-embeddings` | Drop existing embeddings on rebuild. By default, an `analyze` without `--embeddings` preserves them. |

**When to run:** First time in a project, after major code changes, or when `gitnexus://repo/{name}/context` reports the index is stale.

Expand Down
1 change: 1 addition & 0 deletions gitnexus/skills/gitnexus-cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Run from the project root. This parses all source files, builds the knowledge gr
| -------------- | ---------------------------------------------------------------- |
| `--force` | Force full re-index even if up to date |
| `--embeddings` | Enable embedding generation for semantic search (off by default) |
| `--drop-embeddings` | Drop existing embeddings on rebuild. By default, an `analyze` without `--embeddings` preserves them. |

**When to run:** First time in a project, after major code changes, or when `gitnexus://repo/{name}/context` reports the index is stale. In Claude Code, a PostToolUse hook runs `analyze` automatically after `git commit` and `git merge`, preserving embeddings if previously generated.

Expand Down
7 changes: 7 additions & 0 deletions gitnexus/src/cli/analyze.ts
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,12 @@ function ensureHeap(): boolean {
export interface AnalyzeOptions {
force?: boolean;
embeddings?: boolean;
/**
* Explicitly drop existing embeddings on rebuild instead of preserving
* them. Without this flag, a routine `analyze` keeps any embeddings
* already present in the index even when `--embeddings` is omitted.
*/
dropEmbeddings?: boolean;
skills?: boolean;
verbose?: boolean;
/** Skip AGENTS.md and CLAUDE.md gitnexus block updates. */
Expand Down Expand Up @@ -226,6 +232,7 @@ export const analyzeCommand = async (inputPath?: string, options?: AnalyzeOption
// collision guard (see allowDuplicateName below).
force: options?.force || options?.skills,
embeddings: options?.embeddings,
dropEmbeddings: options?.dropEmbeddings,
skipGit: options?.skipGit,
skipAgentsMd: options?.skipAgentsMd,
noStats: options?.noStats,
Expand Down
5 changes: 5 additions & 0 deletions gitnexus/src/cli/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,11 @@ program
.description('Index a repository (full analysis)')
.option('-f, --force', 'Force full re-index even if up to date')
.option('--embeddings', 'Enable embedding generation for semantic search (off by default)')
.option(
'--drop-embeddings',
'Drop existing embeddings on rebuild. By default, an `analyze` without `--embeddings` ' +
'preserves any embeddings already present in the index.',
)
.option('--skills', 'Generate repo-specific skill files from detected communities')
.option('--skip-agents-md', 'Skip updating the gitnexus section in AGENTS.md and CLAUDE.md')
.option('--no-stats', 'Omit volatile file/symbol counts from AGENTS.md and CLAUDE.md')
Expand Down
54 changes: 54 additions & 0 deletions gitnexus/src/core/embedding-mode.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
/**
* Pure derivation of the embedding-mode flags for `runFullAnalysis`.
*
* Lives in its own module (no native imports) so the branching contract can
* be unit-tested without spinning up LadybugDB, tree-sitter, or any of the
* other side-effecting dependencies pulled in by `run-analyze.ts`.
*
* Semantics:
* --drop-embeddings -> wipe (skip cache load entirely)
* --embeddings -> load cache, restore, then generate
* --force + existing>0 -> load cache, restore, then generate (regenerate top-up)
* (default) + existing>0 -> preserve only (load + restore, no generation)
* any path with existing=0 -> no cache work, no preservation
*/

export interface EmbeddingModeInput {
force?: boolean;
embeddings?: boolean;
dropEmbeddings?: boolean;
}

export interface EmbeddingMode {
/** True when phase 4 should run the embedding generation pipeline. */
shouldGenerateEmbeddings: boolean;
/** True when we should load the cache to re-insert vectors after rebuild without generating new ones. */
preserveExistingEmbeddings: boolean;
/** True when `--force` upgraded a default analyze into a regeneration because the repo was already embedded. */
forceRegenerateEmbeddings: boolean;
/** True when we need to load cached embeddings from the existing DB before the rebuild. */
shouldLoadCache: boolean;
}

export function deriveEmbeddingMode(
options: EmbeddingModeInput,
existingEmbeddingCount: number,
): EmbeddingMode {
const hasExisting = existingEmbeddingCount > 0;
const drop = !!options.dropEmbeddings;
const explicit = !!options.embeddings;
const force = !!options.force;

const forceRegenerateEmbeddings = force && !explicit && !drop && hasExisting;
const preserveExistingEmbeddings =
!explicit && !drop && !forceRegenerateEmbeddings && hasExisting;
const shouldGenerateEmbeddings = explicit || forceRegenerateEmbeddings;
const shouldLoadCache = !drop && (shouldGenerateEmbeddings || preserveExistingEmbeddings);

return {
shouldGenerateEmbeddings,
preserveExistingEmbeddings,
forceRegenerateEmbeddings,
shouldLoadCache,
};
}
72 changes: 69 additions & 3 deletions gitnexus/src/core/run-analyze.ts
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,15 @@ export interface AnalyzeOptions {
*/
force?: boolean;
embeddings?: boolean;
/**
* Explicitly drop any embeddings present in the existing index instead of
* preserving them. Only meaningful when `embeddings` is false/undefined:
* the default behavior in that case is to load the previously generated
* embeddings and re-insert them after the rebuild so a routine
* re-analyze does not silently wipe a long embedding pass (#issue: analyze
* silently wipes existing embeddings when run without --embeddings).
*/
dropEmbeddings?: boolean;
skipGit?: boolean;
/** Skip AGENTS.md and CLAUDE.md gitnexus block updates. */
skipAgentsMd?: boolean;
Expand Down Expand Up @@ -94,6 +103,12 @@ export interface AnalyzeResult {
/** Threshold: auto-skip embeddings for repos with more nodes than this */
const EMBEDDING_NODE_LIMIT = 50_000;

// Re-export the pure flag-derivation helper so external callers (and tests)
// keep importing from this module's stable surface.
export { deriveEmbeddingMode } from './embedding-mode.js';
export type { EmbeddingMode } from './embedding-mode.js';
import { deriveEmbeddingMode as _deriveEmbeddingMode } from './embedding-mode.js';

export const PHASE_LABELS: Record<string, string> = {
extracting: 'Scanning files',
structure: 'Building structure',
Expand Down Expand Up @@ -160,18 +175,69 @@ export async function runFullAnalysis(
}

// ── Cache embeddings from existing index before rebuild ────────────
// Four modes:
// --embeddings -> load cache, restore, then generate any new ones
// --force (with existing
// embeddings) -> auto-imply --embeddings: load cache, restore,
// regenerate embeddings for new/changed nodes
// (a forced re-index of an embedded repo
// shouldn't quietly downgrade to "preserve only")
// (default) -> if existing index has embeddings, preserve them
// (load + restore, but do not generate); otherwise no-op
// --drop-embeddings -> skip cache load entirely; rebuild wipes embeddings
//
// The default-preserve branch is what makes a routine `analyze` (e.g. a
// post-commit hook) safe: a multi-minute embedding pass is no longer
// silently dropped just because the caller omitted `--embeddings`.
let cachedEmbeddingNodeIds = new Set<string>();
let cachedEmbeddings: CachedEmbedding[] = [];

if (options.embeddings && existingMeta && !options.force) {
const existingEmbeddingCount = existingMeta?.stats?.embeddings ?? 0;
const {
forceRegenerateEmbeddings,
preserveExistingEmbeddings,
shouldGenerateEmbeddings,
shouldLoadCache,
} = _deriveEmbeddingMode(options, existingEmbeddingCount);

if (options.dropEmbeddings && existingEmbeddingCount > 0) {
log(
`Dropping ${existingEmbeddingCount} existing embeddings (--drop-embeddings). ` +
`Re-run with --embeddings to regenerate.`,
);
} else if (forceRegenerateEmbeddings) {
log(
`--force on a repo with ${existingEmbeddingCount} existing embeddings: ` +
`regenerating embeddings for new/changed nodes. ` +
`Pass --drop-embeddings to wipe them instead.`,
);
} else if (preserveExistingEmbeddings) {
log(
`Preserving ${existingEmbeddingCount} existing embeddings. ` +
`Pass --embeddings to also generate embeddings for new/changed nodes, ` +
`or --drop-embeddings to wipe them.`,
);
}

if (shouldLoadCache && existingMeta) {
try {
progress('embeddings', 0, 'Caching embeddings...');
await initLbug(lbugPath);
const cached = await loadCachedEmbeddings();
cachedEmbeddingNodeIds = cached.embeddingNodeIds;
cachedEmbeddings = cached.embeddings;
await closeLbug();
} catch {
} catch (err: any) {
// Surface cache-load failures explicitly: silently swallowing here would
// re-introduce the original silent-data-loss symptom (embeddings end up
// at 0 in meta.json with no diagnostic) through a different door.
log(
`Warning: could not load cached embeddings ` +
`(${err?.message ?? String(err)}). ` +
`Embeddings will not be preserved on this run.`,
);
cachedEmbeddingNodeIds = new Set<string>();
cachedEmbeddings = [];
try {
await closeLbug();
} catch {
Expand Down Expand Up @@ -253,7 +319,7 @@ export async function runFullAnalysis(
const stats = await getLbugStats();
let embeddingSkipped = true;

if (options.embeddings) {
if (shouldGenerateEmbeddings) {
if (stats.nodes <= EMBEDDING_NODE_LIMIT) {
embeddingSkipped = false;
}
Expand Down
8 changes: 6 additions & 2 deletions gitnexus/src/server/api.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1145,7 +1145,7 @@ export const createServer = async (port: number, host: string = '127.0.0.1') =>
// POST /api/analyze — start a new analysis job
app.post('/api/analyze', async (req, res) => {
try {
const { url: repoUrl, path: repoLocalPath, force, embeddings } = req.body;
const { url: repoUrl, path: repoLocalPath, force, embeddings, dropEmbeddings } = req.body;

// Input type validation
if (repoUrl !== undefined && typeof repoUrl !== 'string') {
Expand Down Expand Up @@ -1339,7 +1339,11 @@ export const createServer = async (port: number, host: string = '127.0.0.1') =>
child.send({
type: 'start',
repoPath: targetPath,
options: { force: !!force, embeddings: !!embeddings },
options: {
force: !!force,
embeddings: !!embeddings,
dropEmbeddings: !!dropEmbeddings,
},
});
};

Expand Down
81 changes: 81 additions & 0 deletions gitnexus/test/unit/run-analyze.test.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import { describe, it, expect } from 'vitest';
import { deriveEmbeddingMode } from '../../src/core/embedding-mode.js';

describe('run-analyze module', () => {
it('exports runFullAnalysis as a function', async () => {
Expand All @@ -12,3 +13,83 @@ describe('run-analyze module', () => {
expect(mod.PHASE_LABELS.parsing).toBe('Parsing code');
});
});

describe('deriveEmbeddingMode', () => {
// Default `analyze` on a repo with existing embeddings: must preserve, must
// NOT regenerate, must load the cache so phase 3.5 can re-insert vectors.
it('default + existing>0 → preserve only (load cache, no generation)', () => {
const m = deriveEmbeddingMode({}, 1234);
expect(m.preserveExistingEmbeddings).toBe(true);
expect(m.shouldGenerateEmbeddings).toBe(false);
expect(m.forceRegenerateEmbeddings).toBe(false);
expect(m.shouldLoadCache).toBe(true);
});

it('default + existing=0 → no-op (no preserve, no generation, no cache load)', () => {
const m = deriveEmbeddingMode({}, 0);
expect(m.preserveExistingEmbeddings).toBe(false);
expect(m.shouldGenerateEmbeddings).toBe(false);
expect(m.forceRegenerateEmbeddings).toBe(false);
expect(m.shouldLoadCache).toBe(false);
});

// The headline behavior change requested in PR feedback: --force on an
// already-embedded repo must regenerate (top up new/changed nodes), not
// silently downgrade to "preserve only".
it('--force + existing>0 → forceRegenerate + generate + load cache', () => {
const m = deriveEmbeddingMode({ force: true }, 500);
expect(m.forceRegenerateEmbeddings).toBe(true);
expect(m.shouldGenerateEmbeddings).toBe(true);
expect(m.preserveExistingEmbeddings).toBe(false);
expect(m.shouldLoadCache).toBe(true);
});

it('--force + existing=0 → no embedding work (force keeps prior semantics)', () => {
const m = deriveEmbeddingMode({ force: true }, 0);
expect(m.forceRegenerateEmbeddings).toBe(false);
expect(m.shouldGenerateEmbeddings).toBe(false);
expect(m.preserveExistingEmbeddings).toBe(false);
expect(m.shouldLoadCache).toBe(false);
});

it('--embeddings → generate + load cache (incremental top-up)', () => {
const m = deriveEmbeddingMode({ embeddings: true }, 500);
expect(m.shouldGenerateEmbeddings).toBe(true);
expect(m.preserveExistingEmbeddings).toBe(false);
expect(m.shouldLoadCache).toBe(true);
});

it('--embeddings + existing=0 → generate; cache load still fires (harmless empty load)', () => {
const m = deriveEmbeddingMode({ embeddings: true }, 0);
expect(m.shouldGenerateEmbeddings).toBe(true);
// Cache load is gated at the call site by `existingMeta`, not by count;
// when explicit `--embeddings` is set we always attempt the load so any
// stray vectors from a partial prior run get picked up.
expect(m.shouldLoadCache).toBe(true);
});

// --drop-embeddings is the explicit wipe path; it must suppress cache load
// even when --force is also set (the dominant escape hatch).
it('--drop-embeddings → suppresses cache load, no generation', () => {
const m = deriveEmbeddingMode({ dropEmbeddings: true }, 1234);
expect(m.shouldLoadCache).toBe(false);
expect(m.shouldGenerateEmbeddings).toBe(false);
expect(m.preserveExistingEmbeddings).toBe(false);
expect(m.forceRegenerateEmbeddings).toBe(false);
});

it('--force + --drop-embeddings → drop wins (no cache load, no generation)', () => {
const m = deriveEmbeddingMode({ force: true, dropEmbeddings: true }, 1234);
expect(m.shouldLoadCache).toBe(false);
expect(m.shouldGenerateEmbeddings).toBe(false);
expect(m.forceRegenerateEmbeddings).toBe(false);
});

it('--embeddings + --drop-embeddings → drop suppresses cache load (no preservation)', () => {
// --embeddings still generates, but the prior vectors are wiped first.
const m = deriveEmbeddingMode({ embeddings: true, dropEmbeddings: true }, 1234);
expect(m.shouldLoadCache).toBe(false);
expect(m.shouldGenerateEmbeddings).toBe(true);
expect(m.preserveExistingEmbeddings).toBe(false);
});
});
Loading