Skip to content

feat: add --ignore-embeddings-node-limit flag to bypass 50K node cap#389

Closed
RyuzakiH wants to merge 1 commit into
abhigyanpatwari:mainfrom
RyuzakiH:feat/ignore-embeddings-node-limit
Closed

feat: add --ignore-embeddings-node-limit flag to bypass 50K node cap#389
RyuzakiH wants to merge 1 commit into
abhigyanpatwari:mainfrom
RyuzakiH:feat/ignore-embeddings-node-limit

Conversation

@RyuzakiH

Copy link
Copy Markdown
Contributor

Summary

This PR adds a new CLI flag, --ignore-embeddings-node-limit, to the analyze command.

When used together with --embeddings, this flag bypasses the existing 50,000-node safety cap for embedding generation.

It also updates the skip message to clearly inform users that they can override the limit with --ignore-embeddings-node-limit.

Closes #382

Changes

  • Added --ignore-embeddings-node-limit to the analyze command
  • Allowed embedding generation to continue beyond the 50,000-node limit when the flag is provided
  • Updated the skip message to mention the override flag

Test Plan

  • Run npx gitnexus analyze --embeddings on a repository with more than 50,000 nodes and verify that embeddings are skipped with the updated message mentioning --ignore-embeddings-node-limit
  • Run npx gitnexus analyze --embeddings --ignore-embeddings-node-limit on the same repository and verify that embeddings are generated
  • Run npx gitnexus analyze --help and verify that the new flag appears in the CLI help output
  • Run the existing test suite (npm test) and verify that there are no regressions

@vercel

vercel Bot commented Mar 19, 2026

Copy link
Copy Markdown

@RyuzakiH is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

Comment thread gitnexus/src/cli/index.ts
Comment on lines 28 to +29
.option('--embeddings', 'Enable embedding generation for semantic search (off by default)')
.option('--ignore-embeddings-node-limit', 'Bypass the 50,000-node safety cap for embedding generation (use with --embeddings on large repos)')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could merge them together and have

--embeddings [limit]

What do you think?

@zander-raycraft zander-raycraft added enhancement New feature or request bug-fix PR tag for bug fixes labels Mar 21, 2026
@zander-raycraft

Copy link
Copy Markdown
Collaborator

@claude review this, check for redundancies or misses in optimization for the method for this new flag. Can you also suggest an e2e test for this

@zander-raycraft zander-raycraft removed the bug-fix PR tag for bug fixes label Mar 21, 2026
@github-actions

github-actions Bot commented Mar 21, 2026

Copy link
Copy Markdown
Contributor

Claude finished @zander-raycraft's task in 1m 53s —— View job


Code Review: --ignore-embeddings-node-limit flag

  • Gather context and read changed files
  • Review for redundancies and optimization misses
  • Suggest e2e test

Review

The implementation is minimal and correct. The logic in analyze.ts:271–275 is the key change:

if (stats.nodes > EMBEDDING_NODE_LIMIT && !options.ignoreEmbeddingsNodeLimit) {
  embeddingSkipReason = `skipped (${stats.nodes.toLocaleString()} nodes > ${EMBEDDING_NODE_LIMIT.toLocaleString()} limit; use --ignore-embeddings-node-limit to override)`;
} else {
  embeddingSkipped = false;
}

This is clean — no redundancy, no wasted work. Short-circuit logic is handled correctly by the outer if (options?.embeddings) guard.


Issues Found

1. Silent no-op when flag is used without --embeddings (medium)

--ignore-embeddings-node-limit without --embeddings is silently ignored. Users who misread the docs will see no feedback. A one-liner warning would help:

// In analyze.ts, after the verbose check (~line 81):
if (options?.ignoreEmbeddingsNodeLimit && !options?.embeddings) {
  console.warn('  Warning: --ignore-embeddings-node-limit has no effect without --embeddings\n');
}

Fix this →

2. No warning about memory risk when bypassing the cap (low)

The 50K limit exists to prevent OOM. When it's bypassed, the user gets no indication that they're in risky territory. Consider logging a one-time advisory after the node count is known:

// After the embeddingSkipped = false branch:
if (options.ignoreEmbeddingsNodeLimit) {
  console.warn(`  Warning: bypassing node limit (${stats.nodes.toLocaleString()} nodes) — ensure sufficient memory\n`);
}

3. Reviewer's suggestion: --embeddings [limit] (design question)

magyargergo suggested collapsing both flags into --embeddings [limit] so --embeddings uses the default cap and --embeddings 0 bypasses it. This would be a cleaner UX but is a larger API change. The current two-flag approach works fine as long as the no-op case (point 1) is handled.

4. AnalyzeOptions interface field placement (nit)

ignoreEmbeddingsNodeLimit is placed at line 49 between embeddings and skills. It would read more clearly directly after embeddings — which it already is. This is fine as-is.


Suggested E2E Test

The challenge: we can't easily create a >50K node repo in tests. The best approach is a two-part strategy — a unit test for the branch logic, and a CLI-flag-registration test using the existing e2e harness.

Unit test — add to test/unit/cli-commands.test.ts or a new test/unit/analyze-node-limit.test.ts:

import { describe, it, expect, vi, beforeEach } from 'vitest';

// Mock all heavy deps so we can call analyzeCommand in-process
vi.mock('../../src/core/ingestion/pipeline.js', () => ({
  runPipelineFromRepo: vi.fn().mockResolvedValue({
    graph: {},
    repoPath: '/tmp/fake',
    totalFileCount: 10,
    communityResult: { stats: { totalCommunities: 0 }, communities: [] },
    processResult: { stats: { totalProcesses: 0 } },
  }),
}));
vi.mock('../../src/core/lbug/lbug-adapter.js', () => ({
  initLbug: vi.fn().mockResolvedValue(undefined),
  loadGraphToLbug: vi.fn().mockResolvedValue({ warnings: [] }),
  getLbugStats: vi.fn().mockResolvedValue({ nodes: 75_000, edges: 0 }), // over limit
  executeQuery: vi.fn().mockResolvedValue([{ cnt: 0 }]),
  executeWithReusedStatement: vi.fn().mockResolvedValue(undefined),
  closeLbug: vi.fn().mockResolvedValue(undefined),
  createFTSIndex: vi.fn().mockResolvedValue(undefined),
  loadCachedEmbeddings: vi.fn().mockResolvedValue({ embeddingNodeIds: new Set(), embeddings: [] }),
}));
vi.mock('../../src/storage/repo-manager.js', () => ({
  getStoragePaths: vi.fn().mockReturnValue({ storagePath: '/tmp/.gitnexus', lbugPath: '/tmp/.gitnexus/lbug.db' }),
  saveMeta: vi.fn().mockResolvedValue(undefined),
  loadMeta: vi.fn().mockResolvedValue(null),
  addToGitignore: vi.fn().mockResolvedValue(undefined),
  registerRepo: vi.fn().mockResolvedValue(undefined),
  getGlobalRegistryPath: vi.fn().mockReturnValue('/tmp/.gitnexus/registry.json'),
  cleanupOldKuzuFiles: vi.fn().mockResolvedValue({ found: false, needsReindex: false }),
}));
vi.mock('../../src/storage/git.js', () => ({
  getCurrentCommit: vi.fn().mockReturnValue('abc123'),
  isGitRepo: vi.fn().mockReturnValue(true),
  getGitRoot: vi.fn().mockReturnValue('/tmp/fake'),
}));
vi.mock('../../src/cli/ai-context.js', () => ({
  generateAIContextFiles: vi.fn().mockResolvedValue({ files: [] }),
}));
vi.mock('../../src/cli/skill-gen.js', () => ({
  generateSkillFiles: vi.fn().mockResolvedValue({ skills: [] }),
}));

describe('analyzeCommand -- embeddings node limit', () => {
  let consoleOutput: string[];
  let exitCode: number | undefined;
  let embeddingPipelineCalled: boolean;

  beforeEach(() => {
    consoleOutput = [];
    exitCode = undefined;
    embeddingPipelineCalled = false;
    vi.spyOn(process, 'stdout', 'get').mockReturnValue(process.stdout); // no-op
    vi.spyOn(console, 'log').mockImplementation((...args) => { consoleOutput.push(args.join(' ')); });
    vi.spyOn(process, 'exit').mockImplementation((code?: number) => { exitCode = code; throw new Error('process.exit'); });
  });

  it('skips embeddings when node count exceeds limit and flag not set', async () => {
    vi.doMock('../../src/core/embeddings/embedding-pipeline.js', () => ({
      runEmbeddingPipeline: vi.fn().mockImplementation(() => { embeddingPipelineCalled = true; }),
    }));
    const { analyzeCommand } = await import('../../src/cli/analyze.js');

    await analyzeCommand('/tmp/fake', { embeddings: true }).catch(() => {});

    const summary = consoleOutput.find(l => l.includes('Embeddings'));
    expect(summary).toMatch(/skipped/);
    expect(summary).toMatch(/--ignore-embeddings-node-limit/);
    expect(embeddingPipelineCalled).toBe(false);
  });

  it('runs embeddings when node count exceeds limit and --ignore-embeddings-node-limit is set', async () => {
    const runEmbeddingPipeline = vi.fn().mockResolvedValue(undefined);
    vi.doMock('../../src/core/embeddings/embedding-pipeline.js', () => ({ runEmbeddingPipeline }));
    const { analyzeCommand } = await import('../../src/cli/analyze.js');

    await analyzeCommand('/tmp/fake', { embeddings: true, ignoreEmbeddingsNodeLimit: true }).catch(() => {});

    expect(runEmbeddingPipeline).toHaveBeenCalled();
    const summary = consoleOutput.find(l => l.includes('Embeddings'));
    expect(summary).not.toMatch(/skipped/);
  });
});

CLI-flag-registration test — add to the existing describe('unhappy path') block in test/integration/cli-e2e.test.ts:

it('analyze --help lists --ignore-embeddings-node-limit flag', () => {
  const result = runCliRaw(['analyze', '--help'], MINI_REPO);
  if (result.status === null) return;

  expect(result.status).toBe(0);
  expect(result.stdout).toMatch(/--ignore-embeddings-node-limit/);
  expect(result.stdout).toMatch(/--embeddings/);
});

This is lightweight, has zero deps on a large repo, and guards against the flag being accidentally removed.


Summary

Finding Severity Actionable?
Silent no-op without --embeddings Medium Yes — add a console.warn
No OOM warning when bypassing cap Low Yes — add advisory log
--embeddings [limit] API design Design Maintainer decision
Implementation logic ✅ Correct No changes needed

@zander-raycraft

Copy link
Copy Markdown
Collaborator

@RyuzakiH can you address the issues raised in the comments above in a new commit to this pr?

@ShunsukeHayashi

Copy link
Copy Markdown
Contributor

This is a must-have for Enterprise adoptions.

We process and track context across multiple large repositories and enterprise codebases for autonomous agents (using OpenClaw MAS with 39 agents). The 50,000-node limit has been a hard blocker for computing the full blast radius on enterprise-scale mono-repos.

We absolutely need the ability to compute full embeddings across large codebases, even if it takes more compute resources. This --ignore-embeddings-node-limit flag allows us to run DeepWiki generation on massive repos during CI/CD without the agent losing scope of distant dependencies.

Tested this locally by patching our version and it works great. Looking forward to this getting merged.

@magyargergo

Copy link
Copy Markdown
Collaborator

⚠️ Upcoming Prettier formatting — rebase instructions

PR #563 adds Prettier as the code formatter for the repo. When it merges, the bulk format commit will touch ~350 files (style-only: whitespace, quotes, trailing commas). Your branch will likely conflict.

After #563 merges, rebase your branch:

git fetch origin
git checkout <your-branch>
git rebase origin/main

# Conflicts will be formatting-only — accept your version:
git checkout --theirs .
git add .
git rebase --continue

# Then re-format your branch to match the new style:
npx prettier --write .
git add -A
git commit -m "style: apply prettier formatting"
git push --force-with-lease

New setup step: Run npm install at the repo root (not just in gitnexus/) to get prettier + activate the pre-commit hook. The hook auto-formats staged files on every commit going forward.

@JorrinKievit

Copy link
Copy Markdown
Contributor

@magyargergo Hey! Also looking for this change, should I open a new PR?

@magyargergo

Copy link
Copy Markdown
Collaborator

@JorrinKievit feel free to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request - Allow exceeding 50,000 node limit for embeddings

5 participants