Skip to content

feat: add optional limit arg to --embeddings flag (closes #382)#1375

Merged
magyargergo merged 2 commits into
abhigyanpatwari:mainfrom
JorrinKievit:feat/embeddings-node-limit-arg
May 6, 2026
Merged

feat: add optional limit arg to --embeddings flag (closes #382)#1375
magyargergo merged 2 commits into
abhigyanpatwari:mainfrom
JorrinKievit:feat/embeddings-node-limit-arg

Conversation

@JorrinKievit

Copy link
Copy Markdown
Contributor

Summary

Reimplements stale PR #389 with the design @magyargergo suggested in review — collapse the proposed --ignore-embeddings-node-limit flag into an optional argument on the existing --embeddings flag so the API stays small.

--embeddings        # keep default 50,000-node safety cap (current behavior)
--embeddings 0      # disable the cap entirely
--embeddings <n>    # use a custom cap

Closes #382.

Why

The 50K-node embedding cap is a real OOM guardrail, but it's a hard blocker on large mono-repos. Users hitting it had no in-CLI way to opt in to running anyway. The original PR added a second flag; this version folds the override into the existing flag instead — same capability, smaller API surface.

Behavior changes

  • Skip message — when the cap blocks generation, the log now tells users how to override:

    Embeddings skipped: 73,412 nodes exceeds the 50,000-node safety cap. Override with \--embeddings 0` to disable the cap, or `--embeddings ` to set a custom cap.`

  • OOM advisory — when --embeddings 0 is used on a repo larger than the default cap, a one-line warning logs that the cap exists for a reason.
  • Validation — non-numeric or negative values fail fast with a clear error before the pipeline starts.

Implementation

  • cli/index.ts--embeddings becomes --embeddings [limit]
  • cli/analyze.ts — parses + validates the optional argument, forwards embeddingsNodeLimit to runFullAnalysis
  • core/run-analyze.ts — replaces the hard-coded EMBEDDING_NODE_LIMIT constant with the per-call override; 0 is treated as "no cap"

Test Plan

  • npx gitnexus analyze --embeddings on a >50K-node repo → embeddings skipped, message mentions --embeddings 0
  • npx gitnexus analyze --embeddings 0 on the same repo → embeddings generated, OOM advisory logged
  • npx gitnexus analyze --embeddings 100000 → custom cap honored
  • npx gitnexus analyze --embeddings abc → fails fast with validation error
  • npx gitnexus analyze --help → shows the new [limit] syntax
  • Existing deriveEmbeddingMode unit tests still pass (boolean coercion preserved)

Replaces the hard-coded 50,000-node embedding cap with an optional
argument on `--embeddings`:

  --embeddings        keeps the default 50K-node safety cap
  --embeddings 0      disables the cap entirely
  --embeddings <n>    sets a custom cap

When the cap is hit, the skip message now points users at the override
syntax. When the cap is disabled on a >50K-node repo, an OOM advisory
is logged.

Reimplements stale PR abhigyanpatwari#389; merges the previously separate
`--ignore-embeddings-node-limit` flag into the existing `--embeddings`
flag per maintainer review feedback.

Closes abhigyanpatwari#382
@vercel

vercel Bot commented May 6, 2026

Copy link
Copy Markdown

Someone is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 4m 25s —— View job


PR #1375 Review — feat: add optional limit arg to --embeddings flag

Tasks

  • Post initial comment
  • Read repo context files (DoD.md, AGENTS.md referenced)
  • Read changed files (index.ts, analyze.ts, run-analyze.ts)
  • Inspect relevant tests (run-analyze.test.ts, analyze-worker-timeout.test.ts, cli-index-help.test.ts, embedding-config.test.ts)
  • Check CI status
  • Run Unicode/hygiene checks
  • Search for key symbols (EMBEDDING_NODE_LIMIT, AnalyzeOptions, deriveEmbeddingMode, etc.)
  • Compile full review findings
  • Post final review

Review bar for this PR

  • --embeddings (bare) must keep existing 50,000-node cap behavior and not regress deriveEmbeddingMode semantics.
  • Omitting --embeddings must leave embedding generation off by default.
  • --embeddings 0 must enable embeddings and disable the node-count cap only, logging an OOM advisory on repos >50k nodes.
  • --embeddings <positive integer> must enable embeddings with a custom cap.
  • Invalid inputs (non-numeric, negative, decimal, NaN, Infinity) must fail fast before the pipeline starts.
  • AnalyzeOptions types across CLI and core must be consistent with no duplicate/conflicting embeddings declarations.
  • runFullAnalysis must keep the default 50k cap unless an explicit override is passed; no old hardcoded guard may survive.
  • Tests must cover the new parsing validation and cap logic, not just the pre-existing deriveEmbeddingMode behavior.

Current PR state

Item Value
Head SHA f43d5d8d3aa91f18ebd343900b9db47e425de618
Status Open, non-draft
Commits 1
Changed files 3 (cli/index.ts, cli/analyze.ts, core/run-analyze.ts)
CI — main CI In progress (started 2026-05-06T11:13:25Z)
CI — CodeQL ✅ Passed
CI — Dependency Review ✅ Passed
CI — Gitleaks ✅ Passed
CI — PR Description Check ✅ Passed
CI — Docker Build In progress
Issue #382 Linked, closes line in PR body ✅
Stale PR #389 Superseded by design (folded into one flag) ✅
Hidden Unicode (bidi/invisible controls) None found — only em dashes, box-drawing chars, and arrows in comments/progress bar (all visible, safe)
Vercel Irrelevant — CLI/core change only
Latest code matches PR body Yes

Understanding of the change

The 50k-node cap in runFullAnalysis exists because generating embeddings for very large graphs can exhaust heap memory. Users on large monorepos had no in-CLI way to opt out of the skip. This PR folds the override into --embeddings [limit] rather than adding a second flag. Commander's optional-value syntax means the flag is undefined when omitted, true when bare, and a string when given a value. The CLI converts the string to a typed number | undefined and passes embeddingsNodeLimit to runFullAnalysis; the pipeline uses options.embeddingsNodeLimit ?? DEFAULT_EMBEDDING_NODE_LIMIT. 0 is the conventional "no cap" sentinel.


Findings

[Medium] Missing tests for new parsing and cap logic

  • Category: Test coverage (DoD §2.7)
  • Files: gitnexus/src/cli/analyze.ts (lines 343–354), gitnexus/src/core/run-analyze.ts (lines 342–363)
  • Issue: The PR ships no automated tests for: --embeddings abc (fail fast), --embeddings -1 (fail fast), --embeddings 1.5 (fail fast), --embeddings 0 (cap disabled), --embeddings 100000 (custom cap), the OOM advisory log, the updated skip message, or the runFullAnalysis cap-decision logic. The existing analyze-worker-timeout.test.ts provides an exact pattern: mock runFullAnalysis, call analyzeCommand, assert process.exitCode === 1 and runFullAnalysisMock not called. The runFullAnalysis cap logic can be unit-tested by injecting a mock stats.nodes value and asserting on whether runEmbeddingPipeline was called and what was logged.
  • Why it matters: DoD §2.7 requires tests to "fail if behavior, wiring, or contracts were broken." Today, the cap logic could regress entirely — reverting to a hard-coded guard — and no test would catch it. The PR body's test plan is manual-only.
  • Recommended fix: Add a gitnexus/test/unit/analyze-embeddings-limit.test.ts using the analyze-worker-timeout.test.ts mock-and-call pattern to cover CLI validation cases, and extend run-analyze.test.ts with cap-decision tests (or extract nodeLimit logic into embedding-mode.ts where it can be unit-tested cheaply).
  • Blocks merge: No — the feature is correct — but this gap must be closed in a follow-up before the PR is considered fully Done against the project's own DoD bar. Fix this →

[Minor] Validation fires after progress bar start — terminal left dirty on error

  • Category: Observability/operability (DoD §2.8), UX correctness
  • Files: gitnexus/src/cli/analyze.ts lines 286 vs 343–356
  • Issue: The progress bar starts at line 286 (bar.start(...)) and console.error is redirected to barLog at line 313. The --embeddings validation block at lines 343–354 uses console.error(...) then process.exit(1). Since bar.stop() is never called, the terminal exits with the cursor hidden (hideCursor: true) and the progress bar not cleaned up. Every other validation in this function that sets process.exitCode = 1; return (lines 167–210) runs before the bar starts. The embeddings validation is the only one placed after bar start and using process.exit(1) directly.
  • Why it matters: On most terminals this means the cursor vanishes after --embeddings abc. Minor UX degradation, not a data-safety issue, but inconsistent with the existing pattern.
  • Recommended fix: Move the embeddings validation block to before bar.start() (e.g., immediately after the workerTimeout block at line 175), using process.exitCode = 1; return to match the existing pattern. Fix this →
  • Blocks merge: No.

[Minor/Informational] Commander optional arg consumes positional repo path

  • Category: CLI usability, Commander edge case
  • Files: gitnexus/src/cli/index.ts line 27, gitnexus/src/cli/analyze.ts lines 343–354
  • Issue: With --embeddings [limit], Commander consumes the next non-flag token as the optional value. So gitnexus analyze --embeddings . treats . as the limit (not the repo path), fails validation with got ".", and leaves inputPath unset. The more natural gitnexus analyze . --embeddings and gitnexus analyze --embeddings 0 . both work correctly. This is inherent to Commander's optional-value behavior and documented in Commander.js docs, but can catch users who put the path before the flag.
  • Why it matters: Not a safety issue — the validation does catch it — but the error message "got ." is confusing. --embeddings --drop-embeddings is safe (Commander recognizes -- tokens as flags, not values). --embeddings -1 is also safe (Commander treats -1 as a flag token, so embeddings = true and -1 causes a Commander unknown-option error before reaching the validator — the negative-value validation in analyze.ts is thus unreachable in practice).
  • Recommended fix: Add a note in the help text: (place before the path argument, e.g. --embeddings 0 <path>), or add a test documenting Commander's behavior.
  • Blocks merge: No.

CLI option assessment

index.ts:27 — option declared as --embeddings [limit] ✅. Help text at lines 28–30 explains the 50k cap, 0 to disable, and custom <n> ✅. Commander behavior is correctly handled by the analyze.ts code: undefined (omitted), true (bare flag), string (with value). The --embeddings --drop-embeddings combination safely gives true for both. The option does not have an adjacent .option that conflicts. No positional-path greedy issue under normal usage patterns (gitnexus analyze . --embeddings 0), but see finding above for --embeddings . form. Help text is clear and practical.

Analyze parsing assessment

analyze.ts:96–147AnalyzeOptions.embeddings?: boolean | string is declared exactly once ✅. No duplicate. The comment at lines 98–104 explains Commander's three-valued behavior clearly.

Validation (lines 343–354):

  • Number(options.embeddings) with !Number.isInteger(parsed) || parsed < 0 correctly rejects: "abc" (NaN, not integer), "1.5" (not integer), "-1" (< 0), "NaN" (NaN, not integer), "Infinity" (not integer) ✅
  • "0" → 0, integer, ≥ 0 → accepted ✅
  • "001" → 1, accepted (reasonable) ✅
  • "" — Commander will not produce "" for an optional-value flag ✅

embeddingsEnabled = !!options?.embeddings correctly handles all cases: undefined→false, true→true, "0"→true, "100000"→true ✅.

Call at lines 358–384: no stale embeddings: options?.embeddings line. Only embeddings: embeddingsEnabled and embeddingsNodeLimit passed — clean ✅.

The one issue is placement after bar.start() (see finding above).

runFullAnalysis cap assessment

run-analyze.ts:118DEFAULT_EMBEDDING_NODE_LIMIT = 50_000 ✅. Named clearly.

Cap logic (lines 342–363):

const nodeLimit = options.embeddingsNodeLimit ?? DEFAULT_EMBEDDING_NODE_LIMIT;
const capDisabled = nodeLimit === 0;
if (capDisabled || stats.nodes <= nodeLimit) { embeddingSkipped = false; ... }

No old hardcoded if (stats.nodes <= EMBEDDING_NODE_LIMIT) outer guard exists ✅. The feature is not broken by a surviving legacy guard. 0capDisabled = true → always generates ✅. Custom cap → stats.nodes <= nodeLimit check ✅. Default (undefined) → 50k cap ✅.

OOM advisory fires only when capDisabled && stats.nodes > DEFAULT_EMBEDDING_NODE_LIMIT — small repos are not noisy ✅. Skip message uses nodeLimit.toLocaleString() (active cap, not hardcoded 50k) ✅. Skip message includes --embeddings 0 and --embeddings <n> guidance ✅.

embeddingSkipped starts true and is set false only when generating — preserving behavior for the preserve-existing and no-op paths ✅.

OOM/resource-safety assessment

Default path is safe: embeddingsNodeLimit is undefined → 50k cap applied. Disabling requires explicit --embeddings 0 at the CLI, which passes embeddingsNodeLimit: 0 to runFullAnalysis. No environment variable or config path can accidentally set embeddingsNodeLimit: 0 — it's only ever set from the parsed CLI arg. Programmatic callers via analyze-worker.ts pass msg.options (type AnalyzeOptions from run-analyze.ts) where embeddingsNodeLimit defaults to undefined unless the server explicitly sets it; the server currently sends no such field. No other guardrail (embedding batching, ONNX thread limits, heap size) is touched. Warning text is proportionate — one log line, not alarming.

The one gap: runFullAnalysis itself does not re-validate embeddingsNodeLimit (e.g., reject negative values from programmatic callers). The CLI validates, but the API is public. A negative value would behave like a positive cap (stats.nodes <= negative_number → always skip), so the failure mode is safe by accident but not by design. Worth documenting or adding a cheap guard in runFullAnalysis.

Test assessment

Existing coverage (unaffected by this PR):

  • deriveEmbeddingMode — 8 tests cover the boolean-level embedding mode derivation ✅
  • runFullAnalysis exports and fast-path — 2 tests ✅
  • embedding-config.test.ts — env var overrides and invalid values ✅
  • analyze-worker-timeout.test.ts — validation pattern for analyzeCommand

New coverage (required by this PR, absent):

  • CLI parsing: --embeddings abc, --embeddings -1, --embeddings 1.5, --embeddings 0, --embeddings 100000, bare --embeddings
  • runFullAnalysis cap decisions: skip when nodes > default, generate when nodes <= custom, generate when cap disabled, advisory log only when cap disabled and nodes > default
  • Skip message content (includes --embeddings 0 guidance)

The analyze-worker-timeout.test.ts template maps directly to what's needed. This is the most actionable gap in the PR.

Hidden Unicode / hygiene assessment

Bidirectional Unicode check (U+202A–202E, U+2066–2069): clean — no bidi controls found in any of the three changed files.

Non-ASCII characters present: em dashes (U+2014) in comments and log strings, box-drawing characters (U+2500, U+2550) in section header comments, block characters (U+2588, U+2591) in the progress bar format string, and right arrows (U+2192) in comments. All are visible characters, pre-existing in the codebase, and none appear in executable code paths that could affect parsing or behavior. No invisible or control-character injection risk.

No stale duplicate fields, no commented-out code, no unrelated file churn. Diff is clean and scoped to the three intended files.

Elegance / maintainability assessment

The design is minimal and correct. One flag, three states (default/custom/disabled), validated at the CLI boundary. The runFullAnalysis API is extended with one optional field (embeddingsNodeLimit?: number) with clear JSDoc semantics. DEFAULT_EMBEDDING_NODE_LIMIT is named clearly. The comment at analyze.ts:340–342 explains Commander's three-valued behavior without clutter. No second flag introduced, no over-abstraction. The two AnalyzeOptions interfaces (CLI vs. core) having the same name across modules could cause confusion for future contributors navigating imports, but both are clearly scoped and the different embeddings types (CLI: boolean | string, core: boolean) reflect their distinct responsibilities.


Final verdict

Production-ready with minor follow-ups.

The core logic is correct end-to-end: --embeddings 0 does bypass the 50k cap (the old guard is gone), bare --embeddings preserves old behavior exactly, invalid inputs are rejected before the pipeline starts, the OOM guardrail holds by default, and no other option is regressed. The TypeScript types are consistent. No hidden Unicode affects executable code.

The two follow-ups to address are: (1) add automated tests for the new parsing validation and cap-decision logic — the existing analyze-worker-timeout.test.ts provides an exact template and this is the clearest gap against the project's own DoD; (2) move the validation block before bar.start() or clean up the bar on exit to avoid leaving the terminal with a hidden cursor when --embeddings abc is passed. Neither blocks the feature from working correctly, but both should be resolved before the PR is considered fully closed against the project's DoD bar.
| Branch

@magyargergo

Copy link
Copy Markdown
Collaborator

@JorrinKievit Can you please look at these findings above? 🙏

Addresses PR abhigyanpatwari#1375 review feedback:

- Move --embeddings validation before bar.start() so an invalid value
  can't leave the terminal with the progress bar's hidden cursor still
  active. Switches process.exit(1) to the sibling pattern
  (process.exitCode = 1; return).
- Extract the node-count cap decision into deriveEmbeddingCap() in
  embedding-mode.ts so the branching can be unit-tested without
  spinning up LadybugDB or the pipeline.
- Add analyze-embeddings-limit.test.ts: rejects abc/-1/1.5/NaN/Infinity
  before the pipeline starts; bare flag forwards undefined limit;
  --embeddings 0 forwards 0; --embeddings <n> forwards n; omitted flag
  keeps embeddings off.
- Extend run-analyze.test.ts with deriveEmbeddingCap cases covering the
  default cap, boundary equality, custom cap (above + below default),
  and the cap-disabled sentinel.
@JorrinKievit

Copy link
Copy Markdown
Contributor Author

@JorrinKievit Can you please look at these findings above? 🙏

done!

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 4m 15s —— View job


PR #1375 Review — feat: add optional limit arg to --embeddings flag

Tasks

  • Post initial comment
  • Read repo context files (DoD.md, AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, TESTING.md)
  • Read changed source files (index.ts, analyze.ts, run-analyze.ts, embedding-mode.ts)
  • Inspect test files (analyze-embeddings-limit.test.ts, run-analyze.test.ts)
  • Check CI status
  • Run Unicode/hygiene checks
  • Search for key symbols (EMBEDDING_NODE_LIMIT, embeddingsNodeLimit, deriveEmbeddingCap, etc.)
  • Compile and post full review

Review bar for this PR

  • --embeddings (bare) must preserve the default 50,000-node cap and not regress deriveEmbeddingMode semantics.
  • Omitting --embeddings must keep embedding generation off by default.
  • --embeddings 0 must enable embeddings and disable the node-count cap only, logging an OOM advisory on repos >50k nodes.
  • --embeddings <positive integer> must enable embeddings with a custom cap.
  • Invalid inputs (non-numeric, negative as string, decimal, NaN, Infinity) must fail fast before the pipeline starts.
  • AnalyzeOptions types must be clean: CLI-layer boolean | string, core-layer boolean + embeddingsNodeLimit?: number.
  • runFullAnalysis must use the per-call override with no surviving legacy hardcoded guard.
  • Tests must cover parsing validation and cap-decision logic, not just pre-existing deriveEmbeddingMode behavior.

Current PR state

Item Value
Head SHA 730f9f9 (2 commits on branch since main)
Status Open, non-draft
Commits 2 (feat: add optional limit arg + test+chore: cover --embeddings [limit])
Changed files 6cli/index.ts, cli/analyze.ts, core/embedding-mode.ts, core/run-analyze.ts, test/unit/analyze-embeddings-limit.test.ts, test/unit/run-analyze.test.ts
CI — Dependency Review ✅ Passed
CI — Gitleaks ✅ Passed
CI — CodeQL ✅ Passed
CI — Main CI ⏳ In progress
CI — Docker Build ⏳ In progress
Issue #382 Linked via "Closes #382" ✅
Stale PR #389 Intentionally superseded — one flag, three states ✅
Hidden Unicode (bidi/invisible) None found in any changed file — em dashes, box-drawing, bullets in comments only (all visible, pre-existing) ✅
Vercel Irrelevant — CLI/core only ✅
Latest code vs PR body Exceeds PR bodyembedding-mode.ts extraction and full test coverage added in commit 2, not mentioned in original PR description

Note for reviewer: The PR now has 6 changed files, not 3. The second commit (test+chore) addressed both gaps flagged in the prior review: (1) missing automated tests and (2) validation firing after bar.start(). Both are resolved.


Understanding of the change

The 50k-node cap exists because generating embeddings for very large graphs can exhaust heap memory — the ONNX runtime loads full model weights and processes every node's text. Users on large monorepos had no in-CLI way to opt out of the automatic skip. This PR folds the override into the existing --embeddings [limit] flag rather than adding a second flag.

Commander's optional-value syntax produces three states: undefined (omitted), true (bare flag), string (with value). The CLI's AnalyzeOptions.embeddings?: boolean | string captures this; embeddingsEnabled = !!options?.embeddings converts to boolean for the core; embeddingsNodeLimit carries the parsed numeric cap as a clean typed value. The validation block now runs before bar.start(), matching all sibling validations.

Critically, the cap logic has been extracted from inline code in run-analyze.ts into a pure deriveEmbeddingCap(nodeCount, embeddingsNodeLimit) function in embedding-mode.ts — the same file as deriveEmbeddingMode. This makes both branch contracts unit-testable without spinning up LadybugDB, and the new tests take full advantage of this.


Findings

[Informational] Commander tokenizer makes --embeddings -1 unreachable from real CLI

  • Category: CLI usability, Commander edge case
  • Files: gitnexus/test/unit/analyze-embeddings-limit.test.ts:44
  • Issue: The test at line 44 lists '-1' (string) in the invalid-values array and passes it as embeddings: '-1' directly to analyzeCommand(). This correctly tests the programmatic path. Via the actual CLI however, --embeddings -1 causes Commander to tokenize -1 as a flag-like token, not as the optional value — so options.embeddings becomes true (bare flag) and Commander emits an "unknown option" error. The parsed < 0 guard at analyze.ts:184 is therefore unreachable from the real CLI for the specific case of -1. For --embeddings=-1 (equals form), Commander does parse it as the string "-1" and the guard fires correctly.
  • Why it matters: The test documents programmatic safety but the test label says "rejects invalid --embeddings value -1". A future contributor may not realize the actual CLI behavior differs from what the test implies.
  • Recommended fix: Add a comment in the test explaining the CLI vs programmatic distinction for -1, or document it in the help text: "Note: --embeddings -1 is intercepted by Commander as an unknown option; use --embeddings=-1 to test the validator." Not blocking — behavior is safe either way.
  • Blocks merge: No.

[Informational] OOM advisory and skip message text are not directly tested

  • Category: Test coverage (DoD §2.7)
  • Files: gitnexus/src/core/run-analyze.ts:351–364, gitnexus/test/unit/run-analyze.test.ts
  • Issue: deriveEmbeddingCap is fully tested (6 cases covering all branches). But the log strings assembled in run-analyze.ts lines 352–364 (OOM advisory text and skip message with --embeddings 0 / --embeddings <n> guidance) are not asserted. If someone changes the message text or removes the advisory branch, no test fails.
  • Why it matters: Messages are user-visible and part of the UX promise made in the PR description. Minor — behavior correctness is covered; only text content is untested.
  • Recommended fix: Optionally add a spy on callbacks.onLog in a future test of runFullAnalysis to assert message content for the skip and advisory paths. Not required for merge.
  • Blocks merge: No.

[Informational] runFullAnalysis does not re-validate embeddingsNodeLimit for programmatic callers

  • Category: API robustness, OOM guardrail (DoD §2.1)
  • Files: gitnexus/src/core/run-analyze.ts:69, gitnexus/src/core/embedding-mode.ts:55–63
  • Issue: The CLI validates embeddingsNodeLimit (non-negative integer, no decimals). A programmatic caller that bypasses the CLI and passes embeddingsNodeLimit: -5 would have deriveEmbeddingCap(nodeCount, -5)nodeCount > -5 → always true → always skip. The failure mode is safe by accident (always skips for negative), but the API contract is silent on this.
  • Why it matters: Low practical risk — no current caller passes negative values and the CLI guards against it. But worth a JSDoc note on embeddingsNodeLimit in the core AnalyzeOptions interface.
  • Recommended fix: Add to the JSDoc at run-analyze.ts:69: "Negative values are not validated here — use the CLI for user-facing input validation. A negative value will behave as an always-skip cap." One line, no code change needed.
  • Blocks merge: No.

CLI parsing assessment

index.ts:27--embeddings [limit] ✅. Help text (lines 28–30) is clear, concise, and actionable: mentions off-by-default, 50k cap, 0 to disable, custom <n>. ✅

analyze.ts:177–193 — Validation block sits before bar.start() at line 305. This resolves the prior review's finding about hidden cursor on error exit. The block uses process.exitCode = 1; return to match every other early-exit validation in the function. ✅

Validation coverage:

  • "abc"Number("abc") = NaN, Number.isInteger(NaN) = false → rejected ✅
  • "1.5"Number.isInteger(1.5) = false → rejected ✅
  • "NaN" → NaN path → rejected ✅
  • "Infinity"Number.isInteger(Infinity) = false → rejected ✅
  • "-1" (string, programmatic) → parsed < 0 → rejected ✅
  • "0" → parsed=0, integer, ≥0 → accepted, embeddingsNodeLimit=0
  • "100000" → integer, ≥0 → accepted, embeddingsNodeLimit=100000
  • true (bare flag) → string branch skipped, embeddingsNodeLimit=undefined
  • undefined (omitted) → string branch skipped, embeddingsEnabled=false

No duplicate embeddings: property — only embeddings: embeddingsEnabled at line 367. ✅

--embeddings --drop-embeddings: Commander sees --drop-embeddings as a flag token, so options.embeddings = true and options.dropEmbeddings = true. Both are handled correctly downstream by deriveEmbeddingMode. ✅


Core embedding cap assessment

embedding-mode.ts:55–63deriveEmbeddingCap is a pure function extracting the cap-decision logic. This is a net improvement over the previous inline implementation noted in the prior review.

const nodeLimit = embeddingsNodeLimit ?? DEFAULT_EMBEDDING_NODE_LIMIT;
const capDisabled = nodeLimit === 0;
const skipForCap = !capDisabled && nodeCount > nodeLimit;

Truth table:

embeddingsNodeLimit nodeCount capDisabled skipForCap
undefined ≤50k false false → generate ✅
undefined >50k false true → skip ✅
0 any true false → generate ✅
100000 ≤100k false false → generate ✅
100000 >100k false true → skip ✅

No surviving legacy if (stats.nodes <= EMBEDDING_NODE_LIMIT) outer guard. ✅

run-analyze.ts:343–365:

  • embeddingSkipped starts true, set to false only when generating ✅
  • OOM advisory fires only when capDisabled && stats.nodes > DEFAULT_EMBEDDING_NODE_LIMIT — small repos are not noisy ✅
  • Skip message uses nodeLimit.toLocaleString() (active cap, not hardcoded 50k) ✅
  • semanticMode set in all branches ('vector-index' or 'exact-scan'), falls back via runtimeCapabilities

OOM guardrail assessment

The default path is safe: embeddingsNodeLimit is undefined in AnalyzeOptions and the CLI only sets it when the user passes a string value. No environment variable or config path can accidentally disable the cap — the only assignment in the codebase is embeddingsNodeLimit = parsed at analyze.ts:192, which requires an explicit string argument.

false, null, and empty string cannot disable the cap: !!options?.embeddings with false or null sets embeddingsEnabled=false, short-circuiting the entire cap decision. Empty string cannot reach the validator from Commander.

Custom caps above 50k do not trigger the OOM advisory — only capDisabled && stats.nodes > DEFAULT_EMBEDDING_NODE_LIMIT does. A user setting --embeddings 75000 on a 60k-node repo proceeds silently; the guardrail is the user's custom choice. This is correct behavior.

The error handler in analyze.ts:541–555 already guides users to increase NODE_OPTIONS=--max-old-space-size if they hit OOM during analysis. The OOM advisory for --embeddings 0 does not duplicate heap guidance — it references the cap rationale only. Appropriate.


Existing embedding behavior assessment

deriveEmbeddingMode is unchanged. It receives options.embeddings as boolean (after embeddingsEnabled = !!options?.embeddings), which is identical to the previous behavior. All 8 existing tests pass. ✅

--drop-embeddings alone: embeddingsEnabled=falseshouldGenerateEmbeddings=false, shouldLoadCache=false — unchanged. ✅

--embeddings 0 --drop-embeddings: embeddingsEnabled=true, embeddingsNodeLimit=0. deriveEmbeddingMode({embeddings:true, dropEmbeddings:true}, n)shouldGenerateEmbeddings=true, shouldLoadCache=false. Then deriveEmbeddingCap(nodes, 0) → generate with no cache load. Embeddings are generated fresh from scratch. Coherent and intentional. ✅ (Covered by test at run-analyze.test.ts:135.)

Already-up-to-date fast path: returns before reaching any cap logic. embeddingsNodeLimit is not read. repoName derivation is unchanged. ✅

Embedding server mapping and backend selection: unchanged — runEmbeddingPipeline signature and call site at run-analyze.ts:401–418 are unmodified. ✅


Test assessment

analyze-embeddings-limit.test.ts (new, 101 lines):

  • Uses the same mock-and-call pattern as analyze-worker-timeout.test.ts
  • Mocks: run-analyze, lbug-adapter, repo-manager, git, max-file-size — correct minimal set ✅
  • NODE_OPTIONS injection at beforeEach:41 bypasses ensureHeap() re-exec ✅
  • vi.resetModules() + dynamic import('../../src/cli/analyze.js') per test ensures clean module state ✅
  • Invalid values: abc, -1, 1.5, NaN, Infinity → asserts process.exitCode===1, runFullAnalysis not called, error message contains expected text ✅
  • Bare --embeddingsopts.embeddingsNodeLimit is undefined
  • --embeddings 0opts.embeddingsNodeLimit === 0
  • --embeddings 100000opts.embeddingsNodeLimit === 100000
  • Omitted --embeddingsopts.embeddings === false

run-analyze.test.ts (extended, +41 lines):

  • Imports deriveEmbeddingCap and DEFAULT_EMBEDDING_NODE_LIMIT from embedding-mode.ts directly ✅
  • 6 deriveEmbeddingCap tests: default cap, exceeds default, boundary (equals cap = no skip), limit=0 disables, custom above/below, custom below default ✅
  • Tests exercise the production function, not copied logic ✅
  • Tests would fail before this PR (the function didn't exist) ✅

Coverage gaps (minor, non-blocking):

  • OOM advisory and skip message text not asserted
  • --embeddings -1 test tests programmatic path; CLI path differs (Commander tokenizer)

Hidden Unicode / hygiene assessment

Bidi/invisible controls (U+202A–202E, U+2066–2069): none found in any of the 6 changed files. ✅

Non-ASCII characters present: em dashes (U+2014) in comments, en dashes (U+2013) in ranges, box-drawing chars (U+2500/U+2550) in section headers, block chars (U+2588/U+2591) in progress bar format, bullet (U+2022) in error messages, right arrow (U+2192) in comments. All visible, pre-existing in the codebase, none in executable code paths. ✅

No stale or duplicate properties: only embeddings: embeddingsEnabled at analyze.ts:367 — the old embeddings: options?.embeddings that could have leaked Commander string types into core is absent. ✅

No unrelated file churn. No package-lock changes. Diff is precisely scoped to the 4 source files + 2 test files for this feature. ✅


Elegance / maintainability assessment

The extraction of deriveEmbeddingCap into embedding-mode.ts is the standout design decision. Both pure derivation functions now live in the same module, separated from all native dependencies, and can be tested in isolation. This goes beyond what was asked and materially improves the codebase's testability surface.

CLI concerns stay in the CLI layer: boolean | string parsing, process.exitCode management, progress bar lifecycle. Core receives embeddings: boolean + embeddingsNodeLimit?: number — no Commander semantics leak. ✅

DEFAULT_EMBEDDING_NODE_LIMIT is defined once in embedding-mode.ts and re-exported from run-analyze.ts for backward compatibility with external callers. ✅

The comment at analyze.ts:98–104 explaining Commander's three-valued behavior is clear and appropriately placed. ✅

The implementation is the smallest correct solution: no speculative abstraction, no second flag, no broad embeddings refactor. Easy to extend if a future config-based cap is added (just another embeddingsNodeLimit source, same core path). ✅


Final verdict

Production-ready.

All three blocking signals from the prior review are resolved: automated tests now cover CLI parsing validation and cap-decision logic; the validation block was moved before bar.start() so terminal state is clean on error exit; the cap logic is centralized in a pure deriveEmbeddingCap function with no surviving legacy guard. The core API boundary is clean — runFullAnalysis receives typed boolean + number | undefined, never a Commander string. No hidden Unicode affects executable code. The deriveEmbeddingCap extraction is a concrete improvement to the codebase's testability beyond the stated scope of the PR.

The three remaining informational items (Commander tokenizer behavior for --embeddings -1, untested advisory message text, and the absence of embeddingsNodeLimit validation for programmatic callers) are documented above but none constitute a correctness risk or production hazard. The OOM guardrail holds by default, explicit opt-out works correctly, and invalid CLI inputs are rejected before the pipeline starts.
| Branch

@magyargergo magyargergo merged commit 96578aa into abhigyanpatwari:main May 6, 2026
24 of 25 checks passed
@magyargergo magyargergo mentioned this pull request May 10, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request - Allow exceeding 50,000 node limit for embeddings

2 participants