diff --git a/README.md b/README.md
index 6e6c7193d1..b3caf7f7a2 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,5 @@
# GitNexus
+
**⚠️ Important Notice:** GitNexus has NO official cryptocurrency, token, or coin. Any token/coin using the GitNexus name on Pump.fun or any other platform is **not affiliated with, endorsed by, or created by** this project or its maintainers. Do not purchase any cryptocurrency claiming association with GitNexus.
@@ -30,14 +31,9 @@
Indexes any codebase into a knowledge graph — every dependency, call chain, cluster, and execution flow — then exposes it through smart tools so AI agents never miss code.
-
-
-
https://github.com/user-attachments/assets/172685ba-8e54-4ea7-9ad1-e31a3398da72
-
-
-> *Like DeepWiki, but deeper.* DeepWiki helps you *understand* code. GitNexus lets you *analyze* it — because a knowledge graph tracks every relationship, not just descriptions.
+> _Like DeepWiki, but deeper._ DeepWiki helps you _understand_ code. GitNexus lets you _analyze_ it — because a knowledge graph tracks every relationship, not just descriptions.
**TL;DR:** The **Web UI** is a quick way to chat with any repo. The **CLI + MCP** is how you make your AI agent actually reliable — it gives Cursor, Claude Code, Codex, and friends a deep architectural view of your codebase so they stop missing dependencies, breaking call chains, and shipping blind edits. Even smaller models get full architectural clarity, making it compete with Goliath models.
@@ -47,18 +43,17 @@ https://github.com/user-attachments/assets/172685ba-8e54-4ea7-9ad1-e31a3398da72
[](https://www.star-history.com/#abhigyanpatwari/GitNexus&type=date&legend=top-left)
-
## Two Ways to Use GitNexus
-| | **CLI + MCP** | **Web UI** |
-| ----------------- | -------------------------------------------------------------- | ------------------------------------------------------------ |
-| **What** | Index repos locally, connect AI agents via MCP | Visual graph explorer + AI chat in browser |
-| **For** | Daily development with Cursor, Claude Code, Codex, Windsurf, OpenCode | Quick exploration, demos, one-off analysis |
-| **Scale** | Full repos, any size | Limited by browser memory (~5k files), or unlimited via backend mode |
-| **Install** | `npm install -g gitnexus` | No install — [gitnexus.vercel.app](https://gitnexus.vercel.app) |
-| **Storage** | LadybugDB native (fast, persistent) | LadybugDB WASM (in-memory, per session) |
-| **Parsing** | Tree-sitter native bindings | Tree-sitter WASM |
-| **Privacy** | Everything local, no network | Everything in-browser, no server |
+| | **CLI + MCP** | **Web UI** |
+| ----------- | --------------------------------------------------------------------- | -------------------------------------------------------------------- |
+| **What** | Index repos locally, connect AI agents via MCP | Visual graph explorer + AI chat in browser |
+| **For** | Daily development with Cursor, Claude Code, Codex, Windsurf, OpenCode | Quick exploration, demos, one-off analysis |
+| **Scale** | Full repos, any size | Limited by browser memory (~5k files), or unlimited via backend mode |
+| **Install** | `npm install -g gitnexus` | No install — [gitnexus.vercel.app](https://gitnexus.vercel.app) |
+| **Storage** | LadybugDB native (fast, persistent) | LadybugDB WASM (in-memory, per session) |
+| **Parsing** | Tree-sitter native bindings | Tree-sitter WASM |
+| **Privacy** | Everything local, no network | Everything in-browser, no server |
> **Bridge mode:** `gitnexus serve` connects the two — the web UI auto-detects the local server and can browse all your CLI-indexed repos without re-uploading or re-indexing.
@@ -69,6 +64,7 @@ https://github.com/user-attachments/assets/172685ba-8e54-4ea7-9ad1-e31a3398da72
GitNexus is available as an **enterprise offering** - either as a fully managed **SaaS** or a **self-hosted** deployment. Also available for **commercial use** of the OSS version with proper licensing.
Enterprise includes:
+
- **PR Review** - automated blast radius analysis on pull requests
- **Auto-updating Code Wiki** - always up-to-date documentation (Code Wiki is also available in OSS)
- **Auto-reindexing** - knowledge graph stays fresh automatically
@@ -77,6 +73,7 @@ Enterprise includes:
- **Priority feature/language support** - request new languages or features
**Upcoming:**
+
- Auto regression forensics
- End-to-end test generation
@@ -117,13 +114,13 @@ To configure MCP for your editor, run `npx gitnexus setup` once — or set it up
### Editor Support
-| Editor | MCP | Skills | Hooks (auto-augment) | Support |
-| --------------------- | --- | ------ | -------------------- | -------------- |
-| **Claude Code** | Yes | Yes | Yes (PreToolUse + PostToolUse) | **Full** |
-| **Cursor** | Yes | Yes | Yes (postToolUse, [manual install](gitnexus-cursor-integration/README.md#hook-install)) | **Full** |
-| **Codex** | Yes | Yes | — | MCP + Skills |
-| **Windsurf** | Yes | — | — | MCP |
-| **OpenCode** | Yes | Yes | — | MCP + Skills |
+| Editor | MCP | Skills | Hooks (auto-augment) | Support |
+| --------------- | --- | ------ | --------------------------------------------------------------------------------------- | ------------ |
+| **Claude Code** | Yes | Yes | Yes (PreToolUse + PostToolUse) | **Full** |
+| **Cursor** | Yes | Yes | Yes (postToolUse, [manual install](gitnexus-cursor-integration/README.md#hook-install)) | **Full** |
+| **Codex** | Yes | Yes | — | MCP + Skills |
+| **Windsurf** | Yes | — | — | MCP |
+| **OpenCode** | Yes | Yes | — | MCP + Skills |
> **Claude Code** gets the deepest integration: MCP tools + agent skills + PreToolUse hooks that enrich searches with graph context + PostToolUse hooks that detect a stale index after commits and prompt the agent to reindex.
@@ -131,10 +128,10 @@ To configure MCP for your editor, run `npx gitnexus setup` once — or set it up
Built by the community — not officially maintained, but worth checking out.
-| Project | Author | Description |
-|---------|--------|-------------|
-| [pi-gitnexus](https://github.com/tintinweb/pi-gitnexus) | [@tintinweb](https://github.com/tintinweb) | GitNexus plugin for [pi](https://pi.dev) — `pi install npm:pi-gitnexus` |
-| [gitnexus-stable-ops](https://github.com/ShunsukeHayashi/gitnexus-stable-ops) | [@ShunsukeHayashi](https://github.com/ShunsukeHayashi) | Stable ops & deployment workflows (Miyabi ecosystem) |
+| Project | Author | Description |
+| ----------------------------------------------------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------------------- |
+| [pi-gitnexus](https://github.com/tintinweb/pi-gitnexus) | [@tintinweb](https://github.com/tintinweb) | GitNexus plugin for [pi](https://pi.dev) — `pi install npm:pi-gitnexus` |
+| [gitnexus-stable-ops](https://github.com/ShunsukeHayashi/gitnexus-stable-ops) | [@ShunsukeHayashi](https://github.com/ShunsukeHayashi) | Stable ops & deployment workflows (Miyabi ecosystem) |
> Have a project built on GitNexus? Open a PR to add it here!
@@ -206,6 +203,7 @@ gitnexus analyze --skip-git # Index folders that are not Git repositories
gitnexus analyze --embeddings # Enable embedding generation (slower, better search)
gitnexus analyze --verbose # Log skipped files when parsers are unavailable
gitnexus analyze --worker-timeout 60 # Increase worker idle timeout for slow parses
+gitnexus analyze --workers
# Parse worker pool size (default: cores-1, capped at 16; 0 = sequential)
gitnexus mcp # Start MCP server (stdio) — serves all indexed repos
gitnexus serve # Start local HTTP server (multi-repo) for web UI connection
gitnexus list # List all indexed repositories
@@ -230,6 +228,25 @@ gitnexus group status # Check staleness of repos in a group
If `analyze` reports a worker parse timeout on a large or unusual repository, it keeps running and falls back safely. To give slow worker jobs more time, use `gitnexus analyze --worker-timeout 60` or set `GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS=60000`. For very large files, `GITNEXUS_WORKER_SUB_BATCH_MAX_BYTES` controls the worker job byte budget.
+#### Environment variables
+
+Most `analyze` knobs are also CLI flags (`--workers`, `--worker-timeout`, `--max-file-size`, `--verbose`). Use the env-var form when you'd otherwise repeat the same flag every run, or when invoking GitNexus from a long-running host (MCP server, eval-server, CI shell) that already manages its own environment. CLI flags take precedence over env vars; env vars take precedence over built-in defaults.
+
+| Variable | Default | Effect | Tune when… |
+| -------------------------------------- | ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
+| `GITNEXUS_WORKER_POOL_SIZE` | `cores - 1`, capped at 16 | Parse worker pool size. `0` disables the pool (sequential fallback). Equivalent to `--workers `. | Constrained containers (cgroup CPU limits), CI runners with explicit quotas, or debugging a worker-only crash via `0`. |
+| `GITNEXUS_PARSE_CHUNK_CONCURRENCY` | `2` | Number of chunks whose file contents may be read into memory in parallel while the pool dispatches the current chunk. Worker dispatch itself stays serial. | Repos large enough to chunk (multi-MB total source) where disk I/O is a measurable fraction of analyze wall-clock. |
+| `GITNEXUS_VERBOSE` | unset | When `1`, enables verbose ingestion logs (skipped-file warnings, per-chunk throughput, parse-cache stats). Equivalent to `--verbose`. | Debugging an analyze that "completed" but seems to have missed files; tuning `--workers` / chunk concurrency against observable throughput. |
+| `GITNEXUS_MAX_FILE_SIZE` | `512` (KB) | Walker skip threshold in KB. Hard cap is `32768` (tree-sitter buffer ceiling). Equivalent to `--max-file-size `. | Indexing repos with intentionally-large source files (generated parsers, vendored bundles) that should still be parsed. |
+| `GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS` | `30000` | Worker idle timeout in milliseconds before retry/fallback. Equivalent to `--worker-timeout ` × 1000. | Slow-parsing files (large minified JS, deeply-nested TS types) that legitimately need more than 30s. |
+| `GITNEXUS_WORKER_SUB_BATCH_MAX_BYTES` | `8388608` (8 MB) | Per-job byte budget the pool will send to a worker in one `postMessage`. | Very large individual files; mostly diagnostic — bumping past 8 MB risks structured-clone memory pressure. |
+| `GITNEXUS_WORKER_MAX_RESPAWNS_PER_SLOT` | `3` | Max replacement spawns per worker slot before the slot is dropped from the active rotation. Bounds respawn loops on a chronically-crashing slot. | Hosts where a flaky worker should retry more (raise) or fail-fast (lower) before the slot is dropped. |
+| `GITNEXUS_WORKER_MAX_CUMULATIVE_TIMEOUT_MS` | `5 × subBatchTimeoutMs` | Total retry wall-time budget per job before quarantining. Combined with `timeoutBackoffFactor`, prevents exponentially-growing retries from stalling for hours. | Slow files that legitimately need long total retry windows; lower to fail-fast on stalls. |
+| `GITNEXUS_WORKER_CONSECUTIVE_FAILURE_THRESHOLD`| `max(3, poolSize)` | Per-slot consecutive deaths before the pool's circuit breaker trips. After tripping, every subsequent dispatch rejects until a fresh pool is created. | Hosts where a SIGSEGV-prone native grammar should trip the breaker sooner; CI runners that should fail loudly. |
+| `GITNEXUS_CHUNK_BYTE_BUDGET` | `2097152` (2 MB) | Chunk boundary used for cache-key composition and dispatch. Smaller = finer-grained cache hits but more dispatch overhead. | Tuning incremental-analyze cache behavior on monorepos. |
+| `GITNEXUS_NO_GITIGNORE` | unset | When set, skips `.gitignore` parsing. `.gitnexusignore` is still honored. | Indexing a repo whose `.gitignore` excludes files you actually want indexed (e.g., generated code committed for cross-repo lookup). |
+| `GITNEXUS_SKIP_OPTIONAL_GRAMMARS` | unset | When `=1` strictly, skips native builds for `tree-sitter-dart` / `tree-sitter-proto` at install time. | Installing on a host without a C++ toolchain; you're willing to skip Dart/Proto parsing. |
+
#### Publishing to understand-quickly (opt-in)
[`looptech-ai/understand-quickly`](https://github.com/looptech-ai/understand-quickly) is a public registry of code-knowledge graphs that lists `gitnexus@1` as a first-class format. After registering your repo once (`npx @understand-quickly/cli add` or the [wizard](https://looptech-ai.github.io/understand-quickly/add.html)), `gitnexus publish` fires a single `repository_dispatch` event so the registry resyncs your entry on demand instead of waiting for the nightly job.
@@ -240,27 +257,27 @@ It is opt-in and a no-op without `UNDERSTAND_QUICKLY_TOKEN` — a fine-grained G
**16 tools** exposed via MCP (11 per-repo + 5 group):
-| Tool | What It Does | `repo` Param |
-| ------------------ | ----------------------------------------------------------------- | -------------- |
-| `list_repos` | Discover all indexed repositories | — |
-| `query` | Process-grouped hybrid search (BM25 + semantic + RRF) | Optional |
-| `context` | 360-degree symbol view — categorized refs, process participation | Optional |
-| `impact` | Blast radius analysis with depth grouping and confidence | Optional |
-| `detect_changes` | Git-diff impact — maps changed lines to affected processes | Optional |
-| `rename` | Multi-file coordinated rename with graph + text search | Optional |
-| `cypher` | Raw Cypher graph queries | Optional |
-| `group_list` | List configured repository groups | — |
-| `group_sync` | Extract contracts and match across repos/services | — |
-| `group_contracts`| Inspect extracted contracts and cross-links | — |
-| `group_query` | Search execution flows across all repos in a group | — |
-| `group_status` | Check staleness of repos in a group | — |
+| Tool | What It Does | `repo` Param |
+| ----------------- | ---------------------------------------------------------------- | ------------ |
+| `list_repos` | Discover all indexed repositories | — |
+| `query` | Process-grouped hybrid search (BM25 + semantic + RRF) | Optional |
+| `context` | 360-degree symbol view — categorized refs, process participation | Optional |
+| `impact` | Blast radius analysis with depth grouping and confidence | Optional |
+| `detect_changes` | Git-diff impact — maps changed lines to affected processes | Optional |
+| `rename` | Multi-file coordinated rename with graph + text search | Optional |
+| `cypher` | Raw Cypher graph queries | Optional |
+| `group_list` | List configured repository groups | — |
+| `group_sync` | Extract contracts and match across repos/services | — |
+| `group_contracts` | Inspect extracted contracts and cross-links | — |
+| `group_query` | Search execution flows across all repos in a group | — |
+| `group_status` | Check staleness of repos in a group | — |
> When only one repo is indexed, the `repo` parameter is optional. With multiple repos, specify which one: `query({query: "auth", repo: "my-app"})`.
**Resources** for instant context:
-| Resource | Purpose |
-| ----------------------------------------- | ---------------------------------------------------- |
+| Resource | Purpose |
+| --------------------------------------- | ---------------------------------------------------- |
| `gitnexus://repos` | List all indexed repositories (read this first) |
| `gitnexus://repo/{name}/context` | Codebase stats, staleness check, and available tools |
| `gitnexus://repo/{name}/clusters` | All functional clusters with cohesion scores |
@@ -271,9 +288,9 @@ It is opt-in and a no-op without `UNDERSTAND_QUICKLY_TOKEN` — a fine-grained G
**2 MCP prompts** for guided workflows:
-| Prompt | What It Does |
-| ----------------- | ------------------------------------------------------------------------- |
-| `detect_impact` | Pre-commit change analysis — scope, affected processes, risk level |
+| Prompt | What It Does |
+| --------------- | ------------------------------------------------------------------------- |
+| `detect_impact` | Pre-commit change analysis — scope, affected processes, risk level |
| `generate_map` | Architecture documentation from the knowledge graph with mermaid diagrams |
**4 agent skills** installed to `.claude/skills/` automatically:
@@ -360,10 +377,10 @@ npx gitnexus@latest serve
The official Docker setup ships **two signed images** orchestrated by `docker-compose.yaml`. Each image is published to both **GitHub Container Registry** (GHCR) and **Docker Hub** — same build, same digest, same Cosign signature — so pick whichever registry you prefer:
-| Purpose | GHCR (default in `docker-compose.yaml`) | Docker Hub mirror |
-| ---------------------------------------------------------------------- | --------------------------------------------- | ------------------------------------------- |
-| CLI / `gitnexus serve` backend (HTTP API on port `4747`, MCP, indexer) | `ghcr.io/abhigyanpatwari/gitnexus:latest` | `akonlabs/gitnexus:latest` |
-| Static web UI (port `4173`) | `ghcr.io/abhigyanpatwari/gitnexus-web:latest` | `akonlabs/gitnexus-web:latest` |
+| Purpose | GHCR (default in `docker-compose.yaml`) | Docker Hub mirror |
+| ---------------------------------------------------------------------- | --------------------------------------------- | ------------------------------ |
+| CLI / `gitnexus serve` backend (HTTP API on port `4747`, MCP, indexer) | `ghcr.io/abhigyanpatwari/gitnexus:latest` | `akonlabs/gitnexus:latest` |
+| Static web UI (port `4173`) | `ghcr.io/abhigyanpatwari/gitnexus-web:latest` | `akonlabs/gitnexus-web:latest` |
> **Heads-up — image rename.** Earlier releases published the web UI under
> `ghcr.io/abhigyanpatwari/gitnexus`. Starting with the introduction of the
@@ -579,22 +596,22 @@ GitNexus builds a complete knowledge graph of your codebase through a multi-phas
### Supported Languages
-| Language | Imports | Named Bindings | Exports | Heritage | Type Annotations | Constructor Inference | Config | Frameworks | Entry Points |
-|----------|---------|----------------|---------|----------|-----------------|---------------------|--------|------------|-------------|
-| TypeScript | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
-| JavaScript | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ | ✓ | ✓ |
-| Python | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
-| Java | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ |
-| Kotlin | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ |
-| C# | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
-| Go | ✓ | — | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
-| Rust | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ |
-| PHP | ✓ | ✓ | ✓ | — | ✓ | ✓ | ✓ | ✓ | ✓ |
-| Ruby | ✓ | — | ✓ | ✓ | — | ✓ | — | ✓ | ✓ |
-| Swift | — | — | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
-| C | — | — | ✓ | — | ✓ | ✓ | — | ✓ | ✓ |
-| C++ | — | — | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ |
-| Dart | ✓ | — | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ |
+| Language | Imports | Named Bindings | Exports | Heritage | Type Annotations | Constructor Inference | Config | Frameworks | Entry Points |
+| ---------- | ------- | -------------- | ------- | -------- | ---------------- | --------------------- | ------ | ---------- | ------------ |
+| TypeScript | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+| JavaScript | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ | ✓ | ✓ |
+| Python | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+| Java | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ |
+| Kotlin | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ |
+| C# | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+| Go | ✓ | — | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+| Rust | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ |
+| PHP | ✓ | ✓ | ✓ | — | ✓ | ✓ | ✓ | ✓ | ✓ |
+| Ruby | ✓ | — | ✓ | ✓ | — | ✓ | — | ✓ | ✓ |
+| Swift | — | — | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+| C | — | — | ✓ | — | ✓ | ✓ | — | ✓ | ✓ |
+| C++ | — | — | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ |
+| Dart | ✓ | — | ✓ | ✓ | ✓ | ✓ | — | ✓ | ✓ |
**Imports** — cross-file import resolution · **Named Bindings** — `import { X as Y }` / re-export tracking · **Exports** — public/exported symbol detection · **Heritage** — class inheritance, interfaces, mixins · **Type Annotations** — explicit type extraction for receiver resolution · **Constructor Inference** — infer receiver type from constructor calls (`self`/`this` resolution included for all languages) · **Config** — language toolchain config parsing (tsconfig, go.mod, etc.) · **Frameworks** — AST-based framework pattern detection · **Entry Points** — entry point scoring heuristics
@@ -739,16 +756,16 @@ The wiki generator reads the indexed graph structure, groups files into modules
## Tech Stack
-| Layer | CLI | Web |
-| ------------------------- | ------------------------------------- | --------------------------------------- |
+| Layer | CLI | Web |
+| ------------------- | ------------------------------------- | --------------------------------------- |
| **Runtime** | Node.js (native) | Browser (WASM) |
| **Parsing** | Tree-sitter native bindings | Tree-sitter WASM |
-| **Database** | LadybugDB native | LadybugDB WASM |
+| **Database** | LadybugDB native | LadybugDB WASM |
| **Embeddings** | HuggingFace transformers.js (GPU/CPU) | transformers.js (WebGPU/WASM) |
| **Search** | BM25 + semantic + RRF | BM25 + semantic + RRF |
| **Agent Interface** | MCP (stdio) | LangChain ReAct agent |
-| **Visualization** | — | Sigma.js + Graphology (WebGL) |
-| **Frontend** | — | React 18, TypeScript, Vite, Tailwind v4 |
+| **Visualization** | — | Sigma.js + Graphology (WebGL) |
+| **Frontend** | — | React 18, TypeScript, Vite, Tailwind v4 |
| **Clustering** | Graphology | Graphology |
| **Concurrency** | Worker threads + async | Web Workers + Comlink |
@@ -764,12 +781,12 @@ The wiki generator reads the indexed graph structure, groups files into modules
### Recently Completed
-- [X] Constructor-Inferred Type Resolution, `self`/`this` Receiver Mapping
-- [X] Wiki Generation, Multi-File Rename, Git-Diff Impact Analysis
-- [X] Process-Grouped Search, 360-Degree Context, Claude Code Hooks
-- [X] Multi-Repo MCP, Zero-Config Setup, 14 Language Support
-- [X] Community Detection, Process Detection, Confidence Scoring
-- [X] Hybrid Search, Vector Index
+- [x] Constructor-Inferred Type Resolution, `self`/`this` Receiver Mapping
+- [x] Wiki Generation, Multi-File Rename, Git-Diff Impact Analysis
+- [x] Process-Grouped Search, 360-Degree Context, Claude Code Hooks
+- [x] Multi-Repo MCP, Zero-Config Setup, 14 Language Support
+- [x] Community Detection, Process Detection, Confidence Scoring
+- [x] Hybrid Search, Vector Index
---
diff --git a/gitnexus/README.md b/gitnexus/README.md
index 6400530135..61ae131ace 100644
--- a/gitnexus/README.md
+++ b/gitnexus/README.md
@@ -359,6 +359,16 @@ npx gitnexus analyze
For repositories with very large source files, `GITNEXUS_WORKER_SUB_BATCH_MAX_BYTES` controls the worker job byte budget. The default is **8388608 bytes (8 MB)**.
+### Worker pool resilience tuning
+
+Three env vars expose the pool's resilience layers (respawn budget, cumulative-timeout cap, circuit breaker). Defaults are tuned for typical repos; bump them when an analyze legitimately needs more retries, or lower them to fail-fast on a known-bad shape.
+
+| Variable | Default | Effect |
+| ------------------------------------------------- | ------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
+| `GITNEXUS_WORKER_MAX_RESPAWNS_PER_SLOT` | `3` | Max replacement spawns per slot before the slot is dropped from the active rotation. |
+| `GITNEXUS_WORKER_MAX_CUMULATIVE_TIMEOUT_MS` | `5 × subBatchTimeoutMs` | Total retry wall-time budget per job before quarantining. Bounds exponentially-growing retry waits. |
+| `GITNEXUS_WORKER_CONSECUTIVE_FAILURE_THRESHOLD` | `max(3, poolSize)` | Per-slot consecutive deaths before the pool's circuit breaker trips. After tripping, dispatches require a fresh pool. |
+
## Privacy
- All processing happens locally on your machine
diff --git a/gitnexus/bench/parse-throughput.md b/gitnexus/bench/parse-throughput.md
new file mode 100644
index 0000000000..24a7c98b1f
--- /dev/null
+++ b/gitnexus/bench/parse-throughput.md
@@ -0,0 +1,175 @@
+# Parse-throughput benchmark (scaffold)
+
+> **Status: methodology + harness scaffold, no measurement data yet.**
+> The Latest measurement table below contains `_TBD_` placeholders.
+> This file ships intentionally without numbers — populating it
+> requires a dedicated bench-pass against the U6 fixture (and ideally
+> a real-world TS-root-scale repo) on consistent hardware, which is
+> tracked as future work rather than gated on PR #1693's merge.
+> Until the table is populated, the load-bearing perf-regression
+> protection lives in `gitnexus/test/integration/parse-impl-large-fixture.test.ts`
+> (U6, 30 s wall-clock budget via `Promise.race`).
+
+Tracks `runChunkedParseAndResolve` wall-clock + peak heap on a synthetic
+fixture so PR #1693's "analyze no longer hangs on TS-root-shaped loads"
+claim is measurable, not just asserted by smoke tests. The harness
+recipe below is deliberately small enough to re-run in a few minutes
+when the bench-pass is undertaken.
+
+---
+
+## Methodology
+
+### Fixture
+
+Synthetic TypeScript repo, _not_ a clone of microsoft/TypeScript. CI cost
+of cloning real-world repos is prohibitive; the synthetic shape exercises
+the same pipeline paths (chunking, deferred extraction, cross-chunk
+imports + heritage) without the disk-I/O overhead. Larger numbers can be
+manually captured against real repos and cross-referenced here, but the
+authoritative regression-tracking shape is the synthetic fixture so runs
+are reproducible across hardware.
+
+The fixture matches the structure pinned by
+`gitnexus/test/integration/parse-impl-large-fixture.test.ts` (U6):
+
+- 15 small modules (`mod0.ts` … `mod14.ts`), one exported function each.
+- 1 dense `complex.ts` with 30 functions + 1 class + 1 interface.
+- 1 `index.ts` re-exporting every symbol from every module.
+
+`GITNEXUS_CHUNK_BYTE_BUDGET=64` forces multi-chunk parsing on this small
+fixture — without that override the whole thing fits in one chunk and
+the deferred-extraction path is not exercised end-to-end.
+
+### What to measure
+
+| Metric | How |
+| --------------------------- | -------------------------------------------------------------------------------- |
+| Wall-clock total | `Date.now()` delta around `runChunkedParseAndResolve` |
+| Peak heap | Sample `process.memoryUsage().heapUsed` every 50 ms during the run; keep the max |
+| Chunks observed | Count distinct `Parsing chunk X/Y` progress messages |
+| `getStats()` final snapshot | Quarantined paths, dropped slots, breaker state |
+
+### Hardware shape (record alongside each measurement)
+
+- OS + version
+- CPU model + logical core count
+- RAM
+- Node version
+- gitnexus commit SHA (so the snapshot is anchored to a tree, not "main")
+
+---
+
+## Harness recipe
+
+The U6 test (`test/integration/parse-impl-large-fixture.test.ts`) is the
+checked-in mini-benchmark — it exercises the same fixture and bounds the
+wall-clock at 30 s via `Promise.race`. To produce a richer snapshot for
+this doc, run it under instrumentation:
+
+```bash
+# From the gitnexus/ subdir:
+cd gitnexus
+# Single-threaded baseline (sequential fallback):
+npx vitest run test/integration/parse-impl-large-fixture.test.ts --reporter=verbose
+
+# Worker-pool path (requires built dist/ — pre-built by `npm run build`):
+npm run build && \
+ GITNEXUS_WORKER_POOL_SIZE=4 \
+ GITNEXUS_PARSE_CHUNK_CONCURRENCY=2 \
+ GITNEXUS_VERBOSE=1 \
+ npx vitest run test/integration/parse-impl-large-fixture.test.ts --reporter=verbose
+```
+
+For peak-heap sampling, wrap the dispatch call in a Node script that
+polls `process.memoryUsage()`. A future helper at
+`gitnexus/bench/scripts/parse-throughput.ts` would automate this — the
+plan's stretch goal. Until that lands, capture peak heap manually via:
+
+```bash
+node --inspect=0 \
+ --require ./scripts/heap-sampler.js \
+ ./node_modules/.bin/vitest run test/integration/parse-impl-large-fixture.test.ts
+```
+
+---
+
+## Latest measurement
+
+> _No measurement data has been collected yet — this file is the
+> methodology + harness scaffold. The single recorded data point is the
+> U6 wall-clock smoke baseline below; the worker-pool rows are
+> placeholders for future bench-pass output._
+
+The U6 integration test (`gitnexus/test/integration/parse-impl-large-fixture.test.ts`)
+was observed completing the synthetic fixture in **~6 seconds** under
+the sequential path (`skipWorkers: true`) on the development machine,
+well under the 30 s `Promise.race` wall-clock budget. That number is a
+smoke baseline only — recorded here for reference, not as a regression
+target.
+
+| Path | files/s | wall-clock | peak heap | chunks | quarantined |
+| ------------------------------------------ | ------- | -------------------- | --------- | ------ | ----------- |
+| Sequential (`skipWorkers: true`, U6 smoke) | _TBD_ | ~6 s _(observation)_ | _TBD_ | 17 | 0 |
+| Worker pool, `--workers 4`, concurrency 2 | _TBD_ | _TBD_ | _TBD_ | _TBD_ | 0 |
+| Worker pool, `--workers 1`, concurrency 1 | _TBD_ | _TBD_ | _TBD_ | _TBD_ | 0 |
+
+**Hardware:** _TBD — record OS, CPU, RAM, Node version, gitnexus SHA at
+the time of the bench-pass that populates the table above._
+
+---
+
+## Operator-tuning quick reference
+
+Cross-links to the env vars documented in the [README](../../README.md#environment-variables).
+Use this section as a starting point when the benchmark numbers above
+suggest a tuning opportunity for your hardware shape.
+
+- **CPU-bound, big repo, lots of cores:** raise `GITNEXUS_WORKER_POOL_SIZE`
+ past the default cap of 16. The 16-worker cap exists because past that
+ point main-thread merge / extraction dominates; if you've measurably
+ ruled that out, the env var lifts the cap explicitly. (See
+ `worker-pool.ts` `DEFAULT_POOL_SIZE_CAP`.)
+- **Slow files (large minified JS, deep TS types):** raise
+ `GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS` past 30 000 ms. The cumulative
+ budget is 5× this value (U10 pins this) so a 60 s idle timeout permits
+ 300 s of total retry-and-split wall-clock before quarantining the file.
+- **Constrained container (cgroup CPU limit):** the pool now uses
+ `os.availableParallelism()` (U3 H2), which honors cgroup limits — no
+ manual `GITNEXUS_WORKER_POOL_SIZE` override needed unless the auto-
+ resolved value is too aggressive for your I/O budget.
+- **Long-running host (eval-server, MCP daemon) running back-to-back
+ analyzes:** `--workers` is now threaded through `AnalyzeOptions`
+ (U2 B2), so per-invocation sizing is honored without `process.env`
+ state leaking across calls. `GITNEXUS_VERBOSE` is similarly snapshot/
+ restore-bracketed.
+
+---
+
+## What this benchmark does NOT measure
+
+- **Real-repo performance.** The synthetic fixture is sized for CI; it
+ doesn't exercise the cumulative-load shape (50k files, occasional
+ pathological file) that drove the original PR #1693 hang report. Real-
+ repo numbers should be captured ad-hoc against the user's target repo
+ and cross-referenced here only as supplementary evidence.
+- **Worker-pool resilience under real crashes.** That's verified by the
+ `worker-pool.test.ts` integration tests (real `process.exit`, real
+ `error` events, real protocol violations) and the unit suite. The
+ benchmark cares about throughput on the happy path.
+- **IPC repack throughput.** Phase 3 of the PR #1693 plan introduces a
+ transferList + binary wire-format IPC repack (U16-U17). Once that
+ lands, an `IPC repack` row should be added to the "Latest measurement"
+ table above with before/after numbers on the same hardware.
+
+---
+
+## Related artifacts
+
+- Plan: `docs/plans/2026-05-20-001-feat-pr1693-resilience-hardening-and-ipc-repack-plan.md`
+- Integration test (mini-benchmark with wall-clock guard): `gitnexus/test/integration/parse-impl-large-fixture.test.ts` (U6)
+- Operator env-var reference: `README.md` → Environment variables
+- Resilience layer tests: `gitnexus/test/unit/worker-pool-resilience.test.ts`,
+ `worker-pool-cumulative-timeout.test.ts`,
+ `worker-pool-windows-quarantine.test.ts`,
+ `worker-pool-slot-generation.test.ts`
diff --git a/gitnexus/src/cli/analyze.ts b/gitnexus/src/cli/analyze.ts
index 70183ecf94..6ea04df05d 100644
--- a/gitnexus/src/cli/analyze.ts
+++ b/gitnexus/src/cli/analyze.ts
@@ -148,7 +148,7 @@ function ensureHeap(): boolean {
stdio: 'inherit',
env: { ...process.env, NODE_OPTIONS: `${nodeOpts} ${HEAP_FLAG}`.trim() },
});
- } catch (e: any) {
+ } catch (e: unknown) {
if (childProcessLikelyOom(e)) {
cliError(
` Analysis likely ran out of memory.\n` +
@@ -159,11 +159,50 @@ function ensureHeap(): boolean {
{ recoveryHint: 'heap-oom-respawn' },
);
}
- process.exitCode = e.status ?? 1;
+ const status =
+ typeof e === 'object' && e !== null && 'status' in e && typeof e.status === 'number'
+ ? e.status
+ : 1;
+ process.exitCode = status;
}
return true;
}
+/**
+ * GITNEXUS_* env vars that `analyzeCommand` writes for backward-compatible
+ * downstream consumption. Snapshotted at function entry and restored in the
+ * finally block so that programmatic callers (tests, long-running hosts)
+ * don't see leaked state across invocations. `GITNEXUS_WORKER_POOL_SIZE` is
+ * NOT in this list: that knob is threaded through `runFullAnalysis` options
+ * (see `workerPoolSize` plumbing) so the CLI never has to mutate `process.env`
+ * for it in the first place.
+ */
+const ANALYZE_CLI_ENV_KEYS = [
+ 'GITNEXUS_VERBOSE',
+ 'GITNEXUS_MAX_FILE_SIZE',
+ 'GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS',
+ 'GITNEXUS_EMBEDDING_THREADS',
+ 'GITNEXUS_EMBEDDING_BATCH_SIZE',
+ 'GITNEXUS_EMBEDDING_SUB_BATCH_SIZE',
+ 'GITNEXUS_EMBEDDING_DEVICE',
+] as const;
+
+type AnalyzeEnvSnapshot = Record<(typeof ANALYZE_CLI_ENV_KEYS)[number], string | undefined>;
+
+const snapshotAnalyzeEnv = (): AnalyzeEnvSnapshot => {
+ const snap = {} as AnalyzeEnvSnapshot;
+ for (const k of ANALYZE_CLI_ENV_KEYS) snap[k] = process.env[k];
+ return snap;
+};
+
+const restoreAnalyzeEnv = (snap: AnalyzeEnvSnapshot): void => {
+ for (const k of ANALYZE_CLI_ENV_KEYS) {
+ const v = snap[k];
+ if (v === undefined) delete process.env[k];
+ else process.env[k] = v;
+ }
+};
+
export interface AnalyzeOptions {
force?: boolean;
repairFts?: boolean;
@@ -226,6 +265,8 @@ export interface AnalyzeOptions {
maxFileSize?: string;
/** Override worker sub-batch idle timeout in seconds. */
workerTimeout?: string;
+ /** Parse worker pool size; 0 disables workers (sequential fallback). */
+ workers?: string;
embeddingThreads?: string;
embeddingBatchSize?: string;
embeddingSubBatchSize?: string;
@@ -259,6 +300,22 @@ export const analyzeCommand = async (inputPath?: string, options?: AnalyzeOption
// a stack trace and a non-zero exit code instead of a silent exit 0.
installFatalHandlers();
+ // Snapshot the GITNEXUS_* env vars that the impl writes for downstream
+ // consumption, so they don't leak across `analyzeCommand` invocations in
+ // programmatic callers (tests, long-running hosts). `process.exit(0)` on
+ // the success path bypasses `finally` — intentional: when the process is
+ // exiting, restoration is moot. For early-return paths (validation
+ // errors) and the alreadyUpToDate fast path the finally restores the
+ // pre-call values.
+ const envSnap = snapshotAnalyzeEnv();
+ try {
+ await analyzeCommandImpl(inputPath, options);
+ } finally {
+ restoreAnalyzeEnv(envSnap);
+ }
+};
+
+const analyzeCommandImpl = async (inputPath?: string, options?: AnalyzeOptions): Promise => {
if (options?.verbose) {
process.env.GITNEXUS_VERBOSE = '1';
}
@@ -279,6 +336,26 @@ export const analyzeCommand = async (inputPath?: string, options?: AnalyzeOption
);
}
+ // `--workers` is threaded through `runFullAnalysis` options → PipelineOptions
+ // → createWorkerPool, intentionally bypassing the GITNEXUS_WORKER_POOL_SIZE
+ // env channel so this CLI surface never mutates `process.env` for pool size.
+ // Tests can therefore re-invoke analyzeCommand with different --workers
+ // values back-to-back and observe the value they passed, not whatever the
+ // previous call leaked.
+ let workerPoolSize: number | undefined;
+ if (options?.workers !== undefined) {
+ const parsedWorkers = Number(options.workers);
+ if (!Number.isInteger(parsedWorkers) || parsedWorkers < 0) {
+ cliError(
+ ' --workers must be a non-negative integer. ' +
+ 'Pass 0 to disable the worker pool (sequential fallback).\n',
+ );
+ process.exitCode = 1;
+ return;
+ }
+ workerPoolSize = parsedWorkers;
+ }
+
// Parse `--embeddings [limit]`: `true` → default cap, string → numeric cap
// (0 disables the cap entirely). Validated up here so failures match the
// sibling-validation pattern (exit before bar.start() — otherwise
@@ -551,6 +628,10 @@ export const analyzeCommand = async (inputPath?: string, options?: AnalyzeOption
// be able to accept the duplicate name without also paying the
// cost of a full pipeline re-index. See #829 review round 2.
allowDuplicateName: options?.allowDuplicateName,
+ // Worker pool size threaded from --workers, replacing the previous
+ // GITNEXUS_WORKER_POOL_SIZE env mutation. `undefined` defers to the
+ // env / auto-formula fallback inside the pipeline.
+ workerPoolSize,
},
{
onProgress: (_phase, percent, message) => {
@@ -688,7 +769,7 @@ export const analyzeCommand = async (inputPath?: string, options?: AnalyzeOption
}
console.log('');
- } catch (err: any) {
+ } catch (err: unknown) {
clearInterval(elapsedTimer);
process.removeListener('SIGINT', sigintHandler);
console.log = origLog;
@@ -698,7 +779,7 @@ export const analyzeCommand = async (inputPath?: string, options?: AnalyzeOption
console.error = origError;
bar.stop();
- const msg = err.message || String(err);
+ const msg = err instanceof Error ? err.message : String(err);
// Registry name-collision from --name (#829) — surface as an
// actionable error rather than a generic stack-trace.
diff --git a/gitnexus/src/cli/index.ts b/gitnexus/src/cli/index.ts
index 9450698d3f..31446cf6fb 100644
--- a/gitnexus/src/cli/index.ts
+++ b/gitnexus/src/cli/index.ts
@@ -71,6 +71,10 @@ program
'--worker-timeout ',
'Worker sub-batch idle timeout before retry/fallback. Default: 30.',
)
+ .option(
+ '--workers ',
+ 'Parse worker pool size. Default: cores-1 capped at 16. Pass 0 to disable workers (sequential).',
+ )
.option('--embedding-threads ', 'Limit local ONNX embedding CPU threads')
.option('--embedding-batch-size ', 'Number of nodes per embedding batch')
.option('--embedding-sub-batch-size ', 'Number of chunks per embedding model call')
@@ -82,6 +86,11 @@ program
' GITNEXUS_MAX_FILE_SIZE=N Override large-file skip threshold (KB). Default 512, max 32768.\n' +
' GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS=N Worker idle timeout in milliseconds. Default 30000.\n' +
' GITNEXUS_WORKER_SUB_BATCH_MAX_BYTES=N Worker job byte budget. Default 8388608.\n' +
+ ' GITNEXUS_WORKER_POOL_SIZE=N Parse worker count override. Default cores-1 capped at 16.\n' +
+ ' GITNEXUS_PARSE_CHUNK_CONCURRENCY=N Concurrent in-flight parse chunks. Default 2.\n' +
+ ' GITNEXUS_WORKER_MAX_RESPAWNS_PER_SLOT=N Max replacement spawns per slot before drop. Default 3.\n' +
+ ' GITNEXUS_WORKER_MAX_CUMULATIVE_TIMEOUT_MS=N Total retry wall-time per job. Default 5x sub-batch timeout.\n' +
+ ' GITNEXUS_WORKER_CONSECUTIVE_FAILURE_THRESHOLD=N Per-slot deaths to trip circuit breaker. Default max(3, poolSize).\n' +
' GITNEXUS_EMBEDDING_THREADS=N Limit local ONNX CPU threads for --embeddings.\n' +
' GITNEXUS_SEMANTIC_EXACT_SCAN_LIMIT=N Max embedding chunks for exact-scan fallback. Default 10000.\n' +
'\nTip: `.gitnexusignore` supports `.gitignore`-style negation. Add e.g.\n' +
diff --git a/gitnexus/src/cli/wiki.ts b/gitnexus/src/cli/wiki.ts
index 8089dd2f2c..6211d371cc 100644
--- a/gitnexus/src/cli/wiki.ts
+++ b/gitnexus/src/cli/wiki.ts
@@ -107,6 +107,24 @@ function prompt(question: string, hide = false): Promise {
}
export const wikiCommand = async (inputPath?: string, options?: WikiCommandOptions) => {
+ // Snapshot GITNEXUS_VERBOSE at entry — wikiCommand mutates it (the impl
+ // below) so cursor-client (process.env-driven) sees the right value during
+ // this run. Restored in finally so back-to-back wiki calls in long-running
+ // hosts don't leak verbose state from one invocation to the next. Pairs
+ // with the same snapshot/restore pattern in `analyzeCommand`.
+ const originalVerbose = process.env.GITNEXUS_VERBOSE;
+ try {
+ await wikiCommandImpl(inputPath, options);
+ } finally {
+ if (originalVerbose === undefined) {
+ delete process.env.GITNEXUS_VERBOSE;
+ } else {
+ process.env.GITNEXUS_VERBOSE = originalVerbose;
+ }
+ }
+};
+
+const wikiCommandImpl = async (inputPath?: string, options?: WikiCommandOptions): Promise => {
// Set verbose mode globally for cursor-client to pick up
if (options?.verbose) {
process.env.GITNEXUS_VERBOSE = '1';
diff --git a/gitnexus/src/core/ingestion/languages/typescript/captures.ts b/gitnexus/src/core/ingestion/languages/typescript/captures.ts
index b82edcff81..22f48e80d9 100644
--- a/gitnexus/src/core/ingestion/languages/typescript/captures.ts
+++ b/gitnexus/src/core/ingestion/languages/typescript/captures.ts
@@ -64,7 +64,7 @@ const CALL_TAGS = [
'@reference.call.constructor',
] as const;
-function pickFirstDefined(grouped: CaptureMatch, tags: readonly string[]): Capture | undefined {
+function pickFirstCapture(grouped: CaptureMatch, tags: readonly string[]): Capture | undefined {
for (const tag of tags) {
const cap = grouped[tag];
if (cap !== undefined) return cap;
@@ -72,6 +72,17 @@ function pickFirstDefined(grouped: CaptureMatch, tags: readonly string[]): Captu
return undefined;
}
+function pickFirstNode(
+ grouped: Record,
+ tags: readonly string[],
+): SyntaxNode | undefined {
+ for (const tag of tags) {
+ const node = grouped[tag];
+ if (node !== undefined) return node;
+ }
+ return undefined;
+}
+
/**
* Drop `@reference.read.member` matches whose underlying `member_expression`
* is NOT actually a read context:
@@ -113,6 +124,34 @@ function shouldEmitReadMember(memberNode: SyntaxNode): boolean {
}
}
+/** Walks the parent chain from `node` (inclusive), returning the first node
+ * whose type matches, or null. Faster than `findNodeAtRange` when the caller
+ * already holds the anchor node — avoids re-scanning the tree from the root. */
+function findSelfOrAncestorOfType(node: SyntaxNode | undefined, type: string): SyntaxNode | null {
+ if (node === undefined) return null;
+ let current: SyntaxNode | null = node;
+ while (current !== null) {
+ if (current.type === type) return current;
+ current = current.parent;
+ }
+ return null;
+}
+
+/** Walks the parent chain from `node` (inclusive), returning the first node
+ * whose type is in the set, or null. Plural form of {@link findSelfOrAncestorOfType}. */
+function findSelfOrAncestorOfTypes(
+ node: SyntaxNode | undefined,
+ types: readonly string[],
+): SyntaxNode | null {
+ if (node === undefined) return null;
+ let current: SyntaxNode | null = node;
+ while (current !== null) {
+ if (types.includes(current.type)) return current;
+ current = current.parent;
+ }
+ return null;
+}
+
export function emitTsScopeCaptures(
sourceText: string,
filePath: string,
@@ -151,9 +190,11 @@ export function emitTsScopeCaptures(
// `@`; we put it back so the central extractor's prefix lookups
// (`@scope.`, `@declaration.`, …) work.
const grouped: Record = {};
+ const groupedNodes: Record = {};
for (const c of m.captures) {
const tag = '@' + c.name;
grouped[tag] = nodeToCapture(tag, c.node);
+ groupedNodes[tag] = c.node;
}
if (Object.keys(grouped).length === 0) continue;
@@ -165,6 +206,10 @@ export function emitTsScopeCaptures(
if (grouped['@import.statement'] !== undefined) {
const stmtCapture = grouped['@import.statement'];
const stmtNode =
+ findSelfOrAncestorOfTypes(groupedNodes['@import.statement'], [
+ 'import_statement',
+ 'export_statement',
+ ]) ??
findNodeAtRange(tree.rootNode, stmtCapture.range, 'import_statement') ??
findNodeAtRange(tree.rootNode, stmtCapture.range, 'export_statement');
if (stmtNode !== null) {
@@ -183,7 +228,9 @@ export function emitTsScopeCaptures(
// `splitDynamicImport` branch consumes.
if (grouped['@import.dynamic'] !== undefined) {
const dynCapture = grouped['@import.dynamic'];
- const callNode = findNodeAtRange(tree.rootNode, dynCapture.range, 'call_expression');
+ const callNode =
+ findSelfOrAncestorOfType(groupedNodes['@import.dynamic'], 'call_expression') ??
+ findNodeAtRange(tree.rootNode, dynCapture.range, 'call_expression');
if (callNode !== null) {
const decomposed = splitImportStatement(callNode);
for (const d of decomposed) out.push(d);
@@ -197,7 +244,9 @@ export function emitTsScopeCaptures(
// we rely on this emit-side filter so the query stays simple.
if (grouped['@reference.read.member'] !== undefined) {
const anchor = grouped['@reference.read.member'];
- const memberNode = findNodeAtRange(tree.rootNode, anchor.range, 'member_expression');
+ const memberNode =
+ findSelfOrAncestorOfType(groupedNodes['@reference.read.member'], 'member_expression') ??
+ findNodeAtRange(tree.rootNode, anchor.range, 'member_expression');
if (memberNode === null || !shouldEmitReadMember(memberNode)) {
continue;
}
@@ -208,9 +257,10 @@ export function emitTsScopeCaptures(
// overloads — TypeScript supports overload signatures via
// function_signature, so `parameterTypes` is populated when
// available.
- const declAnchor = pickFirstDefined(grouped, FUNCTION_DECL_TAGS);
+ const declAnchor = pickFirstCapture(grouped, FUNCTION_DECL_TAGS);
+ const declAnchorNode = pickFirstNode(groupedNodes, FUNCTION_DECL_TAGS);
if (declAnchor !== undefined) {
- const fnNode = findFunctionNode(tree.rootNode, declAnchor.range);
+ const fnNode = findFunctionNode(tree.rootNode, declAnchor.range, declAnchorNode);
if (fnNode !== null) {
const arity = computeTsArityMetadata(fnNode);
if (arity.parameterCount !== undefined) {
@@ -255,9 +305,11 @@ export function emitTsScopeCaptures(
// calls to disambiguate by props-arity, a JSX-aware arity
// synthesizer would need to count `jsx_attribute` children of the
// opening tag instead of `arguments`.
- const callAnchor = pickFirstDefined(grouped, CALL_TAGS);
+ const callAnchor = pickFirstCapture(grouped, CALL_TAGS);
+ const callAnchorNode = pickFirstNode(groupedNodes, CALL_TAGS);
if (callAnchor !== undefined && grouped['@reference.arity'] === undefined) {
const callNode =
+ findSelfOrAncestorOfTypes(callAnchorNode, ['call_expression', 'new_expression']) ??
findNodeAtRange(tree.rootNode, callAnchor.range, 'call_expression') ??
findNodeAtRange(tree.rootNode, callAnchor.range, 'new_expression');
if (callNode !== null) {
@@ -293,7 +345,11 @@ export function emitTsScopeCaptures(
// lookup instead of synthesis — covered by `tsReceiverBinding`.
const scopeFnAnchor = grouped['@scope.function'];
if (scopeFnAnchor !== undefined) {
- const fnNode = findFunctionNode(tree.rootNode, scopeFnAnchor.range);
+ const fnNode = findFunctionNode(
+ tree.rootNode,
+ scopeFnAnchor.range,
+ groupedNodes['@scope.function'],
+ );
if (fnNode !== null) {
const synth = synthesizeTsReceiverBinding(fnNode);
if (synth !== null) out.push(synth);
@@ -518,7 +574,13 @@ function inferArgType(argNode: SyntaxNode): string {
* The `@scope.function` anchor range covers the whole node, but the
* tag alone doesn't identify which node type among the many TS
* function-likes. */
-function findFunctionNode(rootNode: SyntaxNode, range: Capture['range']): SyntaxNode | null {
+function findFunctionNode(
+ rootNode: SyntaxNode,
+ range: Capture['range'],
+ anchorNode?: SyntaxNode,
+): SyntaxNode | null {
+ const fromAnchor = findSelfOrAncestorOfTypes(anchorNode, FUNCTION_NODE_TYPES);
+ if (fromAnchor !== null) return fromAnchor;
for (const nodeType of FUNCTION_NODE_TYPES) {
const n = findNodeAtRange(rootNode, range, nodeType);
if (n !== null) return n;
diff --git a/gitnexus/src/core/ingestion/parsing-processor.ts b/gitnexus/src/core/ingestion/parsing-processor.ts
index 5cf398beeb..28a080eb26 100644
--- a/gitnexus/src/core/ingestion/parsing-processor.ts
+++ b/gitnexus/src/core/ingestion/parsing-processor.ts
@@ -832,6 +832,14 @@ const processParsingSequential = async (
// Public API
// ============================================================================
+/**
+ * Per-`WorkerPool` log-dedup state for quarantine reporting. Keyed on the
+ * pool instance so multiple concurrent pools (test fixtures, future
+ * multi-pool callers) each get their own seen-set. WeakMap entries vanish
+ * when the pool is garbage-collected.
+ */
+const loggedQuarantineByPool = new WeakMap>();
+
export const processParsing = async (
graph: KnowledgeGraph,
files: { path: string; content: string }[],
@@ -874,25 +882,75 @@ export const processParsing = async (
`[scope-resolution prof] worker pool engaged for ${files.length} files — cross-phase tree cache will be empty; scope-resolution re-parses.`,
);
}
- try {
- return await processParsingWithWorkers(
- graph,
- files,
- symbolTable,
- astCache,
- workerPool,
- reportProgress,
- outRawResults,
- );
- } catch (err) {
- const message = err instanceof Error ? err.message : String(err);
- logger.warn({ message }, 'Worker pool parsing stopped; continuing with sequential parser:');
- reportProgress?.(
- lastProgress,
- files.length,
- `Sequential fallback after worker issue: ${message}`,
- );
+ // U20 design pivot: the worker pool's resilience layers
+ // (respawn budget, circuit breaker, quarantine, slot-attribution,
+ // cumulative timeout) are the SOLE contract for handling worker
+ // failures. There is no sequential-parser fallback for either
+ // partial quarantine or full pool failure — the operator must see
+ // a clear hard signal when workers can't recover, instead of a
+ // silently-degraded graph from a possibly-crashing main-thread
+ // sequential parser. A failing tree-sitter native binding that
+ // quarantined a worker would, under the previous design, re-trigger
+ // the same SIGSEGV on the main thread; we avoid that risk entirely.
+ //
+ // - Partial quarantine: the file is missing from this run's graph;
+ // the per-chunk warn log below surfaces it; U2's chunk-cache
+ // write-guard in parse-impl.ts keeps the chunk uncached so the
+ // next analyze gets a cache miss and a fresh pool retries.
+ // - Full pool failure: `WorkerPoolDispatchError` propagates from
+ // `processParsingWithWorkers` up through this function. The
+ // analyze run errors out instead of falling back to sequential.
+ const data = await processParsingWithWorkers(
+ graph,
+ files,
+ symbolTable,
+ astCache,
+ workerPool,
+ reportProgress,
+ outRawResults,
+ );
+ // Session-scoped quarantine (worker-pool resilience Layer 3): surface
+ // any files this pool has decided are unsafe for workers so the
+ // operator can see what was skipped. The pool already filtered them
+ // out of dispatch; we only need to log + progress-report. Quarantine
+ // is session-scoped per pool instance — a fresh `createWorkerPool`
+ // call clears it.
+ //
+ // Dedup: log full path list only for entries newly quarantined since
+ // the previous dispatch on the same pool. The per-chunk progress
+ // message still surfaces the count for UX continuity, but the
+ // structured `quarantinedFiles` payload is only emitted when there
+ // is new signal — prevents O(quarantine × chunks) log spam.
+ const quarantineSnapshot = workerPool.getQuarantinedPaths?.() ?? [];
+ const quarantineSet = new Set(quarantineSnapshot);
+ if (quarantineSet.size > 0) {
+ const quarantinedInChunk = files.filter((file) => quarantineSet.has(file.path));
+ if (quarantinedInChunk.length > 0) {
+ const seenForPool = loggedQuarantineByPool.get(workerPool) ?? new Set();
+ const newlyQuarantined = quarantinedInChunk
+ .map((file) => file.path)
+ .filter((p) => !seenForPool.has(p));
+ for (const p of newlyQuarantined) seenForPool.add(p);
+ loggedQuarantineByPool.set(workerPool, seenForPool);
+ if (newlyQuarantined.length > 0) {
+ logger.warn(
+ {
+ newlyQuarantined,
+ cumulativeQuarantine: quarantineSet.size,
+ chunkSkipped: quarantinedInChunk.length,
+ },
+ `Worker quarantine: ${newlyQuarantined.length} new file(s) skipped this chunk ` +
+ `(${quarantinedInChunk.length} skipped total, ${quarantineSet.size} cumulative).`,
+ );
+ }
+ reportProgress?.(
+ lastProgress,
+ files.length,
+ `${quarantinedInChunk.length} worker-quarantined file(s) skipped`,
+ );
+ }
}
+ return data;
}
// Fallback: sequential parsing (no pre-extracted data)
diff --git a/gitnexus/src/core/ingestion/pipeline-phases/parse-impl.ts b/gitnexus/src/core/ingestion/pipeline-phases/parse-impl.ts
index 17cfaab3f5..46945d74c9 100644
--- a/gitnexus/src/core/ingestion/pipeline-phases/parse-impl.ts
+++ b/gitnexus/src/core/ingestion/pipeline-phases/parse-impl.ts
@@ -55,6 +55,7 @@ import type {
ExtractedCall,
ExtractedDecoratorRoute,
ExtractedFetchCall,
+ ExtractedImport,
ExtractedORMQuery,
ExtractedRoute,
ExtractedToolDef,
@@ -69,6 +70,7 @@ import path from 'node:path';
import { fileURLToPath, pathToFileURL } from 'node:url';
import { isDev } from '../utils/env.js';
+import { isVerboseIngestionEnabled } from '../utils/verbose.js';
import { synthesizeWildcardImportBindings, needsSynthesis } from './wildcard-synthesis.js';
import { extractORMQueriesInline } from './orm-extraction.js';
@@ -85,11 +87,24 @@ import { logger } from '../../logger.js';
* gives a useful invalidation floor (~1/N chunks on a multi-MB repo)
* while keeping worker dispatch overhead under 5% on cold runs.
*/
-const CHUNK_BYTE_BUDGET = (() => {
+/**
+ * Built-in chunk byte budget when neither `PipelineOptions.chunkByteBudget`
+ * nor `GITNEXUS_CHUNK_BYTE_BUDGET` is set. Tuned to give a useful
+ * cache-invalidation floor (~1/N chunks on a multi-MB repo) while keeping
+ * worker dispatch overhead under 5% on cold runs. Resolution happens at
+ * call time inside `runChunkedParseAndResolve` (U14 from PR #1693 review)
+ * — previously this was a module-load IIFE, which froze the env value at
+ * import time and meant per-call option threading silently no-op'd.
+ */
+const DEFAULT_CHUNK_BYTE_BUDGET = 2 * 1024 * 1024;
+
+function resolveChunkByteBudget(options?: PipelineOptions): number {
+ const opt = options?.chunkByteBudget;
+ if (typeof opt === 'number' && Number.isFinite(opt) && opt > 0) return opt;
const env = Number(process.env.GITNEXUS_CHUNK_BYTE_BUDGET);
if (Number.isFinite(env) && env > 0) return env;
- return 2 * 1024 * 1024;
-})();
+ return DEFAULT_CHUNK_BYTE_BUDGET;
+}
// ── Main parse + resolve function ──────────────────────────────────────────
@@ -177,18 +192,28 @@ export async function runChunkedParseAndResolve(
if (totalParseable === 0) {
onProgress({
phase: 'parsing',
- percent: 82,
+ // Skip directly to the end of the parse-phase progress band (M2 from PR
+ // #1693 review). Parse 20-70%, deferred 70-95%; nothing in either runs
+ // when there's no parseable file, so jump to 95.
+ percent: 95,
message: 'No parseable files found — skipping parsing phase',
stats: { filesProcessed: 0, totalFiles: 0, nodesCreated: graph.nodeCount },
});
}
- // Build byte-budget chunks
+ // Build byte-budget chunks. The budget is resolved per-call (U14): options
+ // first, then env, then the built-in default. Pre-U14 this was a
+ // module-load IIFE constant, which froze the env value at import time
+ // and made `PipelineOptions.chunkByteBudget` silently no-op on warm test
+ // runs. Resolving in the function body restores per-call configurability
+ // and matches the pattern used by resolveAutoPoolSize and the U1
+ // parseChunkConcurrency resolver.
+ const chunkByteBudget = resolveChunkByteBudget(options);
const chunks: string[][] = [];
let currentChunk: string[] = [];
let currentBytes = 0;
for (const file of parseableScanned) {
- if (currentChunk.length > 0 && currentBytes + file.size > CHUNK_BYTE_BUDGET) {
+ if (currentChunk.length > 0 && currentBytes + file.size > chunkByteBudget) {
chunks.push(currentChunk);
currentChunk = [];
currentBytes = 0;
@@ -203,16 +228,22 @@ export async function runChunkedParseAndResolve(
if (isDev) {
const totalMB = parseableScanned.reduce((s, f) => s + f.size, 0) / (1024 * 1024);
logger.info(
- `📂 Scan: ${totalFiles} paths, ${totalParseable} parseable (${totalMB.toFixed(0)}MB), ${numChunks} chunks @ ${CHUNK_BYTE_BUDGET / (1024 * 1024)}MB budget`,
+ `📂 Scan: ${totalFiles} paths, ${totalParseable} parseable (${totalMB.toFixed(0)}MB), ${numChunks} chunks @ ${chunkByteBudget / (1024 * 1024)}MB budget`,
);
}
- onProgress({
- phase: 'parsing',
- percent: 20,
- message: `Parsing ${totalParseable} files in ${numChunks} chunk${numChunks !== 1 ? 's' : ''}...`,
- stats: { filesProcessed: 0, totalFiles: totalParseable, nodesCreated: graph.nodeCount },
- });
+ // Skip the "Parsing N files..." announcement when there's nothing to parse
+ // — the early-return branch above already emitted percent 95 ("skipping
+ // parsing phase"), and emitting percent 20 here would regress the
+ // progress stream non-monotonically (M2 from PR #1693 review).
+ if (totalParseable > 0) {
+ onProgress({
+ phase: 'parsing',
+ percent: 20,
+ message: `Parsing ${totalParseable} files in ${numChunks} chunk${numChunks !== 1 ? 's' : ''}...`,
+ stats: { filesProcessed: 0, totalFiles: totalParseable, nodesCreated: graph.nodeCount },
+ });
+ }
// Don't spawn workers for tiny repos — overhead exceeds benefit.
// Test suites may lower the thresholds via `options.workerThresholdsForTest`
@@ -221,18 +252,33 @@ export async function runChunkedParseAndResolve(
const MIN_BYTES_FOR_WORKERS = options?.workerThresholdsForTest?.minBytes ?? 512 * 1024;
const totalBytes = parseableScanned.reduce((s, f) => s + f.size, 0);
- // Create worker pool once, reuse across chunks
+ // Create worker pool once, reuse across chunks.
+ //
+ // `workerPoolSize === 0` is a programmatic equivalent of `skipWorkers:
+ // true` per the `PipelineOptions.workerPoolSize` contract. Short-
+ // circuiting here avoids constructing a useless pool that rejects
+ // every dispatch (with a `Worker pool parsing stopped` warn log per
+ // chunk) just to fall back to the sequential path via the error
+ // catch — the gate honors the docstring directly.
let workerPool: WorkerPool | undefined;
if (
!options?.skipWorkers &&
+ options?.workerPoolSize !== 0 &&
(totalParseable >= MIN_FILES_FOR_WORKERS || totalBytes >= MIN_BYTES_FOR_WORKERS)
) {
try {
- let workerUrl = new URL('../workers/parse-worker.js', import.meta.url);
+ // U20.U3 test-only injection: integration tests pass a custom
+ // worker script URL via `workerUrlForTest` (mirrors the
+ // `workerThresholdsForTest` precedent) so they can drive the
+ // chunk-loop with deterministically-misbehaving workers without
+ // mocking the module import graph. When unset, the normal src/
+ // → dist/ resolution runs.
+ let workerUrl =
+ options?.workerUrlForTest ?? new URL('../workers/parse-worker.js', import.meta.url);
// When running under vitest, import.meta.url points to src/ where no .js exists.
// Fall back to the compiled dist/ worker so the pool can spawn real worker threads.
const thisDir = fileURLToPath(new URL('.', import.meta.url));
- if (!fs.existsSync(fileURLToPath(workerUrl))) {
+ if (!options?.workerUrlForTest && !fs.existsSync(fileURLToPath(workerUrl))) {
const distWorker = path.resolve(
thisDir,
'..',
@@ -249,7 +295,7 @@ export async function runChunkedParseAndResolve(
workerUrl = pathToFileURL(distWorker);
}
}
- workerPool = createWorkerPool(workerUrl);
+ workerPool = createWorkerPool(workerUrl, options?.workerPoolSize);
} catch (err) {
logger.warn(
{ err: (err as Error).message },
@@ -301,6 +347,16 @@ export async function runChunkedParseAndResolve(
const deferredWorkerHeritage: ExtractedHeritage[] = [];
const deferredConstructorBindings: FileConstructorBindings[] = [];
const deferredAssignments: ExtractedAssignment[] = [];
+ // Imports accumulated across chunks. Previously processed per-chunk
+ // via `processImportsFromExtracted` inside the chunk loop, which
+ // forced workers to sit idle on the main thread's extraction pass
+ // between chunk dispatches (4-5% CPU utilization symptom). Deferring
+ // to a single end-of-loop pass lets the worker pool start chunk N+1
+ // immediately after chunk N's worker dispatch returns. Resolution is
+ // strictly-more-information at end-of-loop because graph now has
+ // every chunk's symbols — improves cross-chunk import targets.
+ const deferredWorkerImports: ExtractedImport[] = [];
+ let anyChunkNeedsWildcardSynth = false;
// Aggregated per-file ParsedFile artifacts produced by workers' calls
// to `extractParsedFile`. Threaded through to the scope-resolution
// phase so it can SKIP its own re-extraction on cache hits — this is
@@ -317,10 +373,54 @@ export async function runChunkedParseAndResolve(
let chunkCacheMisses = 0;
try {
+ // U1 — bounded chunk concurrency (B1 from PR #1693 review): pre-fetch
+ // chunk file contents up to `parseChunkConcurrency` chunks ahead of the
+ // dispatch cursor so file I/O overlaps with worker compute. Worker
+ // dispatch itself stays serial because `WorkerPool.dispatch` is not
+ // reentrant (concurrent calls would race on the shared per-slot
+ // busy/in-flight state). With concurrency=1 behavior is identical to
+ // the pure-serial loop. F4: deferred-state aggregation still happens
+ // in chunkIdx order (the for-loop below iterates sequentially), so
+ // cross-chunk processors see deterministic input regardless of
+ // file-read completion order. Honors options.parseChunkConcurrency
+ // (threaded from the CLI), then GITNEXUS_PARSE_CHUNK_CONCURRENCY env
+ // (default 2 — matches the help text the CLI advertises).
+ const parseChunkConcurrency = ((): number => {
+ const opt = options?.parseChunkConcurrency;
+ if (typeof opt === 'number' && Number.isInteger(opt) && opt >= 1) return opt;
+ const env = Number(process.env.GITNEXUS_PARSE_CHUNK_CONCURRENCY);
+ if (Number.isInteger(env) && env >= 1) return env;
+ return 2;
+ })();
+ const chunkContentPromises = new Array> | undefined>(numChunks);
+ const startChunkPrefetch = (i: number): void => {
+ if (i >= numChunks || chunkContentPromises[i] !== undefined) return;
+ chunkContentPromises[i] = readFileContents(repoPath, chunks[i]);
+ };
+ for (let i = 0; i < Math.min(parseChunkConcurrency, numChunks); i++) {
+ startChunkPrefetch(i);
+ }
+
+ // Hoisted loop-invariant: GITNEXUS_VERBOSE / NODE_ENV are read once
+ // (not on every chunk). Previously evaluated at the top of the loop
+ // body, which re-read process.env on every iteration even though
+ // the env can't change mid-run.
+ const verboseThroughputLog = isDev || isVerboseIngestionEnabled();
+
for (let chunkIdx = 0; chunkIdx < numChunks; chunkIdx++) {
const chunkPaths = chunks[chunkIdx];
-
- const chunkContents = await readFileContents(repoPath, chunkPaths);
+ // Start wall-clock for the per-chunk throughput log emitted at end
+ // of this iteration. The gate is computed once above; here we just
+ // sample the clock if the gate is on. Computed when either
+ // NODE_ENV=development OR the operator passed `--verbose`
+ // (GITNEXUS_VERBOSE) — the previous `isDev`-only gate meant
+ // operators running `gitnexus analyze --verbose` in production
+ // never saw the log (M3 from PR #1693 review).
+ const chunkStartMs: number | null = verboseThroughputLog ? Date.now() : null;
+
+ const chunkContents = await chunkContentPromises[chunkIdx]!;
+ chunkContentPromises[chunkIdx] = undefined; // release the in-memory copy
+ startChunkPrefetch(chunkIdx + parseChunkConcurrency);
const chunkFiles = chunkPaths
.filter((p) => chunkContents.has(p))
.map((p) => ({ path: p, content: chunkContents.get(p)! }));
@@ -357,7 +457,11 @@ export async function runChunkedParseAndResolve(
const cachedFiles = chunkFiles.length;
onProgress({
phase: 'parsing',
- percent: Math.round(20 + ((filesParsedSoFar + cachedFiles) / totalParseable) * 62),
+ // Parse phase covers 20-70 (50 points). Deferred extraction below
+ // takes 70-95 so the UI advances through the (potentially long)
+ // resolution stages instead of holding at 82 (M2 from PR #1693
+ // review).
+ percent: Math.round(20 + ((filesParsedSoFar + cachedFiles) / totalParseable) * 50),
message: `Parsing chunk ${chunkIdx + 1}/${numChunks} (cache)...`,
stats: {
filesProcessed: filesParsedSoFar + cachedFiles,
@@ -378,7 +482,8 @@ export async function runChunkedParseAndResolve(
scopeTreeCache,
(current, _total, filePath) => {
const globalCurrent = filesParsedSoFar + current;
- const parsingProgress = 20 + (globalCurrent / totalParseable) * 62;
+ // Parse phase covers 20-70 (M2). Deferred extraction handles 70-95.
+ const parsingProgress = 20 + (globalCurrent / totalParseable) * 50;
onProgress({
phase: 'parsing',
percent: Math.round(parsingProgress),
@@ -399,56 +504,63 @@ export async function runChunkedParseAndResolve(
// Persist the raw results for this chunk hash. Sequential path
// doesn't populate rawResults (it writes directly to graph), so
// small repos without worker pool simply don't cache. That's fine.
+ //
+ // U20.U2: refuse the write when any chunk file is in the
+ // worker pool's cumulative quarantine snapshot. The chunkHash
+ // is computed from EVERY file in the chunk, but the pool's
+ // Layer 3 quarantine filters quarantined files out of dispatch
+ // — so `rawResults` is narrower than the chunkHash key implies.
+ // Caching it would silently replay incomplete results on the
+ // next run with unchanged content (the corruption class Codex's
+ // adversarial review of PR #1693 flagged).
+ //
+ // Skipping the write means the next analyze gets a cache miss
+ // for this chunk and re-dispatches against a fresh worker pool
+ // (quarantine is session-scoped — `createQuarantine` is called
+ // per-pool at worker-pool.ts), giving the quarantined file
+ // another chance. If quarantine fires again, U20.U1's
+ // sequential gap-fill still produces a complete graph for this
+ // run; the cache just stays empty for this chunk until a fully-
+ // clean dispatch lands.
if (parseCache && chunkHash && rawResults.length > 0) {
- parseCache.entries.set(chunkHash, rawResults);
- if (isDev) {
- logger.info(
- `📦 parse-cache MISS+store: chunk ${chunkIdx + 1}/${numChunks} (${chunkFiles.length} files, ${chunkHash.slice(0, 8)})`,
- );
+ const quarantineSnapshot = workerPool?.getQuarantinedPaths?.() ?? [];
+ const quarantineSet = new Set(quarantineSnapshot);
+ const chunkHadQuarantine = chunkFiles.some((f) => quarantineSet.has(f.path));
+ if (chunkHadQuarantine) {
+ if (isDev) {
+ const quarantinedInChunk = chunkFiles.filter((f) => quarantineSet.has(f.path)).length;
+ logger.info(
+ `📦 parse-cache SKIP: chunk ${chunkIdx + 1}/${numChunks} ` +
+ `had ${quarantinedInChunk} worker-quarantined file(s); ` +
+ `next run will rediscover (${chunkHash.slice(0, 8)})`,
+ );
+ }
+ } else {
+ parseCache.entries.set(chunkHash, rawResults);
+ if (isDev) {
+ logger.info(
+ `📦 parse-cache MISS+store: chunk ${chunkIdx + 1}/${numChunks} (${chunkFiles.length} files, ${chunkHash.slice(0, 8)})`,
+ );
+ }
}
}
}
- const chunkBasePercent = 20 + (filesParsedSoFar / totalParseable) * 62;
-
+ // Per-chunk extraction passes (processImportsFromExtracted,
+ // processHeritageFromExtracted, processRoutesFromExtracted,
+ // synthesizeWildcardImportBindings, seedCrossFileReceiverTypes)
+ // moved out of the chunk loop into a single end-of-loop pass below.
+ // Reason: per-chunk extraction blocked the chunk loop on
+ // main-thread work between worker dispatches — workers sat idle
+ // and total CPU utilization plateaued at 4-5% on multi-core boxes.
+ // Deferring keeps workers busy chunk-after-chunk; resolution sees
+ // strictly-more-information (full repo graph) so cross-chunk import
+ // and heritage targets resolve at least as well as before.
if (chunkWorkerData) {
- await processImportsFromExtracted(
- graph,
- allPathObjects,
- chunkWorkerData.imports,
- ctx,
- (current, total) => {
- onProgress({
- phase: 'parsing',
- percent: Math.round(chunkBasePercent),
- message: `Resolving imports (chunk ${chunkIdx + 1}/${numChunks})...`,
- detail: `${current}/${total} files`,
- stats: {
- filesProcessed: filesParsedSoFar,
- totalFiles: totalParseable,
- nodesCreated: graph.nodeCount,
- },
- });
- },
- repoPath,
- importCtx,
- );
if (chunkNeedsSynthesis[chunkIdx]) {
- synthesizeWildcardImportBindings(graph, ctx);
- hasSynthesized = true;
- }
- if (exportedTypeMap.size > 0 && ctx.namedImportMap.size > 0) {
- const { enrichedCount } = seedCrossFileReceiverTypes(
- chunkWorkerData.calls,
- ctx.namedImportMap,
- exportedTypeMap,
- );
- if (isDev && enrichedCount > 0) {
- logger.info(
- `🔗 E1: Seeded ${enrichedCount} cross-file receiver types (chunk ${chunkIdx + 1})`,
- );
- }
+ anyChunkNeedsWildcardSynth = true;
}
+ for (const item of chunkWorkerData.imports) deferredWorkerImports.push(item);
for (const item of chunkWorkerData.calls) deferredWorkerCalls.push(item);
for (const item of chunkWorkerData.heritage) deferredWorkerHeritage.push(item);
for (const item of chunkWorkerData.constructorBindings)
@@ -463,35 +575,6 @@ export async function runChunkedParseAndResolve(
for (const item of chunkWorkerData.assignments) deferredAssignments.push(item);
}
- await Promise.all([
- processHeritageFromExtracted(graph, chunkWorkerData.heritage, ctx, (current, total) => {
- onProgress({
- phase: 'parsing',
- percent: Math.round(chunkBasePercent),
- message: `Resolving heritage (chunk ${chunkIdx + 1}/${numChunks})...`,
- detail: `${current}/${total} records`,
- stats: {
- filesProcessed: filesParsedSoFar,
- totalFiles: totalParseable,
- nodesCreated: graph.nodeCount,
- },
- });
- }),
- processRoutesFromExtracted(graph, chunkWorkerData.routes ?? [], ctx, (current, total) => {
- onProgress({
- phase: 'parsing',
- percent: Math.round(chunkBasePercent),
- message: `Resolving routes (chunk ${chunkIdx + 1}/${numChunks})...`,
- detail: `${current}/${total} routes`,
- stats: {
- filesProcessed: filesParsedSoFar,
- totalFiles: totalParseable,
- nodesCreated: graph.nodeCount,
- },
- });
- }),
- ]);
-
if (chunkWorkerData.fileScopeBindings?.length) {
for (const { filePath, bindings } of chunkWorkerData.fileScopeBindings) {
if (typeof filePath !== 'string' || filePath.length === 0) continue;
@@ -530,6 +613,24 @@ export async function runChunkedParseAndResolve(
filesParsedSoFar += chunkFiles.length;
astCache.clear();
+
+ // Throughput observability (U3): emit a per-chunk metrics line
+ // under verbose ingestion mode so operators can verify CPU
+ // utilization moved + tune `--workers` / batch sizes without
+ // guessing. Cheap snapshot — just reads pool closure state.
+ if (verboseThroughputLog && chunkStartMs !== null) {
+ const elapsedMs = Date.now() - chunkStartMs;
+ const filesPerSec = elapsedMs > 0 ? (chunkFiles.length * 1000) / elapsedMs : 0;
+ const stats = workerPool?.getStats?.();
+ const poolFrag = stats
+ ? ` pool: ${stats.activeSlots}/${stats.size} active, ` +
+ `${stats.quarantined} quarantined${stats.poolBroken ? ', BROKEN' : ''}`
+ : ' (sequential)';
+ logger.info(
+ `📊 chunk ${chunkIdx + 1}/${numChunks}: ${chunkFiles.length} files in ${elapsedMs}ms ` +
+ `(${filesPerSec.toFixed(1)} files/s)${poolFrag}`,
+ );
+ }
}
if (isDev && parseCache && (chunkCacheHits > 0 || chunkCacheMisses > 0)) {
@@ -538,10 +639,129 @@ export async function runChunkedParseAndResolve(
);
}
+ // Deferred end-of-loop extraction (moved out of the per-chunk block):
+ // 1. processImportsFromExtracted on all chunks' imports
+ // 2. synthesizeWildcardImportBindings (if any chunk had wildcards)
+ // 3. seedCrossFileReceiverTypes on deferred calls (depends on
+ // namedImportMap populated by step 1)
+ // 4. processHeritageFromExtracted on all chunks' heritage
+ // 5. processRoutesFromExtracted on all chunks' routes
+ // Same logic as the prior per-chunk passes, just batched — resolution
+ // sees the full repo graph instead of just current-and-earlier chunks.
+ // Deferred extraction band (M2 from PR #1693 review): the 4 stages below
+ // each get their own 5-10 point slice of the 70-95 range so percent
+ // advances monotonically through the (potentially long) resolution work
+ // instead of holding flat at 82. Stages that are skipped (zero-length
+ // input) leave their band as a no-op jump — the next stage still starts
+ // at its own band, preserving monotonicity.
+ // imports: 70 -> 75 (5)
+ // heritage: 75 -> 80 (5)
+ // routes: 80 -> 85 (5)
+ // calls: 85 -> 95 (10)
+ if (deferredWorkerImports.length > 0) {
+ await processImportsFromExtracted(
+ graph,
+ allPathObjects,
+ deferredWorkerImports,
+ ctx,
+ (current, total) => {
+ const ratio = total > 0 ? current / total : 1;
+ onProgress({
+ phase: 'parsing',
+ percent: 70 + Math.round(ratio * 5),
+ message: 'Resolving imports (all chunks)...',
+ detail: `${current}/${total} files`,
+ stats: {
+ filesProcessed: filesParsedSoFar,
+ totalFiles: totalParseable,
+ nodesCreated: graph.nodeCount,
+ },
+ });
+ },
+ repoPath,
+ importCtx,
+ );
+ // U15 (lightweight M1): processImportsFromExtracted is the sole
+ // consumer of `deferredWorkerImports`. Free the array now so the
+ // GC can reclaim the per-file ExtractedImport records before the
+ // heavier downstream stages run (heritage, routes, calls). Peak
+ // accumulator memory drops from O(repo) to O(repo - imports) for
+ // the remainder of the deferred phase. The future per-chunk
+ // streaming upgrade can rewrite this with the same correctness
+ // contract once profile data shows it's warranted.
+ deferredWorkerImports.length = 0;
+ }
+ if (anyChunkNeedsWildcardSynth) {
+ synthesizeWildcardImportBindings(graph, ctx);
+ hasSynthesized = true;
+ }
+ // L5 from PR #1693 review: populate `exportedTypeMap` from the in-progress
+ // graph BEFORE `seedCrossFileReceiverTypes` runs. Previously the seeding
+ // branch below was reached with `exportedTypeMap.size === 0` in the
+ // worker path (the map was only built at the post-parse block far below,
+ // AFTER the seeding branch), so the seed dead-coded itself silently and
+ // call resolution never got the cross-file receiver-type enrichment.
+ // The post-parse builder still runs as a defensive fallback on the
+ // sequential path; its `size === 0` guard means we don't pay the cost
+ // twice on the worker path.
+ if (exportedTypeMap.size === 0 && graph.nodeCount > 0) {
+ const graphExports = buildExportedTypeMapFromGraph(graph, ctx.model.symbols);
+ for (const [fp, exports] of graphExports) exportedTypeMap.set(fp, exports);
+ }
+ if (exportedTypeMap.size > 0 && ctx.namedImportMap.size > 0 && deferredWorkerCalls.length > 0) {
+ const { enrichedCount } = seedCrossFileReceiverTypes(
+ deferredWorkerCalls,
+ ctx.namedImportMap,
+ exportedTypeMap,
+ );
+ if (isDev && enrichedCount > 0) {
+ logger.info(`🔗 E1: Seeded ${enrichedCount} cross-file receiver types (all chunks)`);
+ }
+ }
+ if (deferredWorkerHeritage.length > 0) {
+ await processHeritageFromExtracted(graph, deferredWorkerHeritage, ctx, (current, total) => {
+ const ratio = total > 0 ? current / total : 1;
+ onProgress({
+ phase: 'parsing',
+ percent: 75 + Math.round(ratio * 5),
+ message: 'Resolving heritage (all chunks)...',
+ detail: `${current}/${total} records`,
+ stats: {
+ filesProcessed: filesParsedSoFar,
+ totalFiles: totalParseable,
+ nodesCreated: graph.nodeCount,
+ },
+ });
+ });
+ }
+ if (allExtractedRoutes.length > 0) {
+ await processRoutesFromExtracted(graph, allExtractedRoutes, ctx, (current, total) => {
+ const ratio = total > 0 ? current / total : 1;
+ onProgress({
+ phase: 'parsing',
+ percent: 80 + Math.round(ratio * 5),
+ message: 'Resolving routes (all chunks)...',
+ detail: `${current}/${total} routes`,
+ stats: {
+ filesProcessed: filesParsedSoFar,
+ totalFiles: totalParseable,
+ nodesCreated: graph.nodeCount,
+ },
+ });
+ });
+ }
+
const fullWorkerHeritageMap =
deferredWorkerHeritage.length > 0
? buildHeritageMap(deferredWorkerHeritage, ctx, getHeritageStrategyForLanguage)
: undefined;
+ // U15 (lightweight M1): buildHeritageMap is the LAST consumer of the
+ // raw `deferredWorkerHeritage` records — processCallsFromExtracted
+ // below reads from the derived `fullWorkerHeritageMap` instead. Free
+ // the raw heritage array now so the GC can reclaim it before the
+ // (potentially long) call-resolution stage. processHeritageFromExtracted
+ // earlier was a read-only consumer (pushed to graph, didn't drain).
+ deferredWorkerHeritage.length = 0;
if (deferredWorkerCalls.length > 0) {
await processCallsFromExtracted(
@@ -549,9 +769,13 @@ export async function runChunkedParseAndResolve(
deferredWorkerCalls,
ctx,
(current, total) => {
+ const ratio = total > 0 ? current / total : 1;
onProgress({
phase: 'parsing',
- percent: 82,
+ // Calls is the longest deferred stage on real repos — give it the
+ // 10-point tail 85-95 so the progress bar visibly advances during
+ // call resolution instead of holding at 82 (M2).
+ percent: 85 + Math.round(ratio * 10),
message: 'Resolving calls (all chunks)...',
detail: `${current}/${total} files`,
stats: {
@@ -576,6 +800,20 @@ export async function runChunkedParseAndResolve(
bindingAccumulator,
);
}
+ // U15 (lightweight M1): all three arrays have had their last consumer
+ // by the time we reach this point — processCallsFromExtracted drained
+ // `deferredWorkerCalls` and read `deferredConstructorBindings`;
+ // processAssignmentsFromExtracted drained `deferredAssignments` and
+ // also read `deferredConstructorBindings`. Free them now so the
+ // function-scope references die before downstream graph-build /
+ // scope-resolution starts using its own working memory. Note: arrays
+ // returned in the function result object (allFetchCalls,
+ // allExtractedRoutes, allDecoratorRoutes, allToolDefs, allORMQueries,
+ // allParsedFiles) intentionally stay live — downstream consumers
+ // need them.
+ deferredWorkerCalls.length = 0;
+ deferredConstructorBindings.length = 0;
+ deferredAssignments.length = 0;
} finally {
await workerPool?.terminate();
}
diff --git a/gitnexus/src/core/ingestion/pipeline.ts b/gitnexus/src/core/ingestion/pipeline.ts
index 1ee8e102fb..ec5828b3b2 100644
--- a/gitnexus/src/core/ingestion/pipeline.ts
+++ b/gitnexus/src/core/ingestion/pipeline.ts
@@ -55,6 +55,16 @@ export interface PipelineOptions {
minFiles?: number;
minBytes?: number;
};
+ /**
+ * @internal Test-only override for the worker script URL the pool
+ * spawns. When unset, parse-impl resolves `parse-worker.js` from the
+ * adjacent `workers/` directory (or the compiled `dist/` fallback
+ * under vitest). Integration tests use this to inject a custom
+ * worker script that deterministically triggers worker-pool
+ * resilience paths (e.g., crash-on-poison-file) — same precedent as
+ * `workerThresholdsForTest`. Do not use from production call sites.
+ */
+ workerUrlForTest?: URL;
/**
* Incremental-indexing parse cache. When provided:
* - The parse phase looks up each chunk's content hash in
@@ -68,6 +78,46 @@ export interface PipelineOptions {
* See `gitnexus/src/storage/parse-cache.ts`.
*/
parseCache?: import('../../storage/parse-cache.js').ParseCache;
+ /**
+ * Worker pool size override, threaded from the CLI `--workers` flag
+ * via `AnalyzeOptions`. When set, parse-impl passes this directly to
+ * `createWorkerPool` so the pool sizing bypasses the env-var fallback
+ * in `resolveAutoPoolSize`. The env-var channel
+ * (`GITNEXUS_WORKER_POOL_SIZE`) remains as a back-compat fallback when
+ * this field is undefined. Setting `workerPoolSize: 0` disables the
+ * pool entirely (sequential fallback) — equivalent to `skipWorkers`
+ * but expressed in the same units as `--workers ` so long-running
+ * hosts (eval-server, MCP daemon) can size per-call without leaking
+ * `process.env` state across analyze invocations.
+ */
+ workerPoolSize?: number;
+ /**
+ * Number of chunks whose file contents may be read into memory in
+ * parallel while the worker pool is busy dispatching the current
+ * chunk. Pre-fetching overlaps disk I/O for chunk N+1..N+K with the
+ * worker compute on chunk N — modest but real wall-clock win on
+ * repos large enough to chunk. Worker dispatch itself remains serial
+ * because `WorkerPool.dispatch` is not reentrant (concurrent calls
+ * would race on the shared per-slot busy/in-flight state).
+ *
+ * `1` matches today's pure-serial behavior; `2` is the documented
+ * default (`GITNEXUS_PARSE_CHUNK_CONCURRENCY`). Falls back to the
+ * env var when undefined; defaults to 2 when neither is set.
+ */
+ parseChunkConcurrency?: number;
+ /**
+ * Byte budget per parse chunk (in bytes). When set, parse-impl uses
+ * this instead of the `GITNEXUS_CHUNK_BYTE_BUDGET` env var or the
+ * built-in 2 MB default. Smaller values produce more chunks (finer
+ * cache-hit granularity, more worker dispatches); larger values
+ * batch more files per dispatch.
+ *
+ * Threading the value through options instead of the env var lets
+ * tests vary the chunk layout per-call without `vi.resetModules` and
+ * lets long-running hosts (eval-server, MCP daemon) size per-call
+ * without leaking `process.env` state across invocations.
+ */
+ chunkByteBudget?: number;
}
// ── Phase registry ─────────────────────────────────────────────────────────
diff --git a/gitnexus/src/core/ingestion/workers/parse-worker.ts b/gitnexus/src/core/ingestion/workers/parse-worker.ts
index da681b0700..eee9593a16 100644
--- a/gitnexus/src/core/ingestion/workers/parse-worker.ts
+++ b/gitnexus/src/core/ingestion/workers/parse-worker.ts
@@ -301,10 +301,7 @@ export interface ParseWorkerInput {
content: string;
}
-type WorkerIncomingMessage =
- | { type: 'sub-batch'; files: ParseWorkerInput[] }
- | { type: 'flush' }
- | ParseWorkerInput[];
+type WorkerIncomingMessage = { type: 'sub-batch'; files: ParseWorkerInput[] } | { type: 'flush' };
// ============================================================================
// Worker-local parser + language map
@@ -1401,6 +1398,15 @@ const processFileGroup = (
// Skip files larger than the max tree-sitter buffer (32 MB)
if (getTreeSitterContentByteLength(file.content) > TREE_SITTER_MAX_BUFFER) continue;
+ // Authoritative in-flight signal for the pool: lets `WorkerPool` exclude
+ // exactly this file if the worker dies during parse/extract, instead of
+ // guessing from `items[lastProgress]` (which the language-grouped order
+ // here would defeat). The pool gracefully ignores this when running an
+ // older worker build that doesn't emit it.
+ if (parentPort) {
+ parentPort.postMessage({ type: 'starting-file', path: file.path });
+ }
+
// Vue SFC preprocessing: extract