Skip to content

feat(ingestion): ScopeExtractor driver — 5-pass CaptureMatch → ParsedFile (#919, RFC #909 Ring 2 PKG)#965

Merged
magyargergo merged 2 commits into
mainfrom
rfc/scope-resolution/919-scope-extractor
Apr 18, 2026
Merged

feat(ingestion): ScopeExtractor driver — 5-pass CaptureMatch → ParsedFile (#919, RFC #909 Ring 2 PKG)#965
magyargergo merged 2 commits into
mainfrom
rfc/scope-resolution/919-scope-extractor

Conversation

@magyargergo

Copy link
Copy Markdown
Collaborator

Closes #919. Ring 2 PKG kickoff — the central driver that turns a language provider's capture matches into a per-file ParsedFile artifact. Foundation for the rest of Ring 2 PKG (#920 parse-worker integration, #921 finalize orchestrator, #922 import adapters).

What's new

New shared contracts (gitnexus-shared)

File Role
`parsed-file.ts` `ParsedFile` — per-file extraction artifact: `scopes`, `parsedImports`, `localDefs`, `referenceSites`. Structural superset of `FinalizeFile` so the finalize orchestrator threads it through unchanged.
`reference-site.ts` `ReferenceSite` — pre-resolution usage fact: name, atRange, inScope, kind, optional callForm/explicitReceiver/arity. Converted to `Reference` records by the resolution phase.

Ring 1 collateral tweak

`LanguageProvider.emitScopeCaptures` now returns `Promise<readonly CaptureMatch[]>` (was `readonly Capture[]`). Pre-grouping per tree-sitter match is the provider's job — the extractor expects coherent matches, not flat captures. No consumers yet (all languages still on legacy DAG), so no breakage.

New CLI module (`gitnexus/`)

`src/core/ingestion/scope-extractor.ts` — single entry point:

```ts
extract(matches: readonly CaptureMatch[], filePath: string, provider: LanguageProvider): ParsedFile
```

The five passes (RFC §5.3)

# Topic What it does
1 `@scope.*` Build scope tree via range-containment parent derivation. Honors `provider.shouldCreateScope` (skip-but-reparent) + `provider.resolveScopeKind` (override default suffix mapping). `buildScopeTree` validates invariants.
2 `@declaration.*` Attach `SymbolDefinition` + local `BindingRef` to innermost scope (or hoisted via `provider.bindingScopeFor`). Populates `Scope.ownedDefs` + `Scope.bindings`.
3 `@import.*` Call `provider.interpretImport` per match. Attach raw `ParsedImport[]` to `ParsedFile` (finalize links in Phase 2).
4 `@type-binding.*` Call `provider.interpretTypeBinding` per match. Write `TypeRef` into the host scope's `typeBindings`.
5 `@reference.*` Emit one `ReferenceSite` per match. Call form from declarative sub-tag (`@reference.call.member`) or `provider.classifyCallForm`.

Design principles

  • Source-agnostic. Consumes `CaptureMatch[]` from providers; no `Tree` / `SyntaxNode` types leak. Works for tree-sitter providers and COBOL's regex tagger.
  • One AST walk per language. Providers do the walk inside `emitScopeCaptures`; this driver does zero traversal.
  • Invariants delegated. `ScopeTree.buildScopeTree` (from RING2-SHARED-1: Scope / ScopeTree / PositionIndex in gitnexus-shared #912) enforces structural rules (non-Module has parent, parent contains child, siblings don't overlap). Malformed captures throw `ScopeTreeInvariantError` — the driver doesn't try to repair them.
  • Sub-tag whitelist. `@reference.receiver`, `@declaration.name`, `@import.source`, etc. are excluded from anchor selection so the broadest-range heuristic doesn't misidentify them. Bug surfaced in the end-to-end fixture test (member call with a wider-range receiver) and was fixed before commit.

Tests (23, all passing)

Organized by pass so regressions localize:

  • Pass 1 (8): module-only · nested · deeply-nested · siblings · `shouldCreateScope === false` reparenting · `resolveScopeKind` override · overlap throws · no-Module throws
  • Pass 2 (4): Class attached to enclosing scope · `localDefs` populated · `bindingScopeFor` hoisting · unknown kinds dropped
  • Pass 3 (3): `interpretImport` produces `ParsedImport[]` · null returns dropped · no hook = no imports
  • Pass 4 (2): parameter-annotation TypeRef attached · null returns skipped
  • Pass 5 (5): call inScope anchor · `@reference.call.member` sub-tag · `classifyCallForm` fallback · all 6 reference kinds · arity parsing
  • End-to-end (1): multi-pass fixture exercising all 5 passes together

Verification

  • `tsc --noEmit` clean (both `gitnexus-shared` and `gitnexus`)
  • `gitnexus-shared` build clean
  • 23/23 new tests pass
  • Full scope-resolution / model / shadow suite: 285/285 pass

What's NOT in this PR

The extractor is callable in isolation (tests prove this) but not wired into the real pipeline yet. This deliberate isolation keeps #919's scope tight and lets the downstream tickets land as small, reviewable integrations.

Part of

…File (#919, RFC #909 Ring 2 PKG)

Kicks off Ring 2 PKG. Implements RFC §5.3 + §3.2 Phase 1: the central,
source-agnostic driver that turns a language provider's `CaptureMatch[]`
into a `ParsedFile` — the per-file artifact the finalize orchestrator
(#921) feeds into the shared `finalize()` algorithm (#915).

## Files

### New shared contracts
  - `gitnexus-shared/src/scope-resolution/parsed-file.ts`
    Per-file extraction artifact: scopes, parsedImports, localDefs,
    referenceSites. Structural superset of `FinalizeFile` so the
    finalize orchestrator threads `ParsedFile` through unchanged.
  - `gitnexus-shared/src/scope-resolution/reference-site.ts`
    Pre-resolution usage fact: name, atRange, inScope, kind, optional
    callForm/explicitReceiver/arity. Converted to `Reference` records
    by the resolution phase (populates `ReferenceIndex`).

### Ring 1 collateral tweak
  - `language-provider.ts: emitScopeCaptures` now returns
    `Promise<readonly CaptureMatch[]>` (was `readonly Capture[]`).
    Pre-grouping per tree-sitter match is the provider's job — the
    extractor expects coherent matches, not flat captures. No
    consumers yet (all languages still on legacy DAG), so no breakage.
    Docstring updated.

### New CLI module
  - `gitnexus/src/core/ingestion/scope-extractor.ts`
    Single entry point: `extract(matches, filePath, provider): ParsedFile`.
    Five-pass pipeline:

      Pass 1 — Build scope tree. `@scope.*` → `ScopeDraft[]` via
        range-containment parent derivation. Honors
        `provider.shouldCreateScope` (skip-but-reparent-children) and
        `provider.resolveScopeKind`. Throws `ScopeTreeInvariantError`
        via `buildScopeTree` on malformed input.

      Pass 2 — Attach declarations + local bindings. `@declaration.*`
        → `SymbolDefinition` + `BindingRef { origin: 'local' }`.
        Default attachment: innermost containing scope. Hoisting via
        `provider.bindingScopeFor`.

      Pass 3 — Collect raw imports. `@import.*` → `ParsedImport` via
        `provider.interpretImport`. Attached to ParsedFile
        (finalize resolves owning scope in Phase 2).

      Pass 4 — Collect type bindings. `@type-binding.*` →
        `TypeRef` via `provider.interpretTypeBinding` →
        `scope.typeBindings`. Hoistable via `bindingScopeFor`.

      Pass 5 — Collect reference sites. `@reference.*` →
        `ReferenceSite[]`. Call form from declarative sub-tag
        (`@reference.call.member`) or `provider.classifyCallForm`.

### Tests
  - `gitnexus/test/unit/scope-resolution/scope-extractor.test.ts`
    23 tests organized by pass + one end-to-end fixture exercising
    all 5 passes together. MockProvider emits synthetic
    `CaptureMatch[]` with no AST — extractor is pure given those.

## Design notes

- **Source-agnostic.** No `Tree` / `SyntaxNode` types leak into the
  driver. Works for tree-sitter providers and COBOL's regex tagger.
- **One AST walk per language.** Providers do the walk inside
  `emitScopeCaptures`; this driver does zero traversal.
- **Invariants delegated.** `ScopeTree.buildScopeTree` enforces
  structural rules (non-Module has parent, parent contains child,
  siblings don't overlap). The extractor doesn't try to repair
  malformed captures.
- **Sub-tag whitelist.** `@reference.receiver`, `@declaration.name`,
  `@import.source`, etc. are known sub-tags — excluded from anchor
  selection so the broadest-range heuristic doesn't mis-identify them
  as anchors for their topic. Bug surfaced in the end-to-end fixture
  test (member call with a large-range receiver) and was fixed before
  commit.

## Verification

  - `tsc --noEmit` clean (both `gitnexus-shared` and `gitnexus`)
  - `gitnexus-shared` build clean
  - 23/23 new tests pass
  - Full scope-resolution / model / shadow suite: **285/285 pass**

## Closes part of #909. Unblocks
  - #920 parse-worker integration (emit ParsedFile from the worker)
  - #921 finalize orchestrator (consume ParsedFile[] workspace-wide)
  - #922 per-language import adapters
@vercel

vercel Bot commented Apr 18, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
gitnexus Ready Ready Preview, Comment Apr 18, 2026 6:14pm

Request Review

@magyargergo

Copy link
Copy Markdown
Collaborator Author

@claude Act as a senior reviewer for GitNexus. Your job is to determine whether this PR is production-ready for this repo, not to give a generic code review.

You are reviewing a PR in the GitNexus monorepo:

  • gitnexus/ → CLI + MCP
  • gitnexus-web/
  • gitnexus-shared/

Your task has 2 phases, in this exact order:

PHASE 1 — DEFINE THE BAR
Before reviewing the diff, establish a concise repo-specific definition of “production-ready” for GitNexus, based only on the repo docs and the affected area.
Keep this definition practical and reviewable. Do not invent standards that are not grounded in the repo.

PHASE 2 — REVIEW THE PR AGAINST THAT BAR
Review the actual diff only after defining the bar.
Stay tightly scoped to the changed code and its direct consequences.


CONTEXT TO LOAD FIRST
Read these before reviewing:

  • AGENTS.md
  • GUARDRAILS.md
  • CONTRIBUTING.md
  • TESTING.md
  • ARCHITECTURE.md

Additional context:


PRIMARY OBJECTIVE
Decide whether this PR is safe, correct, maintainable, and operationally acceptable to merge into production for GitNexus.

Do not optimize for completeness at the expense of signal.
Do not pad the review.
Do not propose unrelated refactors.
Do not restate the PR description unless needed for verification.


REVIEW RULES

  • Every finding must be grounded in specific evidence from the diff or directly relevant surrounding code.
  • Every finding must include path:line.
  • If you make a behavioral claim, cite the code that proves it.
  • If you make a performance claim, explain the mechanism.
  • If something cannot be verified from the diff alone, explicitly say so.
  • Distinguish clearly between:
    • verified issue
    • plausible risk
    • unverified concern
  • Avoid vague wording like “might be better” or “could be improved” unless you explain exactly why.
  • Keep the review focused on this PR’s scope only.

For each finding, assign one severity:

  • BLOCKING → must be fixed before merge
  • NON-BLOCKING → valid issue, but merge may still be acceptable
  • NIT → stylistic/minor, not merge-relevant

REPO-SPECIFIC REVIEW CHECKLIST
Use these exact headings.

1. Correctness & functional completeness

Check:

  • Does the implementation actually satisfy the PR claim?
    • ManifestExtractor is truly invoked
    • config.links produces non-zero cross-links where expected
  • Resolver contracts are preserved:
    • resolveSymbol remains exact-match
    • label-scoped Cypher remains correct per contract type
    • flag any regression toward fuzzy or unscoped matching
  • Graph schema integrity is preserved:
    • no silent changes to node labels
    • no silent changes to edge types
    • no silent changes to ID generation (generateId)
  • Call out any missing wiring, partial integration, dead branch, or mismatch between tests and runtime behavior

2. Code clarity & clean code

Check:

  • naming quality
  • local cohesion
  • dead code
  • unnecessary abstraction
  • hidden control flow
  • confusing indirection
  • adherence to repo conventions:
    • direct imports from gitnexus-shared
    • no barrel re-export regressions
    • no // removed comments
    • no unused re-exports
  • no drive-by refactors outside stated scope per CONTRIBUTING.md and GUARDRAILS.md § Scope

3. Test coverage & change safety

Evaluate against TESTING.md:

  • Are there unit tests under gitnexus/test/unit/ covering the newly wired path?
  • Is there a regression guard for 0-link → N-link behavior?
  • Are assertions meaningful rather than tautological?
  • Are fixtures realistic for manifest inputs?
  • If memoization/cache was introduced, is there a test proving hit/miss behavior and correctness?
  • Is there evidence the expected validation path would pass for staged gitnexus/ files?
    • tsc --noEmit
    • vitest run --project default
      If not verifiable, say exactly what is missing.

4. Performance

Inspect for:

  • hot-path overhead in ingestion/group sync
  • excess allocations per manifest entry
  • redundant Cypher round-trips
  • missed batching or missed parallelism (Promise.all) where it materially matters
  • O(n²) or repeated lookup patterns on large repos
  • memoization tradeoffs:
    • correctness
    • invalidation
    • bounded vs unbounded memory growth
      Do not speculate casually; explain the mechanism and likely impact.

5. Operational risk

Check:

  • Windows/cross-platform safety:
    • stream lifecycle
    • FD/file handle lifecycle
    • path separator assumptions
    • anything resembling prior ENOTEMPTY-style lifecycle regressions
  • LadybugDB single-writer invariant is preserved
  • Embeddings preservation:
    • no silent breakage of --embeddings
    • .gitnexus/meta.json.stats.embeddings not silently zeroed by changed paths
  • MCP contracts remain compatible:
    • group_*
    • query
    • context
    • impact
    • detect_changes
    • rename
    • cypher
      Flag any schema or contract break without migration note
  • staleness behavior still triggers correctly (gitnexus/src/mcp/staleness.ts)
  • rollback safety:
    • can this PR be reverted safely without re-analyze?
    • if not, explain why

6. Maintainability

Check:

  • Does the change respect Pipeline Phase DAG boundaries?
  • Is cross-phase coupling being introduced?
  • Are manifest-extractor entry points discoverable from group/sync.ts?
  • Would the next contributor know where to extend this?
  • Is shared logic placed appropriately:
    • in gitnexus-shared/
    • or gitnexus/src/core/**/utils.ts
    • and not duplicated between CLI and web bridge (server/api.ts)?

NON-NEGOTIABLES TO VERIFY
From GUARDRAILS.md / CONTRIBUTING.md:

  • no secrets, tokens, or machine-specific paths committed
  • no blind find-and-replace renames
  • if a symbol rename occurred, rename MCP tool or equivalent careful review was used
  • upstream impact was considered for modified shared symbols
  • no HIGH/CRITICAL impact left unaddressed
  • PR title follows conventional commit style + [area] Short description
  • PolyForm Noncommercial license headers/metadata untouched

OUTPUT FORMAT

Production-ready bar for this PR

Give a concise repo-specific definition in 5–10 bullets.

Findings

For each finding, use this structure exactly:

[BLOCKING|NON-BLOCKING|NIT] Short title

  • Evidence: path:line, path:line
  • Why it matters: ...
  • Recommendation: ...
  • Confidence: High / Medium / Low

Only include findings that are real and supported.
If there are no findings in a section, say:

  • No issues found in this area based on the diff and reviewed context.

What I could not fully verify

List any behavior you could not prove from the diff, tests, or available context.

Final verdict

Choose exactly one:

  • ✅ Ready to merge — no blocking issues.
  • ⚠️ Merge after addressing blocking items.
  • ❌ Not ready — fundamental concerns.

If verdict is not ready, list the blocking items in priority order.


IMPORTANT REVIEW BEHAVIOR

  • Be strict on correctness and contracts.
  • Be conservative about operational risk.
  • Be explicit about uncertainty.
  • Do not suggest unrelated refactors.
  • Do not turn nits into blockers.
  • Do not reward intent; review actual implementation.

You must also ensure we are on track with the RFC plan! Check the issue tickets involved!

@github-actions

github-actions Bot commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 4m 39s —— View job


PR Review: feat(ingestion): ScopeExtractor driver (#965)

Tasks

  • Post initial comment
  • Read repo docs (AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, TESTING.md, ARCHITECTURE.md)
  • Read changed files (scope-extractor.ts, parsed-file.ts, reference-site.ts, language-provider.ts, index.ts, test file)
  • Check related issues and RFC track alignment
  • Phase 1: Define production-ready bar
  • Phase 2: Review diff against bar
  • Post final verdict

Production-ready bar for this PR

Based on AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, TESTING.md, and ARCHITECTURE.md:

  • TypeScript-clean: tsc --noEmit must pass in both gitnexus-shared/ and gitnexus/. PR claims both pass; verified by no obvious type gaps in the diff.
  • Test coverage: New behavior must have unit tests under gitnexus/test/unit/. Pass-by-pass isolation is expected (TESTING.md § Unit). 23 tests are claimed; all passing.
  • Pipeline DAG boundaries respected: New code in gitnexus/src/core/ingestion/ must not name languages or cross phase boundaries. The extractor is source-agnostic by construction; language behavior is behind hooks.
  • No language-specific logic in shared ingestion code: AGENTS.md "Never Do" — verified against scope-extractor.ts (no language names, no direct language imports).
  • Shared symbols in gitnexus-shared/ only: ParsedFile and ReferenceSite correctly land in gitnexus-shared/src/scope-resolution/ and are type-exported in index.ts.
  • Interface changes to LanguageProvider have no live consumers: The emitScopeCaptures return-type change must not break existing providers. Verified: no language file in gitnexus/src/core/ingestion/languages/ implements this hook.
  • ParsedFile is a structural superset of FinalizeFile: Required for the finalize orchestrator (RING2-PKG-3: Finalize orchestrator — invoke shared finalize algorithm; write to MutableSemanticModel #921) to thread through unchanged.
  • Invariant enforcement delegated correctly: ScopeTreeInvariantError propagates up rather than being silently swallowed; malformed provider output surfaces as a bug, not a data condition.
  • No secrets, tokens, or machine-specific paths.
  • PR title follows conventional-commit format.

Findings

1. Correctness & functional completeness

No issues found in the extractor's core logic. The five-pass pipeline matches RFC §5.3 exactly:

  • Pass 1 builds the scope tree via range-containment parent derivation, consulting shouldCreateScope and resolveScopeKind. Sort order (startLine ASC, endLine DESC) is correct for parent-before-child processing.
  • Pass 2 attaches SymbolDefinition + BindingRef{origin:'local'} to the correct scopes; bindingScopeFor hoisting is honored.
  • Pass 3 delegates entirely to interpretImport; null drops are handled.
  • Pass 4 attaches TypeRef to innermost scope with bindingScopeFor override support.
  • Pass 5 emits ReferenceSite with declarative sub-tag priority over hook-based classification.

The CaptureMatch-based return type for emitScopeCaptures is the correct design: providers pre-group per tree-sitter match; the extractor routes by anchor topic. Verified no existing provider implements this hook (grep over gitnexus/src/core/ingestion/languages/ is clean).

ParsedFile shape is a structural superset of FinalizeFile (filePath, moduleScope, parsedImports, localDefs) as claimed — the finalize orchestrator (#921) can pick those four fields without adaptation.


2. Code clarity & clean code

NON-BLOCKING — ownerDefIdFor is a permanently-returning-undefined stub
  • Evidence: scope-extractor.ts:445–453
function ownerDefIdFor(innermost: ScopeDraft, drafts: readonly ScopeDraft[]): string | undefined {
  void innermost;
  void drafts;
  return undefined;
}
  • Why it matters: This function is called when a declaration lands inside a Class or Namespace scope (line 418: ownerId: ownerDefIdFor(innermost, drafts)). The resulting SymbolDefinition always has ownerId: undefined. The RING2-SHARED-6: ClassRegistry / MethodRegistry / FieldRegistry + evidence composition #917 MethodDispatch index keys off def.ownerId for method lookup. Providers can set ownerId at their interpreter hooks, but the extractor never fills it in — a future Ring 3 language that relies on the extractor to propagate ownerId for cross-scope method dispatch will silently get undefined.
  • Recommendation: Either document this as an explicit deferred contract in the PR/issue (Ring 3 providers must set def.ownerId directly in their declaration captures), or remove the dead code path since ownerId: undefined is the same as omitting ownerId from the spread. The void innermost; void drafts; suppressors signal that the implementation is incomplete. A // TODO(#921): ownerId populated by class-registry during finalize comment would make the deferral explicit.
  • Confidence: High
NIT — Inconsistent filePath sourcing between passes 4 and 5
  • Evidence: scope-extractor.ts:593 (Pass 4: drafts[0]!.filePath), scope-extractor.ts:635 (Pass 5: anyFilePathFromScopeTree(scopeTree))
  • Why it matters: Both are equivalent — all drafts and all scopes in a single-file extraction share the same filePath. But the two different patterns for the same operation adds cognitive load and makes anyFilePathFromScopeTree look like it's solving a problem that doesn't exist.
  • Recommendation: Pick one convention and use it in both passes. filePath is already a parameter to extract() — passing it down directly is the simplest approach.
  • Confidence: High
NIT — scopeTree in passes 2–5 contains pre-mutation scope snapshots
  • Evidence: scope-extractor.ts:102–106scopes = scopeDrafts.map(draftToScope) is called before any pass 2–5 mutations, so scopeTree.getScope(id) returns scopes with empty bindings, empty ownedDefs, empty typeBindings.
  • Why it matters: classifyCallForm(captures, enclosingScope) receives this pre-pass snapshot. If any Ring 3 language implements classifyCallForm relying on enclosingScope.bindings to decide call form (e.g., checking whether the callee name is locally bound), it will always see empty bindings and produce wrong classifications.
  • Recommendation: Add a code comment near the scopeTree construction explaining that it is a structural-only snapshot (parent/range queries) and that binding data is not available inside hook calls during extraction. This prevents a future Ring 3 implementor from writing a classifyCallForm that reads bindings.
  • Confidence: Medium (no current implementers; issue arises only at Ring 3)

3. Test coverage & change safety

Broadly solid. 23 tests, organized per-pass, with an end-to-end fixture that exercises all five passes together.

NON-BLOCKING — No dedicated regression for the anchor-vs-receiver bug
  • Evidence: PR description: "Bug surfaced in the end-to-end fixture test (member call with a wider-range receiver) and was fixed before commit." The KNOWN_SUB_TAGS set at scope-extractor.ts:765 explicitly excludes @reference.receiver. The end-to-end fixture uses a @reference.receiver capture (scope-extractor.test.ts:473) but doesn't assert that the receiver wasn't picked as the anchor.
  • Why it matters: The bug was: a @reference.call.member match where @reference.receiver had a wider range than the call anchor, causing anchorCaptureFor to pick the receiver as the anchor. The fix is KNOWN_SUB_TAGS.has(name) exclusion. Without an isolated regression test, the fix is correct but the failure mode is not guarded.
  • Recommendation: Add a targeted test: a @reference.call.member match where @reference.receiver spans columns 0–10 and the call name spans 5–8 (receiver is wider), asserting that referenceSites[0].name is the call name ('save'), not the receiver text.
  • Confidence: High
NIT — mockProvider uses as unknown as LanguageProvider
  • Evidence: scope-extractor.test.ts:109
  • Why it matters: Required fields (isBuiltInName, importSemantics, mroStrategy, heritageDefaultEdge) are absent. Any future change to extract() that reads a required field from provider would silently return undefined in all tests. This won't fail TypeScript (the cast is double), so tests would pass even with wrong behavior.
  • Recommendation: Build a minimal valid LanguageProvider via defineLanguage() with a stub config and then override hooks via object spread. This is slightly more boilerplate but ensures required-field coverage automatically evolves with the interface.
  • Confidence: Medium

4. Performance

No issues. The five-pass design is O(N_matches) per pass with small constants:

  • partitionByTopic: one linear scan over all matches, Object.keys() per match (typically 2–5 keys).
  • anchorCaptureFor: O(K) per match where K = keys per match.
  • draftToScope is called twice (lines 102, 146) — O(N_scopes) total, negligible.
  • draftById maps in passes 2 and 4 are O(N_scopes) to build, not shared between passes — minor redundancy but not hot-path.
  • No Cypher round-trips (extractor is pure in-memory).
  • No unbounded memoization.

The (endLine - startLine) * 1_000_000 span heuristic in anchorCaptureFor (line 797) is safe for realistic files (<1M lines); no overflow concern in practice.


5. Operational risk

  • No pipeline wiring. extract() is callable but not invoked from any phase. parse-worker.ts, pipeline.ts, pipeline-phases/parse.ts are untouched. The pipeline produces the same output as before this PR.
  • LadybugDB single-writer invariant: Not implicated. No DB writes.
  • MCP contracts: Unchanged. No tools.ts, resources.ts, or schema.ts touches.
  • Embeddings: Not affected. No run-analyze.ts changes.
  • Staleness behavior: Unchanged.
  • Rollback safety: Clean. extract() is not called from any live path; reverting this PR requires no re-analyze.
  • Cross-platform: No file I/O, no path manipulation, no stream lifecycle. Pure TypeScript computation.
  • emitScopeCaptures return type break: No existing language provider implements this hook (grep confirms zero matches in languages/). Change is safe.

6. Maintainability

Strong. The change respects Pipeline Phase DAG boundaries precisely:

Concern Assessment
Language-specific logic in shared ingestion None. Every language decision is behind an optional hook.
Cross-phase coupling None. extract() takes captures, returns ParsedFile — no Phase dependency.
ParsedFile/ReferenceSite placement Correct: gitnexus-shared/src/scope-resolution/.
index.ts exports Type-only exports; no value re-export regressions. No barrel anti-patterns.
Drive-by refactors None found. The language-provider.ts change is strictly within scope (updated return type + JSDoc).
KNOWN_SUB_TAGS extensibility One-line addition per new sub-tag convention; well-documented.
Next contributor discoverability extract() is the clear entry point. Five-pass structure mirrors the RFC section numbering. Tests are organized by pass for easy regression localization.

The deliberate isolation strategy (extractor is callable but not wired to the pipeline) is architecturally sound and matches the Ring 2 PKG ticket breakdown.


Non-negotiables

Check Result
No secrets/tokens/machine-specific paths
No blind find-and-replace renames ✅ (Capture → CaptureMatch is a targeted return-type change on an un-implemented hook)
emitScopeCaptures change: upstream impact considered ✅ Zero consumers confirmed
No HIGH/CRITICAL impact unaddressed ✅ (no consumers of changed symbol)
PR title follows conventional commit feat(ingestion): ...
PolyForm license untouched

RFC Plan alignment

Ring 2 SHARED is fully landed: #912 (ScopeTree + PositionIndex), #915 (finalize), #917 (registries), #918 (shadow). This PR (#919) is the correct next step as Ring 2 PKG kickoff. The remaining Ring 2 PKG work (#920 parse-worker, #921 finalize orchestrator, #922 import adapters, #923 shadow harness) is correctly deferred. On track.


What I could not fully verify

  • tsc --noEmit pass for both packages — claimed in the PR description, not locally runnable in this review context. The diff is type-clean on inspection.
  • vitest run 285/285 — claimed, not re-runnable here. Test logic appears sound.
  • buildScopeTree and buildPositionIndex invariant behavior under the exact draft snapshots used in tests — these are Ring 2 SHARED (RING2-SHARED-1: Scope / ScopeTree / PositionIndex in gitnexus-shared #912) and assumed correct per prior review of that PR.
  • ScopeTree.getScope behavior when scopeTree.byId contains pre-mutation snapshots — correct structurally, but the interaction with classifyCallForm implementations is only verifiable once Ring 3 providers are written.

Final verdict

Ready to merge — no blocking issues.

The extractor is correctly implemented, source-agnostic, invariant-enforcing, and well-tested for its stated scope. The two NON-BLOCKING findings are real but not merge-blockers for a deliberately-unwired driver:

  1. ownerDefIdFor stub — should be tracked in a follow-up issue or explicitly noted in RING2-PKG-2: parse-worker.ts integration — emit ParsedFile artifacts #920/RING2-PKG-3: Finalize orchestrator — invoke shared finalize algorithm; write to MutableSemanticModel #921 as a Ring 3 provider responsibility. Fix this →
  2. Missing regression test for receiver-wider-than-anchor — the bug was fixed but the specific failure mode isn't guarded against regression. Fix this →

@github-actions

github-actions Bot commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
6821 6724 0 97 254s

✅ All 6724 tests passed

97 test(s) skipped — expand for details
  • Swift MethodExtractor > isTypeDeclaration > recognizes class_declaration
  • Swift MethodExtractor > isTypeDeclaration > recognizes protocol_declaration
  • Swift MethodExtractor > isTypeDeclaration > rejects import_declaration
  • Swift MethodExtractor > visibility > extracts public method
  • Swift MethodExtractor > visibility > extracts private method
  • Swift MethodExtractor > visibility > defaults to internal when no modifier
  • Swift MethodExtractor > protocol methods > marks protocol method as abstract
  • Swift MethodExtractor > static and class methods > detects static func as isStatic
  • Swift MethodExtractor > static and class methods > detects class func as isStatic
  • Swift MethodExtractor > parameters > extracts parameters with types and default values
  • Swift MethodExtractor > return type > extracts return type from -> annotation
  • Swift MethodExtractor > annotations > extracts @objc attribute
  • Swift MethodExtractor > isFinal > detects final func
  • Swift MethodExtractor > isFinal > is false when not final
  • Swift MethodExtractor > isAsync > detects async func
  • Swift MethodExtractor > isOverride > detects override method
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference uses lookupClassByName
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift explicit init inference uses lookupClassByName
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference does not bind plain functions
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature
  • Swift constructor-inferred type resolution > detects User and Repo classes, both with save methods
  • Swift constructor-inferred type resolution > resolves user.save() to Models/User.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > resolves repo.save() to Models/Repo.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > emits exactly 2 save() CALLS edges (one per receiver type)
  • Swift self resolution > detects User and Repo classes, each with a save function
  • Swift self resolution > resolves self.save() inside User.process to User.save, not Repo.save
  • Swift parent resolution > detects BaseModel and User classes plus Serializable protocol
  • Swift parent resolution > emits EXTENDS edge: User → BaseModel
  • Swift parent resolution > emits IMPLEMENTS edge: User → Serializable (protocol conformance)
  • Swift cross-file User.init() inference > resolves user.save() via User.init(name:) inference
  • Swift cross-file User.init() inference > resolves user.greet() via User.init(name:) inference
  • Swift return type inference > detects User class and getUser function
  • Swift return type inference > detects save function on User (Swift class methods are Function nodes)
  • Swift return type inference > resolves user.save() to User#save via return type of getUser() -> User
  • Swift return-type inference via function return type > resolves user.save() to User#save via return type of getUser()
  • Swift return-type inference via function return type > user.save() does NOT resolve to Repo#save
  • Swift return-type inference via function return type > resolves repo.save() to Repo#save via return type of getRepo()
  • Swift implicit imports (cross-file visibility) > detects UserService class in Models.swift
  • Swift implicit imports (cross-file visibility) > resolves UserService() constructor call across files (no explicit import)
  • Swift implicit imports (cross-file visibility) > resolves service.fetchUser() member call across files
  • Swift implicit imports (cross-file visibility) > creates IMPORTS edges between files in the same module
  • Swift extension deduplication > detects Product class
  • Swift extension deduplication > resolves Product() constructor despite extension creating duplicate class node
  • Swift extension deduplication > resolves product.save() to Product.swift (primary definition)
  • Swift constructor call fallback (no new keyword) > resolves OCRService() as constructor call across files
  • Swift constructor call fallback (no new keyword) > resolves ocr.recognize() member call via constructor-inferred type
  • Swift export visibility (internal vs private) > resolves PublicService() constructor across files
  • Swift export visibility (internal vs private) > resolves internalHelper() across files (internal = module-scoped)
  • Swift if let / guard let binding resolution > detects User and Repo classes
  • Swift if let / guard let binding resolution > resolves user.save() inside if-let to User#save
  • Swift if let / guard let binding resolution > resolves repo.save() inside guard-let to Repo#save
  • Swift if let / guard let binding resolution > user.save() in if-let does NOT resolve to Repo#save
  • Swift await / try expression unwrapping > resolves user.save() via await fetchUser() return type
  • Swift await / try expression unwrapping > resolves repo.save() via try parseRepo() return type
  • Swift await / try expression unwrapping > detects fetchUser and parseRepo as functions
  • Swift for-in loop element type inference > detects User and Repo classes
  • Swift for-in loop element type inference > creates implicit import edges between files
  • Swift field-type resolution > detects classes and their properties
  • Swift field-type resolution > emits HAS_PROPERTY edges from class to field
  • Swift field-type resolution > resolves field-chain call user.address.save() → Address#save
  • Swift field-type resolution > emits ACCESSES edges for field reads in chains
  • Swift field-type resolution > populates field metadata (visibility, declaredType) on Property nodes
  • Swift call-result binding > resolves call-result-bound method call user.save() → User#save
  • Swift call-result binding > getUser() is present as a defined function
  • Swift call-result binding > emits processUser -> getUser CALLS edge for let-assigned free function call
  • Swift method enrichment > detects Animal protocol and Dog class
  • Swift method enrichment > emits IMPLEMENTS edge Dog -> Animal
  • Swift method enrichment > emits HAS_METHOD edges for Dog methods
  • Swift method enrichment > marks protocol Animal.speak as isAbstract
  • Swift method enrichment > marks Dog.speak as NOT isAbstract
  • Swift method enrichment > marks breathe as isFinal
  • Swift method enrichment > marks classify as isStatic
  • Swift method enrichment > captures @objc annotation on breathe
  • Swift method enrichment > populates parameterTypes for classify(_ name: String)
  • Swift method enrichment > records parameterCount for classify
  • Swift method enrichment > records returnType for speak
  • Swift method enrichment > resolves dog.speak() CALLS edge
  • Swift method enrichment > resolves Dog.classify("dog") CALLS edge
  • Swift abstract dispatch > detects Repository protocol and SqlRepository class
  • Swift abstract dispatch > emits IMPLEMENTS edge SqlRepository -> Repository
  • Swift abstract dispatch > emits HAS_METHOD edges for Repository.find and Repository.save
  • Swift abstract dispatch > emits HAS_METHOD edges for SqlRepository.find and SqlRepository.save
  • Swift abstract dispatch > marks base Repository.find as isAbstract
  • Swift abstract dispatch > marks base Repository.save as isAbstract
  • Swift abstract dispatch > marks concrete SqlRepository.find as NOT isAbstract
  • Swift abstract dispatch > resolves repo.find(id: 42) CALLS edge
  • Swift abstract dispatch > resolves repo.save(entity: user) CALLS edge
  • Swift abstract dispatch > populates parameterTypes for Repository.find
  • Swift abstract dispatch > populates parameterTypes for Repository.save
  • Swift abstract dispatch > records returnType for SqlRepository.find
  • Swift abstract dispatch > emits METHOD_IMPLEMENTS edges from SqlRepository methods → Repository protocol methods
  • Swift overloaded method disambiguation > detects 2 distinct find Method nodes on SqlRepository
  • Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edges for both find overloads
  • Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edge for save
  • Swift overloaded method disambiguation > emits exactly 3 METHOD_IMPLEMENTS edges total
  • Swift Child extends Parent — inherited method resolution (SM-9) > detects Parent and Child classes
  • Swift Child extends Parent — inherited method resolution (SM-9) > resolves c.parentMethod() to Parent.parentMethod via first-wins MRO walk

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 73.37% 18141/24724 73.32% 📈 +0.1 🟢 ██████████████░░░░░░
Branches 62.31% 11512/18473 62.32% 📉 -0.0 🔴 ████████████░░░░░░░░
Functions 78.07% 1724/2208 77.97% 📈 +0.1 🟢 ███████████████░░░░░
Lines 75.94% 16443/21650 75.87% 📈 +0.1 🟢 ███████████████░░░░░

📋 View full run · Generated by CI

Addresses all 5 items from the PR #965 review in-PR.

## Structural changes

- **Extract `ScopeExtractorHooks` as the narrow dependency surface.**
  The extractor now declares its dependency on a `Pick`-narrowed subset
  of `LanguageProvider` (just the 6 scope-resolution hooks it actually
  reads). Test mocks implement exactly that interface — no more
  `as unknown as LanguageProvider` cast hiding missing-field bugs.
  Adding a new hook read becomes a compile error, not a silent test
  pass. (Finding 3.2)

- **Remove dead `ownerDefIdFor` stub + `isOwnerKind` helper.** The
  function always returned `undefined` with `void innermost; void
  drafts;` suppressors — an incomplete-implementation signal. The code
  path was also misleading: creating a clone of the def with
  `ownerId: undefined` is structurally identical to keeping the
  original. Pass 2 now keeps the def as-is. Contract is documented in
  a code comment: providers that need `ownerId` set it from their
  declaration hook; `finalize` (via #914 `MethodDispatchIndex`) fills
  in method/field `ownerId` in a post-extraction pass that has full
  def visibility. (Finding 2.1)

- **Standardize `filePath` threading across passes 4 and 5.** Pass 4
  was reading `drafts[0]!.filePath`; pass 5 was reading
  `anyFilePathFromScopeTree(scopeTree)`. Both equivalent but
  inconsistent. Both now take `filePath` as a parameter from the
  top-level `extract()` call. The `anyFilePathFromScopeTree` helper is
  removed. (Finding 2.2)

## Documentation

- **Snapshot-semantics comment on `scopeTree` + `positionIndex`.** The
  hooks called during Passes 2-5 receive a `scopeTree` built BEFORE any
  bindings/ownedDefs/typeBindings were written. Hooks MUST NOT rely on
  `scope.bindings` etc. being populated — they're for parent/range/kind
  queries only. Added a doc block at the `scopeTree`/`positionIndex`
  construction site so future Ring 3 implementers don't write a
  `classifyCallForm` that reads bindings. (Finding 2.3)

## Tests

- **Regression for the anchor-vs-receiver bug** (Finding 3.1): a
  member-call match where `@reference.receiver` spans columns 0-10
  (wider) and the call name spans 11-15 (narrower). Without the
  `KNOWN_SUB_TAGS` exclusion, the broadest-range heuristic would have
  picked the receiver; the test pins that the call name is the one
  that ends up in `referenceSites[0].name`.

- **Mock provider now types exactly `ScopeExtractorHooks`**, no more
  double-cast. Any future hook added to `extract()` that isn't in
  `ScopeExtractorHooks` is a compile error.

## Verification

- `tsc --noEmit` clean in both `gitnexus-shared` and `gitnexus`
- `gitnexus-shared` build clean
- 24/24 scope-extractor tests pass (+1 regression)
- Full scope-resolution / model / shadow suite: **286/286 pass**
@magyargergo magyargergo merged commit c6a291d into main Apr 18, 2026
14 checks passed
@magyargergo magyargergo deleted the rfc/scope-resolution/919-scope-extractor branch April 18, 2026 18:29
github714801013 pushed a commit to github714801013/GitNexus that referenced this pull request Apr 28, 2026
…File (abhigyanpatwari#919, RFC abhigyanpatwari#909 Ring 2 PKG) (abhigyanpatwari#965)

* feat(ingestion): ScopeExtractor driver — 5-pass CaptureMatch → ParsedFile (abhigyanpatwari#919, RFC abhigyanpatwari#909 Ring 2 PKG)

Kicks off Ring 2 PKG. Implements RFC §5.3 + §3.2 Phase 1: the central,
source-agnostic driver that turns a language provider's `CaptureMatch[]`
into a `ParsedFile` — the per-file artifact the finalize orchestrator
(abhigyanpatwari#921) feeds into the shared `finalize()` algorithm (abhigyanpatwari#915).

## Files

### New shared contracts
  - `gitnexus-shared/src/scope-resolution/parsed-file.ts`
    Per-file extraction artifact: scopes, parsedImports, localDefs,
    referenceSites. Structural superset of `FinalizeFile` so the
    finalize orchestrator threads `ParsedFile` through unchanged.
  - `gitnexus-shared/src/scope-resolution/reference-site.ts`
    Pre-resolution usage fact: name, atRange, inScope, kind, optional
    callForm/explicitReceiver/arity. Converted to `Reference` records
    by the resolution phase (populates `ReferenceIndex`).

### Ring 1 collateral tweak
  - `language-provider.ts: emitScopeCaptures` now returns
    `Promise<readonly CaptureMatch[]>` (was `readonly Capture[]`).
    Pre-grouping per tree-sitter match is the provider's job — the
    extractor expects coherent matches, not flat captures. No
    consumers yet (all languages still on legacy DAG), so no breakage.
    Docstring updated.

### New CLI module
  - `gitnexus/src/core/ingestion/scope-extractor.ts`
    Single entry point: `extract(matches, filePath, provider): ParsedFile`.
    Five-pass pipeline:

      Pass 1 — Build scope tree. `@scope.*` → `ScopeDraft[]` via
        range-containment parent derivation. Honors
        `provider.shouldCreateScope` (skip-but-reparent-children) and
        `provider.resolveScopeKind`. Throws `ScopeTreeInvariantError`
        via `buildScopeTree` on malformed input.

      Pass 2 — Attach declarations + local bindings. `@declaration.*`
        → `SymbolDefinition` + `BindingRef { origin: 'local' }`.
        Default attachment: innermost containing scope. Hoisting via
        `provider.bindingScopeFor`.

      Pass 3 — Collect raw imports. `@import.*` → `ParsedImport` via
        `provider.interpretImport`. Attached to ParsedFile
        (finalize resolves owning scope in Phase 2).

      Pass 4 — Collect type bindings. `@type-binding.*` →
        `TypeRef` via `provider.interpretTypeBinding` →
        `scope.typeBindings`. Hoistable via `bindingScopeFor`.

      Pass 5 — Collect reference sites. `@reference.*` →
        `ReferenceSite[]`. Call form from declarative sub-tag
        (`@reference.call.member`) or `provider.classifyCallForm`.

### Tests
  - `gitnexus/test/unit/scope-resolution/scope-extractor.test.ts`
    23 tests organized by pass + one end-to-end fixture exercising
    all 5 passes together. MockProvider emits synthetic
    `CaptureMatch[]` with no AST — extractor is pure given those.

## Design notes

- **Source-agnostic.** No `Tree` / `SyntaxNode` types leak into the
  driver. Works for tree-sitter providers and COBOL's regex tagger.
- **One AST walk per language.** Providers do the walk inside
  `emitScopeCaptures`; this driver does zero traversal.
- **Invariants delegated.** `ScopeTree.buildScopeTree` enforces
  structural rules (non-Module has parent, parent contains child,
  siblings don't overlap). The extractor doesn't try to repair
  malformed captures.
- **Sub-tag whitelist.** `@reference.receiver`, `@declaration.name`,
  `@import.source`, etc. are known sub-tags — excluded from anchor
  selection so the broadest-range heuristic doesn't mis-identify them
  as anchors for their topic. Bug surfaced in the end-to-end fixture
  test (member call with a large-range receiver) and was fixed before
  commit.

## Verification

  - `tsc --noEmit` clean (both `gitnexus-shared` and `gitnexus`)
  - `gitnexus-shared` build clean
  - 23/23 new tests pass
  - Full scope-resolution / model / shadow suite: **285/285 pass**

## Closes part of abhigyanpatwari#909. Unblocks
  - abhigyanpatwari#920 parse-worker integration (emit ParsedFile from the worker)
  - abhigyanpatwari#921 finalize orchestrator (consume ParsedFile[] workspace-wide)
  - abhigyanpatwari#922 per-language import adapters

* chore(ingestion): address abhigyanpatwari#919 review findings on the extractor

Addresses all 5 items from the PR abhigyanpatwari#965 review in-PR.

## Structural changes

- **Extract `ScopeExtractorHooks` as the narrow dependency surface.**
  The extractor now declares its dependency on a `Pick`-narrowed subset
  of `LanguageProvider` (just the 6 scope-resolution hooks it actually
  reads). Test mocks implement exactly that interface — no more
  `as unknown as LanguageProvider` cast hiding missing-field bugs.
  Adding a new hook read becomes a compile error, not a silent test
  pass. (Finding 3.2)

- **Remove dead `ownerDefIdFor` stub + `isOwnerKind` helper.** The
  function always returned `undefined` with `void innermost; void
  drafts;` suppressors — an incomplete-implementation signal. The code
  path was also misleading: creating a clone of the def with
  `ownerId: undefined` is structurally identical to keeping the
  original. Pass 2 now keeps the def as-is. Contract is documented in
  a code comment: providers that need `ownerId` set it from their
  declaration hook; `finalize` (via abhigyanpatwari#914 `MethodDispatchIndex`) fills
  in method/field `ownerId` in a post-extraction pass that has full
  def visibility. (Finding 2.1)

- **Standardize `filePath` threading across passes 4 and 5.** Pass 4
  was reading `drafts[0]!.filePath`; pass 5 was reading
  `anyFilePathFromScopeTree(scopeTree)`. Both equivalent but
  inconsistent. Both now take `filePath` as a parameter from the
  top-level `extract()` call. The `anyFilePathFromScopeTree` helper is
  removed. (Finding 2.2)

## Documentation

- **Snapshot-semantics comment on `scopeTree` + `positionIndex`.** The
  hooks called during Passes 2-5 receive a `scopeTree` built BEFORE any
  bindings/ownedDefs/typeBindings were written. Hooks MUST NOT rely on
  `scope.bindings` etc. being populated — they're for parent/range/kind
  queries only. Added a doc block at the `scopeTree`/`positionIndex`
  construction site so future Ring 3 implementers don't write a
  `classifyCallForm` that reads bindings. (Finding 2.3)

## Tests

- **Regression for the anchor-vs-receiver bug** (Finding 3.1): a
  member-call match where `@reference.receiver` spans columns 0-10
  (wider) and the call name spans 11-15 (narrower). Without the
  `KNOWN_SUB_TAGS` exclusion, the broadest-range heuristic would have
  picked the receiver; the test pins that the call name is the one
  that ends up in `referenceSites[0].name`.

- **Mock provider now types exactly `ScopeExtractorHooks`**, no more
  double-cast. Any future hook added to `extract()` that isn't in
  `ScopeExtractorHooks` is a compile error.

## Verification

- `tsc --noEmit` clean in both `gitnexus-shared` and `gitnexus`
- `gitnexus-shared` build clean
- 24/24 scope-extractor tests pass (+1 regression)
- Full scope-resolution / model / shadow suite: **286/286 pass**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RING2-PKG-1: ScopeExtractor driver (tree-sitter + provider hooks → shared ScopeTree)

1 participant