Skip to content

feat(group): extractor expansion + manifest extractor (2/4 of #606 split)#796

Merged
magyargergo merged 10 commits into
abhigyanpatwari:mainfrom
ivkond:feat/group-extractors
Apr 13, 2026
Merged

feat(group): extractor expansion + manifest extractor (2/4 of #606 split)#796
magyargergo merged 10 commits into
abhigyanpatwari:mainfrom
ivkond:feat/group-extractors

Conversation

@ivkond

@ivkond ivkond commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

Part 2 of 4 in the split of #606 per @magyargergo's request. Follows #795 (already merged) but has no code-level dependency on it — both PRs could have been reviewed in parallel.

Closes #792.

What changed

Expands the 3 existing contract extractors with substantially more language/framework coverage, and adds a new `manifest-extractor` that resolves `group.yaml`-declared cross-links against the per-repo graph via exact-name lookups.

New (228 LOC prod + 308 LOC tests):

  • `extractors/manifest-extractor.ts` — exact graph lookup for `group.yaml` cross-links. HTTP paths canonicalized; gRPC resolved by service/method name (NO `.proto`-filename fallback); topic and lib use exact-name match. Falls back to a synthetic `manifest::::` uid when unresolved, so cross-impact still has a stable anchor.

Modified (+794 LOC prod + 956 LOC tests):

  • `extractors/grpc-extractor.ts` (+522) — `.proto` parser with string/comment-aware brace counting; canonical service/method IDs; Go / Java / Python / TS server + client detection
  • `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib, NestJS decorators, Python FastAPI, Spring, Java consumers (restTemplate / WebClient / OkHttp)
  • `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}` struct literal detection, kafka-go Writer/Reader, Python NATS, JetStream

Self-review fixes folded in

Carried forward from the original #606 self-review on `d15b8cb`:

  • HIGH — `manifest-extractor.resolveSymbol` was too fuzzy. `CONTAINS` match + unconditional `.proto`-filename fallback meant manifest "/orders" matched "/suborders" and gRPC entries returned a random proto symbol from any repo with any `.proto` file. Replaced with exact equality + deterministic `ORDER BY` + synthetic-uid fallback. Regression tests included.
  • MED — proto parser brace-depth now sanitizes strings and comments first. A valid proto with `option deprecated_reason = "use NewService { instead"` used to have its service body closed early by the `"{"` inside the literal. Regression tests for string and comment variants.
  • MED — sarama Kafka regex changed from `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on constructor, caught only first topic in a loop) to `sarama.ProducerMessage{...Topic:}` (matches every struct literal). Regression test with a for-loop producing multiple messages.
  • MED — `resolveSymbol` no longer has a silent `catch { /* fall through */ }`; logs the error via `console.warn` with link type, contract name, repo key, and error message before falling through to the synthetic-uid path.

How to verify

```bash
cd gitnexus
npx tsc --noEmit
npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks
npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks
npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks
npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks
```

Local: typecheck clean; 99/99 extractor tests pass (grpc 43, http 18, topic 30, manifest 8).

Risk / rollback

Low. Extractors have no user-facing surface in this PR — they produce `ExtractedContract[]` consumed by `sync.ts` in the next split (#793). No existing behavior changes for users who don't run `group sync`. Rollback is a clean `git revert`; the three modified extractors revert to their pre-PR versions which are subsets of the new functionality.

Scope discipline (per `GUARDRAILS.md`)

Dependencies

🤖 Co-authored with Claude Code — findings and fixes written by me, Claude ran the test/edit loop. I reviewed and signed off on every change before pushing.

@vercel

vercel Bot commented Apr 11, 2026

Copy link
Copy Markdown

@ivkond is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions

github-actions Bot commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
6194 6097 0 97 226s

✅ All 6097 tests passed

97 test(s) skipped — expand for details
  • Swift MethodExtractor > isTypeDeclaration > recognizes class_declaration
  • Swift MethodExtractor > isTypeDeclaration > recognizes protocol_declaration
  • Swift MethodExtractor > isTypeDeclaration > rejects import_declaration
  • Swift MethodExtractor > visibility > extracts public method
  • Swift MethodExtractor > visibility > extracts private method
  • Swift MethodExtractor > visibility > defaults to internal when no modifier
  • Swift MethodExtractor > protocol methods > marks protocol method as abstract
  • Swift MethodExtractor > static and class methods > detects static func as isStatic
  • Swift MethodExtractor > static and class methods > detects class func as isStatic
  • Swift MethodExtractor > parameters > extracts parameters with types and default values
  • Swift MethodExtractor > return type > extracts return type from -> annotation
  • Swift MethodExtractor > annotations > extracts @objc attribute
  • Swift MethodExtractor > isFinal > detects final func
  • Swift MethodExtractor > isFinal > is false when not final
  • Swift MethodExtractor > isAsync > detects async func
  • Swift MethodExtractor > isOverride > detects override method
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference uses lookupClassByName
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift explicit init inference uses lookupClassByName
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference does not bind plain functions
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature
  • Swift constructor-inferred type resolution > detects User and Repo classes, both with save methods
  • Swift constructor-inferred type resolution > resolves user.save() to Models/User.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > resolves repo.save() to Models/Repo.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > emits exactly 2 save() CALLS edges (one per receiver type)
  • Swift self resolution > detects User and Repo classes, each with a save function
  • Swift self resolution > resolves self.save() inside User.process to User.save, not Repo.save
  • Swift parent resolution > detects BaseModel and User classes plus Serializable protocol
  • Swift parent resolution > emits EXTENDS edge: User → BaseModel
  • Swift parent resolution > emits IMPLEMENTS edge: User → Serializable (protocol conformance)
  • Swift cross-file User.init() inference > resolves user.save() via User.init(name:) inference
  • Swift cross-file User.init() inference > resolves user.greet() via User.init(name:) inference
  • Swift return type inference > detects User class and getUser function
  • Swift return type inference > detects save function on User (Swift class methods are Function nodes)
  • Swift return type inference > resolves user.save() to User#save via return type of getUser() -> User
  • Swift return-type inference via function return type > resolves user.save() to User#save via return type of getUser()
  • Swift return-type inference via function return type > user.save() does NOT resolve to Repo#save
  • Swift return-type inference via function return type > resolves repo.save() to Repo#save via return type of getRepo()
  • Swift implicit imports (cross-file visibility) > detects UserService class in Models.swift
  • Swift implicit imports (cross-file visibility) > resolves UserService() constructor call across files (no explicit import)
  • Swift implicit imports (cross-file visibility) > resolves service.fetchUser() member call across files
  • Swift implicit imports (cross-file visibility) > creates IMPORTS edges between files in the same module
  • Swift extension deduplication > detects Product class
  • Swift extension deduplication > resolves Product() constructor despite extension creating duplicate class node
  • Swift extension deduplication > resolves product.save() to Product.swift (primary definition)
  • Swift constructor call fallback (no new keyword) > resolves OCRService() as constructor call across files
  • Swift constructor call fallback (no new keyword) > resolves ocr.recognize() member call via constructor-inferred type
  • Swift export visibility (internal vs private) > resolves PublicService() constructor across files
  • Swift export visibility (internal vs private) > resolves internalHelper() across files (internal = module-scoped)
  • Swift if let / guard let binding resolution > detects User and Repo classes
  • Swift if let / guard let binding resolution > resolves user.save() inside if-let to User#save
  • Swift if let / guard let binding resolution > resolves repo.save() inside guard-let to Repo#save
  • Swift if let / guard let binding resolution > user.save() in if-let does NOT resolve to Repo#save
  • Swift await / try expression unwrapping > resolves user.save() via await fetchUser() return type
  • Swift await / try expression unwrapping > resolves repo.save() via try parseRepo() return type
  • Swift await / try expression unwrapping > detects fetchUser and parseRepo as functions
  • Swift for-in loop element type inference > detects User and Repo classes
  • Swift for-in loop element type inference > creates implicit import edges between files
  • Swift field-type resolution > detects classes and their properties
  • Swift field-type resolution > emits HAS_PROPERTY edges from class to field
  • Swift field-type resolution > resolves field-chain call user.address.save() → Address#save
  • Swift field-type resolution > emits ACCESSES edges for field reads in chains
  • Swift field-type resolution > populates field metadata (visibility, declaredType) on Property nodes
  • Swift call-result binding > resolves call-result-bound method call user.save() → User#save
  • Swift call-result binding > getUser() is present as a defined function
  • Swift call-result binding > emits processUser -> getUser CALLS edge for let-assigned free function call
  • Swift method enrichment > detects Animal protocol and Dog class
  • Swift method enrichment > emits IMPLEMENTS edge Dog -> Animal
  • Swift method enrichment > emits HAS_METHOD edges for Dog methods
  • Swift method enrichment > marks protocol Animal.speak as isAbstract
  • Swift method enrichment > marks Dog.speak as NOT isAbstract
  • Swift method enrichment > marks breathe as isFinal
  • Swift method enrichment > marks classify as isStatic
  • Swift method enrichment > captures @objc annotation on breathe
  • Swift method enrichment > populates parameterTypes for classify(_ name: String)
  • Swift method enrichment > records parameterCount for classify
  • Swift method enrichment > records returnType for speak
  • Swift method enrichment > resolves dog.speak() CALLS edge
  • Swift method enrichment > resolves Dog.classify("dog") CALLS edge
  • Swift abstract dispatch > detects Repository protocol and SqlRepository class
  • Swift abstract dispatch > emits IMPLEMENTS edge SqlRepository -> Repository
  • Swift abstract dispatch > emits HAS_METHOD edges for Repository.find and Repository.save
  • Swift abstract dispatch > emits HAS_METHOD edges for SqlRepository.find and SqlRepository.save
  • Swift abstract dispatch > marks base Repository.find as isAbstract
  • Swift abstract dispatch > marks base Repository.save as isAbstract
  • Swift abstract dispatch > marks concrete SqlRepository.find as NOT isAbstract
  • Swift abstract dispatch > resolves repo.find(id: 42) CALLS edge
  • Swift abstract dispatch > resolves repo.save(entity: user) CALLS edge
  • Swift abstract dispatch > populates parameterTypes for Repository.find
  • Swift abstract dispatch > populates parameterTypes for Repository.save
  • Swift abstract dispatch > records returnType for SqlRepository.find
  • Swift abstract dispatch > emits METHOD_IMPLEMENTS edges from SqlRepository methods → Repository protocol methods
  • Swift overloaded method disambiguation > detects 2 distinct find Method nodes on SqlRepository
  • Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edges for both find overloads
  • Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edge for save
  • Swift overloaded method disambiguation > emits exactly 3 METHOD_IMPLEMENTS edges total
  • Swift Child extends Parent — inherited method resolution (SM-9) > detects Parent and Child classes
  • Swift Child extends Parent — inherited method resolution (SM-9) > resolves c.parentMethod() to Parent.parentMethod via first-wins MRO walk

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 72.56% 16509/22750 72.44% 📈 +0.1 🟢 ██████████████░░░░░░
Branches 61.49% 10576/17197 61.4% 📈 +0.1 🟢 ████████████░░░░░░░░
Functions 77.47% 1531/1976 77.16% 📈 +0.3 🟢 ███████████████░░░░░
Lines 74.99% 14988/19986 74.59% 📈 +0.4 🟢 ██████████████░░░░░░

📋 View full run · Generated by CI

@magyargergo

Copy link
Copy Markdown
Collaborator

@claude is this pr ready for production? Is this following clean code principles? Are there any performance bottlenecks? Is the code structured in a way that will unlock easy contributions in the future?

@github-actions

github-actions Bot commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

Claude encountered an error —— View job


PR Review

  • Gather context
  • Read production source files
  • Analyze: production readiness, clean code, performance, extensibility
  • Post findings

Verdict: Approve with notes

CI is green, 99/99 extractor tests pass, self-review fixes are solid, and rollback risk is low. The code is mergeable. That said, there are a few things worth flagging now before the codebase grows further.


Production Readiness ✅

The PR is production-ready given its current scope. Extractors produce ExtractedContract[] that isn't yet consumed by a user-facing surface, the synthetic-uid fallback ensures cross-impact stability for unresolved manifest entries, and the console.warn logging provides enough signal for debugging without breaking on errors.

The self-review fixes carried in from #606 are exactly the right things to have caught: the CONTAINS-fuzzy match false positive was a latent correctness bug and the sarama regex fix changes semantics meaningfully in loop-heavy producer code.


Clean Code Issues

1. readSafe is copy-pasted three timesgrpc-extractor.ts:7-17, http-route-extractor.ts:48-58, topic-extractor.ts:12-22. The path-traversal guard (the relToBase.startsWith('..') check) is security-sensitive logic. If it ever needs a fix, the next contributor has to know to update three files. This should live in a shared fs-utils.ts.

Fix this →

2. dedupe is duplicated with inconsistent behaviorgrpc-extractor.ts:715-729 keeps the highest-confidence entry (useful!), but http-route-extractor.ts:248-258 and topic-extractor.ts:364-374 both just take the first seen. Inconsistent behavior for the same operation makes it hard for contributors to know what to copy when adding a new extractor.

3. mapNames is defined twice in http-route-extractor.ts — once in pickJavaHandlerName (line 66) and again in inferMethodFromFileScan (line 211). These are identical Record<string, string> objects; one module-level constant would suffice.

4. ManifestExtractor is a stateless class — it holds no instance fields, so class ManifestExtractor is just a namespace. This is a common antipattern in functional TypeScript: plain exported functions would be clearer and easier to test in isolation, since callers don't need to new anything.

5. Magic confidence numbers — values like 0.85, 0.8, 0.75, 0.65, 0.55 are scattered across grpc-extractor.ts without named constants. A const CONFIDENCE = { PROTO_PARSE: 0.85, GO_REGISTER_WITH_PROTO: 0.8, ... } block would make the scoring rationale legible and consistent for future language additions.


Performance Bottlenecks

1. Double glob + double parse for .proto filesGrpcExtractor.extract() calls buildProtoContext(repoPath) (which does glob('**/*.proto', ...) and reads every file) and then immediately does its own glob('**/*.proto', ...) again to call parseProtoFile on each one (grpc-extractor.ts:341-357). That's two full filesystem traversals and two full parses of every proto file in the repo. The fix is straightforward: buildProtoContext should return provider contracts alongside the service map, or extract() should reuse the contents map that buildProtoContext already built internally.

Fix this →

2. Regex objects recreated per file in TopicExtractor.scanFiletopic-extractor.ts:333 does new RegExp(pattern.regex.source, pattern.regex.flags) for all 13 patterns on every single file. This is needed to reset lastIndex, but allocating 13 new RegExp objects per file is wasteful. The standard fix is to reset re.lastIndex = 0 before each file instead of constructing a new object.

3. Proto file reads are serialbuildProtoContext reads files in a for loop with fs.readFileSync. For repos with many proto files, switching to Promise.all over async reads would help. Same applies to source-file scanning in all three extractors.

4. longestSharedSegmentRun is O(n²) in segment count — called on every proto service candidate lookup (resolveProtoConflict). For repos with deep directory structures and many services of the same name this could be slow. In practice it's fine for most repos, but worth documenting the complexity assumption (e.g. // O(n²) in path depth — acceptable for typical repo structures).


Extensibility

The overall structure unlocks contributions well. The PatternDef[] approach in topic-extractor.ts is genuinely excellent — adding a new broker or language is literally just appending to a constant array. The grpc language-specific scan* methods are cleanly separated and easy to extend.

Two things that could trip up future contributors:

1. The manifest resolver's topic/lib queries are label-unawaremanifest-extractor.ts:144-149 and 183-188 both run MATCH (n) WHERE n.name = $contract with no label filter. Any node in the graph named "payments" would match a topic called "payments", including a Java class, a file, or a variable. Adding WHERE n:Topic (or whatever label the indexer uses) would make this robust. Same concern applies to the gRPC method lookup at line 160: n.name = 'Create' would match any graph node named Create, not necessarily a gRPC method.

Fix this →

2. buildContractId in manifest-extractor.ts has no default case — if ContractType gains a new member in the future, TypeScript will not produce an error here (since the switch exhausts all current variants but has no explicit exhaustiveness guard). Adding a default: link.type satisfies never assertion would make future extension safe.


Minor Nits

  • _role parameter in inferMethodFromFileScan (http-route-extractor.ts:207) is unused and could be removed.
  • scanPythonConsumers in grpc-extractor.ts:617 filters ['Mock', 'Test', 'Fake', 'Stub'].includes(name)TestStub, FakeStub, etc. would slip through. Low risk but worth noting for future expansion.
  • The NestJS regex @(Get|Post|...)\s*\(\s*['"\]?([^'"\`)])['"`]?\s)(line 321) will match unquoted decorator arguments like@get(myPathVar), producing myPathVar` as a literal path. Probably fine for now but fragile.

Summary: Merge-ready. The double proto glob is the most actionable fix before the next split PR lands, since sync.ts (#793) will call these extractors on real repos and the redundant I/O will show up in profiling. The label-unaware Cypher queries in manifest-extractor.ts are the most correctness-relevant issue. The readSafe deduplication is a hygiene item that will get worse with each new extractor added in the remaining splits.

@magyargergo magyargergo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivkond I don't think it's ready to be merged! I think, we should work on this a little more and try to find a better approach. Right now, we are doing regex based lookups, would it be possible to utilise tree-sitter nodes?

@magyargergo

magyargergo commented Apr 11, 2026

Copy link
Copy Markdown
Collaborator

Can we make them language agnostic? The rule of thumb is we dont want in the top level processors/extractors to have langauge dependencies.

@ivkond ivkond force-pushed the feat/group-extractors branch from 115a591 to ea06d11 Compare April 11, 2026 23:21

@magyargergo magyargergo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plerase make sure we use tree-sitters appropriately over regexp 🙏

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivkond

ivkond commented Apr 12, 2026

Copy link
Copy Markdown
Contributor Author

@magyargergo, checked tree-sitter-proto — the grammar peers on tree-sitter ^0.21 while we run 0.25, so the native binary won't load (ABI mismatch). The .proto parser we have is already sanitizing comments + strings and handles nested braces correctly (43 tests cover it including the braces-in-strings and braces-in-comments regressions).

Happy to revisit once the grammar is updated to the 0.25 ABI, or if you'd prefer we fork + rebuild it ourselves — let us know.

In the meantime, all three extractor source-scan paths (topic, http, grpc) have been migrated to the language-agnostic plugin architecture per your earlier feedback — the .proto in-tree parser is the only remaining non-tree-sitter code path and is documented as an explicit exception.

@magyargergo

Copy link
Copy Markdown
Collaborator

I'm just concious about performance and that's why I'm pushing for it. Could we maybe spend today with looking for an alternatice fork that may have sorted this? 🙏

ivkond and others added 7 commits April 12, 2026 13:36
Part 2 of 4 in the split of #606 (ticket: #792). Follows #795
(bridge.lbug storage foundation, already merged), but this PR has no
code-level dependency on #795 — it only imports types and the
ContractExtractor interface that existed on upstream main before
either PR. It could have been reviewed in parallel with #795.

## What changed

Expands the 3 existing contract extractors with substantially more
language/framework coverage, and adds a new `manifest-extractor`
that resolves `group.yaml`-declared cross-links against the per-repo
graph via exact-name lookups.

### New file (228 LOC)

- `gitnexus/src/core/group/extractors/manifest-extractor.ts` —
  exact graph lookup for `group.yaml`-declared cross-links. HTTP
  paths are canonicalized before Route.name matching; gRPC is
  resolved by service/method name (NO `.proto`-filename fallback);
  topic and lib use exact-name match. Falls back to a synthetic
  `manifest::<repo>::<contractId>` uid when the graph has no
  matching symbol, so cross-impact traversal still has a stable
  anchor for the contract.

### Modified extractors (+958 LOC prod)

- `extractors/grpc-extractor.ts` (+522) — `.proto` parser with
  comment and string-literal sanitization (braces inside strings no
  longer truncate service bodies); package/service/method canonical
  IDs; server/client detection across Go (`grpc.NewServer`,
  `RegisterXxxServer`, `XxxGrpc.XxxImplBase`), Java (`@GrpcService`,
  `BlockingStub`), Python (`servicer_to_server`, `XxxStub`), and
  TypeScript/Node (`@GrpcMethod`, `ClientGrpc`, `loadPackageDefinition`).
- `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib
  `HandleFunc`, NestJS `@Controller`+`@Get`/etc, Python FastAPI
  decorators, Java Spring `@RequestMapping`/`@GetMapping`,
  restTemplate / WebClient / OkHttp consumers.
- `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}`
  struct literal detection (replaces a constructor-anchored regex
  that missed topics inside producer loops), kafka-go Writer/Reader,
  Python NATS (`await nc.subscribe`/`await nc.publish`), JetStream
  helpers.

### Modified and new tests (+1264 LOC)

- `grpc-extractor.test.ts` (+539) — full coverage of the new proto
  parser (strings-with-braces regression, comments-with-braces
  regression), per-language server/client detection
- `http-route-extractor.test.ts` (+240) — per-framework route
  extraction + normalization edge cases
- `topic-extractor.test.ts` (+177) — the sarama in-loop regression,
  JetStream, Python NATS, kafka-go Writer/Reader
- `manifest-extractor.test.ts` (+308 NEW) — HTTP path normalization,
  gRPC exact lookup with proto-fallback regression, lib and topic
  exact matching, synthetic-uid fallback behavior

### Self-review fixes folded in

Carried forward from the #606 self-review (commit `d15b8cb`):

- **HIGH #1** — `manifest-extractor.resolveSymbol` was too fuzzy.
  Previously used `CONTAINS` on route/name fields plus an
  unconditional `filePath ENDS WITH '.proto'` fallback for gRPC.
  Consequences: `/orders` matched `/suborders`, and any repo with
  any `.proto` file returned a random proto symbol for a gRPC
  manifest entry. Replaced with exact equality + deterministic
  `ORDER BY` + synthetic-uid fallback for unresolved manifests.
  Regression tests included.
- **MED #3** — gRPC proto parser brace-depth counting now sanitizes
  strings and comments first (`stripProtoCommentsAndStrings`). A
  valid proto with `option deprecated_reason = "use NewService {
  instead"` used to have its service body closed early by the `"{"`
  inside the literal, silently dropping methods after the offending
  string. Regression tests for both string-with-brace and
  comment-with-brace cases.
- **MED #4** — sarama Kafka regex changed from
  `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on
  constructor, caught only first topic in a loop) to
  `sarama.ProducerMessage{...Topic:}` (matches every struct literal
  directly). Regression test with a for-loop that constructs
  multiple `ProducerMessage`s.
- **MED #7** — `manifest-extractor.resolveSymbol` no longer has a
  silent `catch { /* fall through */ }`. Errors from the graph
  executor are logged via `console.warn` with link type, contract
  name, repo key, and error message before falling through to the
  synthetic-uid path.

## Why

Reviewer focus here is pure regex / parser correctness — no
storage, no Cypher queries, no algorithmic changes to the cross-link
algorithm. Separating this from the bridge foundation PR (#795)
meant reviewers could stay in a single mental mode (parsing logic)
instead of context-switching between DDL, Cypher, and regex.

## How to verify

- `cd gitnexus && npx tsc --noEmit`
- `cd gitnexus && npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks`

Local pre-push: typecheck clean, all 99 extractor unit tests pass
(grpc 43, http 18, topic 30, manifest 8).

## Risk / rollback

**Low.** Extractors have no user-facing surface in this PR — they
produce `ExtractedContract[]` that is consumed by `sync.ts` in the
next split (#793). No existing behavior changes for users who don't
run a `group sync`. Rollback = `git revert` of the merge commit;
the modifications to `grpc-extractor.ts` / `http-route-extractor.ts`
/ `topic-extractor.ts` revert to the pre-PR versions that still
work (they're subsets of the new functionality).

## Scope discipline (per GUARDRAILS.md)

- Only the 8 files above are touched; no drive-by refactors
- No CI/release/security config changes
- No secrets or machine-specific paths
- Content lifted from #606 (CI 11/11 green on `d15b8cb`)

## Dependencies

- **Base:** `main` (upstream already includes #795 as `1ff324c`)
- **Blocks:** sync pipeline (#793) and the cross-impact feature (#794)
- **Tracker issue:** #792
- **Parent PR:** #606

Co-authored-by: Claude <noreply@anthropic.com>
…eries

Addresses @magyargergo's feedback on #796 that regex-based lookups
should use tree-sitter nodes instead, and that the top-level
extractors must NOT carry language dependencies. This is phase 1 of
a multi-step migration — topic-extractor first because its patterns
are the most uniform (16 "call/annotation with first-arg string
literal" variants), which makes it a clean proof of the approach
before grpc-extractor and http-route-extractor get the same treatment.

## Architecture: language-agnostic orchestrator + per-language plugins

The top-level extractor is a thin orchestrator that never imports a
tree-sitter grammar or a query string. Per-language knowledge lives
in a new `topic-patterns/` folder with one file per language plus a
registry that maps file extensions to compiled plugins:

```
src/core/group/extractors/
├── tree-sitter-scanner.ts         # shared, language-agnostic scanning utilities
├── topic-extractor.ts              # thin orchestrator (no grammar imports)
└── topic-patterns/
    ├── types.ts                    # TopicMeta, Broker
    ├── index.ts                    # registry: extension → compiled provider
    ├── java.ts                     # tree-sitter-java + JAVA_TOPIC_PROVIDER
    ├── go.ts                       # tree-sitter-go + GO_TOPIC_PROVIDER
    ├── python.ts                   # tree-sitter-python + PYTHON_TOPIC_PROVIDER
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript
                                    # → JAVASCRIPT_/TYPESCRIPT_/TSX_TOPIC_PROVIDER
```

**Shared scanner (`tree-sitter-scanner.ts`)** — defines
`PatternSpec<TMeta>`, `LanguagePatterns<TMeta>`, `CompiledPatterns<TMeta>`
and the `scanFile(parser, plugin, content)` helper. Plugins compile their
queries eagerly at module load via `compilePatterns()`, so a broken
pattern fails loudly at import time instead of silently at scan time.
`unquoteLiteral()` handles single/double/template quotes, Python
triple-quoted strings, and Go raw backtick strings.

**Per-language plugins** own:
- the tree-sitter grammar import (this is the ONLY place in
  `src/core/group/` where tree-sitter grammars are imported),
- the query S-expressions,
- the `TopicMeta` payload (role, broker, confidence, symbolName) that
  the orchestrator receives back on every match.

Each plugin uses a `@value` capture name to bind the topic literal node.
The JavaScript and TypeScript grammars share AST node names for every
construct we query, so `node.ts` defines the pattern sources once and
compiles them against `JavaScript`, `TypeScript.typescript`, and
`TypeScript.tsx` — exporting three providers because `Parser.Query`
objects are NOT portable across grammar instances.

**Registry (`topic-patterns/index.ts`)** — maps `.java` → Java provider,
`.go` → Go, `.py` → Python, `.js`/`.jsx` → JS, `.ts` → TS, `.tsx` → TSX.
Also exports `TOPIC_SCAN_GLOB` so adding a new language is a single
file-level edit (drop `topic-patterns/<lang>.ts`, import + register it
here — zero edits required in `topic-extractor.ts`).

**Orchestrator (`topic-extractor.ts`)** — ~110 lines, no grammar or
query imports. Per file: `getProviderForFile(rel)` → `scanFile(parser,
provider, content)` → `unquoteLiteral(valueText)` → `makeContract(...)`.
Reuses one `Parser` instance across files; the scanner calls
`setLanguage` per plugin.

## Why this is better than regex

1. **Comments and strings are respected for free.** The old regex
   would match `// kafkaTemplate.send("fake.topic")` as a real
   producer; tree-sitter never visits comments or string literals as
   code nodes, so false positives from commented-out code are
   eliminated.
2. **Struct/object literal patterns are structural, not textual.**
   `sarama.ProducerMessage{Topic: "..."}` no longer needs a 300-char
   lookahead (which was a known cross-match bug partly mitigated by a
   loop regression test in the self-review). The new query matches a
   specific `composite_literal` with a specific `qualified_type` and
   `keyed_element` — exactly one struct literal per match.
3. **No order-of-operations fragility.** Regex for
   `channel.publish` vs `channel.consume` was independent and
   file-wide; the AST scopes matches to the specific `call_expression`.
4. **Language-agnostic extension.** Adding Ruby, Rust, or C# topic
   detection later means dropping one file in `topic-patterns/` — no
   changes to shared scanner or orchestrator, and no tree-sitter
   imports leak into top-level code.

## Per-file fault tolerance

- Malformed files that tree-sitter can't parse are silently skipped
  (`parser.parse` is wrapped by `scanFile`). The ingestion pipeline
  already logs unparseable files at index time.
- A syntactically invalid query is caught at `compilePatterns` time,
  not scan time — broken plugins fail loudly at import.
- Per-pattern `matches()` failures are swallowed so one broken query
  in a plugin doesn't block the rest.

## Tests

All 30 existing `topic-extractor.test.ts` tests pass **without any
changes to the test file** — they were written as input/output contract
tests (given this source file, expect these `ExtractedContract` objects)
and that contract is unchanged. Regression coverage includes:

- Kafka: Java `@KafkaListener` + `kafkaTemplate.send`; Node
  `producer.send` + `consumer.subscribe`; Go sarama producer/consumer
  (sync and async); kafka-go Writer/Reader; Python `KafkaConsumer` +
  `producer.send/produce`
- RabbitMQ: Java `@RabbitListener` + `rabbitTemplate.convertAndSend`;
  Node `channel.consume/publish/sendToQueue`; Python `basic_consume/
  basic_publish` with keyword args
- NATS: Go and Node `nc.Subscribe/Publish`; Go and Node JetStream
  `js.Subscribe/Publish`; Python `await nc.subscribe/publish`

Including the regression test for the sarama `ProducerMessage`
in-loop case — the AST-based query captures every literal in the
file independently, not just the first one after `NewSyncProducer`.

## Neighbor regression check

- `topic-extractor.test.ts` — 30/30 pass (rewritten extractor)
- `http-route-extractor.test.ts` — 18/18 pass (untouched)
- `grpc-extractor.test.ts` — 43/43 pass (untouched)
- `manifest-extractor.test.ts` — 8/8 pass (untouched)
- Full `npx tsc --noEmit` clean

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched; no
  changes to other extractors, tests, MCP surface, or pipeline.ts.
- No CI/release/security config changes, no secrets.
- New tree-sitter imports all reference grammars that are already
  installed as dependencies (`tree-sitter`, `tree-sitter-javascript`,
  `tree-sitter-typescript`, `tree-sitter-python`, `tree-sitter-java`,
  `tree-sitter-go` — all in `package.json` for the existing pipeline).

## Phase 2 / phase 3 plan

- **Phase 2 (next commit):** rewrite `http-route-extractor.ts`
  Strategy B (regex fallback) on the same plugin pattern. Graph-assisted
  Strategy A stays as-is (already uses pipeline-built tree-sitter data
  via `HANDLES_ROUTE` Cypher queries).
- **Phase 3 (commit after):** rewrite `grpc-extractor.ts` for Java /
  Go / Python / TypeScript detection. `.proto` files are the one
  outstanding question — there is no `tree-sitter-proto` grammar
  installed; the in-tree string-sanitizing parser stays as a pragmatic
  exception with a comment, alternative being to add
  `tree-sitter-proto` as a dep (open for the maintainer).

Co-authored-by: Claude <noreply@anthropic.com>
…er plugins

Phase 2 of the extractor refactor requested by @magyargergo on #796.
Same architecture as the phase 1 topic-extractor rewrite: a thin,
language-agnostic orchestrator plus per-language plugins that own
tree-sitter grammars and query sources. The top-level extractor file
no longer imports any tree-sitter grammar or query string.

## Architecture

```
src/core/group/extractors/
├── tree-sitter-scanner.ts          # shared, language-agnostic primitives
├── http-route-extractor.ts         # thin orchestrator (no grammar imports)
└── http-patterns/
    ├── types.ts                    # HttpDetection, HttpLanguagePlugin, HttpRole
    ├── index.ts                    # registry: ext → plugin + HTTP_SCAN_GLOB
    ├── java.ts                     # tree-sitter-java: Spring + RestTemplate/WebClient/OkHttp
    ├── go.ts                       # tree-sitter-go: gin/echo/HandleFunc + http/resty consumers
    ├── python.ts                   # tree-sitter-python: FastAPI + requests
    ├── php.ts                      # tree-sitter-php: Laravel Route::get/...
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript:
                                    #   NestJS controllers, Express, fetch, axios
```

**Shared scanner (`tree-sitter-scanner.ts`)** — generalised from phase 1:
- `ScanMatch<TMeta>.captures` is now a full `CaptureMap` (every named
  capture the query binds, not just a single `@value`). Topic extractor
  updated to read `match.captures.value` accordingly.
- New `runCompiledPatterns(plugin, tree)` helper lets plugins run
  multiple query bundles against the same pre-parsed tree. This is
  needed for HTTP plugins that combine a class-prefix query with a
  method-route query (Spring, NestJS).
- `scanFile` becomes a thin wrapper over `parser.parse + runCompiledPatterns`.

**HTTP plugin shape** — unlike topic plugins, HTTP plugins expose a
`scan(tree)` function rather than a flat pattern list. This reflects
HTTP's more complex extraction: each detection needs method + path +
handler name, and framework patterns like Spring `@RequestMapping` /
NestJS `@Controller` require cross-referencing a class-level prefix
with method-level annotations. Plugins internally use
`compilePatterns` + `runCompiledPatterns` and walk the AST to resolve
the class/method relationships.

**Per-framework coverage:**

- **Java (`java.ts`)**
  - Spring: `@RequestMapping("/api/v2")` class prefix + `@(Get|Post|Put|
    Delete|Patch)Mapping("/sub")` method routes, joined via the
    enclosing `class_declaration` node id.
  - `RestTemplate.getForObject/postForEntity/put/delete/patchForObject` →
    method derived from API name.
  - `WebClient.method(HttpMethod.X, "/path")` → method from
    `HttpMethod.X` capture.
  - `new Request.Builder().url("/path")` → OkHttp consumer.

- **Go (`go.ts`)**
  - gin / echo / chi frameworks: `\w+.GET("/path", handler)` captures
    upper-case verb + handler identifier.
  - `net/http.HandleFunc("/path", handler)` → provider (default GET).
  - `http.Get/Post/Head` consumer, `http.NewRequest("METHOD", ...)`,
    resty `client.R().Get/Post/...`.

- **Python (`python.ts`)**
  - `@app.get("/path")` FastAPI decorators.
  - `requests.get/post/...` and `requests.request("METHOD", "url")`.

- **PHP (`php.ts`)**
  - Laravel `Route::get/post/.../patch('/path', ...)` via
    `scoped_call_expression`. Uses `PHP.php_only` to match the
    existing ingestion pipeline's grammar selection.

- **Node (`node.ts`) — JS + TS + TSX**
  - Pattern sources defined once, compiled against three grammar
    variants (`JavaScript`, `TypeScript.typescript`, `TypeScript.tsx`)
    because `Parser.Query` objects are not portable across grammars.
    Exports three plugins sharing the same `scan` logic.
  - NestJS: `@Controller('prefix')` decorators are siblings of the
    class in `export_statement` / `program`; `@Get(':id')` decorators
    are siblings of the method in `class_body`. The plugin walks
    decorator → next named sibling to find the decorated class /
    method, then combines the class prefix with the method path.
    Only emits NestJS detections when the enclosing class has a real
    `@Controller` decorator — prevents false positives from generic
    classes that happen to use `@Get` from another library.
  - Express: `(router|app).<verb>('/path', ...)`.
  - `fetch(url)` (default GET) + `fetch(url, { method: 'X' })`
    (uses two queries + a SyntaxNode-id dedupe set so URL literals
    aren't double-emitted by the options variant).
  - `axios.get/post/...`.

## Orchestrator changes

`http-route-extractor.ts` drops every `scanXxxProviders` / `scanXxxConsumers`
regex method and replaces them with a single source-scan loop that
delegates to `getPluginForFile(rel).scan(tree)`. The orchestrator
still owns:

- **Path normalization** (`normalizeHttpPath`, `normalizeConsumerPath`)
  — language-agnostic string processing shared by both strategies.
- **Graph-assisted Strategy A** (`HANDLES_ROUTE` / `FETCHES` / `CONTAINS`
  Cypher queries) — unchanged in spirit. The only regex helpers it
  used (`inferMethodFromFileScan`, `pickJavaHandlerName`) are now
  replaced by a lookup against the plugin's detections for the same
  file: for each route row, find the detection whose normalized path
  matches, and pull the HTTP method + handler name from it.
- **Per-file parse cache** — the orchestrator parses each relevant
  file at most once per `extract()` call. Both the graph-assisted
  enrichment loop and the source-scan fallback share the same
  `cachedDetections` map, so we never run the plugin twice for the
  same file.

## Why this is better than the regex version

1. **Comments and strings for free.** The old regex would match
   `// router.get('/fake')` as a real Express route; tree-sitter
   never visits string/comment nodes.
2. **Structural controller-prefix.** Spring and NestJS class-prefix
   joining is now scoped to the enclosing class via `class_declaration`
   node ids, eliminating file-wide state that broke when a file had
   multiple controllers.
3. **Precise NestJS disambiguation.** The plugin only emits a NestJS
   detection when the enclosing class has a real `@Controller`
   decorator — the old regex would fire on any `@Get(...)` in the
   file regardless of surrounding context.
4. **Language-agnostic extension.** Adding Ruby / Rust / Kotlin HTTP
   detection later means dropping one file in `http-patterns/` — no
   changes to the shared scanner, the orchestrator, or the Strategy A
   Cypher queries.

## Tests

- `http-route-extractor.test.ts` — **18/18 pass** (tests unchanged;
  they're contract-style input/output tests and the contract shape is
  unchanged). Covers Spring class prefix, Express, gin/echo, stdlib
  HandleFunc, NestJS, Laravel, FastAPI for providers and
  fetch/axios/python-requests/rest-template/webClient/okhttp/go-stdlib/
  resty for consumers, plus graph-first Strategy A for both.
- `topic-extractor.test.ts` — **30/30 pass** after the `captures.value`
  API migration.
- `grpc-extractor.test.ts` — 43/43 pass (untouched; phase 3).
- `manifest-extractor.test.ts` — 8/8 pass (untouched).
- `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass.
- `npx tsc -p tsconfig.json --noEmit` clean.

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched.
- No changes to pipeline.ts, MCP surface, ingestion, or tests.
- No CI / release / security / secrets changes.
- Tree-sitter grammars imported by plugins (`tree-sitter-java`,
  `tree-sitter-go`, `tree-sitter-python`, `tree-sitter-php`,
  `tree-sitter-javascript`, `tree-sitter-typescript`) are all already
  in `package.json` for the existing ingestion pipeline.

## Phase 3 plan

- **grpc-extractor** gets the same treatment: plugin-per-language under
  `grpc-patterns/` for Java / Go / Python / TS detection. `.proto`
  files remain an open question — no `tree-sitter-proto` grammar is
  installed, so the in-tree string-sanitizing parser from PR #796's
  self-review stays as a pragmatic exception unless the maintainer
  wants us to add `tree-sitter-proto` as a new dep.

Co-authored-by: Claude <noreply@anthropic.com>
…lugins

Phase 3 (final) of the extractor refactor requested by @magyargergo on
#796. Same architecture as phase 1 (topic) and phase 2 (http): thin
language-agnostic orchestrator + per-language plugins that own
tree-sitter grammars and query sources. With this commit the top-level
extractors under `src/core/group/extractors/` import ZERO tree-sitter
grammars and ZERO query strings — every grammar import lives in a
`*-patterns/<lang>.ts` plugin file, and the orchestrators go through
the registry indirection.

## Architecture

```
src/core/group/extractors/
├── tree-sitter-scanner.ts         # shared primitives (unchanged)
├── grpc-extractor.ts               # orchestrator (only `.proto` parser left)
└── grpc-patterns/
    ├── types.ts                    # GrpcDetection, GrpcLanguagePlugin, GrpcRole
    ├── index.ts                    # registry: ext → plugin + GRPC_SCAN_GLOB
    ├── go.ts                       # tree-sitter-go: RegisterXxxServer, Unimplemented, NewXxxClient
    ├── java.ts                     # tree-sitter-java: @GrpcService + XxxImplBase + newBlockingStub
    ├── python.ts                   # tree-sitter-python: add_XxxServicer_to_server + XxxStub
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript:
                                    #   @GrpcMethod, @GrpcClient field type,
                                    #   .getService<X>('Svc'), new XxxServiceClient,
                                    #   loadPackageDefinition dynamic constructors
```

## Per-language coverage

**Go (`go.ts`)**
- Provider: `\w+.RegisterXxxServer(...)` via `call_expression →
  selector_expression → field_identifier` + JS regex filter
  `^Register(\w+)Server$`.
- Provider: `pb.UnimplementedXxxServer` embedded in a struct via
  `struct_type → field_declaration_list → field_declaration →
  qualified_type → type_identifier` + JS filter.
- Consumer: `\w+.NewXxxClient(...)` via the same call_expression
  query + JS filter `^New(\w+)Client$`.

**Java (`java.ts`)**
- Provider: `class X extends YyyGrpc.YyyImplBase` — two queries
  handle the scoped and plain forms. `scoped_type_identifier`'s
  children are positional (no `scope:`/`name:` fields), so the
  query matches the two `type_identifier` children by position.
- `#match? @inner "ImplBase$"` restricts matches at query time.
- Whether the class has `@GrpcService` or not controls only the
  `source` metadata label — the plugin walks the class_declaration's
  `modifiers` child in JS to detect the marker_annotation.
- Consumer: `YyyGrpc.newStub(ch)` / `newBlockingStub(ch)` via a
  `method_invocation` query with `#match? @method
  "^new(Blocking)?Stub$"`, service name extracted via
  `^(\w+)Grpc$` on the object identifier.

**Python (`python.ts`)**
- Single call-expression query covers both bare identifier and
  `obj.method` attribute forms:
  `(call function: [(identifier) @fn (attribute attribute: (identifier) @fn)])`.
- Plugin filters `@fn.text` against two JS regexes:
  `^add_(\w+)Servicer_to_server$` (provider) and `^(\w+)Stub$`
  (consumer), with a reserved-names ignore list for the Stub case
  (Mock / Test / Fake / Stub).

**Node — JavaScript + TypeScript + TSX (`node.ts`)**
- Pattern sources defined once, compiled three times (one per grammar)
  because `Parser.Query` objects are not portable across grammars.
  Exports three `GrpcLanguagePlugin`s sharing the same `scan`.
- `@GrpcMethod('Service', 'Method')`: decorator query captures the
  two string literals. Confidence is hard-coded 0.8 regardless of
  proto map resolution (matches the original regex version's
  behaviour).
- `@GrpcClient(...) field: XxxServiceClient`: decorator query
  captures the decorator node, plugin walks up to find the enclosing
  `public_field_definition` (decorators on fields are CHILDREN of
  the field definition in tree-sitter-typescript, not siblings) and
  reads its first `type_annotation → type_identifier`, then runs the
  `^(\w+Service)Client$` JS filter.
- `client.getService<X>('AuthService')`: call-expression query on
  `member_expression.property = "getService"` + string literal arg.
- `new XxxServiceClient(...)`: `new_expression` with a bare
  identifier constructor, filtered by `^(\w+Service)Client$` so
  generic `new AuthClient(...)` (missing the `Service` infix) does
  NOT falsely register as a consumer. Preserves the regression test
  `test_extract_ts_non_service_client_constructor_is_ignored`.
- `loadPackageDefinition` dynamic loader: gated on
  `tree.rootNode.text.includes('loadPackageDefinition')`. When set,
  `new foo.bar.Xxx(...)` qualified constructors with a capitalised
  property name register as consumers.

## Orchestrator changes

`grpc-extractor.ts` loses every `scanGoProviders` / `scanJavaProviders`
/ ... helper and replaces them with a single source-scan loop that:

1. Parses each file with the plugin's grammar (one shared `Parser`
   instance across all files, `setLanguage` called per plugin).
2. Calls `plugin.scan(tree)` to get `GrpcDetection[]`.
3. Converts each detection to an `ExtractedContract` via the private
   `detectionToContract` helper, which:
   - Looks the short service name up in the proto map (filled by
     the `.proto` parser).
   - Picks confidence = `confidenceWithProto` if resolved, else
     `confidenceWithoutProto`.
   - Builds a method-level contract id (`grpc::pkg.Svc/Method`) when
     the detection carries a `methodName` (TS `@GrpcMethod` only),
     otherwise a service-level id (`grpc::pkg.Svc/*`).

Everything else — the `.proto` parser, `buildProtoContext`,
`buildProtoMap`, `resolveProtoConflict`, `serviceContractId`,
`stripProtoCommentsAndStrings`, `extractServiceBlocks`, the dedupe
function — stays exactly as before. The `.proto` parser is kept as a
pragmatic exception to the "no regex in extractors" rule because no
`tree-sitter-proto` grammar is installed in the repo; a comment at the
top of the file explains this and flags the maintainer option of
adding `tree-sitter-proto` as a dependency.

## Why this is better than the regex version

1. **Comments and strings are respected for free.** Matched node types
   are only code constructs, never text inside comments or string
   literals.
2. **No false positives on partial names.** The old `(\w+?)Grpc`-style
   regexes would cross-match unrelated identifiers; structural queries
   restrict matches to the exact AST shape (`scoped_type_identifier →
   type_identifier` pairs, `method_invocation → identifier` etc.).
3. **NestJS `@GrpcClient` is structural, not regex-based.** The old
   regex required a specific textual layout
   (`@GrpcClient(...) private readonly foo!: XxxServiceClient`); the
   plugin now walks the AST, so modifier order / optional modifiers /
   multi-line formatting don't break it.
4. **Language-agnostic extension.** Adding Kotlin / Rust / C# gRPC
   detection later is a one-file edit in `grpc-patterns/index.ts` —
   no touches to the shared scanner, the orchestrator, or the proto
   parser.

## Tests

- `grpc-extractor.test.ts` — **43/43 pass** (tests unchanged; the
  contract shape is identical). Covers .proto parsing (including the
  brace-inside-string regression), Go provider/consumer,
  Java @GrpcService / plain ImplBase provider + newBlockingStub
  consumer, Python servicer + stub, TS @GrpcMethod + @GrpcClient +
  .getService + new XxxServiceClient + loadPackageDefinition + the
  `AuthClient` vs `AuthServiceClient` discrimination, dedupe across
  multiple patterns in one file, proto-aware confidence, and the
  inherited-package resolution for split proto definitions.
- `topic-extractor.test.ts` — 30/30 pass.
- `http-route-extractor.test.ts` — 18/18 pass.
- `manifest-extractor.test.ts` — 8/8 pass.
- `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass.
- `npx tsc -p tsconfig.json --noEmit` clean.

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched.
- No pipeline.ts, MCP surface, ingestion, CI / release / security, or
  test changes.
- New tree-sitter grammar imports (`tree-sitter-go`, `tree-sitter-java`,
  `tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`)
  are all already installed for the ingestion pipeline.

## End of phase series

This commit completes the three-phase extractor refactor:
  - **Phase 1** (`ea06d11`): topic-extractor → `topic-patterns/`
  - **Phase 2** (`b6015f6`): http-route-extractor → `http-patterns/`
  - **Phase 3** (this commit): grpc-extractor → `grpc-patterns/`

Every remaining regex-based extractor helper under the `src/core/group/
extractors/` directory is either (a) language-agnostic string
processing (path normalization, dedupe keys) or (b) the `.proto`
parser, which is documented as an explicit exception.

Co-authored-by: Claude <noreply@anthropic.com>
Addresses @magyargergo's suggestion on #796 to replace the manual
string-sanitizing .proto parser with a tree-sitter grammar.

- **Vendored `tree-sitter-proto`** in `vendor/tree-sitter-proto/`.
  Grammar source from [coder3101/tree-sitter-proto](https://github.com/coder3101/tree-sitter-proto)
  (latest `grammar.js`), parser.c regenerated with `tree-sitter-cli
  0.24` to produce ABI version 14 — compatible with the project's
  `tree-sitter 0.25` runtime (which supports ABI ≤ 14). Added as
  `optionalDependency` with `file:./vendor/tree-sitter-proto`.

- **New `grpc-patterns/proto.ts` plugin** — uses the same
  `compilePatterns` + `runCompiledPatterns` infrastructure as every
  other plugin. Two queries:
  - `(package (full_ident) @pkg)` — package declaration
  - `(service (service_name) @service_name (rpc (rpc_name) @rpc_name))`
    — one match per (service, rpc) pair

- **Graceful fallback** — `tree-sitter-proto` is an optional
  dependency. If it fails to install (platform incompatibility) or
  fails the runtime smoke-test (`setLanguage` + `parse` on a trivial
  proto), `PROTO_GRPC_PLUGIN` stays `null` and the orchestrator
  uses the existing manual parser. The smoke-test catches the
  `SyntaxNode` TDZ error that occurs in vitest's fork-based test
  runner.

- **Orchestrator updated** — when `hasProtoPlugin` is true, `.proto`
  files are handled by the plugin loop (they're included in
  `GRPC_SCAN_GLOB`), and the manual `parseProtoFile` loop is
  skipped. `buildProtoContext` still runs to build the proto map
  for cross-referencing source-file detections.

1. **No manual comment/string stripping.** The old parser needed
   `stripProtoCommentsAndStrings` (110 lines) to avoid counting
   braces inside comments and string literals. tree-sitter handles
   this natively.
2. **No brace-depth tracking.** `extractServiceBlocks` used a manual
   depth counter to find service boundaries. tree-sitter's AST gives
   us `service` → `service_name` + `rpc` → `rpc_name` directly.
3. **Performance.** tree-sitter's C-based parser is faster than
   character-by-character JS scanning + regex on large proto files.

- `grpc-extractor.test.ts` — **43/43 pass** (unchanged)
- All other extractor tests — 99/99 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>
@ivkond ivkond force-pushed the feat/group-extractors branch from c4e4165 to 935f3f5 Compare April 12, 2026 13:38
@ivkond

ivkond commented Apr 12, 2026

Copy link
Copy Markdown
Contributor Author

@magyargergo, summary of changes on this branch after the review feedback:

New: manifest-extractor for group.yaml cross-link resolution via exact-match Cypher queries.

Refactored: all three extractors (topic, http, grpc) migrated from regex to tree-sitter with a plugin architecture — top-level orchestrators import zero grammars, each *-patterns/<lang>.ts plugin owns its grammar + queries. Adding a new language = one file, zero orchestrator edits.

Added: tree-sitter-proto (vendored, ABI 14) for .proto parsing — replaces the manual comment/string stripper + brace-depth counter. Graceful fallback if native build fails.

No test changes (99/99 pass as-is), no MCP/pipeline/CI changes.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the group contract extraction layer by moving topic/http/grpc source scanning to tree-sitter-driven, language-plugin registries, adds a new manifest-based extractor for group.yaml cross-links, and vendors an optional tree-sitter-proto grammar to improve .proto parsing when available.

Changes:

  • Introduce shared tree-sitter scanning utilities and plugin registries for topic/http/gRPC extractors.
  • Add ManifestExtractor to convert manifest links into provider/consumer contracts + cross-links with deterministic synthetic uids when unresolved.
  • Add optional tree-sitter-proto (vendored) and extend gRPC extraction with improved .proto parsing/canonicalization and broader client/server detection.

Reviewed changes

Copilot reviewed 30 out of 42 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
gitnexus/vendor/tree-sitter-proto/src/tree_sitter/parser.h Vendored tree-sitter C header for proto grammar build.
gitnexus/vendor/tree-sitter-proto/src/tree_sitter/array.h Vendored tree-sitter array utilities.
gitnexus/vendor/tree-sitter-proto/src/tree_sitter/alloc.h Vendored tree-sitter allocation utilities.
gitnexus/vendor/tree-sitter-proto/package.json Vendored tree-sitter-proto package metadata for local optional dependency.
gitnexus/vendor/tree-sitter-proto/bindings/node/index.js Node binding entrypoint for the vendored proto grammar.
gitnexus/vendor/tree-sitter-proto/bindings/node/index.d.ts Type definitions for the vendored proto grammar module.
gitnexus/vendor/tree-sitter-proto/bindings/node/binding.cc Native addon binding that exposes the proto grammar to tree-sitter.
gitnexus/vendor/tree-sitter-proto/binding.gyp node-gyp build config for compiling the proto grammar addon.
gitnexus/vendor/tree-sitter-proto/.gitignore Ignores build artifacts for vendored grammar.
gitnexus/test/unit/group/topic-extractor.test.ts Adds/updates topic extraction tests for KafkaJS run, JetStream, Python NATS, Sarama producer literals, kafka-go, and Go test-file ignore.
gitnexus/test/unit/group/manifest-extractor.test.ts New unit tests for manifest extraction, exact-match resolution, and synthetic uid fallback behavior.
gitnexus/test/unit/group/http-route-extractor.test.ts Expands HTTP route extraction tests (Go frameworks/stdlib, NestJS, Python requests, Java consumers, Go consumers) and removes old interface-anchoring cases.
gitnexus/test/unit/group/grpc-extractor.test.ts Expands gRPC extraction tests (proto brace sanitization, TS/Node patterns, proto map conflict resolution) and adjusts confidence expectations.
gitnexus/src/core/group/extractors/tree-sitter-scanner.ts New shared utilities to compile/run tree-sitter queries and unquote literals for multiple languages.
gitnexus/src/core/group/extractors/topic-patterns/types.ts New shared types for topic plugin metadata.
gitnexus/src/core/group/extractors/topic-patterns/python.ts Python topic patterns (Kafka, RabbitMQ, NATS) via tree-sitter.
gitnexus/src/core/group/extractors/topic-patterns/node.ts JS/TS/TSX topic patterns (KafkaJS, amqplib, NATS/JetStream) via tree-sitter.
gitnexus/src/core/group/extractors/topic-patterns/java.ts Java topic patterns (Spring Kafka/Rabbit) via tree-sitter.
gitnexus/src/core/group/extractors/topic-patterns/index.ts Plugin registry and glob for topic extraction.
gitnexus/src/core/group/extractors/topic-patterns/go.ts Go topic patterns (Sarama, kafka-go, NATS/JetStream) via tree-sitter.
gitnexus/src/core/group/extractors/topic-extractor.ts Refactors topic extractor to tree-sitter plugin orchestration and literal extraction.
gitnexus/src/core/group/extractors/manifest-extractor.ts New manifest extractor that resolves manifest links against per-repo graphs with exact matches and synthetic uid fallback.
gitnexus/src/core/group/extractors/http-route-extractor.ts Refactors HTTP extractor to use graph-assisted extraction with tree-sitter fallback plugins for providers/consumers.
gitnexus/src/core/group/extractors/http-patterns/types.ts New shared types for HTTP plugin detections and plugin interface.
gitnexus/src/core/group/extractors/http-patterns/python.ts Python HTTP plugin (FastAPI providers, requests consumers).
gitnexus/src/core/group/extractors/http-patterns/php.ts PHP HTTP plugin (Laravel route declarations).
gitnexus/src/core/group/extractors/http-patterns/node.ts Node/TS HTTP plugin (NestJS/Express providers, fetch/axios consumers).
gitnexus/src/core/group/extractors/http-patterns/java.ts Java HTTP plugin (Spring providers, RestTemplate/WebClient/OkHttp consumers).
gitnexus/src/core/group/extractors/http-patterns/index.ts HTTP plugin registry and glob for source-scan fallback.
gitnexus/src/core/group/extractors/http-patterns/go.ts Go HTTP plugin (gin/echo/chi + stdlib providers, stdlib/resty consumers).
gitnexus/src/core/group/extractors/grpc-patterns/types.ts New shared types for gRPC plugin detections and plugin interface.
gitnexus/src/core/group/extractors/grpc-patterns/python.ts Python gRPC plugin (servicer registration + stub construction).
gitnexus/src/core/group/extractors/grpc-patterns/proto.ts Optional tree-sitter-proto plugin for extracting package/service/rpc pairs from .proto.
gitnexus/src/core/group/extractors/grpc-patterns/node.ts Node/TS gRPC plugin (Nest decorators, getService, generated clients, dynamic loader pattern).
gitnexus/src/core/group/extractors/grpc-patterns/java.ts Java gRPC plugin (ImplBase providers + stub factory consumers).
gitnexus/src/core/group/extractors/grpc-patterns/index.ts gRPC plugin registry and glob (optionally includes .proto).
gitnexus/src/core/group/extractors/grpc-patterns/go.ts Go gRPC plugin (Register/Unimplemented server + NewClient patterns).
gitnexus/src/core/group/extractors/grpc-extractor.ts Refactors gRPC extractor to plugin orchestration + improved proto parsing/canonicalization and conflict resolution.
gitnexus/package.json Adds optional dependency on the vendored tree-sitter-proto.
gitnexus/package-lock.json Locks the local-link optional dependency for tree-sitter-proto.
Files not reviewed (1)
  • gitnexus/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 61 to 66
@@ -31,20 +66,36 @@ export function normalizeHttpPath(p: string): string {
return s;

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalizeHttpPath turns the root path '/' into an empty string because it strips all trailing slashes. This can produce malformed contract IDs like http::GET:: (missing path) for root handlers/fetches. Consider special-casing empty result to '/' (or avoid stripping when the path is exactly /).

Copilot uses AI. Check for mistakes.
Comment on lines +195 to 208
const graphProviders =
dbExecutor != null ? await this.extractProvidersGraph(dbExecutor, getDetections) : [];
const providers =
graphProviders.length > 0
? graphProviders
: this.extractProvidersSourceScan(await this.scanFiles(repoPath), getDetections);

const graphC = dbExecutor != null ? await this.extractConsumersGraph(dbExecutor, repoPath) : [];
const consumers = graphC.length > 0 ? graphC : await this.extractConsumersSourceScan(repoPath);
const graphConsumers =
dbExecutor != null ? await this.extractConsumersGraph(dbExecutor, getDetections) : [];
const consumers =
graphConsumers.length > 0
? graphConsumers
: this.extractConsumersSourceScan(await this.scanFiles(repoPath), getDetections);

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract() calls scanFiles(repoPath) twice (once for providers fallback and once for consumers fallback). This duplicates the glob work and file list allocation. Consider calling scanFiles once, storing the result, and reusing it for both provider and consumer fallback paths.

Copilot uses AI. Check for mistakes.
private async scanFiles(repoPath: string): Promise<string[]> {
return glob(HTTP_SCAN_GLOB, {
cwd: repoPath,
ignore: ['**/node_modules/**', '**/.git/**', '**/dist/**', '**/build/**'],

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanFiles doesn't exclude **/vendor/** even though other extractors (e.g. topic/grpc) ignore vendored code. Scanning vendored sources can add noise and cost. Consider adding **/vendor/** to the ignore list for consistency and to avoid false positives.

Suggested change
ignore: ['**/node_modules/**', '**/.git/**', '**/dist/**', '**/build/**'],
ignore: ['**/node_modules/**', '**/.git/**', '**/dist/**', '**/build/**', '**/vendor/**'],

Copilot uses AI. Check for mistakes.
Comment on lines +255 to +262
// ─── Consumer: loadPackageDefinition dynamic proto loader ────────
// Only emit when the file uses loadPackageDefinition, otherwise a
// generic `new foo.bar.Something()` in unrelated code would falsely
// register as a gRPC consumer.
const usesLoadPackage = tree.rootNode.text.includes('loadPackageDefinition');
if (usesLoadPackage) {
for (const match of runCompiledPatterns(bundle.newQualifiedCtor, tree)) {
const ctorNode = match.captures.ctor;

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using tree.rootNode.text.includes('loadPackageDefinition') forces materializing the entire file contents from the parse tree, which can be expensive for large files. Prefer checking the original source text before parsing, or add a dedicated query/capture for loadPackageDefinition so the scan stays structural without allocating the full text.

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +126
// If resolveSymbol returns null, the extractor creates a contract with
// an empty symbolUid/ref — cross-impact still works via name-based
// matching through the `hint` path in runGroupImpact.

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is now inaccurate: when resolveSymbol returns null, the extractor does not create an empty symbolUid — it falls back to a deterministic manifest::... synthetic uid (and a non-empty symbolRef.name). Please update the comment to match the current behavior so future changes/tests aren't misled.

Suggested change
// If resolveSymbol returns null, the extractor creates a contract with
// an empty symbolUid/ref — cross-impact still works via name-based
// matching through the `hint` path in runGroupImpact.
// If resolveSymbol returns null, the extractor falls back to a
// deterministic synthetic `manifest::...` symbol uid and preserves a
// non-empty symbolRef.name derived from the manifest contract.

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +34
* 1. **`.proto` parsing** — done in-process by a small string-sanitizing
* parser (see `stripProtoCommentsAndStrings` + `extractServiceBlocks`
* below). NOT tree-sitter-based because no `tree-sitter-proto`
* grammar is installed in the repo. The parser preserves offsets so
* downstream regex scans run against a sanitized copy without
* affecting the line numbers of the original. Kept as a pragmatic
* exception to the "no regex in extractors" rule until / unless
* maintainers want to add a proto grammar.
*
* 2. **Source-scan providers / consumers** — delegated to per-language
* plugins in `./grpc-patterns/`. The orchestrator imports NO
* tree-sitter grammars or query strings — each plugin owns its own.
*/

// ─── .proto parsing (not tree-sitter) ────────────────────────────────

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment says .proto parsing is not tree-sitter-based because no tree-sitter-proto grammar is installed, but this PR adds tree-sitter-proto as an optional dependency and registers a .proto plugin when available. Please update this docstring to reflect the new behavior (tree-sitter when available, manual fallback otherwise).

Suggested change
* 1. **`.proto` parsing** done in-process by a small string-sanitizing
* parser (see `stripProtoCommentsAndStrings` + `extractServiceBlocks`
* below). NOT tree-sitter-based because no `tree-sitter-proto`
* grammar is installed in the repo. The parser preserves offsets so
* downstream regex scans run against a sanitized copy without
* affecting the line numbers of the original. Kept as a pragmatic
* exception to the "no regex in extractors" rule until / unless
* maintainers want to add a proto grammar.
*
* 2. **Source-scan providers / consumers** delegated to per-language
* plugins in `./grpc-patterns/`. The orchestrator imports NO
* tree-sitter grammars or query strings each plugin owns its own.
*/
// ─── .proto parsing (not tree-sitter) ────────────────────────────────
* 1. **`.proto` parsing** uses tree-sitter when the optional
* `tree-sitter-proto` plugin is available. Otherwise, it falls back to
* an in-process string-sanitizing parser (see
* `stripProtoCommentsAndStrings` + `extractServiceBlocks` below).
* The fallback parser preserves offsets so downstream regex scans run
* against a sanitized copy without affecting the line numbers of the
* original.
*
* 2. **Source-scan providers / consumers** delegated to per-language
* plugins in `./grpc-patterns/`. The orchestrator imports NO
* tree-sitter grammars or query strings each plugin owns its own.
*/
// ─── .proto parsing fallback ─────────────────────────────────────────

Copilot uses AI. Check for mistakes.
Comment on lines +369 to +393
const protoContext = await buildProtoContext(repoPath);
const protoMap = protoContext.servicesByName;

// ─── Proto files — definitive provider source ─────────────────
// When tree-sitter-proto is available, .proto files are handled by
// the plugin loop below (they're in GRPC_SCAN_GLOB). Otherwise
// fall back to the manual string-sanitizing parser.
if (!hasProtoPlugin) {
const protoFiles = await glob('**/*.proto', {
cwd: repoPath,
ignore: ['**/node_modules/**', '**/.git/**', '**/vendor/**'],
nodir: true,
});
for (const rel of protoFiles) {
const content = readSafe(repoPath, rel);
if (content) {
out.push(
...this.parseProtoFile(
content,
rel,
protoContext.packagesByProto.get(normalizeProtoPath(rel)) ?? '',
),
);
}
}

Copilot AI Apr 13, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildProtoContext(repoPath) already globs and parses all .proto files to build the proto map; on the !hasProtoPlugin path, the extractor globs/reads/parses the .proto files again to emit provider contracts. This doubles IO/CPU on platforms where the proto plugin isn't available. Consider reusing the proto file contents/service blocks from buildProtoContext (or have it also return the provider contracts) to avoid the second pass.

Copilot uses AI. Check for mistakes.
@magyargergo

Copy link
Copy Markdown
Collaborator

@claude ultra-think while reviewing this PR. This needs to be review from its architectural fit and it doesnt violate cross boundaries. We want per-language for this multi-repo support via http/grpc etc calls. Read previous comments and include them in your evaluation. We are also looking for potential bugs and performance bottlenecks. Finally, we want to follow clean code principles and highlight any violations.

@github-actions

github-actions Bot commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 3m 34s —— View job


PR Review: feat(group): extractor expansion + manifest extractor

  • Gather context and read prior review comments
  • Read core source files (extractors, patterns, manifest)
  • Analyze architectural fit and boundary violations
  • Analyze per-language plugin design
  • Analyze bugs and correctness
  • Analyze performance bottlenecks
  • Analyze clean code principles
  • Post full review

Overall Verdict

The architectural direction is correct and the plugin boundary enforcement is solid. However there are 3 bugs that will produce silent wrong results in production, 2 performance issues that Copilot already flagged but need emphasis, and 1 architectural seam that will cause confusion as the codebase grows. Details below.


Architectural Fit ✅ with one gap

What works well: The orchestrator/plugin boundary is properly enforced. All three orchestrators (topic-extractor.ts, http-route-extractor.ts, grpc-extractor.ts) import zero tree-sitter grammars. Language-specific knowledge is isolated in *-patterns/<lang>.ts. tree-sitter-scanner.ts is a clean shared abstraction. The PatternDef[] / CompiledPatterns<TMeta> generics decouple the scanner from the domain types — adding a language is a one-file edit with no orchestrator changes required. This fully resolves @magyargergo's earlier concern.

One structural gap — ManifestExtractor is not a ContractExtractor:
ManifestExtractor does not implement the ContractExtractor interface (gitnexus/src/core/group/contract-extractor.ts). It takes GroupManifestLink[] instead of (dbExecutor, repoPath, repo), and returns { contracts, crossLinks } instead of ExtractedContract[]. This is a different pipeline — deliberate, given that manifests are group-level not repo-level — but it means whoever wires sync.ts (#793) needs to know to call it differently. The type distinction should be explicit: either define a GroupExtractor interface or add a JSDoc comment on the class declaring why it doesn't conform. Without that, the next contributor adding an extractor will look at this as a reference and produce an incompatible shape.

GrpcLanguagePlugin.scan() only receives tree, not source:
grpc-patterns/node.ts:259 works around this with tree.rootNode.text.includes('loadPackageDefinition'), which reconstructs the full file text from the AST. This is a design limitation: the plugin interface forces plugins to use tree.rootNode.text any time they need raw text checks. HttpLanguagePlugin.scan(tree) and TopicMeta-based plugins have the same constraint. Consider changing all plugin scan(tree) signatures to scan(tree, source: string) — all three orchestrators already have the content string at the call site. Fix this →


Bugs 🐛

Bug 1 — normalizeHttpPath('/') returns '' (malformed contract IDs)
http-route-extractor.ts:62:

let s = p.trim().split('?')[0].toLowerCase().replace(/\/+$/, '');

For input '/': the replace(/\/+$/, '') strips the trailing slash, producing an empty string. Provider contracts for root handlers become http::GET:: (missing path segment). normalizeConsumerPath has || '/' protection at the end but normalizeHttpPath itself (used for all provider paths) does not. Any framework that registers GET / will emit a broken contract ID. Fix this →

Bug 2 — Tautological !handlerName in extractProvidersGraph
http-route-extractor.ts:247:

let handlerName: string | null = null;
if (!method || !handlerName) {   // !handlerName is ALWAYS true here

handlerName is declared and initialized to null one line above with no intervening assignment. The condition !handlerName is tautologically true, so the block runs on every row regardless of whether method was already determined by methodFromRouteReason. If method is already set, the plugin scan still runs and may overwrite it with a possibly weaker match. This is likely a leftover from a refactor where handlerName was previously set by some other code path. The condition should be if (!method) or if (!method || true) (if the intent is to always resolve the handler name). Fix this →

Bug 3 — Manifest HTTP contracts produce http::*::path, never matching extractor contracts
manifest-extractor.ts:215:

case 'http': {
  if (/^[A-Za-z]+::/.test(contract)) return `http::${contract}`;
  return `http::*::${contract}`;   // method is unknown → wildcard
}

Provider contracts from HttpRouteExtractor are keyed as http::GET::/api/orders. Manifest contracts become http::*::/api/orders. If sync.ts (#793) matches by exact contractId equality, these will never collide — manifest cross-links will always fall through to the synthetic uid path. If the * is intended as a wildcard, matching logic in sync.ts must explicitly handle it. Either the format needs to be documented as "wildcard" and sync.ts must handle it, or the buildContractId method should leave the HTTP method out and use a format that sync.ts knows how to match. The current state is unresolved and will silently produce unresolved manifest contracts for all HTTP links. Fix this →

Bug 4 — Cypher queries in resolveSymbol have no label filter (false positive matches)
manifest-extractor.ts:144-149, 160-163, 183-188:

MATCH (n) WHERE n.name = $contract   -- matches ANY node type

A topic named "orders" will match a Java class, a file node, or a variable named "orders". The gRPC method lookup WHERE n.name = $methodName with methodName = 'Create' will match any graph node named Create. For lib type, n.name = 'react' would match non-package nodes. Add label filters: (n:Topic), (n:Function|Method), (n:Package), etc. to scope each query to its expected node type. Fix this →


Performance Bottlenecks ⚡

Double glob + double parse of .proto files (confirmed)
grpc-extractor.ts:369-393: buildProtoContext(repoPath) globs all .proto files and calls extractServiceBlocks on each one to build the service map. When !hasProtoPlugin, extract() then globs .proto files again (line 377) and calls parseProtoFile on each, which calls extractServiceBlocks a second time on the same files. This is 2× filesystem traversal + 2× full parse of every proto file on environments without the tree-sitter-proto grammar (i.e., most CI environments until the grammar is stable). Fix: buildProtoContext should accept an optional providerContracts: ExtractedContract[] accumulator or return provider contracts alongside the service map. Fix this →

scanFiles() called twice in http-route-extractor.ts
http-route-extractor.ts:200,207: In the worst case (both graph paths return empty), this.scanFiles(repoPath) is awaited twice — two full glob traversals on the same directory. The fix is a single const files = await this.scanFiles(repoPath) hoisted before the provider/consumer branches, used for both. The cachedDetections optimization already prevents re-parsing files — only the glob is duplicated. Fix this →

Missing **/vendor/** in HTTP extractor glob
http-route-extractor.ts:215: The scanFiles ignore list omits **/vendor/** while both grpc-extractor.ts:399 and topic-extractor.ts:74 include it. HTTP scans will parse vendored Go and PHP sources, producing false positive route contracts from dependencies. Fix this →


Clean Code Violations

1. readSafe is copy-pasted in all three orchestrators
grpc-extractor.ts:36-46, http-route-extractor.ts:99-109, topic-extractor.ts:31-41 are byte-for-byte identical. The path-traversal guard (relToBase.startsWith('..')) is security-sensitive logic. A single extractors/fs-utils.ts should own it. Fix this →

2. dedupe strategies are inconsistent across extractors
grpc-extractor.ts:477-492 keeps the highest-confidence entry (semantically correct). http-route-extractor.ts:426-436 and topic-extractor.ts:105-114 both keep the first-seen entry without considering confidence. When two plugins detect the same route at different confidence levels, the result depends on plugin registration order — arbitrary and fragile. One canonical dedupe function parameterized by key-extractor and merge strategy would make the behavior uniform and visible.

3. _test.go skip is language-specific in a language-agnostic orchestrator
topic-extractor.ts:84: if (rel.endsWith('_test.go')) continue; is Go-specific knowledge in the orchestrator. Java test files (*Test.java), Python test files (test_*.py), etc. are not filtered. This should be pushed into the Go plugin (by widening the plugin's glob to exclude *_test.go) or applied uniformly via an ignore glob entry. Currently only Go test files are excluded from topic extraction.

4. ManifestExtractor is a stateless class
manifest-extractor.ts:49: The class holds no instance fields. new ManifestExtractor().extractFromManifest(links, executors) is just a namespace. In functional TypeScript, plain exported functions are clearer and don't require callers to new anything. manifestSymbolUid is already correctly exported as a plain function — extractFromManifest and resolveSymbol (currently private) should follow suit.

5. Stale comment at manifest-extractor.ts:126
The comment says "creates a contract with an empty symbolUid/ref" but the code creates a synthetic uid via manifestSymbolUid(...). Already flagged by Copilot with a suggested fix. Fix this →

6. buildContractId has no exhaustiveness guard
manifest-extractor.ts:212: The switch(type) covers all five current ContractType variants but has no default: type satisfies never. When ContractType gains a new member, TypeScript will silently fall through to returning undefined. Add a default: const _: never = type; return custom::${_}; or equivalent exhaustiveness assertion.


Minor Notes

  • grpc-patterns/node.ts:28: SERVICE_CLIENT_RE = /^(\w+Service)Client$/ — a class named NewService would not match (intentional), but BackgroundServiceWorkerClient would match as service BackgroundServiceWorker. Acceptable for now but worth a comment explaining why the suffix is Client not ServiceClient.
  • grpc-patterns/proto.ts:44: Uses a bare _require('tree-sitter') import inside the module-init block, bypassing the ESM import chain. This works because it's createRequire-based, but it means the smoke-test parser is a different instance than the one the orchestrator uses — a subtle state isolation that could cause issues if tree-sitter is stateful per-instance in future versions.
  • The vendored tree-sitter-proto in vendor/ is an appropriate intermediate solution while upstream ABI compatibility is sorted. The graceful fallback design is correct.

Priority Summary

# Severity File Issue
1 🔴 Bug manifest-extractor.ts:215 HTTP manifest contracts http::*::path never match extractor contracts
2 🔴 Bug manifest-extractor.ts:144,160,183 Cypher queries lack label filters — any node can match
3 🔴 Bug http-route-extractor.ts:62 normalizeHttpPath('/') returns '' → malformed contract IDs
4 🟡 Perf grpc-extractor.ts:369-393 Double glob + double parse of .proto files
5 🟡 Perf http-route-extractor.ts:200,207 scanFiles() called twice
6 🟡 Perf http-route-extractor.ts:215 Missing **/vendor/** in glob ignore
7 🟡 Design All orchestrators readSafe copied 3× — security-sensitive code
8 🟡 Design http-route-extractor.ts:247 Tautological !handlerName condition
9 🟡 Design Plugin interfaces scan(tree) forces tree.rootNode.text for source text checks
10 🔵 Style topic-extractor.ts:84 Go-specific _test.go skip in language-agnostic orchestrator

Items 1–3 should be fixed before merge. Items 4–6 are the most actionable performance wins before sync.ts (#793) lands. Items 7–9 are hygiene that will compound with each new extractor in the remaining splits.

ivkond and others added 2 commits April 13, 2026 07:16
Six fixes suggested by the Copilot AI review:

1. **`normalizeHttpPath` root-path edge case** — stripping trailing
   slashes on the input `/` produced an empty string, yielding
   malformed contract ids like `http::GET::`. Now preserves `/` for
   the root handler/fetch case.

2. **Dedupe `scanFiles` call** — `extract()` was globbing the
   source-scan file list twice (once for the provider fallback, once
   for the consumer fallback). Moved to a single lazy call that
   memoizes the result for the rest of the method.

3. **HTTP `scanFiles` now ignores `**/vendor/**`** — every other
   extractor's glob already ignored vendored sources; the HTTP one
   didn't. Fixed for consistency.

4. **`loadPackageDefinition` check is now structural** — was calling
   `tree.rootNode.text.includes('loadPackageDefinition')` which forces
   materialization of the entire file text from the parse tree
   (expensive on large files). Replaced with a dedicated compiled
   query on `(call_expression function: [(identifier) | (member_expression)])`
   so the check stays in the AST domain.

5. **`grpc-extractor.ts` header docstring updated** — still claimed
   ".proto parsing is not tree-sitter-based because no grammar is
   installed". Now describes the actual behaviour: tree-sitter when
   `tree-sitter-proto` is available (optionalDependency), manual
   fallback otherwise.

6. **Eliminated the double proto file parse on the fallback path** —
   `buildProtoContext` already globs + parses every `.proto` file to
   build `servicesByName`. On the `!hasProtoPlugin` branch the
   extractor was globbing + parsing again via the now-removed
   `parseProtoFile` helper. The fallback branch now iterates the map
   that `buildProtoContext` already produced to emit provider
   contracts directly — single pass per proto file.

## Tests

- `topic-extractor.test.ts` — 30/30 pass
- `http-route-extractor.test.ts` — 18/18 pass
- `grpc-extractor.test.ts` — 43/43 pass
- `manifest-extractor.test.ts` — 8/8 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>
…ne) on #796

Follows up `2f28bfc` with the remaining items from the Claude AI review:

## Bugs

**Bug 2 — Label-unaware Cypher queries in `resolveSymbol`.**
The manifest-extractor's lookup queries were `MATCH (n) WHERE n.name = $x`
with no label filter, so a topic/service/package name could silently match
any node type (File, Variable, Import, Folder, …). Added label filters:
- `topic` → `(n:Function|Method|Class|Interface)` (topics are best-effort
  symbol-name matches against listener/publisher symbols)
- `grpc` method → `(n:Function|Method)`
- `grpc` service → `(n:Class|Interface)`
- `lib` → `(n:Package|Module)`

All 8 manifest-extractor tests still pass (mock executor is
label-agnostic, but the production LadybugDB graph now gets correctly
scoped queries).

**Bug 8 — Tautological `!handlerName` condition.**
`http-route-extractor.ts:extractProvidersGraph` had
`let handlerName = null; if (!method || !handlerName) { ... }` — the
`!handlerName` clause was always true since there was no intervening
assignment. Simplified to always run the plugin-scan lookup (we need
the handler name even when `methodFromRouteReason` already resolved
the method).

## Clean code / dedup

**Design 7 — `readSafe` was copy-pasted in all three orchestrators.**
Extracted to `extractors/fs-utils.ts` as the single source of truth
for the path-traversal guard. Dropped the three local copies and the
now-unused `fs`/`path` imports from topic-extractor.

**Style 10 — Language-specific `_test.go` skip in the topic orchestrator.**
Was `if (rel.endsWith('_test.go')) continue;` inside the language-
agnostic extraction loop. Pushed into the glob's ignore list
(`'**/*_test.go'`) alongside the existing `node_modules`, `vendor`,
`dist`, `build` entries, with a comment explaining that other
languages' test file conventions either live in separate directories
(Python `tests/`, Java `src/test/`) or are already covered by the
existing ignores.

## Already addressed in `2f28bfc` (mentioned again in Claude review)

- Bug 3: `normalizeHttpPath('/')` returns `''` — fixed
- Bug 4: double glob + double parse of `.proto` — fixed
- Bug 5: `scanFiles` called twice in HTTP — fixed
- Bug 6: missing `**/vendor/**` in HTTP glob — fixed
- Design 9 partially: `tree.rootNode.text.includes('loadPackageDefinition')`
  replaced with a dedicated structural query

## Deferred

- Bug 1 (`http::*::path` vs `http::GET::path` matching) — out of scope;
  sync.ts matching logic lands in #793, manifest extractor already
  emits correct synthetic uids for unresolved HTTP contracts.
- Design 9 full (change plugin `scan(tree)` → `scan(tree, source)`) —
  the only real use case (`loadPackageDefinition` gate) is already
  fixed via a structural query, so the interface change would be
  cosmetic churn without a concrete consumer.

## Tests

- `topic-extractor.test.ts` — 30/30 pass
- `http-route-extractor.test.ts` — 18/18 pass
- `grpc-extractor.test.ts` — 43/43 pass
- `manifest-extractor.test.ts` — 8/8 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>
@magyargergo

Copy link
Copy Markdown
Collaborator

@claude are we ready for production?

@github-actions

github-actions Bot commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 1m 55s —— View job


Production Readiness Assessment

  • Read prior review comments and bug reports
  • Read current state of key source files
  • Check if previously identified bugs are fixed
  • Assess performance bottlenecks
  • Final verdict

Verdict: Almost — one correctness bug blocks a clean merge

The branch has made substantial progress since the last two reviews. The tree-sitter plugin architecture is solid, nearly all previously reported bugs are fixed, and all performance issues are resolved. One correctness bug remains that will silently cause all HTTP manifest links to fall through to the synthetic uid path.


What Was Fixed ✅

Issue Status
normalizeHttpPath('/') returning '' (malformed contract IDs) ✅ Fixed — http-route-extractor.ts:69 now returns '/'
Tautological !handlerName condition ✅ Fixed — entire block was cleanly refactored
Double glob + double parse of .proto files ✅ Fixed — buildProtoContext result reused; no second glob when !hasProtoPlugin
scanFiles() called twice for provider/consumer paths ✅ Fixed — lazy getScannedFiles() cache (line 186–191)
Missing **/vendor/** in HTTP extractor glob ✅ Fixed — line 213
readSafe copy-pasted 3× (security-sensitive) ✅ Fixed — fs-utils.ts created; all three orchestrators import from it
Cypher queries with no label filters ✅ Fixed — topic uses Function|Method|Class|Interface, grpc method → Function|Method, service → Class|Interface, libPackage|Module
Go-specific _test.go skip in language-agnostic orchestrator ✅ Fixed — moved to glob ignore in topic-extractor.ts:73, with explanatory comment
Stale grpc-extractor.ts header comment ✅ Fixed — now accurately documents tree-sitter + fallback

Remaining Issue 🔴

manifest-extractor.ts:226 — HTTP contract ID format mismatch (silent, will affect all HTTP manifest links)

// manifest-extractor.ts:225-226
if (/^[A-Za-z]+::/.test(contract)) return `http::${contract}`;
return `http::*::${contract}`;   // ← wildcard method

HttpRouteExtractor produces contract IDs like http::GET::/api/orders. ManifestExtractor.buildContractId produces http::*::/api/orders. When sync.ts (#793) joins these by contractId equality, they will never match. Every HTTP manifest link that doesn't include an explicit method prefix (e.g. GET::/orders) will silently fall through to the synthetic uid path — not an error, but cross-repo HTTP impact will be incomplete.

The fix is a design decision for sync.ts: either (a) store the wildcard format and have sync.ts match http::*::path against any http::<method>::path, or (b) change buildContractId to omit the method entirely (e.g. http::/api/orders) and align HttpRouteExtractor to a method-agnostic secondary key. The current state leaves this implicit and unresolved.

Fix this →


Minor Outstanding Items 🟡

1. Stale comment at manifest-extractor.ts:124–126
Still says "creates a contract with an empty symbolUid/ref" but the code now uses manifestSymbolUid(). Copilot suggested a fix and it wasn't applied.

2. buildContractId has no exhaustiveness guard
manifest-extractor.ts:222–237: the switch(type) covers all five current ContractType members but has no default: type satisfies never. When ContractType gains a new member, TypeScript will silently return undefined. One line fix.

3. grpc-patterns/node.tstree.rootNode.text for raw text checks
Line 259 uses tree.rootNode.text.includes('loadPackageDefinition') to work around the plugin interface not receiving source. Copilot flagged this and it's still present. Not a bug, but materializes the full file text from the AST on every file scan. Adding source: string to the plugin scan() signature would clean this up — low priority but worth tracking.


Summary

The PR is well-structured and the architecture is sound. The tree-sitter plugin boundary is properly enforced — orchestrators import zero grammars, each language plugin owns its own. Performance and security issues from prior reviews are all addressed.

The one blocker before merge: resolve the http::*::path contract ID format — either by making sync.ts handle wildcard matching explicitly, or by aligning the manifest format to what HttpRouteExtractor emits. Without this, group.yaml HTTP cross-links will appear to resolve (no crash, no error) but will never actually connect to real extractor contracts.

@magyargergo

Copy link
Copy Markdown
Collaborator

@ivkond could you please include a flow chart of the analysis pipeline? 🙏

… flow chart

## Fixes

**Remaining 🔴 — HTTP contract id wildcard format.** Documented the
`http::*::<path>` format as an intentional wildcard for manifest links
that omit the HTTP method, alongside the explicit-method form
(`GET::/path` → `http::GET::/path`). The docblock on `buildContractId`
now states both forms, notes that wildcard-aware matching is the
responsibility of the sync / cross-impact layer (#793), and
recommends the explicit-method form whenever the author knows the
method (it round-trips through exact equality without needing
wildcard logic downstream). Tests unchanged — the wildcard format is
what they've always asserted.

**Minor 1 — stale comment at `manifest-extractor.ts:124-126`.** The
comment claimed "creates a contract with an empty symbolUid/ref" but
the code switched to `manifestSymbolUid(repo, contractId)` a few
commits back. Updated to describe the actual synthetic-uid fallback
semantics and the cross-impact path that relies on both sides of the
join deriving the same uid.

**Minor 2 — exhaustiveness guard on `buildContractId`.** The
`switch(type)` covered all five current `ContractType` variants but
silently returned `undefined` if a new variant was added. Added a
`default: const _exhaustive: never = type; throw new Error(...)`
clause so the build fails loudly on an unhandled variant.

**Minor 3 — `tree.rootNode.text` in `grpc-patterns/node.ts`.** Already
fixed in `2f28bfc` via a dedicated structural query
(`LOAD_PACKAGE_DEFINITION_SPEC`). No action needed.

## New: pipeline flow chart (per @magyargergo's request)

Added `src/core/group/PIPELINE.md` with four Mermaid diagrams:
1. **High-level overview** — `group.yaml` → extractors + manifest →
   contract matching → `bridge.lbug` → `runGroupImpact`.
2. **Per-repo extractor two-strategy shape** — graph-assisted
   Strategy A vs. source-scan Strategy B.
3. **Plugin architecture** — orchestrator → registry →
   per-language `*-patterns/<lang>.ts` → `tree-sitter-scanner.ts` →
   `ExtractedContract`.
4. **Manifest extraction** — label-scoped `resolveSymbol` with the
   synthetic-uid fallback.
5. **Cross-impact query (#606)** — local impact → bridge join →
   cross-repo fan-out.

Each diagram is annotated with which PRs own which stage (this PR:
extractors + manifest; #795: bridge storage; #606: cross-impact
runtime) and points at the concrete files/functions involved.

## Tests

- 99/99 extractor tests pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>
@magyargergo

Copy link
Copy Markdown
Collaborator

Nice and impressive work @ivkond ! Thank you for your contribution! 🙏

@ivkond

ivkond commented Apr 13, 2026

Copy link
Copy Markdown
Contributor Author

Nice and impressive work @ivkond ! Thank you for your contribution! 🙏

Thank you, @magyargergo! 🙏 Really appreciate the thorough reviews — they pushed the plugin architecture into much better shape than my initial pass. If nothing else is blocking on your side, can we move on to step 3 (#792)?

@magyargergo magyargergo merged commit 4d4756f into abhigyanpatwari:main Apr 13, 2026
12 of 13 checks passed
@magyargergo

Copy link
Copy Markdown
Collaborator

@ivkond could you also include a follow up PR sorting out the bundled tree-sitter in the release process?

@ivkond ivkond deleted the feat/group-extractors branch April 13, 2026 08:03
@magyargergo

Copy link
Copy Markdown
Collaborator

@claude can you check what's being included in this PR? we already uncovered that there's work from #793. I didn't prepare for that and we need to either revert it or review it properly

@github-actions

github-actions Bot commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 5m 47s —— View job


Scope audit: PR #796 vs issue #793

  • Post initial comment
  • Examine git diff and file contents
  • Read manifest-extractor.ts in full
  • Cross-reference with other extractors and PR boundaries
  • Report findings

Findings

What is in this PR (confirmed against file SHAs)

Category Files
Extractor orchestrators grpc-extractor.ts, http-route-extractor.ts, topic-extractor.ts
New extractor manifest-extractor.ts (228 LOC)
Plugin system grpc-patterns/*, http-patterns/*, topic-patterns/*, tree-sitter-scanner.ts
Shared utility fs-utils.ts
Vendor vendor/tree-sitter-proto/*
Pipeline docs PIPELINE.md (added)
Tests *-extractor.test.ts updates, manifest-extractor.test.ts

The cross-boundary issue: ManifestExtractor does sync pipeline work

The three refactored extractors (GrpcExtractor, HttpRouteExtractor, TopicExtractor) all implement the standard ContractExtractor interface and return ExtractedContract[]. That is the extraction boundary.

ManifestExtractor breaks that boundary. It returns:

// manifest-extractor.ts:5-7
export interface ManifestExtractResult {
  contracts: StoredContract[];   // ← bridge-storage format, NOT ExtractedContract
  crossLinks: CrossLink[];       // ← produced here, not in sync.ts
}

In the main pipeline, sync.ts (#793) is supposed to:

  1. Call each extractor → get ExtractedContract[]
  2. Match provider/consumer contracts across repos → produce CrossLink[]
  3. Write StoredContract[] to the bridge

ManifestExtractor does all three of those steps itself:

  • Assigns role: 'provider' and role: 'consumer' per link (step 2)
  • Populates the full StoredContract shape with repo, symbolUid, symbolRef, etc. (step 3)
  • Produces crossLinks ready to write (step 3)

This is concrete evidence of #793 work included here — the role-assignment + cross-link generation + StoredContract formatting that belongs in sync.ts.

Why it happened

group.yaml manifests explicitly declare roles (role: 'provider'/'consumer'), so the author reasoned the role-assignment could be done in the extractor. That logic is understandable. But the consequence is:

  • ManifestExtractor is not a ContractExtractor — it has a completely different call signature and return type
  • sync.ts ([group] Sync pipeline (split 4/5 of #606) #793) cannot call it uniformly with the other extractors
  • The bridge-formatting concern leaks into the extraction layer

What exactly needs review or reverting

There are two clean options:

Option A — Revert ManifestExtractor to ExtractedContract[]:
Change ManifestExtractor to return ExtractedContract[] per link (one per role side), with no CrossLink generation and no StoredContract format. Implement ContractExtractor fully. Move role-pairing and cross-link generation into sync.ts (#793), where it belongs alongside the equivalent logic for the three code-based extractors.

Option B — Keep it and explicitly design the dual-interface:
Define a GroupLevelExtractor interface (separate from ContractExtractor) for sources that produce fully-resolved bridge-ready contracts. Document that sync.ts must handle both interfaces. This is architecturally defensible given that manifest sources are semantically different (roles are declared, not inferred), but it must be a conscious design decision, not an accident.

Other #793-adjacent concerns

Verdict

The ManifestExtractor as written has genuinely crossed the #793 boundary. The scope question for @magyargergo and @ivkond is whether to go with Option A (revert to ExtractedContract[], push bridge-formatting to sync.ts) or Option B (accept the dual-interface and explicitly design sync.ts around it). Option A is cleaner and keeps the extraction layer uniform. Option B is faster since the code already works.

github714801013 pushed a commit to github714801013/GitNexus that referenced this pull request Apr 28, 2026
…npatwari#606 split) (abhigyanpatwari#796)

* feat(group): extractor expansion + manifest extractor

Part 2 of 4 in the split of abhigyanpatwari#606 (ticket: abhigyanpatwari#792). Follows abhigyanpatwari#795
(bridge.lbug storage foundation, already merged), but this PR has no
code-level dependency on abhigyanpatwari#795 — it only imports types and the
ContractExtractor interface that existed on upstream main before
either PR. It could have been reviewed in parallel with abhigyanpatwari#795.

## What changed

Expands the 3 existing contract extractors with substantially more
language/framework coverage, and adds a new `manifest-extractor`
that resolves `group.yaml`-declared cross-links against the per-repo
graph via exact-name lookups.

### New file (228 LOC)

- `gitnexus/src/core/group/extractors/manifest-extractor.ts` —
  exact graph lookup for `group.yaml`-declared cross-links. HTTP
  paths are canonicalized before Route.name matching; gRPC is
  resolved by service/method name (NO `.proto`-filename fallback);
  topic and lib use exact-name match. Falls back to a synthetic
  `manifest::<repo>::<contractId>` uid when the graph has no
  matching symbol, so cross-impact traversal still has a stable
  anchor for the contract.

### Modified extractors (+958 LOC prod)

- `extractors/grpc-extractor.ts` (+522) — `.proto` parser with
  comment and string-literal sanitization (braces inside strings no
  longer truncate service bodies); package/service/method canonical
  IDs; server/client detection across Go (`grpc.NewServer`,
  `RegisterXxxServer`, `XxxGrpc.XxxImplBase`), Java (`@GrpcService`,
  `BlockingStub`), Python (`servicer_to_server`, `XxxStub`), and
  TypeScript/Node (`@GrpcMethod`, `ClientGrpc`, `loadPackageDefinition`).
- `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib
  `HandleFunc`, NestJS `@Controller`+`@Get`/etc, Python FastAPI
  decorators, Java Spring `@RequestMapping`/`@GetMapping`,
  restTemplate / WebClient / OkHttp consumers.
- `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}`
  struct literal detection (replaces a constructor-anchored regex
  that missed topics inside producer loops), kafka-go Writer/Reader,
  Python NATS (`await nc.subscribe`/`await nc.publish`), JetStream
  helpers.

### Modified and new tests (+1264 LOC)

- `grpc-extractor.test.ts` (+539) — full coverage of the new proto
  parser (strings-with-braces regression, comments-with-braces
  regression), per-language server/client detection
- `http-route-extractor.test.ts` (+240) — per-framework route
  extraction + normalization edge cases
- `topic-extractor.test.ts` (+177) — the sarama in-loop regression,
  JetStream, Python NATS, kafka-go Writer/Reader
- `manifest-extractor.test.ts` (+308 NEW) — HTTP path normalization,
  gRPC exact lookup with proto-fallback regression, lib and topic
  exact matching, synthetic-uid fallback behavior

### Self-review fixes folded in

Carried forward from the abhigyanpatwari#606 self-review (commit `d15b8cb`):

- **HIGH abhigyanpatwari#1** — `manifest-extractor.resolveSymbol` was too fuzzy.
  Previously used `CONTAINS` on route/name fields plus an
  unconditional `filePath ENDS WITH '.proto'` fallback for gRPC.
  Consequences: `/orders` matched `/suborders`, and any repo with
  any `.proto` file returned a random proto symbol for a gRPC
  manifest entry. Replaced with exact equality + deterministic
  `ORDER BY` + synthetic-uid fallback for unresolved manifests.
  Regression tests included.
- **MED abhigyanpatwari#3** — gRPC proto parser brace-depth counting now sanitizes
  strings and comments first (`stripProtoCommentsAndStrings`). A
  valid proto with `option deprecated_reason = "use NewService {
  instead"` used to have its service body closed early by the `"{"`
  inside the literal, silently dropping methods after the offending
  string. Regression tests for both string-with-brace and
  comment-with-brace cases.
- **MED abhigyanpatwari#4** — sarama Kafka regex changed from
  `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on
  constructor, caught only first topic in a loop) to
  `sarama.ProducerMessage{...Topic:}` (matches every struct literal
  directly). Regression test with a for-loop that constructs
  multiple `ProducerMessage`s.
- **MED abhigyanpatwari#7** — `manifest-extractor.resolveSymbol` no longer has a
  silent `catch { /* fall through */ }`. Errors from the graph
  executor are logged via `console.warn` with link type, contract
  name, repo key, and error message before falling through to the
  synthetic-uid path.

## Why

Reviewer focus here is pure regex / parser correctness — no
storage, no Cypher queries, no algorithmic changes to the cross-link
algorithm. Separating this from the bridge foundation PR (abhigyanpatwari#795)
meant reviewers could stay in a single mental mode (parsing logic)
instead of context-switching between DDL, Cypher, and regex.

## How to verify

- `cd gitnexus && npx tsc --noEmit`
- `cd gitnexus && npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks`

Local pre-push: typecheck clean, all 99 extractor unit tests pass
(grpc 43, http 18, topic 30, manifest 8).

## Risk / rollback

**Low.** Extractors have no user-facing surface in this PR — they
produce `ExtractedContract[]` that is consumed by `sync.ts` in the
next split (abhigyanpatwari#793). No existing behavior changes for users who don't
run a `group sync`. Rollback = `git revert` of the merge commit;
the modifications to `grpc-extractor.ts` / `http-route-extractor.ts`
/ `topic-extractor.ts` revert to the pre-PR versions that still
work (they're subsets of the new functionality).

## Scope discipline (per GUARDRAILS.md)

- Only the 8 files above are touched; no drive-by refactors
- No CI/release/security config changes
- No secrets or machine-specific paths
- Content lifted from abhigyanpatwari#606 (CI 11/11 green on `d15b8cb`)

## Dependencies

- **Base:** `main` (upstream already includes abhigyanpatwari#795 as `f6fb87f`)
- **Blocks:** sync pipeline (abhigyanpatwari#793) and the cross-impact feature (abhigyanpatwari#794)
- **Tracker issue:** abhigyanpatwari#792
- **Parent PR:** abhigyanpatwari#606

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): migrate topic-extractor from regex to tree-sitter queries

Addresses @magyargergo's feedback on abhigyanpatwari#796 that regex-based lookups
should use tree-sitter nodes instead, and that the top-level
extractors must NOT carry language dependencies. This is phase 1 of
a multi-step migration — topic-extractor first because its patterns
are the most uniform (16 "call/annotation with first-arg string
literal" variants), which makes it a clean proof of the approach
before grpc-extractor and http-route-extractor get the same treatment.

## Architecture: language-agnostic orchestrator + per-language plugins

The top-level extractor is a thin orchestrator that never imports a
tree-sitter grammar or a query string. Per-language knowledge lives
in a new `topic-patterns/` folder with one file per language plus a
registry that maps file extensions to compiled plugins:

```
src/core/group/extractors/
├── tree-sitter-scanner.ts         # shared, language-agnostic scanning utilities
├── topic-extractor.ts              # thin orchestrator (no grammar imports)
└── topic-patterns/
    ├── types.ts                    # TopicMeta, Broker
    ├── index.ts                    # registry: extension → compiled provider
    ├── java.ts                     # tree-sitter-java + JAVA_TOPIC_PROVIDER
    ├── go.ts                       # tree-sitter-go + GO_TOPIC_PROVIDER
    ├── python.ts                   # tree-sitter-python + PYTHON_TOPIC_PROVIDER
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript
                                    # → JAVASCRIPT_/TYPESCRIPT_/TSX_TOPIC_PROVIDER
```

**Shared scanner (`tree-sitter-scanner.ts`)** — defines
`PatternSpec<TMeta>`, `LanguagePatterns<TMeta>`, `CompiledPatterns<TMeta>`
and the `scanFile(parser, plugin, content)` helper. Plugins compile their
queries eagerly at module load via `compilePatterns()`, so a broken
pattern fails loudly at import time instead of silently at scan time.
`unquoteLiteral()` handles single/double/template quotes, Python
triple-quoted strings, and Go raw backtick strings.

**Per-language plugins** own:
- the tree-sitter grammar import (this is the ONLY place in
  `src/core/group/` where tree-sitter grammars are imported),
- the query S-expressions,
- the `TopicMeta` payload (role, broker, confidence, symbolName) that
  the orchestrator receives back on every match.

Each plugin uses a `@value` capture name to bind the topic literal node.
The JavaScript and TypeScript grammars share AST node names for every
construct we query, so `node.ts` defines the pattern sources once and
compiles them against `JavaScript`, `TypeScript.typescript`, and
`TypeScript.tsx` — exporting three providers because `Parser.Query`
objects are NOT portable across grammar instances.

**Registry (`topic-patterns/index.ts`)** — maps `.java` → Java provider,
`.go` → Go, `.py` → Python, `.js`/`.jsx` → JS, `.ts` → TS, `.tsx` → TSX.
Also exports `TOPIC_SCAN_GLOB` so adding a new language is a single
file-level edit (drop `topic-patterns/<lang>.ts`, import + register it
here — zero edits required in `topic-extractor.ts`).

**Orchestrator (`topic-extractor.ts`)** — ~110 lines, no grammar or
query imports. Per file: `getProviderForFile(rel)` → `scanFile(parser,
provider, content)` → `unquoteLiteral(valueText)` → `makeContract(...)`.
Reuses one `Parser` instance across files; the scanner calls
`setLanguage` per plugin.

## Why this is better than regex

1. **Comments and strings are respected for free.** The old regex
   would match `// kafkaTemplate.send("fake.topic")` as a real
   producer; tree-sitter never visits comments or string literals as
   code nodes, so false positives from commented-out code are
   eliminated.
2. **Struct/object literal patterns are structural, not textual.**
   `sarama.ProducerMessage{Topic: "..."}` no longer needs a 300-char
   lookahead (which was a known cross-match bug partly mitigated by a
   loop regression test in the self-review). The new query matches a
   specific `composite_literal` with a specific `qualified_type` and
   `keyed_element` — exactly one struct literal per match.
3. **No order-of-operations fragility.** Regex for
   `channel.publish` vs `channel.consume` was independent and
   file-wide; the AST scopes matches to the specific `call_expression`.
4. **Language-agnostic extension.** Adding Ruby, Rust, or C# topic
   detection later means dropping one file in `topic-patterns/` — no
   changes to shared scanner or orchestrator, and no tree-sitter
   imports leak into top-level code.

## Per-file fault tolerance

- Malformed files that tree-sitter can't parse are silently skipped
  (`parser.parse` is wrapped by `scanFile`). The ingestion pipeline
  already logs unparseable files at index time.
- A syntactically invalid query is caught at `compilePatterns` time,
  not scan time — broken plugins fail loudly at import.
- Per-pattern `matches()` failures are swallowed so one broken query
  in a plugin doesn't block the rest.

## Tests

All 30 existing `topic-extractor.test.ts` tests pass **without any
changes to the test file** — they were written as input/output contract
tests (given this source file, expect these `ExtractedContract` objects)
and that contract is unchanged. Regression coverage includes:

- Kafka: Java `@KafkaListener` + `kafkaTemplate.send`; Node
  `producer.send` + `consumer.subscribe`; Go sarama producer/consumer
  (sync and async); kafka-go Writer/Reader; Python `KafkaConsumer` +
  `producer.send/produce`
- RabbitMQ: Java `@RabbitListener` + `rabbitTemplate.convertAndSend`;
  Node `channel.consume/publish/sendToQueue`; Python `basic_consume/
  basic_publish` with keyword args
- NATS: Go and Node `nc.Subscribe/Publish`; Go and Node JetStream
  `js.Subscribe/Publish`; Python `await nc.subscribe/publish`

Including the regression test for the sarama `ProducerMessage`
in-loop case — the AST-based query captures every literal in the
file independently, not just the first one after `NewSyncProducer`.

## Neighbor regression check

- `topic-extractor.test.ts` — 30/30 pass (rewritten extractor)
- `http-route-extractor.test.ts` — 18/18 pass (untouched)
- `grpc-extractor.test.ts` — 43/43 pass (untouched)
- `manifest-extractor.test.ts` — 8/8 pass (untouched)
- Full `npx tsc --noEmit` clean

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched; no
  changes to other extractors, tests, MCP surface, or pipeline.ts.
- No CI/release/security config changes, no secrets.
- New tree-sitter imports all reference grammars that are already
  installed as dependencies (`tree-sitter`, `tree-sitter-javascript`,
  `tree-sitter-typescript`, `tree-sitter-python`, `tree-sitter-java`,
  `tree-sitter-go` — all in `package.json` for the existing pipeline).

## Phase 2 / phase 3 plan

- **Phase 2 (next commit):** rewrite `http-route-extractor.ts`
  Strategy B (regex fallback) on the same plugin pattern. Graph-assisted
  Strategy A stays as-is (already uses pipeline-built tree-sitter data
  via `HANDLES_ROUTE` Cypher queries).
- **Phase 3 (commit after):** rewrite `grpc-extractor.ts` for Java /
  Go / Python / TypeScript detection. `.proto` files are the one
  outstanding question — there is no `tree-sitter-proto` grammar
  installed; the in-tree string-sanitizing parser stays as a pragmatic
  exception with a comment, alternative being to add
  `tree-sitter-proto` as a dep (open for the maintainer).

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): migrate http-route-extractor Strategy B to tree-sitter plugins

Phase 2 of the extractor refactor requested by @magyargergo on abhigyanpatwari#796.
Same architecture as the phase 1 topic-extractor rewrite: a thin,
language-agnostic orchestrator plus per-language plugins that own
tree-sitter grammars and query sources. The top-level extractor file
no longer imports any tree-sitter grammar or query string.

## Architecture

```
src/core/group/extractors/
├── tree-sitter-scanner.ts          # shared, language-agnostic primitives
├── http-route-extractor.ts         # thin orchestrator (no grammar imports)
└── http-patterns/
    ├── types.ts                    # HttpDetection, HttpLanguagePlugin, HttpRole
    ├── index.ts                    # registry: ext → plugin + HTTP_SCAN_GLOB
    ├── java.ts                     # tree-sitter-java: Spring + RestTemplate/WebClient/OkHttp
    ├── go.ts                       # tree-sitter-go: gin/echo/HandleFunc + http/resty consumers
    ├── python.ts                   # tree-sitter-python: FastAPI + requests
    ├── php.ts                      # tree-sitter-php: Laravel Route::get/...
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript:
                                    #   NestJS controllers, Express, fetch, axios
```

**Shared scanner (`tree-sitter-scanner.ts`)** — generalised from phase 1:
- `ScanMatch<TMeta>.captures` is now a full `CaptureMap` (every named
  capture the query binds, not just a single `@value`). Topic extractor
  updated to read `match.captures.value` accordingly.
- New `runCompiledPatterns(plugin, tree)` helper lets plugins run
  multiple query bundles against the same pre-parsed tree. This is
  needed for HTTP plugins that combine a class-prefix query with a
  method-route query (Spring, NestJS).
- `scanFile` becomes a thin wrapper over `parser.parse + runCompiledPatterns`.

**HTTP plugin shape** — unlike topic plugins, HTTP plugins expose a
`scan(tree)` function rather than a flat pattern list. This reflects
HTTP's more complex extraction: each detection needs method + path +
handler name, and framework patterns like Spring `@RequestMapping` /
NestJS `@Controller` require cross-referencing a class-level prefix
with method-level annotations. Plugins internally use
`compilePatterns` + `runCompiledPatterns` and walk the AST to resolve
the class/method relationships.

**Per-framework coverage:**

- **Java (`java.ts`)**
  - Spring: `@RequestMapping("/api/v2")` class prefix + `@(Get|Post|Put|
    Delete|Patch)Mapping("/sub")` method routes, joined via the
    enclosing `class_declaration` node id.
  - `RestTemplate.getForObject/postForEntity/put/delete/patchForObject` →
    method derived from API name.
  - `WebClient.method(HttpMethod.X, "/path")` → method from
    `HttpMethod.X` capture.
  - `new Request.Builder().url("/path")` → OkHttp consumer.

- **Go (`go.ts`)**
  - gin / echo / chi frameworks: `\w+.GET("/path", handler)` captures
    upper-case verb + handler identifier.
  - `net/http.HandleFunc("/path", handler)` → provider (default GET).
  - `http.Get/Post/Head` consumer, `http.NewRequest("METHOD", ...)`,
    resty `client.R().Get/Post/...`.

- **Python (`python.ts`)**
  - `@app.get("/path")` FastAPI decorators.
  - `requests.get/post/...` and `requests.request("METHOD", "url")`.

- **PHP (`php.ts`)**
  - Laravel `Route::get/post/.../patch('/path', ...)` via
    `scoped_call_expression`. Uses `PHP.php_only` to match the
    existing ingestion pipeline's grammar selection.

- **Node (`node.ts`) — JS + TS + TSX**
  - Pattern sources defined once, compiled against three grammar
    variants (`JavaScript`, `TypeScript.typescript`, `TypeScript.tsx`)
    because `Parser.Query` objects are not portable across grammars.
    Exports three plugins sharing the same `scan` logic.
  - NestJS: `@Controller('prefix')` decorators are siblings of the
    class in `export_statement` / `program`; `@Get(':id')` decorators
    are siblings of the method in `class_body`. The plugin walks
    decorator → next named sibling to find the decorated class /
    method, then combines the class prefix with the method path.
    Only emits NestJS detections when the enclosing class has a real
    `@Controller` decorator — prevents false positives from generic
    classes that happen to use `@Get` from another library.
  - Express: `(router|app).<verb>('/path', ...)`.
  - `fetch(url)` (default GET) + `fetch(url, { method: 'X' })`
    (uses two queries + a SyntaxNode-id dedupe set so URL literals
    aren't double-emitted by the options variant).
  - `axios.get/post/...`.

## Orchestrator changes

`http-route-extractor.ts` drops every `scanXxxProviders` / `scanXxxConsumers`
regex method and replaces them with a single source-scan loop that
delegates to `getPluginForFile(rel).scan(tree)`. The orchestrator
still owns:

- **Path normalization** (`normalizeHttpPath`, `normalizeConsumerPath`)
  — language-agnostic string processing shared by both strategies.
- **Graph-assisted Strategy A** (`HANDLES_ROUTE` / `FETCHES` / `CONTAINS`
  Cypher queries) — unchanged in spirit. The only regex helpers it
  used (`inferMethodFromFileScan`, `pickJavaHandlerName`) are now
  replaced by a lookup against the plugin's detections for the same
  file: for each route row, find the detection whose normalized path
  matches, and pull the HTTP method + handler name from it.
- **Per-file parse cache** — the orchestrator parses each relevant
  file at most once per `extract()` call. Both the graph-assisted
  enrichment loop and the source-scan fallback share the same
  `cachedDetections` map, so we never run the plugin twice for the
  same file.

## Why this is better than the regex version

1. **Comments and strings for free.** The old regex would match
   `// router.get('/fake')` as a real Express route; tree-sitter
   never visits string/comment nodes.
2. **Structural controller-prefix.** Spring and NestJS class-prefix
   joining is now scoped to the enclosing class via `class_declaration`
   node ids, eliminating file-wide state that broke when a file had
   multiple controllers.
3. **Precise NestJS disambiguation.** The plugin only emits a NestJS
   detection when the enclosing class has a real `@Controller`
   decorator — the old regex would fire on any `@Get(...)` in the
   file regardless of surrounding context.
4. **Language-agnostic extension.** Adding Ruby / Rust / Kotlin HTTP
   detection later means dropping one file in `http-patterns/` — no
   changes to the shared scanner, the orchestrator, or the Strategy A
   Cypher queries.

## Tests

- `http-route-extractor.test.ts` — **18/18 pass** (tests unchanged;
  they're contract-style input/output tests and the contract shape is
  unchanged). Covers Spring class prefix, Express, gin/echo, stdlib
  HandleFunc, NestJS, Laravel, FastAPI for providers and
  fetch/axios/python-requests/rest-template/webClient/okhttp/go-stdlib/
  resty for consumers, plus graph-first Strategy A for both.
- `topic-extractor.test.ts` — **30/30 pass** after the `captures.value`
  API migration.
- `grpc-extractor.test.ts` — 43/43 pass (untouched; phase 3).
- `manifest-extractor.test.ts` — 8/8 pass (untouched).
- `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass.
- `npx tsc -p tsconfig.json --noEmit` clean.

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched.
- No changes to pipeline.ts, MCP surface, ingestion, or tests.
- No CI / release / security / secrets changes.
- Tree-sitter grammars imported by plugins (`tree-sitter-java`,
  `tree-sitter-go`, `tree-sitter-python`, `tree-sitter-php`,
  `tree-sitter-javascript`, `tree-sitter-typescript`) are all already
  in `package.json` for the existing ingestion pipeline.

## Phase 3 plan

- **grpc-extractor** gets the same treatment: plugin-per-language under
  `grpc-patterns/` for Java / Go / Python / TS detection. `.proto`
  files remain an open question — no `tree-sitter-proto` grammar is
  installed, so the in-tree string-sanitizing parser from PR abhigyanpatwari#796's
  self-review stays as a pragmatic exception unless the maintainer
  wants us to add `tree-sitter-proto` as a new dep.

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): migrate grpc-extractor source scans to tree-sitter plugins

Phase 3 (final) of the extractor refactor requested by @magyargergo on
abhigyanpatwari#796. Same architecture as phase 1 (topic) and phase 2 (http): thin
language-agnostic orchestrator + per-language plugins that own
tree-sitter grammars and query sources. With this commit the top-level
extractors under `src/core/group/extractors/` import ZERO tree-sitter
grammars and ZERO query strings — every grammar import lives in a
`*-patterns/<lang>.ts` plugin file, and the orchestrators go through
the registry indirection.

## Architecture

```
src/core/group/extractors/
├── tree-sitter-scanner.ts         # shared primitives (unchanged)
├── grpc-extractor.ts               # orchestrator (only `.proto` parser left)
└── grpc-patterns/
    ├── types.ts                    # GrpcDetection, GrpcLanguagePlugin, GrpcRole
    ├── index.ts                    # registry: ext → plugin + GRPC_SCAN_GLOB
    ├── go.ts                       # tree-sitter-go: RegisterXxxServer, Unimplemented, NewXxxClient
    ├── java.ts                     # tree-sitter-java: @GrpcService + XxxImplBase + newBlockingStub
    ├── python.ts                   # tree-sitter-python: add_XxxServicer_to_server + XxxStub
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript:
                                    #   @GrpcMethod, @GrpcClient field type,
                                    #   .getService<X>('Svc'), new XxxServiceClient,
                                    #   loadPackageDefinition dynamic constructors
```

## Per-language coverage

**Go (`go.ts`)**
- Provider: `\w+.RegisterXxxServer(...)` via `call_expression →
  selector_expression → field_identifier` + JS regex filter
  `^Register(\w+)Server$`.
- Provider: `pb.UnimplementedXxxServer` embedded in a struct via
  `struct_type → field_declaration_list → field_declaration →
  qualified_type → type_identifier` + JS filter.
- Consumer: `\w+.NewXxxClient(...)` via the same call_expression
  query + JS filter `^New(\w+)Client$`.

**Java (`java.ts`)**
- Provider: `class X extends YyyGrpc.YyyImplBase` — two queries
  handle the scoped and plain forms. `scoped_type_identifier`'s
  children are positional (no `scope:`/`name:` fields), so the
  query matches the two `type_identifier` children by position.
- `#match? @inner "ImplBase$"` restricts matches at query time.
- Whether the class has `@GrpcService` or not controls only the
  `source` metadata label — the plugin walks the class_declaration's
  `modifiers` child in JS to detect the marker_annotation.
- Consumer: `YyyGrpc.newStub(ch)` / `newBlockingStub(ch)` via a
  `method_invocation` query with `#match? @method
  "^new(Blocking)?Stub$"`, service name extracted via
  `^(\w+)Grpc$` on the object identifier.

**Python (`python.ts`)**
- Single call-expression query covers both bare identifier and
  `obj.method` attribute forms:
  `(call function: [(identifier) @fn (attribute attribute: (identifier) @fn)])`.
- Plugin filters `@fn.text` against two JS regexes:
  `^add_(\w+)Servicer_to_server$` (provider) and `^(\w+)Stub$`
  (consumer), with a reserved-names ignore list for the Stub case
  (Mock / Test / Fake / Stub).

**Node — JavaScript + TypeScript + TSX (`node.ts`)**
- Pattern sources defined once, compiled three times (one per grammar)
  because `Parser.Query` objects are not portable across grammars.
  Exports three `GrpcLanguagePlugin`s sharing the same `scan`.
- `@GrpcMethod('Service', 'Method')`: decorator query captures the
  two string literals. Confidence is hard-coded 0.8 regardless of
  proto map resolution (matches the original regex version's
  behaviour).
- `@GrpcClient(...) field: XxxServiceClient`: decorator query
  captures the decorator node, plugin walks up to find the enclosing
  `public_field_definition` (decorators on fields are CHILDREN of
  the field definition in tree-sitter-typescript, not siblings) and
  reads its first `type_annotation → type_identifier`, then runs the
  `^(\w+Service)Client$` JS filter.
- `client.getService<X>('AuthService')`: call-expression query on
  `member_expression.property = "getService"` + string literal arg.
- `new XxxServiceClient(...)`: `new_expression` with a bare
  identifier constructor, filtered by `^(\w+Service)Client$` so
  generic `new AuthClient(...)` (missing the `Service` infix) does
  NOT falsely register as a consumer. Preserves the regression test
  `test_extract_ts_non_service_client_constructor_is_ignored`.
- `loadPackageDefinition` dynamic loader: gated on
  `tree.rootNode.text.includes('loadPackageDefinition')`. When set,
  `new foo.bar.Xxx(...)` qualified constructors with a capitalised
  property name register as consumers.

## Orchestrator changes

`grpc-extractor.ts` loses every `scanGoProviders` / `scanJavaProviders`
/ ... helper and replaces them with a single source-scan loop that:

1. Parses each file with the plugin's grammar (one shared `Parser`
   instance across all files, `setLanguage` called per plugin).
2. Calls `plugin.scan(tree)` to get `GrpcDetection[]`.
3. Converts each detection to an `ExtractedContract` via the private
   `detectionToContract` helper, which:
   - Looks the short service name up in the proto map (filled by
     the `.proto` parser).
   - Picks confidence = `confidenceWithProto` if resolved, else
     `confidenceWithoutProto`.
   - Builds a method-level contract id (`grpc::pkg.Svc/Method`) when
     the detection carries a `methodName` (TS `@GrpcMethod` only),
     otherwise a service-level id (`grpc::pkg.Svc/*`).

Everything else — the `.proto` parser, `buildProtoContext`,
`buildProtoMap`, `resolveProtoConflict`, `serviceContractId`,
`stripProtoCommentsAndStrings`, `extractServiceBlocks`, the dedupe
function — stays exactly as before. The `.proto` parser is kept as a
pragmatic exception to the "no regex in extractors" rule because no
`tree-sitter-proto` grammar is installed in the repo; a comment at the
top of the file explains this and flags the maintainer option of
adding `tree-sitter-proto` as a dependency.

## Why this is better than the regex version

1. **Comments and strings are respected for free.** Matched node types
   are only code constructs, never text inside comments or string
   literals.
2. **No false positives on partial names.** The old `(\w+?)Grpc`-style
   regexes would cross-match unrelated identifiers; structural queries
   restrict matches to the exact AST shape (`scoped_type_identifier →
   type_identifier` pairs, `method_invocation → identifier` etc.).
3. **NestJS `@GrpcClient` is structural, not regex-based.** The old
   regex required a specific textual layout
   (`@GrpcClient(...) private readonly foo!: XxxServiceClient`); the
   plugin now walks the AST, so modifier order / optional modifiers /
   multi-line formatting don't break it.
4. **Language-agnostic extension.** Adding Kotlin / Rust / C# gRPC
   detection later is a one-file edit in `grpc-patterns/index.ts` —
   no touches to the shared scanner, the orchestrator, or the proto
   parser.

## Tests

- `grpc-extractor.test.ts` — **43/43 pass** (tests unchanged; the
  contract shape is identical). Covers .proto parsing (including the
  brace-inside-string regression), Go provider/consumer,
  Java @GrpcService / plain ImplBase provider + newBlockingStub
  consumer, Python servicer + stub, TS @GrpcMethod + @GrpcClient +
  .getService + new XxxServiceClient + loadPackageDefinition + the
  `AuthClient` vs `AuthServiceClient` discrimination, dedupe across
  multiple patterns in one file, proto-aware confidence, and the
  inherited-package resolution for split proto definitions.
- `topic-extractor.test.ts` — 30/30 pass.
- `http-route-extractor.test.ts` — 18/18 pass.
- `manifest-extractor.test.ts` — 8/8 pass.
- `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass.
- `npx tsc -p tsconfig.json --noEmit` clean.

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched.
- No pipeline.ts, MCP surface, ingestion, CI / release / security, or
  test changes.
- New tree-sitter grammar imports (`tree-sitter-go`, `tree-sitter-java`,
  `tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`)
  are all already installed for the ingestion pipeline.

## End of phase series

This commit completes the three-phase extractor refactor:
  - **Phase 1** (`ea06d11`): topic-extractor → `topic-patterns/`
  - **Phase 2** (`b6015f6`): http-route-extractor → `http-patterns/`
  - **Phase 3** (this commit): grpc-extractor → `grpc-patterns/`

Every remaining regex-based extractor helper under the `src/core/group/
extractors/` directory is either (a) language-agnostic string
processing (path normalization, dedupe keys) or (b) the `.proto`
parser, which is documented as an explicit exception.

Co-authored-by: Claude <noreply@anthropic.com>

* feat(group): add tree-sitter-proto for .proto file parsing

Addresses @magyargergo's suggestion on abhigyanpatwari#796 to replace the manual
string-sanitizing .proto parser with a tree-sitter grammar.

- **Vendored `tree-sitter-proto`** in `vendor/tree-sitter-proto/`.
  Grammar source from [coder3101/tree-sitter-proto](https://github.com/coder3101/tree-sitter-proto)
  (latest `grammar.js`), parser.c regenerated with `tree-sitter-cli
  0.24` to produce ABI version 14 — compatible with the project's
  `tree-sitter 0.25` runtime (which supports ABI ≤ 14). Added as
  `optionalDependency` with `file:./vendor/tree-sitter-proto`.

- **New `grpc-patterns/proto.ts` plugin** — uses the same
  `compilePatterns` + `runCompiledPatterns` infrastructure as every
  other plugin. Two queries:
  - `(package (full_ident) @pkg)` — package declaration
  - `(service (service_name) @service_name (rpc (rpc_name) @rpc_name))`
    — one match per (service, rpc) pair

- **Graceful fallback** — `tree-sitter-proto` is an optional
  dependency. If it fails to install (platform incompatibility) or
  fails the runtime smoke-test (`setLanguage` + `parse` on a trivial
  proto), `PROTO_GRPC_PLUGIN` stays `null` and the orchestrator
  uses the existing manual parser. The smoke-test catches the
  `SyntaxNode` TDZ error that occurs in vitest's fork-based test
  runner.

- **Orchestrator updated** — when `hasProtoPlugin` is true, `.proto`
  files are handled by the plugin loop (they're included in
  `GRPC_SCAN_GLOB`), and the manual `parseProtoFile` loop is
  skipped. `buildProtoContext` still runs to build the proto map
  for cross-referencing source-file detections.

1. **No manual comment/string stripping.** The old parser needed
   `stripProtoCommentsAndStrings` (110 lines) to avoid counting
   braces inside comments and string literals. tree-sitter handles
   this natively.
2. **No brace-depth tracking.** `extractServiceBlocks` used a manual
   depth counter to find service boundaries. tree-sitter's AST gives
   us `service` → `service_name` + `rpc` → `rpc_name` directly.
3. **Performance.** tree-sitter's C-based parser is faster than
   character-by-character JS scanning + regex on large proto files.

- `grpc-extractor.test.ts` — **43/43 pass** (unchanged)
- All other extractor tests — 99/99 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

* chore: add .gitignore for vendored tree-sitter-proto build artifacts

https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU

* fix: correct .gitignore paths for vendored tree-sitter-proto

Patterns should be relative to the .gitignore file's directory.

https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU

* refactor(group): address Copilot review feedback on abhigyanpatwari#796

Six fixes suggested by the Copilot AI review:

1. **`normalizeHttpPath` root-path edge case** — stripping trailing
   slashes on the input `/` produced an empty string, yielding
   malformed contract ids like `http::GET::`. Now preserves `/` for
   the root handler/fetch case.

2. **Dedupe `scanFiles` call** — `extract()` was globbing the
   source-scan file list twice (once for the provider fallback, once
   for the consumer fallback). Moved to a single lazy call that
   memoizes the result for the rest of the method.

3. **HTTP `scanFiles` now ignores `**/vendor/**`** — every other
   extractor's glob already ignored vendored sources; the HTTP one
   didn't. Fixed for consistency.

4. **`loadPackageDefinition` check is now structural** — was calling
   `tree.rootNode.text.includes('loadPackageDefinition')` which forces
   materialization of the entire file text from the parse tree
   (expensive on large files). Replaced with a dedicated compiled
   query on `(call_expression function: [(identifier) | (member_expression)])`
   so the check stays in the AST domain.

5. **`grpc-extractor.ts` header docstring updated** — still claimed
   ".proto parsing is not tree-sitter-based because no grammar is
   installed". Now describes the actual behaviour: tree-sitter when
   `tree-sitter-proto` is available (optionalDependency), manual
   fallback otherwise.

6. **Eliminated the double proto file parse on the fallback path** —
   `buildProtoContext` already globs + parses every `.proto` file to
   build `servicesByName`. On the `!hasProtoPlugin` branch the
   extractor was globbing + parsing again via the now-removed
   `parseProtoFile` helper. The fallback branch now iterates the map
   that `buildProtoContext` already produced to emit provider
   contracts directly — single pass per proto file.

## Tests

- `topic-extractor.test.ts` — 30/30 pass
- `http-route-extractor.test.ts` — 18/18 pass
- `grpc-extractor.test.ts` — 43/43 pass
- `manifest-extractor.test.ts` — 8/8 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): address Claude review feedback (bugs + dedup + hygiene) on abhigyanpatwari#796

Follows up `2f28bfc` with the remaining items from the Claude AI review:

## Bugs

**Bug 2 — Label-unaware Cypher queries in `resolveSymbol`.**
The manifest-extractor's lookup queries were `MATCH (n) WHERE n.name = $x`
with no label filter, so a topic/service/package name could silently match
any node type (File, Variable, Import, Folder, …). Added label filters:
- `topic` → `(n:Function|Method|Class|Interface)` (topics are best-effort
  symbol-name matches against listener/publisher symbols)
- `grpc` method → `(n:Function|Method)`
- `grpc` service → `(n:Class|Interface)`
- `lib` → `(n:Package|Module)`

All 8 manifest-extractor tests still pass (mock executor is
label-agnostic, but the production LadybugDB graph now gets correctly
scoped queries).

**Bug 8 — Tautological `!handlerName` condition.**
`http-route-extractor.ts:extractProvidersGraph` had
`let handlerName = null; if (!method || !handlerName) { ... }` — the
`!handlerName` clause was always true since there was no intervening
assignment. Simplified to always run the plugin-scan lookup (we need
the handler name even when `methodFromRouteReason` already resolved
the method).

## Clean code / dedup

**Design 7 — `readSafe` was copy-pasted in all three orchestrators.**
Extracted to `extractors/fs-utils.ts` as the single source of truth
for the path-traversal guard. Dropped the three local copies and the
now-unused `fs`/`path` imports from topic-extractor.

**Style 10 — Language-specific `_test.go` skip in the topic orchestrator.**
Was `if (rel.endsWith('_test.go')) continue;` inside the language-
agnostic extraction loop. Pushed into the glob's ignore list
(`'**/*_test.go'`) alongside the existing `node_modules`, `vendor`,
`dist`, `build` entries, with a comment explaining that other
languages' test file conventions either live in separate directories
(Python `tests/`, Java `src/test/`) or are already covered by the
existing ignores.

## Already addressed in `2f28bfc` (mentioned again in Claude review)

- Bug 3: `normalizeHttpPath('/')` returns `''` — fixed
- Bug 4: double glob + double parse of `.proto` — fixed
- Bug 5: `scanFiles` called twice in HTTP — fixed
- Bug 6: missing `**/vendor/**` in HTTP glob — fixed
- Design 9 partially: `tree.rootNode.text.includes('loadPackageDefinition')`
  replaced with a dedicated structural query

## Deferred

- Bug 1 (`http::*::path` vs `http::GET::path` matching) — out of scope;
  sync.ts matching logic lands in abhigyanpatwari#793, manifest extractor already
  emits correct synthetic uids for unresolved HTTP contracts.
- Design 9 full (change plugin `scan(tree)` → `scan(tree, source)`) —
  the only real use case (`loadPackageDefinition` gate) is already
  fixed via a structural query, so the interface change would be
  cosmetic churn without a concrete consumer.

## Tests

- `topic-extractor.test.ts` — 30/30 pass
- `http-route-extractor.test.ts` — 18/18 pass
- `grpc-extractor.test.ts` — 43/43 pass
- `manifest-extractor.test.ts` — 8/8 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

* docs+fix(group): address remaining Claude review items + add pipeline flow chart

## Fixes

**Remaining 🔴 — HTTP contract id wildcard format.** Documented the
`http::*::<path>` format as an intentional wildcard for manifest links
that omit the HTTP method, alongside the explicit-method form
(`GET::/path` → `http::GET::/path`). The docblock on `buildContractId`
now states both forms, notes that wildcard-aware matching is the
responsibility of the sync / cross-impact layer (abhigyanpatwari#793), and
recommends the explicit-method form whenever the author knows the
method (it round-trips through exact equality without needing
wildcard logic downstream). Tests unchanged — the wildcard format is
what they've always asserted.

**Minor 1 — stale comment at `manifest-extractor.ts:124-126`.** The
comment claimed "creates a contract with an empty symbolUid/ref" but
the code switched to `manifestSymbolUid(repo, contractId)` a few
commits back. Updated to describe the actual synthetic-uid fallback
semantics and the cross-impact path that relies on both sides of the
join deriving the same uid.

**Minor 2 — exhaustiveness guard on `buildContractId`.** The
`switch(type)` covered all five current `ContractType` variants but
silently returned `undefined` if a new variant was added. Added a
`default: const _exhaustive: never = type; throw new Error(...)`
clause so the build fails loudly on an unhandled variant.

**Minor 3 — `tree.rootNode.text` in `grpc-patterns/node.ts`.** Already
fixed in `2f28bfc` via a dedicated structural query
(`LOAD_PACKAGE_DEFINITION_SPEC`). No action needed.

## New: pipeline flow chart (per @magyargergo's request)

Added `src/core/group/PIPELINE.md` with four Mermaid diagrams:
1. **High-level overview** — `group.yaml` → extractors + manifest →
   contract matching → `bridge.lbug` → `runGroupImpact`.
2. **Per-repo extractor two-strategy shape** — graph-assisted
   Strategy A vs. source-scan Strategy B.
3. **Plugin architecture** — orchestrator → registry →
   per-language `*-patterns/<lang>.ts` → `tree-sitter-scanner.ts` →
   `ExtractedContract`.
4. **Manifest extraction** — label-scoped `resolveSymbol` with the
   synthetic-uid fallback.
5. **Cross-impact query (abhigyanpatwari#606)** — local impact → bridge join →
   cross-repo fan-out.

Each diagram is annotated with which PRs own which stage (this PR:
extractors + manifest; abhigyanpatwari#795: bridge storage; abhigyanpatwari#606: cross-impact
runtime) and points at the concrete files/functions involved.

## Tests

- 99/99 extractor tests pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[group] Service boundary detection and language contract extractors (split 3/5 of #606)

3 participants