feat(group): extractor expansion + manifest extractor (2/4 of #606 split)#796
Conversation
|
@ivkond is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 6097 tests passed 97 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
|
@claude is this pr ready for production? Is this following clean code principles? Are there any performance bottlenecks? Is the code structured in a way that will unlock easy contributions in the future? |
|
Claude encountered an error —— View job PR Review
Verdict: Approve with notesCI is green, 99/99 extractor tests pass, self-review fixes are solid, and rollback risk is low. The code is mergeable. That said, there are a few things worth flagging now before the codebase grows further. Production Readiness ✅The PR is production-ready given its current scope. Extractors produce The self-review fixes carried in from #606 are exactly the right things to have caught: the Clean Code Issues1. 2. 3. 4. 5. Magic confidence numbers — values like Performance Bottlenecks1. Double glob + double parse for 2. Regex objects recreated per file in 3. Proto file reads are serial — 4. ExtensibilityThe overall structure unlocks contributions well. The Two things that could trip up future contributors: 1. The manifest resolver's topic/lib queries are label-unaware — 2. Minor Nits
Summary: Merge-ready. The double proto glob is the most actionable fix before the next split PR lands, since |
There was a problem hiding this comment.
@ivkond I don't think it's ready to be merged! I think, we should work on this a little more and try to find a better approach. Right now, we are doing regex based lookups, would it be possible to utilise tree-sitter nodes?
|
Can we make them language agnostic? The rule of thumb is we dont want in the top level processors/extractors to have langauge dependencies. |
115a591 to
ea06d11
Compare
There was a problem hiding this comment.
Could we add https://github.com/coder3101/tree-sitter-proto instead of manual parsing?
|
@magyargergo, checked tree-sitter-proto — the grammar peers on tree-sitter ^0.21 while we run 0.25, so the native binary won't load (ABI mismatch). The .proto parser we have is already sanitizing comments + strings and handles nested braces correctly (43 tests cover it including the braces-in-strings and braces-in-comments regressions). Happy to revisit once the grammar is updated to the 0.25 ABI, or if you'd prefer we fork + rebuild it ourselves — let us know. In the meantime, all three extractor source-scan paths (topic, http, grpc) have been migrated to the language-agnostic plugin architecture per your earlier feedback — the .proto in-tree parser is the only remaining non-tree-sitter code path and is documented as an explicit exception. |
|
I'm just concious about performance and that's why I'm pushing for it. Could we maybe spend today with looking for an alternatice fork that may have sorted this? 🙏 |
Part 2 of 4 in the split of #606 (ticket: #792). Follows #795 (bridge.lbug storage foundation, already merged), but this PR has no code-level dependency on #795 — it only imports types and the ContractExtractor interface that existed on upstream main before either PR. It could have been reviewed in parallel with #795. ## What changed Expands the 3 existing contract extractors with substantially more language/framework coverage, and adds a new `manifest-extractor` that resolves `group.yaml`-declared cross-links against the per-repo graph via exact-name lookups. ### New file (228 LOC) - `gitnexus/src/core/group/extractors/manifest-extractor.ts` — exact graph lookup for `group.yaml`-declared cross-links. HTTP paths are canonicalized before Route.name matching; gRPC is resolved by service/method name (NO `.proto`-filename fallback); topic and lib use exact-name match. Falls back to a synthetic `manifest::<repo>::<contractId>` uid when the graph has no matching symbol, so cross-impact traversal still has a stable anchor for the contract. ### Modified extractors (+958 LOC prod) - `extractors/grpc-extractor.ts` (+522) — `.proto` parser with comment and string-literal sanitization (braces inside strings no longer truncate service bodies); package/service/method canonical IDs; server/client detection across Go (`grpc.NewServer`, `RegisterXxxServer`, `XxxGrpc.XxxImplBase`), Java (`@GrpcService`, `BlockingStub`), Python (`servicer_to_server`, `XxxStub`), and TypeScript/Node (`@GrpcMethod`, `ClientGrpc`, `loadPackageDefinition`). - `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib `HandleFunc`, NestJS `@Controller`+`@Get`/etc, Python FastAPI decorators, Java Spring `@RequestMapping`/`@GetMapping`, restTemplate / WebClient / OkHttp consumers. - `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}` struct literal detection (replaces a constructor-anchored regex that missed topics inside producer loops), kafka-go Writer/Reader, Python NATS (`await nc.subscribe`/`await nc.publish`), JetStream helpers. ### Modified and new tests (+1264 LOC) - `grpc-extractor.test.ts` (+539) — full coverage of the new proto parser (strings-with-braces regression, comments-with-braces regression), per-language server/client detection - `http-route-extractor.test.ts` (+240) — per-framework route extraction + normalization edge cases - `topic-extractor.test.ts` (+177) — the sarama in-loop regression, JetStream, Python NATS, kafka-go Writer/Reader - `manifest-extractor.test.ts` (+308 NEW) — HTTP path normalization, gRPC exact lookup with proto-fallback regression, lib and topic exact matching, synthetic-uid fallback behavior ### Self-review fixes folded in Carried forward from the #606 self-review (commit `d15b8cb`): - **HIGH #1** — `manifest-extractor.resolveSymbol` was too fuzzy. Previously used `CONTAINS` on route/name fields plus an unconditional `filePath ENDS WITH '.proto'` fallback for gRPC. Consequences: `/orders` matched `/suborders`, and any repo with any `.proto` file returned a random proto symbol for a gRPC manifest entry. Replaced with exact equality + deterministic `ORDER BY` + synthetic-uid fallback for unresolved manifests. Regression tests included. - **MED #3** — gRPC proto parser brace-depth counting now sanitizes strings and comments first (`stripProtoCommentsAndStrings`). A valid proto with `option deprecated_reason = "use NewService { instead"` used to have its service body closed early by the `"{"` inside the literal, silently dropping methods after the offending string. Regression tests for both string-with-brace and comment-with-brace cases. - **MED #4** — sarama Kafka regex changed from `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on constructor, caught only first topic in a loop) to `sarama.ProducerMessage{...Topic:}` (matches every struct literal directly). Regression test with a for-loop that constructs multiple `ProducerMessage`s. - **MED #7** — `manifest-extractor.resolveSymbol` no longer has a silent `catch { /* fall through */ }`. Errors from the graph executor are logged via `console.warn` with link type, contract name, repo key, and error message before falling through to the synthetic-uid path. ## Why Reviewer focus here is pure regex / parser correctness — no storage, no Cypher queries, no algorithmic changes to the cross-link algorithm. Separating this from the bridge foundation PR (#795) meant reviewers could stay in a single mental mode (parsing logic) instead of context-switching between DDL, Cypher, and regex. ## How to verify - `cd gitnexus && npx tsc --noEmit` - `cd gitnexus && npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks` Local pre-push: typecheck clean, all 99 extractor unit tests pass (grpc 43, http 18, topic 30, manifest 8). ## Risk / rollback **Low.** Extractors have no user-facing surface in this PR — they produce `ExtractedContract[]` that is consumed by `sync.ts` in the next split (#793). No existing behavior changes for users who don't run a `group sync`. Rollback = `git revert` of the merge commit; the modifications to `grpc-extractor.ts` / `http-route-extractor.ts` / `topic-extractor.ts` revert to the pre-PR versions that still work (they're subsets of the new functionality). ## Scope discipline (per GUARDRAILS.md) - Only the 8 files above are touched; no drive-by refactors - No CI/release/security config changes - No secrets or machine-specific paths - Content lifted from #606 (CI 11/11 green on `d15b8cb`) ## Dependencies - **Base:** `main` (upstream already includes #795 as `1ff324c`) - **Blocks:** sync pipeline (#793) and the cross-impact feature (#794) - **Tracker issue:** #792 - **Parent PR:** #606 Co-authored-by: Claude <noreply@anthropic.com>
…eries Addresses @magyargergo's feedback on #796 that regex-based lookups should use tree-sitter nodes instead, and that the top-level extractors must NOT carry language dependencies. This is phase 1 of a multi-step migration — topic-extractor first because its patterns are the most uniform (16 "call/annotation with first-arg string literal" variants), which makes it a clean proof of the approach before grpc-extractor and http-route-extractor get the same treatment. ## Architecture: language-agnostic orchestrator + per-language plugins The top-level extractor is a thin orchestrator that never imports a tree-sitter grammar or a query string. Per-language knowledge lives in a new `topic-patterns/` folder with one file per language plus a registry that maps file extensions to compiled plugins: ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic scanning utilities ├── topic-extractor.ts # thin orchestrator (no grammar imports) └── topic-patterns/ ├── types.ts # TopicMeta, Broker ├── index.ts # registry: extension → compiled provider ├── java.ts # tree-sitter-java + JAVA_TOPIC_PROVIDER ├── go.ts # tree-sitter-go + GO_TOPIC_PROVIDER ├── python.ts # tree-sitter-python + PYTHON_TOPIC_PROVIDER └── node.ts # tree-sitter-javascript + tree-sitter-typescript # → JAVASCRIPT_/TYPESCRIPT_/TSX_TOPIC_PROVIDER ``` **Shared scanner (`tree-sitter-scanner.ts`)** — defines `PatternSpec<TMeta>`, `LanguagePatterns<TMeta>`, `CompiledPatterns<TMeta>` and the `scanFile(parser, plugin, content)` helper. Plugins compile their queries eagerly at module load via `compilePatterns()`, so a broken pattern fails loudly at import time instead of silently at scan time. `unquoteLiteral()` handles single/double/template quotes, Python triple-quoted strings, and Go raw backtick strings. **Per-language plugins** own: - the tree-sitter grammar import (this is the ONLY place in `src/core/group/` where tree-sitter grammars are imported), - the query S-expressions, - the `TopicMeta` payload (role, broker, confidence, symbolName) that the orchestrator receives back on every match. Each plugin uses a `@value` capture name to bind the topic literal node. The JavaScript and TypeScript grammars share AST node names for every construct we query, so `node.ts` defines the pattern sources once and compiles them against `JavaScript`, `TypeScript.typescript`, and `TypeScript.tsx` — exporting three providers because `Parser.Query` objects are NOT portable across grammar instances. **Registry (`topic-patterns/index.ts`)** — maps `.java` → Java provider, `.go` → Go, `.py` → Python, `.js`/`.jsx` → JS, `.ts` → TS, `.tsx` → TSX. Also exports `TOPIC_SCAN_GLOB` so adding a new language is a single file-level edit (drop `topic-patterns/<lang>.ts`, import + register it here — zero edits required in `topic-extractor.ts`). **Orchestrator (`topic-extractor.ts`)** — ~110 lines, no grammar or query imports. Per file: `getProviderForFile(rel)` → `scanFile(parser, provider, content)` → `unquoteLiteral(valueText)` → `makeContract(...)`. Reuses one `Parser` instance across files; the scanner calls `setLanguage` per plugin. ## Why this is better than regex 1. **Comments and strings are respected for free.** The old regex would match `// kafkaTemplate.send("fake.topic")` as a real producer; tree-sitter never visits comments or string literals as code nodes, so false positives from commented-out code are eliminated. 2. **Struct/object literal patterns are structural, not textual.** `sarama.ProducerMessage{Topic: "..."}` no longer needs a 300-char lookahead (which was a known cross-match bug partly mitigated by a loop regression test in the self-review). The new query matches a specific `composite_literal` with a specific `qualified_type` and `keyed_element` — exactly one struct literal per match. 3. **No order-of-operations fragility.** Regex for `channel.publish` vs `channel.consume` was independent and file-wide; the AST scopes matches to the specific `call_expression`. 4. **Language-agnostic extension.** Adding Ruby, Rust, or C# topic detection later means dropping one file in `topic-patterns/` — no changes to shared scanner or orchestrator, and no tree-sitter imports leak into top-level code. ## Per-file fault tolerance - Malformed files that tree-sitter can't parse are silently skipped (`parser.parse` is wrapped by `scanFile`). The ingestion pipeline already logs unparseable files at index time. - A syntactically invalid query is caught at `compilePatterns` time, not scan time — broken plugins fail loudly at import. - Per-pattern `matches()` failures are swallowed so one broken query in a plugin doesn't block the rest. ## Tests All 30 existing `topic-extractor.test.ts` tests pass **without any changes to the test file** — they were written as input/output contract tests (given this source file, expect these `ExtractedContract` objects) and that contract is unchanged. Regression coverage includes: - Kafka: Java `@KafkaListener` + `kafkaTemplate.send`; Node `producer.send` + `consumer.subscribe`; Go sarama producer/consumer (sync and async); kafka-go Writer/Reader; Python `KafkaConsumer` + `producer.send/produce` - RabbitMQ: Java `@RabbitListener` + `rabbitTemplate.convertAndSend`; Node `channel.consume/publish/sendToQueue`; Python `basic_consume/ basic_publish` with keyword args - NATS: Go and Node `nc.Subscribe/Publish`; Go and Node JetStream `js.Subscribe/Publish`; Python `await nc.subscribe/publish` Including the regression test for the sarama `ProducerMessage` in-loop case — the AST-based query captures every literal in the file independently, not just the first one after `NewSyncProducer`. ## Neighbor regression check - `topic-extractor.test.ts` — 30/30 pass (rewritten extractor) - `http-route-extractor.test.ts` — 18/18 pass (untouched) - `grpc-extractor.test.ts` — 43/43 pass (untouched) - `manifest-extractor.test.ts` — 8/8 pass (untouched) - Full `npx tsc --noEmit` clean ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched; no changes to other extractors, tests, MCP surface, or pipeline.ts. - No CI/release/security config changes, no secrets. - New tree-sitter imports all reference grammars that are already installed as dependencies (`tree-sitter`, `tree-sitter-javascript`, `tree-sitter-typescript`, `tree-sitter-python`, `tree-sitter-java`, `tree-sitter-go` — all in `package.json` for the existing pipeline). ## Phase 2 / phase 3 plan - **Phase 2 (next commit):** rewrite `http-route-extractor.ts` Strategy B (regex fallback) on the same plugin pattern. Graph-assisted Strategy A stays as-is (already uses pipeline-built tree-sitter data via `HANDLES_ROUTE` Cypher queries). - **Phase 3 (commit after):** rewrite `grpc-extractor.ts` for Java / Go / Python / TypeScript detection. `.proto` files are the one outstanding question — there is no `tree-sitter-proto` grammar installed; the in-tree string-sanitizing parser stays as a pragmatic exception with a comment, alternative being to add `tree-sitter-proto` as a dep (open for the maintainer). Co-authored-by: Claude <noreply@anthropic.com>
…er plugins Phase 2 of the extractor refactor requested by @magyargergo on #796. Same architecture as the phase 1 topic-extractor rewrite: a thin, language-agnostic orchestrator plus per-language plugins that own tree-sitter grammars and query sources. The top-level extractor file no longer imports any tree-sitter grammar or query string. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic primitives ├── http-route-extractor.ts # thin orchestrator (no grammar imports) └── http-patterns/ ├── types.ts # HttpDetection, HttpLanguagePlugin, HttpRole ├── index.ts # registry: ext → plugin + HTTP_SCAN_GLOB ├── java.ts # tree-sitter-java: Spring + RestTemplate/WebClient/OkHttp ├── go.ts # tree-sitter-go: gin/echo/HandleFunc + http/resty consumers ├── python.ts # tree-sitter-python: FastAPI + requests ├── php.ts # tree-sitter-php: Laravel Route::get/... └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # NestJS controllers, Express, fetch, axios ``` **Shared scanner (`tree-sitter-scanner.ts`)** — generalised from phase 1: - `ScanMatch<TMeta>.captures` is now a full `CaptureMap` (every named capture the query binds, not just a single `@value`). Topic extractor updated to read `match.captures.value` accordingly. - New `runCompiledPatterns(plugin, tree)` helper lets plugins run multiple query bundles against the same pre-parsed tree. This is needed for HTTP plugins that combine a class-prefix query with a method-route query (Spring, NestJS). - `scanFile` becomes a thin wrapper over `parser.parse + runCompiledPatterns`. **HTTP plugin shape** — unlike topic plugins, HTTP plugins expose a `scan(tree)` function rather than a flat pattern list. This reflects HTTP's more complex extraction: each detection needs method + path + handler name, and framework patterns like Spring `@RequestMapping` / NestJS `@Controller` require cross-referencing a class-level prefix with method-level annotations. Plugins internally use `compilePatterns` + `runCompiledPatterns` and walk the AST to resolve the class/method relationships. **Per-framework coverage:** - **Java (`java.ts`)** - Spring: `@RequestMapping("/api/v2")` class prefix + `@(Get|Post|Put| Delete|Patch)Mapping("/sub")` method routes, joined via the enclosing `class_declaration` node id. - `RestTemplate.getForObject/postForEntity/put/delete/patchForObject` → method derived from API name. - `WebClient.method(HttpMethod.X, "/path")` → method from `HttpMethod.X` capture. - `new Request.Builder().url("/path")` → OkHttp consumer. - **Go (`go.ts`)** - gin / echo / chi frameworks: `\w+.GET("/path", handler)` captures upper-case verb + handler identifier. - `net/http.HandleFunc("/path", handler)` → provider (default GET). - `http.Get/Post/Head` consumer, `http.NewRequest("METHOD", ...)`, resty `client.R().Get/Post/...`. - **Python (`python.ts`)** - `@app.get("/path")` FastAPI decorators. - `requests.get/post/...` and `requests.request("METHOD", "url")`. - **PHP (`php.ts`)** - Laravel `Route::get/post/.../patch('/path', ...)` via `scoped_call_expression`. Uses `PHP.php_only` to match the existing ingestion pipeline's grammar selection. - **Node (`node.ts`) — JS + TS + TSX** - Pattern sources defined once, compiled against three grammar variants (`JavaScript`, `TypeScript.typescript`, `TypeScript.tsx`) because `Parser.Query` objects are not portable across grammars. Exports three plugins sharing the same `scan` logic. - NestJS: `@Controller('prefix')` decorators are siblings of the class in `export_statement` / `program`; `@Get(':id')` decorators are siblings of the method in `class_body`. The plugin walks decorator → next named sibling to find the decorated class / method, then combines the class prefix with the method path. Only emits NestJS detections when the enclosing class has a real `@Controller` decorator — prevents false positives from generic classes that happen to use `@Get` from another library. - Express: `(router|app).<verb>('/path', ...)`. - `fetch(url)` (default GET) + `fetch(url, { method: 'X' })` (uses two queries + a SyntaxNode-id dedupe set so URL literals aren't double-emitted by the options variant). - `axios.get/post/...`. ## Orchestrator changes `http-route-extractor.ts` drops every `scanXxxProviders` / `scanXxxConsumers` regex method and replaces them with a single source-scan loop that delegates to `getPluginForFile(rel).scan(tree)`. The orchestrator still owns: - **Path normalization** (`normalizeHttpPath`, `normalizeConsumerPath`) — language-agnostic string processing shared by both strategies. - **Graph-assisted Strategy A** (`HANDLES_ROUTE` / `FETCHES` / `CONTAINS` Cypher queries) — unchanged in spirit. The only regex helpers it used (`inferMethodFromFileScan`, `pickJavaHandlerName`) are now replaced by a lookup against the plugin's detections for the same file: for each route row, find the detection whose normalized path matches, and pull the HTTP method + handler name from it. - **Per-file parse cache** — the orchestrator parses each relevant file at most once per `extract()` call. Both the graph-assisted enrichment loop and the source-scan fallback share the same `cachedDetections` map, so we never run the plugin twice for the same file. ## Why this is better than the regex version 1. **Comments and strings for free.** The old regex would match `// router.get('/fake')` as a real Express route; tree-sitter never visits string/comment nodes. 2. **Structural controller-prefix.** Spring and NestJS class-prefix joining is now scoped to the enclosing class via `class_declaration` node ids, eliminating file-wide state that broke when a file had multiple controllers. 3. **Precise NestJS disambiguation.** The plugin only emits a NestJS detection when the enclosing class has a real `@Controller` decorator — the old regex would fire on any `@Get(...)` in the file regardless of surrounding context. 4. **Language-agnostic extension.** Adding Ruby / Rust / Kotlin HTTP detection later means dropping one file in `http-patterns/` — no changes to the shared scanner, the orchestrator, or the Strategy A Cypher queries. ## Tests - `http-route-extractor.test.ts` — **18/18 pass** (tests unchanged; they're contract-style input/output tests and the contract shape is unchanged). Covers Spring class prefix, Express, gin/echo, stdlib HandleFunc, NestJS, Laravel, FastAPI for providers and fetch/axios/python-requests/rest-template/webClient/okhttp/go-stdlib/ resty for consumers, plus graph-first Strategy A for both. - `topic-extractor.test.ts` — **30/30 pass** after the `captures.value` API migration. - `grpc-extractor.test.ts` — 43/43 pass (untouched; phase 3). - `manifest-extractor.test.ts` — 8/8 pass (untouched). - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No changes to pipeline.ts, MCP surface, ingestion, or tests. - No CI / release / security / secrets changes. - Tree-sitter grammars imported by plugins (`tree-sitter-java`, `tree-sitter-go`, `tree-sitter-python`, `tree-sitter-php`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already in `package.json` for the existing ingestion pipeline. ## Phase 3 plan - **grpc-extractor** gets the same treatment: plugin-per-language under `grpc-patterns/` for Java / Go / Python / TS detection. `.proto` files remain an open question — no `tree-sitter-proto` grammar is installed, so the in-tree string-sanitizing parser from PR #796's self-review stays as a pragmatic exception unless the maintainer wants us to add `tree-sitter-proto` as a new dep. Co-authored-by: Claude <noreply@anthropic.com>
…lugins Phase 3 (final) of the extractor refactor requested by @magyargergo on #796. Same architecture as phase 1 (topic) and phase 2 (http): thin language-agnostic orchestrator + per-language plugins that own tree-sitter grammars and query sources. With this commit the top-level extractors under `src/core/group/extractors/` import ZERO tree-sitter grammars and ZERO query strings — every grammar import lives in a `*-patterns/<lang>.ts` plugin file, and the orchestrators go through the registry indirection. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared primitives (unchanged) ├── grpc-extractor.ts # orchestrator (only `.proto` parser left) └── grpc-patterns/ ├── types.ts # GrpcDetection, GrpcLanguagePlugin, GrpcRole ├── index.ts # registry: ext → plugin + GRPC_SCAN_GLOB ├── go.ts # tree-sitter-go: RegisterXxxServer, Unimplemented, NewXxxClient ├── java.ts # tree-sitter-java: @GrpcService + XxxImplBase + newBlockingStub ├── python.ts # tree-sitter-python: add_XxxServicer_to_server + XxxStub └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # @GrpcMethod, @GrpcClient field type, # .getService<X>('Svc'), new XxxServiceClient, # loadPackageDefinition dynamic constructors ``` ## Per-language coverage **Go (`go.ts`)** - Provider: `\w+.RegisterXxxServer(...)` via `call_expression → selector_expression → field_identifier` + JS regex filter `^Register(\w+)Server$`. - Provider: `pb.UnimplementedXxxServer` embedded in a struct via `struct_type → field_declaration_list → field_declaration → qualified_type → type_identifier` + JS filter. - Consumer: `\w+.NewXxxClient(...)` via the same call_expression query + JS filter `^New(\w+)Client$`. **Java (`java.ts`)** - Provider: `class X extends YyyGrpc.YyyImplBase` — two queries handle the scoped and plain forms. `scoped_type_identifier`'s children are positional (no `scope:`/`name:` fields), so the query matches the two `type_identifier` children by position. - `#match? @inner "ImplBase$"` restricts matches at query time. - Whether the class has `@GrpcService` or not controls only the `source` metadata label — the plugin walks the class_declaration's `modifiers` child in JS to detect the marker_annotation. - Consumer: `YyyGrpc.newStub(ch)` / `newBlockingStub(ch)` via a `method_invocation` query with `#match? @method "^new(Blocking)?Stub$"`, service name extracted via `^(\w+)Grpc$` on the object identifier. **Python (`python.ts`)** - Single call-expression query covers both bare identifier and `obj.method` attribute forms: `(call function: [(identifier) @fn (attribute attribute: (identifier) @fn)])`. - Plugin filters `@fn.text` against two JS regexes: `^add_(\w+)Servicer_to_server$` (provider) and `^(\w+)Stub$` (consumer), with a reserved-names ignore list for the Stub case (Mock / Test / Fake / Stub). **Node — JavaScript + TypeScript + TSX (`node.ts`)** - Pattern sources defined once, compiled three times (one per grammar) because `Parser.Query` objects are not portable across grammars. Exports three `GrpcLanguagePlugin`s sharing the same `scan`. - `@GrpcMethod('Service', 'Method')`: decorator query captures the two string literals. Confidence is hard-coded 0.8 regardless of proto map resolution (matches the original regex version's behaviour). - `@GrpcClient(...) field: XxxServiceClient`: decorator query captures the decorator node, plugin walks up to find the enclosing `public_field_definition` (decorators on fields are CHILDREN of the field definition in tree-sitter-typescript, not siblings) and reads its first `type_annotation → type_identifier`, then runs the `^(\w+Service)Client$` JS filter. - `client.getService<X>('AuthService')`: call-expression query on `member_expression.property = "getService"` + string literal arg. - `new XxxServiceClient(...)`: `new_expression` with a bare identifier constructor, filtered by `^(\w+Service)Client$` so generic `new AuthClient(...)` (missing the `Service` infix) does NOT falsely register as a consumer. Preserves the regression test `test_extract_ts_non_service_client_constructor_is_ignored`. - `loadPackageDefinition` dynamic loader: gated on `tree.rootNode.text.includes('loadPackageDefinition')`. When set, `new foo.bar.Xxx(...)` qualified constructors with a capitalised property name register as consumers. ## Orchestrator changes `grpc-extractor.ts` loses every `scanGoProviders` / `scanJavaProviders` / ... helper and replaces them with a single source-scan loop that: 1. Parses each file with the plugin's grammar (one shared `Parser` instance across all files, `setLanguage` called per plugin). 2. Calls `plugin.scan(tree)` to get `GrpcDetection[]`. 3. Converts each detection to an `ExtractedContract` via the private `detectionToContract` helper, which: - Looks the short service name up in the proto map (filled by the `.proto` parser). - Picks confidence = `confidenceWithProto` if resolved, else `confidenceWithoutProto`. - Builds a method-level contract id (`grpc::pkg.Svc/Method`) when the detection carries a `methodName` (TS `@GrpcMethod` only), otherwise a service-level id (`grpc::pkg.Svc/*`). Everything else — the `.proto` parser, `buildProtoContext`, `buildProtoMap`, `resolveProtoConflict`, `serviceContractId`, `stripProtoCommentsAndStrings`, `extractServiceBlocks`, the dedupe function — stays exactly as before. The `.proto` parser is kept as a pragmatic exception to the "no regex in extractors" rule because no `tree-sitter-proto` grammar is installed in the repo; a comment at the top of the file explains this and flags the maintainer option of adding `tree-sitter-proto` as a dependency. ## Why this is better than the regex version 1. **Comments and strings are respected for free.** Matched node types are only code constructs, never text inside comments or string literals. 2. **No false positives on partial names.** The old `(\w+?)Grpc`-style regexes would cross-match unrelated identifiers; structural queries restrict matches to the exact AST shape (`scoped_type_identifier → type_identifier` pairs, `method_invocation → identifier` etc.). 3. **NestJS `@GrpcClient` is structural, not regex-based.** The old regex required a specific textual layout (`@GrpcClient(...) private readonly foo!: XxxServiceClient`); the plugin now walks the AST, so modifier order / optional modifiers / multi-line formatting don't break it. 4. **Language-agnostic extension.** Adding Kotlin / Rust / C# gRPC detection later is a one-file edit in `grpc-patterns/index.ts` — no touches to the shared scanner, the orchestrator, or the proto parser. ## Tests - `grpc-extractor.test.ts` — **43/43 pass** (tests unchanged; the contract shape is identical). Covers .proto parsing (including the brace-inside-string regression), Go provider/consumer, Java @GrpcService / plain ImplBase provider + newBlockingStub consumer, Python servicer + stub, TS @GrpcMethod + @GrpcClient + .getService + new XxxServiceClient + loadPackageDefinition + the `AuthClient` vs `AuthServiceClient` discrimination, dedupe across multiple patterns in one file, proto-aware confidence, and the inherited-package resolution for split proto definitions. - `topic-extractor.test.ts` — 30/30 pass. - `http-route-extractor.test.ts` — 18/18 pass. - `manifest-extractor.test.ts` — 8/8 pass. - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No pipeline.ts, MCP surface, ingestion, CI / release / security, or test changes. - New tree-sitter grammar imports (`tree-sitter-go`, `tree-sitter-java`, `tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already installed for the ingestion pipeline. ## End of phase series This commit completes the three-phase extractor refactor: - **Phase 1** (`ea06d11`): topic-extractor → `topic-patterns/` - **Phase 2** (`b6015f6`): http-route-extractor → `http-patterns/` - **Phase 3** (this commit): grpc-extractor → `grpc-patterns/` Every remaining regex-based extractor helper under the `src/core/group/ extractors/` directory is either (a) language-agnostic string processing (path normalization, dedupe keys) or (b) the `.proto` parser, which is documented as an explicit exception. Co-authored-by: Claude <noreply@anthropic.com>
Addresses @magyargergo's suggestion on #796 to replace the manual string-sanitizing .proto parser with a tree-sitter grammar. - **Vendored `tree-sitter-proto`** in `vendor/tree-sitter-proto/`. Grammar source from [coder3101/tree-sitter-proto](https://github.com/coder3101/tree-sitter-proto) (latest `grammar.js`), parser.c regenerated with `tree-sitter-cli 0.24` to produce ABI version 14 — compatible with the project's `tree-sitter 0.25` runtime (which supports ABI ≤ 14). Added as `optionalDependency` with `file:./vendor/tree-sitter-proto`. - **New `grpc-patterns/proto.ts` plugin** — uses the same `compilePatterns` + `runCompiledPatterns` infrastructure as every other plugin. Two queries: - `(package (full_ident) @pkg)` — package declaration - `(service (service_name) @service_name (rpc (rpc_name) @rpc_name))` — one match per (service, rpc) pair - **Graceful fallback** — `tree-sitter-proto` is an optional dependency. If it fails to install (platform incompatibility) or fails the runtime smoke-test (`setLanguage` + `parse` on a trivial proto), `PROTO_GRPC_PLUGIN` stays `null` and the orchestrator uses the existing manual parser. The smoke-test catches the `SyntaxNode` TDZ error that occurs in vitest's fork-based test runner. - **Orchestrator updated** — when `hasProtoPlugin` is true, `.proto` files are handled by the plugin loop (they're included in `GRPC_SCAN_GLOB`), and the manual `parseProtoFile` loop is skipped. `buildProtoContext` still runs to build the proto map for cross-referencing source-file detections. 1. **No manual comment/string stripping.** The old parser needed `stripProtoCommentsAndStrings` (110 lines) to avoid counting braces inside comments and string literals. tree-sitter handles this natively. 2. **No brace-depth tracking.** `extractServiceBlocks` used a manual depth counter to find service boundaries. tree-sitter's AST gives us `service` → `service_name` + `rpc` → `rpc_name` directly. 3. **Performance.** tree-sitter's C-based parser is faster than character-by-character JS scanning + regex on large proto files. - `grpc-extractor.test.ts` — **43/43 pass** (unchanged) - All other extractor tests — 99/99 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com>
Patterns should be relative to the .gitignore file's directory. https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU
c4e4165 to
935f3f5
Compare
|
@magyargergo, summary of changes on this branch after the review feedback: New: manifest-extractor for Refactored: all three extractors (topic, http, grpc) migrated from regex to tree-sitter with a plugin architecture — top-level orchestrators import zero grammars, each Added: No test changes (99/99 pass as-is), no MCP/pipeline/CI changes. |
There was a problem hiding this comment.
Pull request overview
This PR expands the group contract extraction layer by moving topic/http/grpc source scanning to tree-sitter-driven, language-plugin registries, adds a new manifest-based extractor for group.yaml cross-links, and vendors an optional tree-sitter-proto grammar to improve .proto parsing when available.
Changes:
- Introduce shared tree-sitter scanning utilities and plugin registries for topic/http/gRPC extractors.
- Add
ManifestExtractorto convert manifest links into provider/consumer contracts + cross-links with deterministic synthetic uids when unresolved. - Add optional
tree-sitter-proto(vendored) and extend gRPC extraction with improved.protoparsing/canonicalization and broader client/server detection.
Reviewed changes
Copilot reviewed 30 out of 42 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| gitnexus/vendor/tree-sitter-proto/src/tree_sitter/parser.h | Vendored tree-sitter C header for proto grammar build. |
| gitnexus/vendor/tree-sitter-proto/src/tree_sitter/array.h | Vendored tree-sitter array utilities. |
| gitnexus/vendor/tree-sitter-proto/src/tree_sitter/alloc.h | Vendored tree-sitter allocation utilities. |
| gitnexus/vendor/tree-sitter-proto/package.json | Vendored tree-sitter-proto package metadata for local optional dependency. |
| gitnexus/vendor/tree-sitter-proto/bindings/node/index.js | Node binding entrypoint for the vendored proto grammar. |
| gitnexus/vendor/tree-sitter-proto/bindings/node/index.d.ts | Type definitions for the vendored proto grammar module. |
| gitnexus/vendor/tree-sitter-proto/bindings/node/binding.cc | Native addon binding that exposes the proto grammar to tree-sitter. |
| gitnexus/vendor/tree-sitter-proto/binding.gyp | node-gyp build config for compiling the proto grammar addon. |
| gitnexus/vendor/tree-sitter-proto/.gitignore | Ignores build artifacts for vendored grammar. |
| gitnexus/test/unit/group/topic-extractor.test.ts | Adds/updates topic extraction tests for KafkaJS run, JetStream, Python NATS, Sarama producer literals, kafka-go, and Go test-file ignore. |
| gitnexus/test/unit/group/manifest-extractor.test.ts | New unit tests for manifest extraction, exact-match resolution, and synthetic uid fallback behavior. |
| gitnexus/test/unit/group/http-route-extractor.test.ts | Expands HTTP route extraction tests (Go frameworks/stdlib, NestJS, Python requests, Java consumers, Go consumers) and removes old interface-anchoring cases. |
| gitnexus/test/unit/group/grpc-extractor.test.ts | Expands gRPC extraction tests (proto brace sanitization, TS/Node patterns, proto map conflict resolution) and adjusts confidence expectations. |
| gitnexus/src/core/group/extractors/tree-sitter-scanner.ts | New shared utilities to compile/run tree-sitter queries and unquote literals for multiple languages. |
| gitnexus/src/core/group/extractors/topic-patterns/types.ts | New shared types for topic plugin metadata. |
| gitnexus/src/core/group/extractors/topic-patterns/python.ts | Python topic patterns (Kafka, RabbitMQ, NATS) via tree-sitter. |
| gitnexus/src/core/group/extractors/topic-patterns/node.ts | JS/TS/TSX topic patterns (KafkaJS, amqplib, NATS/JetStream) via tree-sitter. |
| gitnexus/src/core/group/extractors/topic-patterns/java.ts | Java topic patterns (Spring Kafka/Rabbit) via tree-sitter. |
| gitnexus/src/core/group/extractors/topic-patterns/index.ts | Plugin registry and glob for topic extraction. |
| gitnexus/src/core/group/extractors/topic-patterns/go.ts | Go topic patterns (Sarama, kafka-go, NATS/JetStream) via tree-sitter. |
| gitnexus/src/core/group/extractors/topic-extractor.ts | Refactors topic extractor to tree-sitter plugin orchestration and literal extraction. |
| gitnexus/src/core/group/extractors/manifest-extractor.ts | New manifest extractor that resolves manifest links against per-repo graphs with exact matches and synthetic uid fallback. |
| gitnexus/src/core/group/extractors/http-route-extractor.ts | Refactors HTTP extractor to use graph-assisted extraction with tree-sitter fallback plugins for providers/consumers. |
| gitnexus/src/core/group/extractors/http-patterns/types.ts | New shared types for HTTP plugin detections and plugin interface. |
| gitnexus/src/core/group/extractors/http-patterns/python.ts | Python HTTP plugin (FastAPI providers, requests consumers). |
| gitnexus/src/core/group/extractors/http-patterns/php.ts | PHP HTTP plugin (Laravel route declarations). |
| gitnexus/src/core/group/extractors/http-patterns/node.ts | Node/TS HTTP plugin (NestJS/Express providers, fetch/axios consumers). |
| gitnexus/src/core/group/extractors/http-patterns/java.ts | Java HTTP plugin (Spring providers, RestTemplate/WebClient/OkHttp consumers). |
| gitnexus/src/core/group/extractors/http-patterns/index.ts | HTTP plugin registry and glob for source-scan fallback. |
| gitnexus/src/core/group/extractors/http-patterns/go.ts | Go HTTP plugin (gin/echo/chi + stdlib providers, stdlib/resty consumers). |
| gitnexus/src/core/group/extractors/grpc-patterns/types.ts | New shared types for gRPC plugin detections and plugin interface. |
| gitnexus/src/core/group/extractors/grpc-patterns/python.ts | Python gRPC plugin (servicer registration + stub construction). |
| gitnexus/src/core/group/extractors/grpc-patterns/proto.ts | Optional tree-sitter-proto plugin for extracting package/service/rpc pairs from .proto. |
| gitnexus/src/core/group/extractors/grpc-patterns/node.ts | Node/TS gRPC plugin (Nest decorators, getService, generated clients, dynamic loader pattern). |
| gitnexus/src/core/group/extractors/grpc-patterns/java.ts | Java gRPC plugin (ImplBase providers + stub factory consumers). |
| gitnexus/src/core/group/extractors/grpc-patterns/index.ts | gRPC plugin registry and glob (optionally includes .proto). |
| gitnexus/src/core/group/extractors/grpc-patterns/go.ts | Go gRPC plugin (Register/Unimplemented server + NewClient patterns). |
| gitnexus/src/core/group/extractors/grpc-extractor.ts | Refactors gRPC extractor to plugin orchestration + improved proto parsing/canonicalization and conflict resolution. |
| gitnexus/package.json | Adds optional dependency on the vendored tree-sitter-proto. |
| gitnexus/package-lock.json | Locks the local-link optional dependency for tree-sitter-proto. |
Files not reviewed (1)
- gitnexus/package-lock.json: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -31,20 +66,36 @@ export function normalizeHttpPath(p: string): string { | |||
| return s; | |||
There was a problem hiding this comment.
normalizeHttpPath turns the root path '/' into an empty string because it strips all trailing slashes. This can produce malformed contract IDs like http::GET:: (missing path) for root handlers/fetches. Consider special-casing empty result to '/' (or avoid stripping when the path is exactly /).
| const graphProviders = | ||
| dbExecutor != null ? await this.extractProvidersGraph(dbExecutor, getDetections) : []; | ||
| const providers = | ||
| graphProviders.length > 0 | ||
| ? graphProviders | ||
| : this.extractProvidersSourceScan(await this.scanFiles(repoPath), getDetections); | ||
|
|
||
| const graphC = dbExecutor != null ? await this.extractConsumersGraph(dbExecutor, repoPath) : []; | ||
| const consumers = graphC.length > 0 ? graphC : await this.extractConsumersSourceScan(repoPath); | ||
| const graphConsumers = | ||
| dbExecutor != null ? await this.extractConsumersGraph(dbExecutor, getDetections) : []; | ||
| const consumers = | ||
| graphConsumers.length > 0 | ||
| ? graphConsumers | ||
| : this.extractConsumersSourceScan(await this.scanFiles(repoPath), getDetections); | ||
|
|
There was a problem hiding this comment.
extract() calls scanFiles(repoPath) twice (once for providers fallback and once for consumers fallback). This duplicates the glob work and file list allocation. Consider calling scanFiles once, storing the result, and reusing it for both provider and consumer fallback paths.
| private async scanFiles(repoPath: string): Promise<string[]> { | ||
| return glob(HTTP_SCAN_GLOB, { | ||
| cwd: repoPath, | ||
| ignore: ['**/node_modules/**', '**/.git/**', '**/dist/**', '**/build/**'], |
There was a problem hiding this comment.
scanFiles doesn't exclude **/vendor/** even though other extractors (e.g. topic/grpc) ignore vendored code. Scanning vendored sources can add noise and cost. Consider adding **/vendor/** to the ignore list for consistency and to avoid false positives.
| ignore: ['**/node_modules/**', '**/.git/**', '**/dist/**', '**/build/**'], | |
| ignore: ['**/node_modules/**', '**/.git/**', '**/dist/**', '**/build/**', '**/vendor/**'], |
| // ─── Consumer: loadPackageDefinition dynamic proto loader ──────── | ||
| // Only emit when the file uses loadPackageDefinition, otherwise a | ||
| // generic `new foo.bar.Something()` in unrelated code would falsely | ||
| // register as a gRPC consumer. | ||
| const usesLoadPackage = tree.rootNode.text.includes('loadPackageDefinition'); | ||
| if (usesLoadPackage) { | ||
| for (const match of runCompiledPatterns(bundle.newQualifiedCtor, tree)) { | ||
| const ctorNode = match.captures.ctor; |
There was a problem hiding this comment.
Using tree.rootNode.text.includes('loadPackageDefinition') forces materializing the entire file contents from the parse tree, which can be expensive for large files. Prefer checking the original source text before parsing, or add a dedicated query/capture for loadPackageDefinition so the scan stays structural without allocating the full text.
| // If resolveSymbol returns null, the extractor creates a contract with | ||
| // an empty symbolUid/ref — cross-impact still works via name-based | ||
| // matching through the `hint` path in runGroupImpact. |
There was a problem hiding this comment.
This comment is now inaccurate: when resolveSymbol returns null, the extractor does not create an empty symbolUid — it falls back to a deterministic manifest::... synthetic uid (and a non-empty symbolRef.name). Please update the comment to match the current behavior so future changes/tests aren't misled.
| // If resolveSymbol returns null, the extractor creates a contract with | |
| // an empty symbolUid/ref — cross-impact still works via name-based | |
| // matching through the `hint` path in runGroupImpact. | |
| // If resolveSymbol returns null, the extractor falls back to a | |
| // deterministic synthetic `manifest::...` symbol uid and preserves a | |
| // non-empty symbolRef.name derived from the manifest contract. |
| * 1. **`.proto` parsing** — done in-process by a small string-sanitizing | ||
| * parser (see `stripProtoCommentsAndStrings` + `extractServiceBlocks` | ||
| * below). NOT tree-sitter-based because no `tree-sitter-proto` | ||
| * grammar is installed in the repo. The parser preserves offsets so | ||
| * downstream regex scans run against a sanitized copy without | ||
| * affecting the line numbers of the original. Kept as a pragmatic | ||
| * exception to the "no regex in extractors" rule until / unless | ||
| * maintainers want to add a proto grammar. | ||
| * | ||
| * 2. **Source-scan providers / consumers** — delegated to per-language | ||
| * plugins in `./grpc-patterns/`. The orchestrator imports NO | ||
| * tree-sitter grammars or query strings — each plugin owns its own. | ||
| */ | ||
|
|
||
| // ─── .proto parsing (not tree-sitter) ──────────────────────────────── |
There was a problem hiding this comment.
The header comment says .proto parsing is not tree-sitter-based because no tree-sitter-proto grammar is installed, but this PR adds tree-sitter-proto as an optional dependency and registers a .proto plugin when available. Please update this docstring to reflect the new behavior (tree-sitter when available, manual fallback otherwise).
| * 1. **`.proto` parsing** — done in-process by a small string-sanitizing | |
| * parser (see `stripProtoCommentsAndStrings` + `extractServiceBlocks` | |
| * below). NOT tree-sitter-based because no `tree-sitter-proto` | |
| * grammar is installed in the repo. The parser preserves offsets so | |
| * downstream regex scans run against a sanitized copy without | |
| * affecting the line numbers of the original. Kept as a pragmatic | |
| * exception to the "no regex in extractors" rule until / unless | |
| * maintainers want to add a proto grammar. | |
| * | |
| * 2. **Source-scan providers / consumers** — delegated to per-language | |
| * plugins in `./grpc-patterns/`. The orchestrator imports NO | |
| * tree-sitter grammars or query strings — each plugin owns its own. | |
| */ | |
| // ─── .proto parsing (not tree-sitter) ──────────────────────────────── | |
| * 1. **`.proto` parsing** — uses tree-sitter when the optional | |
| * `tree-sitter-proto` plugin is available. Otherwise, it falls back to | |
| * an in-process string-sanitizing parser (see | |
| * `stripProtoCommentsAndStrings` + `extractServiceBlocks` below). | |
| * The fallback parser preserves offsets so downstream regex scans run | |
| * against a sanitized copy without affecting the line numbers of the | |
| * original. | |
| * | |
| * 2. **Source-scan providers / consumers** — delegated to per-language | |
| * plugins in `./grpc-patterns/`. The orchestrator imports NO | |
| * tree-sitter grammars or query strings — each plugin owns its own. | |
| */ | |
| // ─── .proto parsing fallback ───────────────────────────────────────── |
| const protoContext = await buildProtoContext(repoPath); | ||
| const protoMap = protoContext.servicesByName; | ||
|
|
||
| // ─── Proto files — definitive provider source ───────────────── | ||
| // When tree-sitter-proto is available, .proto files are handled by | ||
| // the plugin loop below (they're in GRPC_SCAN_GLOB). Otherwise | ||
| // fall back to the manual string-sanitizing parser. | ||
| if (!hasProtoPlugin) { | ||
| const protoFiles = await glob('**/*.proto', { | ||
| cwd: repoPath, | ||
| ignore: ['**/node_modules/**', '**/.git/**', '**/vendor/**'], | ||
| nodir: true, | ||
| }); | ||
| for (const rel of protoFiles) { | ||
| const content = readSafe(repoPath, rel); | ||
| if (content) { | ||
| out.push( | ||
| ...this.parseProtoFile( | ||
| content, | ||
| rel, | ||
| protoContext.packagesByProto.get(normalizeProtoPath(rel)) ?? '', | ||
| ), | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
buildProtoContext(repoPath) already globs and parses all .proto files to build the proto map; on the !hasProtoPlugin path, the extractor globs/reads/parses the .proto files again to emit provider contracts. This doubles IO/CPU on platforms where the proto plugin isn't available. Consider reusing the proto file contents/service blocks from buildProtoContext (or have it also return the provider contracts) to avoid the second pass.
|
@claude ultra-think while reviewing this PR. This needs to be review from its architectural fit and it doesnt violate cross boundaries. We want per-language for this multi-repo support via http/grpc etc calls. Read previous comments and include them in your evaluation. We are also looking for potential bugs and performance bottlenecks. Finally, we want to follow clean code principles and highlight any violations. |
|
Claude finished @magyargergo's task in 3m 34s —— View job PR Review: feat(group): extractor expansion + manifest extractor
Overall VerdictThe architectural direction is correct and the plugin boundary enforcement is solid. However there are 3 bugs that will produce silent wrong results in production, 2 performance issues that Copilot already flagged but need emphasis, and 1 architectural seam that will cause confusion as the codebase grows. Details below. Architectural Fit ✅ with one gapWhat works well: The orchestrator/plugin boundary is properly enforced. All three orchestrators ( One structural gap —
Bugs 🐛Bug 1 — let s = p.trim().split('?')[0].toLowerCase().replace(/\/+$/, '');For input Bug 2 — Tautological let handlerName: string | null = null;
if (!method || !handlerName) { // !handlerName is ALWAYS true here
Bug 3 — Manifest HTTP contracts produce case 'http': {
if (/^[A-Za-z]+::/.test(contract)) return `http::${contract}`;
return `http::*::${contract}`; // method is unknown → wildcard
}Provider contracts from Bug 4 — Cypher queries in MATCH (n) WHERE n.name = $contract -- matches ANY node typeA topic named Performance Bottlenecks ⚡Double glob + double parse of
Missing Clean Code Violations1. 2. 3. 4. 5. Stale comment at 6. Minor Notes
Priority Summary
Items 1–3 should be fixed before merge. Items 4–6 are the most actionable performance wins before |
Six fixes suggested by the Copilot AI review:
1. **`normalizeHttpPath` root-path edge case** — stripping trailing
slashes on the input `/` produced an empty string, yielding
malformed contract ids like `http::GET::`. Now preserves `/` for
the root handler/fetch case.
2. **Dedupe `scanFiles` call** — `extract()` was globbing the
source-scan file list twice (once for the provider fallback, once
for the consumer fallback). Moved to a single lazy call that
memoizes the result for the rest of the method.
3. **HTTP `scanFiles` now ignores `**/vendor/**`** — every other
extractor's glob already ignored vendored sources; the HTTP one
didn't. Fixed for consistency.
4. **`loadPackageDefinition` check is now structural** — was calling
`tree.rootNode.text.includes('loadPackageDefinition')` which forces
materialization of the entire file text from the parse tree
(expensive on large files). Replaced with a dedicated compiled
query on `(call_expression function: [(identifier) | (member_expression)])`
so the check stays in the AST domain.
5. **`grpc-extractor.ts` header docstring updated** — still claimed
".proto parsing is not tree-sitter-based because no grammar is
installed". Now describes the actual behaviour: tree-sitter when
`tree-sitter-proto` is available (optionalDependency), manual
fallback otherwise.
6. **Eliminated the double proto file parse on the fallback path** —
`buildProtoContext` already globs + parses every `.proto` file to
build `servicesByName`. On the `!hasProtoPlugin` branch the
extractor was globbing + parsing again via the now-removed
`parseProtoFile` helper. The fallback branch now iterates the map
that `buildProtoContext` already produced to emit provider
contracts directly — single pass per proto file.
## Tests
- `topic-extractor.test.ts` — 30/30 pass
- `http-route-extractor.test.ts` — 18/18 pass
- `grpc-extractor.test.ts` — 43/43 pass
- `manifest-extractor.test.ts` — 8/8 pass
- `npx tsc -p tsconfig.json --noEmit` clean
Co-authored-by: Claude <noreply@anthropic.com>
…ne) on #796 Follows up `2f28bfc` with the remaining items from the Claude AI review: ## Bugs **Bug 2 — Label-unaware Cypher queries in `resolveSymbol`.** The manifest-extractor's lookup queries were `MATCH (n) WHERE n.name = $x` with no label filter, so a topic/service/package name could silently match any node type (File, Variable, Import, Folder, …). Added label filters: - `topic` → `(n:Function|Method|Class|Interface)` (topics are best-effort symbol-name matches against listener/publisher symbols) - `grpc` method → `(n:Function|Method)` - `grpc` service → `(n:Class|Interface)` - `lib` → `(n:Package|Module)` All 8 manifest-extractor tests still pass (mock executor is label-agnostic, but the production LadybugDB graph now gets correctly scoped queries). **Bug 8 — Tautological `!handlerName` condition.** `http-route-extractor.ts:extractProvidersGraph` had `let handlerName = null; if (!method || !handlerName) { ... }` — the `!handlerName` clause was always true since there was no intervening assignment. Simplified to always run the plugin-scan lookup (we need the handler name even when `methodFromRouteReason` already resolved the method). ## Clean code / dedup **Design 7 — `readSafe` was copy-pasted in all three orchestrators.** Extracted to `extractors/fs-utils.ts` as the single source of truth for the path-traversal guard. Dropped the three local copies and the now-unused `fs`/`path` imports from topic-extractor. **Style 10 — Language-specific `_test.go` skip in the topic orchestrator.** Was `if (rel.endsWith('_test.go')) continue;` inside the language- agnostic extraction loop. Pushed into the glob's ignore list (`'**/*_test.go'`) alongside the existing `node_modules`, `vendor`, `dist`, `build` entries, with a comment explaining that other languages' test file conventions either live in separate directories (Python `tests/`, Java `src/test/`) or are already covered by the existing ignores. ## Already addressed in `2f28bfc` (mentioned again in Claude review) - Bug 3: `normalizeHttpPath('/')` returns `''` — fixed - Bug 4: double glob + double parse of `.proto` — fixed - Bug 5: `scanFiles` called twice in HTTP — fixed - Bug 6: missing `**/vendor/**` in HTTP glob — fixed - Design 9 partially: `tree.rootNode.text.includes('loadPackageDefinition')` replaced with a dedicated structural query ## Deferred - Bug 1 (`http::*::path` vs `http::GET::path` matching) — out of scope; sync.ts matching logic lands in #793, manifest extractor already emits correct synthetic uids for unresolved HTTP contracts. - Design 9 full (change plugin `scan(tree)` → `scan(tree, source)`) — the only real use case (`loadPackageDefinition` gate) is already fixed via a structural query, so the interface change would be cosmetic churn without a concrete consumer. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com>
|
@claude are we ready for production? |
|
Claude finished @magyargergo's task in 1m 55s —— View job Production Readiness Assessment
Verdict: Almost — one correctness bug blocks a clean mergeThe branch has made substantial progress since the last two reviews. The tree-sitter plugin architecture is solid, nearly all previously reported bugs are fixed, and all performance issues are resolved. One correctness bug remains that will silently cause all HTTP manifest links to fall through to the synthetic uid path. What Was Fixed ✅
Remaining Issue 🔴
// manifest-extractor.ts:225-226
if (/^[A-Za-z]+::/.test(contract)) return `http::${contract}`;
return `http::*::${contract}`; // ← wildcard method
The fix is a design decision for Minor Outstanding Items 🟡1. Stale comment at 2. 3. SummaryThe PR is well-structured and the architecture is sound. The tree-sitter plugin boundary is properly enforced — orchestrators import zero grammars, each language plugin owns its own. Performance and security issues from prior reviews are all addressed. The one blocker before merge: resolve the |
|
@ivkond could you please include a flow chart of the analysis pipeline? 🙏 |
… flow chart ## Fixes **Remaining 🔴 — HTTP contract id wildcard format.** Documented the `http::*::<path>` format as an intentional wildcard for manifest links that omit the HTTP method, alongside the explicit-method form (`GET::/path` → `http::GET::/path`). The docblock on `buildContractId` now states both forms, notes that wildcard-aware matching is the responsibility of the sync / cross-impact layer (#793), and recommends the explicit-method form whenever the author knows the method (it round-trips through exact equality without needing wildcard logic downstream). Tests unchanged — the wildcard format is what they've always asserted. **Minor 1 — stale comment at `manifest-extractor.ts:124-126`.** The comment claimed "creates a contract with an empty symbolUid/ref" but the code switched to `manifestSymbolUid(repo, contractId)` a few commits back. Updated to describe the actual synthetic-uid fallback semantics and the cross-impact path that relies on both sides of the join deriving the same uid. **Minor 2 — exhaustiveness guard on `buildContractId`.** The `switch(type)` covered all five current `ContractType` variants but silently returned `undefined` if a new variant was added. Added a `default: const _exhaustive: never = type; throw new Error(...)` clause so the build fails loudly on an unhandled variant. **Minor 3 — `tree.rootNode.text` in `grpc-patterns/node.ts`.** Already fixed in `2f28bfc` via a dedicated structural query (`LOAD_PACKAGE_DEFINITION_SPEC`). No action needed. ## New: pipeline flow chart (per @magyargergo's request) Added `src/core/group/PIPELINE.md` with four Mermaid diagrams: 1. **High-level overview** — `group.yaml` → extractors + manifest → contract matching → `bridge.lbug` → `runGroupImpact`. 2. **Per-repo extractor two-strategy shape** — graph-assisted Strategy A vs. source-scan Strategy B. 3. **Plugin architecture** — orchestrator → registry → per-language `*-patterns/<lang>.ts` → `tree-sitter-scanner.ts` → `ExtractedContract`. 4. **Manifest extraction** — label-scoped `resolveSymbol` with the synthetic-uid fallback. 5. **Cross-impact query (#606)** — local impact → bridge join → cross-repo fan-out. Each diagram is annotated with which PRs own which stage (this PR: extractors + manifest; #795: bridge storage; #606: cross-impact runtime) and points at the concrete files/functions involved. ## Tests - 99/99 extractor tests pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com>
|
Nice and impressive work @ivkond ! Thank you for your contribution! 🙏 |
Thank you, @magyargergo! 🙏 Really appreciate the thorough reviews — they pushed the plugin architecture into much better shape than my initial pass. If nothing else is blocking on your side, can we move on to step 3 (#792)? |
|
@ivkond could you also include a follow up PR sorting out the bundled tree-sitter in the release process? |
|
Claude finished @magyargergo's task in 5m 47s —— View job Scope audit: PR #796 vs issue #793
FindingsWhat is in this PR (confirmed against file SHAs)
The cross-boundary issue:
|
…npatwari#606 split) (abhigyanpatwari#796) * feat(group): extractor expansion + manifest extractor Part 2 of 4 in the split of abhigyanpatwari#606 (ticket: abhigyanpatwari#792). Follows abhigyanpatwari#795 (bridge.lbug storage foundation, already merged), but this PR has no code-level dependency on abhigyanpatwari#795 — it only imports types and the ContractExtractor interface that existed on upstream main before either PR. It could have been reviewed in parallel with abhigyanpatwari#795. ## What changed Expands the 3 existing contract extractors with substantially more language/framework coverage, and adds a new `manifest-extractor` that resolves `group.yaml`-declared cross-links against the per-repo graph via exact-name lookups. ### New file (228 LOC) - `gitnexus/src/core/group/extractors/manifest-extractor.ts` — exact graph lookup for `group.yaml`-declared cross-links. HTTP paths are canonicalized before Route.name matching; gRPC is resolved by service/method name (NO `.proto`-filename fallback); topic and lib use exact-name match. Falls back to a synthetic `manifest::<repo>::<contractId>` uid when the graph has no matching symbol, so cross-impact traversal still has a stable anchor for the contract. ### Modified extractors (+958 LOC prod) - `extractors/grpc-extractor.ts` (+522) — `.proto` parser with comment and string-literal sanitization (braces inside strings no longer truncate service bodies); package/service/method canonical IDs; server/client detection across Go (`grpc.NewServer`, `RegisterXxxServer`, `XxxGrpc.XxxImplBase`), Java (`@GrpcService`, `BlockingStub`), Python (`servicer_to_server`, `XxxStub`), and TypeScript/Node (`@GrpcMethod`, `ClientGrpc`, `loadPackageDefinition`). - `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib `HandleFunc`, NestJS `@Controller`+`@Get`/etc, Python FastAPI decorators, Java Spring `@RequestMapping`/`@GetMapping`, restTemplate / WebClient / OkHttp consumers. - `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}` struct literal detection (replaces a constructor-anchored regex that missed topics inside producer loops), kafka-go Writer/Reader, Python NATS (`await nc.subscribe`/`await nc.publish`), JetStream helpers. ### Modified and new tests (+1264 LOC) - `grpc-extractor.test.ts` (+539) — full coverage of the new proto parser (strings-with-braces regression, comments-with-braces regression), per-language server/client detection - `http-route-extractor.test.ts` (+240) — per-framework route extraction + normalization edge cases - `topic-extractor.test.ts` (+177) — the sarama in-loop regression, JetStream, Python NATS, kafka-go Writer/Reader - `manifest-extractor.test.ts` (+308 NEW) — HTTP path normalization, gRPC exact lookup with proto-fallback regression, lib and topic exact matching, synthetic-uid fallback behavior ### Self-review fixes folded in Carried forward from the abhigyanpatwari#606 self-review (commit `d15b8cb`): - **HIGH abhigyanpatwari#1** — `manifest-extractor.resolveSymbol` was too fuzzy. Previously used `CONTAINS` on route/name fields plus an unconditional `filePath ENDS WITH '.proto'` fallback for gRPC. Consequences: `/orders` matched `/suborders`, and any repo with any `.proto` file returned a random proto symbol for a gRPC manifest entry. Replaced with exact equality + deterministic `ORDER BY` + synthetic-uid fallback for unresolved manifests. Regression tests included. - **MED abhigyanpatwari#3** — gRPC proto parser brace-depth counting now sanitizes strings and comments first (`stripProtoCommentsAndStrings`). A valid proto with `option deprecated_reason = "use NewService { instead"` used to have its service body closed early by the `"{"` inside the literal, silently dropping methods after the offending string. Regression tests for both string-with-brace and comment-with-brace cases. - **MED abhigyanpatwari#4** — sarama Kafka regex changed from `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on constructor, caught only first topic in a loop) to `sarama.ProducerMessage{...Topic:}` (matches every struct literal directly). Regression test with a for-loop that constructs multiple `ProducerMessage`s. - **MED abhigyanpatwari#7** — `manifest-extractor.resolveSymbol` no longer has a silent `catch { /* fall through */ }`. Errors from the graph executor are logged via `console.warn` with link type, contract name, repo key, and error message before falling through to the synthetic-uid path. ## Why Reviewer focus here is pure regex / parser correctness — no storage, no Cypher queries, no algorithmic changes to the cross-link algorithm. Separating this from the bridge foundation PR (abhigyanpatwari#795) meant reviewers could stay in a single mental mode (parsing logic) instead of context-switching between DDL, Cypher, and regex. ## How to verify - `cd gitnexus && npx tsc --noEmit` - `cd gitnexus && npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks` Local pre-push: typecheck clean, all 99 extractor unit tests pass (grpc 43, http 18, topic 30, manifest 8). ## Risk / rollback **Low.** Extractors have no user-facing surface in this PR — they produce `ExtractedContract[]` that is consumed by `sync.ts` in the next split (abhigyanpatwari#793). No existing behavior changes for users who don't run a `group sync`. Rollback = `git revert` of the merge commit; the modifications to `grpc-extractor.ts` / `http-route-extractor.ts` / `topic-extractor.ts` revert to the pre-PR versions that still work (they're subsets of the new functionality). ## Scope discipline (per GUARDRAILS.md) - Only the 8 files above are touched; no drive-by refactors - No CI/release/security config changes - No secrets or machine-specific paths - Content lifted from abhigyanpatwari#606 (CI 11/11 green on `d15b8cb`) ## Dependencies - **Base:** `main` (upstream already includes abhigyanpatwari#795 as `f6fb87f`) - **Blocks:** sync pipeline (abhigyanpatwari#793) and the cross-impact feature (abhigyanpatwari#794) - **Tracker issue:** abhigyanpatwari#792 - **Parent PR:** abhigyanpatwari#606 Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate topic-extractor from regex to tree-sitter queries Addresses @magyargergo's feedback on abhigyanpatwari#796 that regex-based lookups should use tree-sitter nodes instead, and that the top-level extractors must NOT carry language dependencies. This is phase 1 of a multi-step migration — topic-extractor first because its patterns are the most uniform (16 "call/annotation with first-arg string literal" variants), which makes it a clean proof of the approach before grpc-extractor and http-route-extractor get the same treatment. ## Architecture: language-agnostic orchestrator + per-language plugins The top-level extractor is a thin orchestrator that never imports a tree-sitter grammar or a query string. Per-language knowledge lives in a new `topic-patterns/` folder with one file per language plus a registry that maps file extensions to compiled plugins: ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic scanning utilities ├── topic-extractor.ts # thin orchestrator (no grammar imports) └── topic-patterns/ ├── types.ts # TopicMeta, Broker ├── index.ts # registry: extension → compiled provider ├── java.ts # tree-sitter-java + JAVA_TOPIC_PROVIDER ├── go.ts # tree-sitter-go + GO_TOPIC_PROVIDER ├── python.ts # tree-sitter-python + PYTHON_TOPIC_PROVIDER └── node.ts # tree-sitter-javascript + tree-sitter-typescript # → JAVASCRIPT_/TYPESCRIPT_/TSX_TOPIC_PROVIDER ``` **Shared scanner (`tree-sitter-scanner.ts`)** — defines `PatternSpec<TMeta>`, `LanguagePatterns<TMeta>`, `CompiledPatterns<TMeta>` and the `scanFile(parser, plugin, content)` helper. Plugins compile their queries eagerly at module load via `compilePatterns()`, so a broken pattern fails loudly at import time instead of silently at scan time. `unquoteLiteral()` handles single/double/template quotes, Python triple-quoted strings, and Go raw backtick strings. **Per-language plugins** own: - the tree-sitter grammar import (this is the ONLY place in `src/core/group/` where tree-sitter grammars are imported), - the query S-expressions, - the `TopicMeta` payload (role, broker, confidence, symbolName) that the orchestrator receives back on every match. Each plugin uses a `@value` capture name to bind the topic literal node. The JavaScript and TypeScript grammars share AST node names for every construct we query, so `node.ts` defines the pattern sources once and compiles them against `JavaScript`, `TypeScript.typescript`, and `TypeScript.tsx` — exporting three providers because `Parser.Query` objects are NOT portable across grammar instances. **Registry (`topic-patterns/index.ts`)** — maps `.java` → Java provider, `.go` → Go, `.py` → Python, `.js`/`.jsx` → JS, `.ts` → TS, `.tsx` → TSX. Also exports `TOPIC_SCAN_GLOB` so adding a new language is a single file-level edit (drop `topic-patterns/<lang>.ts`, import + register it here — zero edits required in `topic-extractor.ts`). **Orchestrator (`topic-extractor.ts`)** — ~110 lines, no grammar or query imports. Per file: `getProviderForFile(rel)` → `scanFile(parser, provider, content)` → `unquoteLiteral(valueText)` → `makeContract(...)`. Reuses one `Parser` instance across files; the scanner calls `setLanguage` per plugin. ## Why this is better than regex 1. **Comments and strings are respected for free.** The old regex would match `// kafkaTemplate.send("fake.topic")` as a real producer; tree-sitter never visits comments or string literals as code nodes, so false positives from commented-out code are eliminated. 2. **Struct/object literal patterns are structural, not textual.** `sarama.ProducerMessage{Topic: "..."}` no longer needs a 300-char lookahead (which was a known cross-match bug partly mitigated by a loop regression test in the self-review). The new query matches a specific `composite_literal` with a specific `qualified_type` and `keyed_element` — exactly one struct literal per match. 3. **No order-of-operations fragility.** Regex for `channel.publish` vs `channel.consume` was independent and file-wide; the AST scopes matches to the specific `call_expression`. 4. **Language-agnostic extension.** Adding Ruby, Rust, or C# topic detection later means dropping one file in `topic-patterns/` — no changes to shared scanner or orchestrator, and no tree-sitter imports leak into top-level code. ## Per-file fault tolerance - Malformed files that tree-sitter can't parse are silently skipped (`parser.parse` is wrapped by `scanFile`). The ingestion pipeline already logs unparseable files at index time. - A syntactically invalid query is caught at `compilePatterns` time, not scan time — broken plugins fail loudly at import. - Per-pattern `matches()` failures are swallowed so one broken query in a plugin doesn't block the rest. ## Tests All 30 existing `topic-extractor.test.ts` tests pass **without any changes to the test file** — they were written as input/output contract tests (given this source file, expect these `ExtractedContract` objects) and that contract is unchanged. Regression coverage includes: - Kafka: Java `@KafkaListener` + `kafkaTemplate.send`; Node `producer.send` + `consumer.subscribe`; Go sarama producer/consumer (sync and async); kafka-go Writer/Reader; Python `KafkaConsumer` + `producer.send/produce` - RabbitMQ: Java `@RabbitListener` + `rabbitTemplate.convertAndSend`; Node `channel.consume/publish/sendToQueue`; Python `basic_consume/ basic_publish` with keyword args - NATS: Go and Node `nc.Subscribe/Publish`; Go and Node JetStream `js.Subscribe/Publish`; Python `await nc.subscribe/publish` Including the regression test for the sarama `ProducerMessage` in-loop case — the AST-based query captures every literal in the file independently, not just the first one after `NewSyncProducer`. ## Neighbor regression check - `topic-extractor.test.ts` — 30/30 pass (rewritten extractor) - `http-route-extractor.test.ts` — 18/18 pass (untouched) - `grpc-extractor.test.ts` — 43/43 pass (untouched) - `manifest-extractor.test.ts` — 8/8 pass (untouched) - Full `npx tsc --noEmit` clean ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched; no changes to other extractors, tests, MCP surface, or pipeline.ts. - No CI/release/security config changes, no secrets. - New tree-sitter imports all reference grammars that are already installed as dependencies (`tree-sitter`, `tree-sitter-javascript`, `tree-sitter-typescript`, `tree-sitter-python`, `tree-sitter-java`, `tree-sitter-go` — all in `package.json` for the existing pipeline). ## Phase 2 / phase 3 plan - **Phase 2 (next commit):** rewrite `http-route-extractor.ts` Strategy B (regex fallback) on the same plugin pattern. Graph-assisted Strategy A stays as-is (already uses pipeline-built tree-sitter data via `HANDLES_ROUTE` Cypher queries). - **Phase 3 (commit after):** rewrite `grpc-extractor.ts` for Java / Go / Python / TypeScript detection. `.proto` files are the one outstanding question — there is no `tree-sitter-proto` grammar installed; the in-tree string-sanitizing parser stays as a pragmatic exception with a comment, alternative being to add `tree-sitter-proto` as a dep (open for the maintainer). Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate http-route-extractor Strategy B to tree-sitter plugins Phase 2 of the extractor refactor requested by @magyargergo on abhigyanpatwari#796. Same architecture as the phase 1 topic-extractor rewrite: a thin, language-agnostic orchestrator plus per-language plugins that own tree-sitter grammars and query sources. The top-level extractor file no longer imports any tree-sitter grammar or query string. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic primitives ├── http-route-extractor.ts # thin orchestrator (no grammar imports) └── http-patterns/ ├── types.ts # HttpDetection, HttpLanguagePlugin, HttpRole ├── index.ts # registry: ext → plugin + HTTP_SCAN_GLOB ├── java.ts # tree-sitter-java: Spring + RestTemplate/WebClient/OkHttp ├── go.ts # tree-sitter-go: gin/echo/HandleFunc + http/resty consumers ├── python.ts # tree-sitter-python: FastAPI + requests ├── php.ts # tree-sitter-php: Laravel Route::get/... └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # NestJS controllers, Express, fetch, axios ``` **Shared scanner (`tree-sitter-scanner.ts`)** — generalised from phase 1: - `ScanMatch<TMeta>.captures` is now a full `CaptureMap` (every named capture the query binds, not just a single `@value`). Topic extractor updated to read `match.captures.value` accordingly. - New `runCompiledPatterns(plugin, tree)` helper lets plugins run multiple query bundles against the same pre-parsed tree. This is needed for HTTP plugins that combine a class-prefix query with a method-route query (Spring, NestJS). - `scanFile` becomes a thin wrapper over `parser.parse + runCompiledPatterns`. **HTTP plugin shape** — unlike topic plugins, HTTP plugins expose a `scan(tree)` function rather than a flat pattern list. This reflects HTTP's more complex extraction: each detection needs method + path + handler name, and framework patterns like Spring `@RequestMapping` / NestJS `@Controller` require cross-referencing a class-level prefix with method-level annotations. Plugins internally use `compilePatterns` + `runCompiledPatterns` and walk the AST to resolve the class/method relationships. **Per-framework coverage:** - **Java (`java.ts`)** - Spring: `@RequestMapping("/api/v2")` class prefix + `@(Get|Post|Put| Delete|Patch)Mapping("/sub")` method routes, joined via the enclosing `class_declaration` node id. - `RestTemplate.getForObject/postForEntity/put/delete/patchForObject` → method derived from API name. - `WebClient.method(HttpMethod.X, "/path")` → method from `HttpMethod.X` capture. - `new Request.Builder().url("/path")` → OkHttp consumer. - **Go (`go.ts`)** - gin / echo / chi frameworks: `\w+.GET("/path", handler)` captures upper-case verb + handler identifier. - `net/http.HandleFunc("/path", handler)` → provider (default GET). - `http.Get/Post/Head` consumer, `http.NewRequest("METHOD", ...)`, resty `client.R().Get/Post/...`. - **Python (`python.ts`)** - `@app.get("/path")` FastAPI decorators. - `requests.get/post/...` and `requests.request("METHOD", "url")`. - **PHP (`php.ts`)** - Laravel `Route::get/post/.../patch('/path', ...)` via `scoped_call_expression`. Uses `PHP.php_only` to match the existing ingestion pipeline's grammar selection. - **Node (`node.ts`) — JS + TS + TSX** - Pattern sources defined once, compiled against three grammar variants (`JavaScript`, `TypeScript.typescript`, `TypeScript.tsx`) because `Parser.Query` objects are not portable across grammars. Exports three plugins sharing the same `scan` logic. - NestJS: `@Controller('prefix')` decorators are siblings of the class in `export_statement` / `program`; `@Get(':id')` decorators are siblings of the method in `class_body`. The plugin walks decorator → next named sibling to find the decorated class / method, then combines the class prefix with the method path. Only emits NestJS detections when the enclosing class has a real `@Controller` decorator — prevents false positives from generic classes that happen to use `@Get` from another library. - Express: `(router|app).<verb>('/path', ...)`. - `fetch(url)` (default GET) + `fetch(url, { method: 'X' })` (uses two queries + a SyntaxNode-id dedupe set so URL literals aren't double-emitted by the options variant). - `axios.get/post/...`. ## Orchestrator changes `http-route-extractor.ts` drops every `scanXxxProviders` / `scanXxxConsumers` regex method and replaces them with a single source-scan loop that delegates to `getPluginForFile(rel).scan(tree)`. The orchestrator still owns: - **Path normalization** (`normalizeHttpPath`, `normalizeConsumerPath`) — language-agnostic string processing shared by both strategies. - **Graph-assisted Strategy A** (`HANDLES_ROUTE` / `FETCHES` / `CONTAINS` Cypher queries) — unchanged in spirit. The only regex helpers it used (`inferMethodFromFileScan`, `pickJavaHandlerName`) are now replaced by a lookup against the plugin's detections for the same file: for each route row, find the detection whose normalized path matches, and pull the HTTP method + handler name from it. - **Per-file parse cache** — the orchestrator parses each relevant file at most once per `extract()` call. Both the graph-assisted enrichment loop and the source-scan fallback share the same `cachedDetections` map, so we never run the plugin twice for the same file. ## Why this is better than the regex version 1. **Comments and strings for free.** The old regex would match `// router.get('/fake')` as a real Express route; tree-sitter never visits string/comment nodes. 2. **Structural controller-prefix.** Spring and NestJS class-prefix joining is now scoped to the enclosing class via `class_declaration` node ids, eliminating file-wide state that broke when a file had multiple controllers. 3. **Precise NestJS disambiguation.** The plugin only emits a NestJS detection when the enclosing class has a real `@Controller` decorator — the old regex would fire on any `@Get(...)` in the file regardless of surrounding context. 4. **Language-agnostic extension.** Adding Ruby / Rust / Kotlin HTTP detection later means dropping one file in `http-patterns/` — no changes to the shared scanner, the orchestrator, or the Strategy A Cypher queries. ## Tests - `http-route-extractor.test.ts` — **18/18 pass** (tests unchanged; they're contract-style input/output tests and the contract shape is unchanged). Covers Spring class prefix, Express, gin/echo, stdlib HandleFunc, NestJS, Laravel, FastAPI for providers and fetch/axios/python-requests/rest-template/webClient/okhttp/go-stdlib/ resty for consumers, plus graph-first Strategy A for both. - `topic-extractor.test.ts` — **30/30 pass** after the `captures.value` API migration. - `grpc-extractor.test.ts` — 43/43 pass (untouched; phase 3). - `manifest-extractor.test.ts` — 8/8 pass (untouched). - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No changes to pipeline.ts, MCP surface, ingestion, or tests. - No CI / release / security / secrets changes. - Tree-sitter grammars imported by plugins (`tree-sitter-java`, `tree-sitter-go`, `tree-sitter-python`, `tree-sitter-php`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already in `package.json` for the existing ingestion pipeline. ## Phase 3 plan - **grpc-extractor** gets the same treatment: plugin-per-language under `grpc-patterns/` for Java / Go / Python / TS detection. `.proto` files remain an open question — no `tree-sitter-proto` grammar is installed, so the in-tree string-sanitizing parser from PR abhigyanpatwari#796's self-review stays as a pragmatic exception unless the maintainer wants us to add `tree-sitter-proto` as a new dep. Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate grpc-extractor source scans to tree-sitter plugins Phase 3 (final) of the extractor refactor requested by @magyargergo on abhigyanpatwari#796. Same architecture as phase 1 (topic) and phase 2 (http): thin language-agnostic orchestrator + per-language plugins that own tree-sitter grammars and query sources. With this commit the top-level extractors under `src/core/group/extractors/` import ZERO tree-sitter grammars and ZERO query strings — every grammar import lives in a `*-patterns/<lang>.ts` plugin file, and the orchestrators go through the registry indirection. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared primitives (unchanged) ├── grpc-extractor.ts # orchestrator (only `.proto` parser left) └── grpc-patterns/ ├── types.ts # GrpcDetection, GrpcLanguagePlugin, GrpcRole ├── index.ts # registry: ext → plugin + GRPC_SCAN_GLOB ├── go.ts # tree-sitter-go: RegisterXxxServer, Unimplemented, NewXxxClient ├── java.ts # tree-sitter-java: @GrpcService + XxxImplBase + newBlockingStub ├── python.ts # tree-sitter-python: add_XxxServicer_to_server + XxxStub └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # @GrpcMethod, @GrpcClient field type, # .getService<X>('Svc'), new XxxServiceClient, # loadPackageDefinition dynamic constructors ``` ## Per-language coverage **Go (`go.ts`)** - Provider: `\w+.RegisterXxxServer(...)` via `call_expression → selector_expression → field_identifier` + JS regex filter `^Register(\w+)Server$`. - Provider: `pb.UnimplementedXxxServer` embedded in a struct via `struct_type → field_declaration_list → field_declaration → qualified_type → type_identifier` + JS filter. - Consumer: `\w+.NewXxxClient(...)` via the same call_expression query + JS filter `^New(\w+)Client$`. **Java (`java.ts`)** - Provider: `class X extends YyyGrpc.YyyImplBase` — two queries handle the scoped and plain forms. `scoped_type_identifier`'s children are positional (no `scope:`/`name:` fields), so the query matches the two `type_identifier` children by position. - `#match? @inner "ImplBase$"` restricts matches at query time. - Whether the class has `@GrpcService` or not controls only the `source` metadata label — the plugin walks the class_declaration's `modifiers` child in JS to detect the marker_annotation. - Consumer: `YyyGrpc.newStub(ch)` / `newBlockingStub(ch)` via a `method_invocation` query with `#match? @method "^new(Blocking)?Stub$"`, service name extracted via `^(\w+)Grpc$` on the object identifier. **Python (`python.ts`)** - Single call-expression query covers both bare identifier and `obj.method` attribute forms: `(call function: [(identifier) @fn (attribute attribute: (identifier) @fn)])`. - Plugin filters `@fn.text` against two JS regexes: `^add_(\w+)Servicer_to_server$` (provider) and `^(\w+)Stub$` (consumer), with a reserved-names ignore list for the Stub case (Mock / Test / Fake / Stub). **Node — JavaScript + TypeScript + TSX (`node.ts`)** - Pattern sources defined once, compiled three times (one per grammar) because `Parser.Query` objects are not portable across grammars. Exports three `GrpcLanguagePlugin`s sharing the same `scan`. - `@GrpcMethod('Service', 'Method')`: decorator query captures the two string literals. Confidence is hard-coded 0.8 regardless of proto map resolution (matches the original regex version's behaviour). - `@GrpcClient(...) field: XxxServiceClient`: decorator query captures the decorator node, plugin walks up to find the enclosing `public_field_definition` (decorators on fields are CHILDREN of the field definition in tree-sitter-typescript, not siblings) and reads its first `type_annotation → type_identifier`, then runs the `^(\w+Service)Client$` JS filter. - `client.getService<X>('AuthService')`: call-expression query on `member_expression.property = "getService"` + string literal arg. - `new XxxServiceClient(...)`: `new_expression` with a bare identifier constructor, filtered by `^(\w+Service)Client$` so generic `new AuthClient(...)` (missing the `Service` infix) does NOT falsely register as a consumer. Preserves the regression test `test_extract_ts_non_service_client_constructor_is_ignored`. - `loadPackageDefinition` dynamic loader: gated on `tree.rootNode.text.includes('loadPackageDefinition')`. When set, `new foo.bar.Xxx(...)` qualified constructors with a capitalised property name register as consumers. ## Orchestrator changes `grpc-extractor.ts` loses every `scanGoProviders` / `scanJavaProviders` / ... helper and replaces them with a single source-scan loop that: 1. Parses each file with the plugin's grammar (one shared `Parser` instance across all files, `setLanguage` called per plugin). 2. Calls `plugin.scan(tree)` to get `GrpcDetection[]`. 3. Converts each detection to an `ExtractedContract` via the private `detectionToContract` helper, which: - Looks the short service name up in the proto map (filled by the `.proto` parser). - Picks confidence = `confidenceWithProto` if resolved, else `confidenceWithoutProto`. - Builds a method-level contract id (`grpc::pkg.Svc/Method`) when the detection carries a `methodName` (TS `@GrpcMethod` only), otherwise a service-level id (`grpc::pkg.Svc/*`). Everything else — the `.proto` parser, `buildProtoContext`, `buildProtoMap`, `resolveProtoConflict`, `serviceContractId`, `stripProtoCommentsAndStrings`, `extractServiceBlocks`, the dedupe function — stays exactly as before. The `.proto` parser is kept as a pragmatic exception to the "no regex in extractors" rule because no `tree-sitter-proto` grammar is installed in the repo; a comment at the top of the file explains this and flags the maintainer option of adding `tree-sitter-proto` as a dependency. ## Why this is better than the regex version 1. **Comments and strings are respected for free.** Matched node types are only code constructs, never text inside comments or string literals. 2. **No false positives on partial names.** The old `(\w+?)Grpc`-style regexes would cross-match unrelated identifiers; structural queries restrict matches to the exact AST shape (`scoped_type_identifier → type_identifier` pairs, `method_invocation → identifier` etc.). 3. **NestJS `@GrpcClient` is structural, not regex-based.** The old regex required a specific textual layout (`@GrpcClient(...) private readonly foo!: XxxServiceClient`); the plugin now walks the AST, so modifier order / optional modifiers / multi-line formatting don't break it. 4. **Language-agnostic extension.** Adding Kotlin / Rust / C# gRPC detection later is a one-file edit in `grpc-patterns/index.ts` — no touches to the shared scanner, the orchestrator, or the proto parser. ## Tests - `grpc-extractor.test.ts` — **43/43 pass** (tests unchanged; the contract shape is identical). Covers .proto parsing (including the brace-inside-string regression), Go provider/consumer, Java @GrpcService / plain ImplBase provider + newBlockingStub consumer, Python servicer + stub, TS @GrpcMethod + @GrpcClient + .getService + new XxxServiceClient + loadPackageDefinition + the `AuthClient` vs `AuthServiceClient` discrimination, dedupe across multiple patterns in one file, proto-aware confidence, and the inherited-package resolution for split proto definitions. - `topic-extractor.test.ts` — 30/30 pass. - `http-route-extractor.test.ts` — 18/18 pass. - `manifest-extractor.test.ts` — 8/8 pass. - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No pipeline.ts, MCP surface, ingestion, CI / release / security, or test changes. - New tree-sitter grammar imports (`tree-sitter-go`, `tree-sitter-java`, `tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already installed for the ingestion pipeline. ## End of phase series This commit completes the three-phase extractor refactor: - **Phase 1** (`ea06d11`): topic-extractor → `topic-patterns/` - **Phase 2** (`b6015f6`): http-route-extractor → `http-patterns/` - **Phase 3** (this commit): grpc-extractor → `grpc-patterns/` Every remaining regex-based extractor helper under the `src/core/group/ extractors/` directory is either (a) language-agnostic string processing (path normalization, dedupe keys) or (b) the `.proto` parser, which is documented as an explicit exception. Co-authored-by: Claude <noreply@anthropic.com> * feat(group): add tree-sitter-proto for .proto file parsing Addresses @magyargergo's suggestion on abhigyanpatwari#796 to replace the manual string-sanitizing .proto parser with a tree-sitter grammar. - **Vendored `tree-sitter-proto`** in `vendor/tree-sitter-proto/`. Grammar source from [coder3101/tree-sitter-proto](https://github.com/coder3101/tree-sitter-proto) (latest `grammar.js`), parser.c regenerated with `tree-sitter-cli 0.24` to produce ABI version 14 — compatible with the project's `tree-sitter 0.25` runtime (which supports ABI ≤ 14). Added as `optionalDependency` with `file:./vendor/tree-sitter-proto`. - **New `grpc-patterns/proto.ts` plugin** — uses the same `compilePatterns` + `runCompiledPatterns` infrastructure as every other plugin. Two queries: - `(package (full_ident) @pkg)` — package declaration - `(service (service_name) @service_name (rpc (rpc_name) @rpc_name))` — one match per (service, rpc) pair - **Graceful fallback** — `tree-sitter-proto` is an optional dependency. If it fails to install (platform incompatibility) or fails the runtime smoke-test (`setLanguage` + `parse` on a trivial proto), `PROTO_GRPC_PLUGIN` stays `null` and the orchestrator uses the existing manual parser. The smoke-test catches the `SyntaxNode` TDZ error that occurs in vitest's fork-based test runner. - **Orchestrator updated** — when `hasProtoPlugin` is true, `.proto` files are handled by the plugin loop (they're included in `GRPC_SCAN_GLOB`), and the manual `parseProtoFile` loop is skipped. `buildProtoContext` still runs to build the proto map for cross-referencing source-file detections. 1. **No manual comment/string stripping.** The old parser needed `stripProtoCommentsAndStrings` (110 lines) to avoid counting braces inside comments and string literals. tree-sitter handles this natively. 2. **No brace-depth tracking.** `extractServiceBlocks` used a manual depth counter to find service boundaries. tree-sitter's AST gives us `service` → `service_name` + `rpc` → `rpc_name` directly. 3. **Performance.** tree-sitter's C-based parser is faster than character-by-character JS scanning + regex on large proto files. - `grpc-extractor.test.ts` — **43/43 pass** (unchanged) - All other extractor tests — 99/99 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * chore: add .gitignore for vendored tree-sitter-proto build artifacts https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * fix: correct .gitignore paths for vendored tree-sitter-proto Patterns should be relative to the .gitignore file's directory. https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * refactor(group): address Copilot review feedback on abhigyanpatwari#796 Six fixes suggested by the Copilot AI review: 1. **`normalizeHttpPath` root-path edge case** — stripping trailing slashes on the input `/` produced an empty string, yielding malformed contract ids like `http::GET::`. Now preserves `/` for the root handler/fetch case. 2. **Dedupe `scanFiles` call** — `extract()` was globbing the source-scan file list twice (once for the provider fallback, once for the consumer fallback). Moved to a single lazy call that memoizes the result for the rest of the method. 3. **HTTP `scanFiles` now ignores `**/vendor/**`** — every other extractor's glob already ignored vendored sources; the HTTP one didn't. Fixed for consistency. 4. **`loadPackageDefinition` check is now structural** — was calling `tree.rootNode.text.includes('loadPackageDefinition')` which forces materialization of the entire file text from the parse tree (expensive on large files). Replaced with a dedicated compiled query on `(call_expression function: [(identifier) | (member_expression)])` so the check stays in the AST domain. 5. **`grpc-extractor.ts` header docstring updated** — still claimed ".proto parsing is not tree-sitter-based because no grammar is installed". Now describes the actual behaviour: tree-sitter when `tree-sitter-proto` is available (optionalDependency), manual fallback otherwise. 6. **Eliminated the double proto file parse on the fallback path** — `buildProtoContext` already globs + parses every `.proto` file to build `servicesByName`. On the `!hasProtoPlugin` branch the extractor was globbing + parsing again via the now-removed `parseProtoFile` helper. The fallback branch now iterates the map that `buildProtoContext` already produced to emit provider contracts directly — single pass per proto file. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): address Claude review feedback (bugs + dedup + hygiene) on abhigyanpatwari#796 Follows up `2f28bfc` with the remaining items from the Claude AI review: ## Bugs **Bug 2 — Label-unaware Cypher queries in `resolveSymbol`.** The manifest-extractor's lookup queries were `MATCH (n) WHERE n.name = $x` with no label filter, so a topic/service/package name could silently match any node type (File, Variable, Import, Folder, …). Added label filters: - `topic` → `(n:Function|Method|Class|Interface)` (topics are best-effort symbol-name matches against listener/publisher symbols) - `grpc` method → `(n:Function|Method)` - `grpc` service → `(n:Class|Interface)` - `lib` → `(n:Package|Module)` All 8 manifest-extractor tests still pass (mock executor is label-agnostic, but the production LadybugDB graph now gets correctly scoped queries). **Bug 8 — Tautological `!handlerName` condition.** `http-route-extractor.ts:extractProvidersGraph` had `let handlerName = null; if (!method || !handlerName) { ... }` — the `!handlerName` clause was always true since there was no intervening assignment. Simplified to always run the plugin-scan lookup (we need the handler name even when `methodFromRouteReason` already resolved the method). ## Clean code / dedup **Design 7 — `readSafe` was copy-pasted in all three orchestrators.** Extracted to `extractors/fs-utils.ts` as the single source of truth for the path-traversal guard. Dropped the three local copies and the now-unused `fs`/`path` imports from topic-extractor. **Style 10 — Language-specific `_test.go` skip in the topic orchestrator.** Was `if (rel.endsWith('_test.go')) continue;` inside the language- agnostic extraction loop. Pushed into the glob's ignore list (`'**/*_test.go'`) alongside the existing `node_modules`, `vendor`, `dist`, `build` entries, with a comment explaining that other languages' test file conventions either live in separate directories (Python `tests/`, Java `src/test/`) or are already covered by the existing ignores. ## Already addressed in `2f28bfc` (mentioned again in Claude review) - Bug 3: `normalizeHttpPath('/')` returns `''` — fixed - Bug 4: double glob + double parse of `.proto` — fixed - Bug 5: `scanFiles` called twice in HTTP — fixed - Bug 6: missing `**/vendor/**` in HTTP glob — fixed - Design 9 partially: `tree.rootNode.text.includes('loadPackageDefinition')` replaced with a dedicated structural query ## Deferred - Bug 1 (`http::*::path` vs `http::GET::path` matching) — out of scope; sync.ts matching logic lands in abhigyanpatwari#793, manifest extractor already emits correct synthetic uids for unresolved HTTP contracts. - Design 9 full (change plugin `scan(tree)` → `scan(tree, source)`) — the only real use case (`loadPackageDefinition` gate) is already fixed via a structural query, so the interface change would be cosmetic churn without a concrete consumer. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * docs+fix(group): address remaining Claude review items + add pipeline flow chart ## Fixes **Remaining 🔴 — HTTP contract id wildcard format.** Documented the `http::*::<path>` format as an intentional wildcard for manifest links that omit the HTTP method, alongside the explicit-method form (`GET::/path` → `http::GET::/path`). The docblock on `buildContractId` now states both forms, notes that wildcard-aware matching is the responsibility of the sync / cross-impact layer (abhigyanpatwari#793), and recommends the explicit-method form whenever the author knows the method (it round-trips through exact equality without needing wildcard logic downstream). Tests unchanged — the wildcard format is what they've always asserted. **Minor 1 — stale comment at `manifest-extractor.ts:124-126`.** The comment claimed "creates a contract with an empty symbolUid/ref" but the code switched to `manifestSymbolUid(repo, contractId)` a few commits back. Updated to describe the actual synthetic-uid fallback semantics and the cross-impact path that relies on both sides of the join deriving the same uid. **Minor 2 — exhaustiveness guard on `buildContractId`.** The `switch(type)` covered all five current `ContractType` variants but silently returned `undefined` if a new variant was added. Added a `default: const _exhaustive: never = type; throw new Error(...)` clause so the build fails loudly on an unhandled variant. **Minor 3 — `tree.rootNode.text` in `grpc-patterns/node.ts`.** Already fixed in `2f28bfc` via a dedicated structural query (`LOAD_PACKAGE_DEFINITION_SPEC`). No action needed. ## New: pipeline flow chart (per @magyargergo's request) Added `src/core/group/PIPELINE.md` with four Mermaid diagrams: 1. **High-level overview** — `group.yaml` → extractors + manifest → contract matching → `bridge.lbug` → `runGroupImpact`. 2. **Per-repo extractor two-strategy shape** — graph-assisted Strategy A vs. source-scan Strategy B. 3. **Plugin architecture** — orchestrator → registry → per-language `*-patterns/<lang>.ts` → `tree-sitter-scanner.ts` → `ExtractedContract`. 4. **Manifest extraction** — label-scoped `resolveSymbol` with the synthetic-uid fallback. 5. **Cross-impact query (abhigyanpatwari#606)** — local impact → bridge join → cross-repo fan-out. Each diagram is annotated with which PRs own which stage (this PR: extractors + manifest; abhigyanpatwari#795: bridge storage; abhigyanpatwari#606: cross-impact runtime) and points at the concrete files/functions involved. ## Tests - 99/99 extractor tests pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
Part 2 of 4 in the split of #606 per @magyargergo's request. Follows #795 (already merged) but has no code-level dependency on it — both PRs could have been reviewed in parallel.
Closes #792.
What changed
Expands the 3 existing contract extractors with substantially more language/framework coverage, and adds a new `manifest-extractor` that resolves `group.yaml`-declared cross-links against the per-repo graph via exact-name lookups.
New (228 LOC prod + 308 LOC tests):
Modified (+794 LOC prod + 956 LOC tests):
Self-review fixes folded in
Carried forward from the original #606 self-review on `d15b8cb`:
How to verify
```bash
cd gitnexus
npx tsc --noEmit
npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks
npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks
npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks
npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks
```
Local: typecheck clean; 99/99 extractor tests pass (grpc 43, http 18, topic 30, manifest 8).
Risk / rollback
Low. Extractors have no user-facing surface in this PR — they produce `ExtractedContract[]` consumed by `sync.ts` in the next split (#793). No existing behavior changes for users who don't run `group sync`. Rollback is a clean `git revert`; the three modified extractors revert to their pre-PR versions which are subsets of the new functionality.
Scope discipline (per `GUARDRAILS.md`)
Dependencies
🤖 Co-authored with Claude Code — findings and fixes written by me, Claude ran the test/edit loop. I reviewed and signed off on every change before pushing.