[group] Cross-repo impact analysis via repository groups#606
Conversation
|
@ivkond is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
|
Hi @abhigyanpatwari! This is a proposal PR for cross-repository impact analysis (#256, #306, #77). I'd appreciate your feedback on:
This is a demo/proof-of-concept — BM25/embedding matching, gRPC/topic extractors, and multi-hop traversal are intentionally deferred. Happy to iterate based on your feedback. |
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 6022 tests passed 97 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
Code Review: Cross-Index Impact GroupsOverall: Impressive RFC-to-implementation PR. The design document is thorough and the implementation appears solid. This is a significant feature that addresses real gaps in multi-repo analysis. Issues1. Graph schema gaps not addressed in this PR (medium risk)
The PR implements the group infrastructure but these gaps mean the cross-link extraction for HTTP routes, gRPC, and topics will not work as designed. Consider:
2. Contract ID format could collide (low risk)
Consider using a less common delimiter or proper escaping. URL-encoding the discriminator could prevent collisions. 3. BM25 threshold hardcoded at 0.7 (minor)
4. Embedding vectors stored separately (question)
The doc mentions no automatic migration — this could leave stale vectors with incompatible dimensions. 5. Test coverage looks good Minor
VerdictApprove with suggestions. The design is sound and the implementation is solid. The main concern is the gap between designed capability (all contract types) and actual capability (limited by schema). Documenting this honestly would help users understand what works today vs what is planned. |
|
Thanks for the thorough review! Addressing each point: 1. Graph schema gapsThe RFC "Prerequisites and Current State" section describes the codebase as of the RFC writing date, not the current state. The implementation plan (
The gRPC, topic, and shared lib extractors are intentionally out of scope for Demo PR (documented in PR description). 2. Contract ID
|
|
@claude ultra-think while identifying cross-indexing impact analysis gaps. Perfom this analysis on this pull request and verify those findings. Act as a senior compiler font-end engineer and expert in static analysis tools and review this change from that perpective and review the architectural fit. Also check out previous comments and reason with their requests. |
|
@ivkond This is a great idea. I had been implementing something like this myself in the enterprise version of gitnexus. Our indexing system allows cross-repo analyses too but will need a bridge ( virtual graph ). Correct approach. But I would also look at intra repo rpc, grpc, kafka, rabbitmq, api endpoints, etc tracking. Once this can be done for intra repo we can move the same logic into building the virtual graph system so multi-repo support is bulletproof. A good test would be to see if microservice based repos can be indexed well. If that is done cross repo becomes easy |
|
@abhigyanpatwari Thanks for the direction! I've started implementing intra-repo service communication tracking on a separate branch (stacked on this PR): What's added (ivkond/GitNexus#1):
Proof of concept: Fixture monorepo with 3 services (auth, orders, gateway) connected via gRPC + Kafka + HTTP routes — all cross-links are discovered automatically. 47 new tests, zero regressions (2643 unit + 2016 integration). The Next: test on a real microservice repo, add more framework patterns, then wire into cross-repo virtual graph. |
|
Update: Following your suggestion, I've split the work into two PRs:
This way the intra-repo foundation lands first, proving that microservice monorepos index well, before cross-repo is layered on top. |
|
Update: Following your suggestion, I've split the work into two PRs:
This way the intra-repo foundation lands first, proving that microservice monorepos index well, before cross-repo is layered on top. (Supersedes #625 which was closed due to branch rename.) |
Cross-repo blast radius analysis through Contract Registry: - group_impact MCP tool and CLI command - ManifestExtractor for explicit cross-repo links - HTTP route extractor for auto-detected contracts - closeLbug() resource cleanup in CLI sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nonical ID Design spec (12 rounds of review): - Bridge.lbug: LadybugDB storage replacing contracts.json - gRPC canonical ID: proto-aware extraction with wildcard matching - Direction-dependent Cypher queries for cross-impact - Write-to-temp-then-rename atomic swap strategy - Backward compatibility via openBridgeOrFallback Implementation plan (10 tasks, TDD): - Bridge DB core, schema, write/read lifecycle - Proto map, source scanner resolution, confidence adjustments - Wildcard matching, sync migration, service layer integration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
88fd9d6 to
de79d4b
Compare
|
Deployment failed with the following error: |
Bridge.lbug — contract storage in LadybugDB: - bridge-schema.ts: Contract, RepoSnapshot, ContractLink DDL - bridge-db.ts: open/close, writeBridge (atomic swap), queryBridge, openBridgeDbReadOnly (with version gate), readBridgeMeta, openBridgeOrFallback - BridgeHandle, BridgeMeta types; LegacyContractRegistry alias gRPC canonical ID — proto-aware extraction: - buildProtoMap: scan .proto files (excludes node_modules/vendor) - resolveProtoConflict: directory proximity disambiguation - serviceContractId: grpc::pkg.Service/* format - All 4 source scanners (Go/Java/Python/TS) resolve via proto map - Confidence adjustments: reduced without proto, boosted with Matching — wildcard support: - buildProviderIndex exported for reuse - runExactMatch skips gRPC /* contracts - runWildcardMatch: bare-name and FQ service matching Consumer migration: - sync.ts: writeBridge + wildcard pass, returns remaining after wildcard - cross-impact.ts: Cypher-based Phase 2 with direction-dependent queries, hint param for empty-UID fan-out, error-object guard - service.ts: openBridgeOrFallback with bridge/json/none branching, groupContracts bridge path returns flat contracts + crossLinks + unmatchedOnly - CLI: status/impact use openBridgeOrFallback, deprecation warnings - MCP tools: descriptions updated (contracts.json → bridge.lbug) - Old storage API removed (writeContractRegistry, readContractRegistry) - Cross-link insertion uses endpoint symbolRef lookup (not link.contractId) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
de79d4b to
ae68bfb
Compare
|
@abhigyanpatwari This PR is ready for re-review. I've addressed both items from your PR #626 review: 1. Contract storage → bridge.lbug
2. gRPC normalization mismatch
Tests: 239 in group module (11 new coverage tests), 5410 total in CI. Build: clean. Note on force-push: Rebased on current main (includes PR #626) and squashed 24 intermediate commits into 3 logical units — original feature, documentation (spec + plan), implementation. The intermediate commits were iterative spec review rounds and per-task TDD commits that added noise to the history without providing bisect value. |
|
@abhigyanpatwari hi! Friendly follow-up on this PR — I’d really appreciate your feedback when you have time, especially on whether this direction and architecture fit your vision. |
|
@claude do a deep review of it, especially focused on coverage of cross repo communication protocols. Also see that there are real world test to validate it against real multi repo and microsercice based repos. |
|
Claude finished @abhigyanpatwari's task in 2m 43s —— View job Deep Review: Cross-Repo Impact Analysis — Protocol Coverage & Real-World Test ValidationTodo list:
SummaryThe architecture is fundamentally sound — the hybrid lazy virtual graph via 1. Protocol Coverage: What Works vs. What's MissingHTTP (
|
| Missing pattern | Impact |
|---|---|
Go HTTP handlers (gin.GET, echo.GET, mux.HandleFunc, stdlib http.HandleFunc) |
Any Go-based API service goes undetected as provider |
NestJS @Controller/@Get/@Post decorators |
NestJS is the dominant Node.js microservice framework; not covered |
Python requests.get/post consumer |
Python consumer side is entirely absent |
Java RestTemplate.getForEntity, WebClient.get(), OkHttpClient |
Java inter-service HTTP calls not tracked |
Go http.Get, resty.R().Get() consumer patterns |
Go consumer side absent |
The fallback consumer detection source scan, lines 409–423 only scans .ts,.tsx,.js,.jsx,.vue,.svelte — Go, Python, Java, PHP consumers are never found without graph data. Fix this →
gRPC (grpc-extractor.ts)
Covered: Proto parsing (definitive), Go Register*Server/New*Client, Java @GrpcService/*ImplBase/*Grpc.newStub, Python add_*Servicer_to_server/*Stub, NestJS @GrpcMethod (server only).
Critical gap — TypeScript gRPC consumer is entirely missing:
// These are completely invisible to the extractor:
// grpc-js
const client = new proto.hr.UserService('host', credentials.createInsecure());
// NestJS @GrpcClient (consumer side)
@GrpcClient({ service: 'UserService', url: '...' })
private readonly userClient: ClientGrpc;
// NestJS ClientGrpc
const svc = this.userClient.getService<UserServiceClient>('UserService');scanTsProviders handles @GrpcMethod only (server side). There is no scanTsConsumers. Given TypeScript/NestJS is a primary gRPC consumer framework in Node.js microservices, this is a significant real-world gap. Fix this →
Other gaps: No C# gRPC (GrpcChannel.ForAddress, new Service.ServiceClient()), no Ruby (Grpc::ClientStub), no Rust tonic patterns. These matter less than TypeScript but are worth documenting.
Proto import resolution not implemented: When a .proto uses import "other.proto" to reference types from another package, the buildProtoMap doesn't follow imports. This means services that split their types across files may get wrong or missing package names.
Topic/Messaging (topic-extractor.ts)
Covered well: Java Spring Kafka/RabbitMQ annotations, basic Node.js producer/consumer, Go sarama ConsumePartition, Python Kafka. This is the most complete extractor.
Notable gaps:
| Missing | Framework | Impact |
|---|---|---|
consumer.run({ eachMessage: ... }) |
kafkajs (most popular Node Kafka lib) | Node.js Kafka consumers won't be detected as they use run() not subscribe() |
kafka.NewWriter / kafka.NewReader |
segmentio/kafka-go | Second most popular Go Kafka library |
sarama.NewSyncProducer / sarama.NewAsyncProducer |
sarama (most popular Go Kafka) | Go Kafka producers missing |
js.Publish / js.Subscribe |
NATS JetStream | JetStream is the modern NATS API, largely replacing core NATS |
nats.connect / nc.subscribe (Python) |
nats-py | NATS Python consumer absent |
| Redis Pub/Sub, AWS SQS/SNS, Azure Service Bus | — | Common in enterprise microservices |
The kafkajs gap is the most impactful — fix this →
2. Real-World Test Coverage: The Main Gap
The integration test (test/integration/group/group-impact.test.ts) is explicitly noted as "mocks localImpactFn / crossImpactFn. E2E with real graphs is a follow-up." This is the most important gap to close.
What exists today:
- Synthetic in-memory registry with hardcoded cross-links
- Single test scenario (upstream fan-out via UID match)
- Test-monorepo fixture (
test/fixtures/group/test-monorepo/) with 3 service stubs — not wired to any integration test
What's needed for real multi-repo / microservice validation:
A. Fixture-based integration test against real code
The test-monorepo fixture exists but isn't exercised end-to-end. A test like:
it('discovers gRPC link from gateway → auth across service boundaries', async () => {
// 1. Index each service directory with real indexer
// 2. Run group sync against the fixture
// 3. Assert bridge.lbug contains expected ContractLink
// 4. Run group impact on a gateway function
// 5. Assert auth service appears in cross-repo hits
});This requires running the real extractor pipeline — but it would catch bugs that the current mocked tests cannot.
B. Known-good open-source repo test
There are popular open-source microservice demos (e.g., Google's Online Boutique, Netflix's Conductor) that have well-documented service boundaries. A test that clones a specific commit and asserts specific cross-links would give confidence that the system works on real code.
C. Confidence degradation test
No test verifies that removing a .proto file degrades confidence from 0.8 to 0.65 for Go providers. No test exercises the wildcard→method match path end-to-end.
3. Correctness Bugs Found
Bug 1: outOfScope always empty in Cypher path (cross-impact.ts:417)
// runGroupImpact returns:
return {
...
outOfScope: [], // ← always empty, never populated
...
};The legacy runGroupImpactLegacy populates outOfScope when a matched repo falls outside the subgroup filter. The Cypher-based runGroupImpact uses a WHERE clause to filter, so matches never come back — but the caller has no way to know which repos were filtered. This is a silent information loss that affects UX when --subgroup is used. Fix this →
Bug 2: Manifest cross-links silently dropped for gRPC/lib (manifest-extractor.ts:92)
private async resolveSymbol(...): Promise<...> {
...
} else {
return null; // ← gRPC and lib types always return null
}When null is returned, symbolRef = { filePath: '', name: link.contract }. In bridge-db.ts the cross-link insert query matches nodes by c.filePath = $fp AND c.symbolName = $sn with fp = ''. Since no contract node will have an empty filePath, fromRows or toRows will be empty and the edge is silently skipped (line 209: if (!fromRows.length || !toRows.length) continue). Manifest-declared gRPC cross-links produce contracts but no ContractLink edges in the bridge DB, making group_impact miss them entirely.
Bug 3: resolveProtoConflict directory proximity heuristic fails for centralized proto repos
// Matches character-by-character, not path-segment-by-segment
for (let i = 0; i < min; i++) {
if (sourceDir[i] === protoDir[i]) shared++;
else break;
}For a monorepo with protos/ at root and services in services/auth/, services/orders/, the paths share the same repo prefix. A source file at services/auth/server.go and proto at protos/auth/auth.proto will score lower than proto at protos/orders/orders.proto if "protos/auth" lexically matches less of the path than another option. The heuristic should compare normalized path segments, not characters. Fix this →
4. Minor Issues
phase1Timeout ignores user intent for large timeouts (cross-impact.ts:129/312): Math.min(5000, timeout) means users who set timeout: 60000 still only get 5s for Phase 1. For large graphs this might truncate the local impact result. Consider Math.min(timeout * 0.3, 10000) or make it configurable.
Deduplication key in GrpcExtractor uses file path (grpc-extractor.ts:491): ${c.contractId}|${c.role}|${c.symbolRef.filePath}. A file that registers both UserService and UserService (via two different patterns) would only keep one. This is correct but worth a comment.
Topic extractor scans .go but Kafka/NATS patterns use Go-style calls without Go file detection — the glob pattern includes *.go but many Go patterns like ConsumePartition may appear in test files. Consider filtering *_test.go.
5. What's Working Well
- The bridge.lbug atomic write (temp→rename with
.bakrecovery) is solid for concurrent use - Wildcard gRPC matching (
runWildcardMatch) correctly handles the case where a consumer doesn't have proto context — this is a real-world pattern normalizeContractIdhandles HTTP method case and trailing slashes, gRPC package lowercasing correctly- Graph-first strategy in
HttpRouteExtractoris the right approach — when graph data is available, use it; fall back to source scan only when needed - Test naming convention (
test_verb_condition_expectation) is clean and consistent throughout
Recommended Next Steps (Prioritized)
- High — Add fixture-based end-to-end test using
test-monorepothat runs the real extractor pipeline and asserts specificContractLinkedges inbridge.lbug - High — Add TypeScript gRPC consumer detection (
@GrpcClient,ClientGrpc, grpc-jsnew ServiceClient()) - High — Fix manifest cross-link silently dropped for gRPC/lib types (Bug 2)
- Medium — Add Go HTTP provider patterns (gin, echo, stdlib mux)
- Medium — Add
kafkajsconsumer.run({ eachMessage })and sarama Go producer patterns - Medium — Fix
outOfScopealways empty in Cypher path (Bug 1) - Low — Fix directory proximity heuristic to use path segments (Bug 3)
- Low — Add NestJS
@Controller/@Getprovider detection
|
@ivkond can u look into these pls 🙏. Also great work, mostly clean implementation few easy bugs to bash |
daeac6a to
55089ae
Compare
Before this PR lands, I ran a systematic self-review of the cross-impact module and fixed 13 correctness / robustness issues caught during it. All changes target files added by this PR; nothing outside of group/* or the new MCP group_impact tool schema is touched. Everything below is in this single commit. The rest of the PR history is unchanged. Full group suite: all unit and integration test files green when run individually (bridge-db.test.ts hits a pre-existing native LadybugDB worker cleanup segfault that sometimes flakes the reported count — documented in vitest.config.ts, unrelated to these fixes). CI on this commit: 11/11 green across ubuntu/macos/windows + quality + typecheck + e2e. ═══════════════════════════════════════════════════════════════════════ HIGH severity — correctness bugs with regression tests ═══════════════════════════════════════════════════════════════════════ HIGH #1 — manifest-extractor.resolveSymbol was too fuzzy ─────────────────────────────────────────────────────────── resolveSymbol() used CONTAINS on route/name fields plus an unconditional `filePath ENDS WITH '.proto'` fallback for gRPC. Consequences in production: * HTTP: manifest "/orders" matched "/suborders", "/myorders/{id}" * gRPC: ANY repo with any .proto file returned a random proto symbol regardless of whether the service/method matched * topic/lib: similar CONTAINS false positives Now all four types use EXACT equality on the relevant name field plus deterministic ORDER BY before LIMIT 1. HTTP paths are canonicalized (ensureSlash + trailing-slash strip) before matching Route.name via a new normalizeRoutePath() helper. gRPC matches by structured service/method name with NO .proto fallback. To avoid cross-impact breakage when resolveSymbol now legitimately returns null (the indexer may not expose a node with the exact method name), unresolved manifest contracts now get a deterministic synthetic symbolUid `manifest::<repo>::<contractId>` instead of an empty string. This gives the bridge Cypher query a stable anchor on both provider and consumer sides. HIGH #2 — sync.ts per-extractor isolation ─────────────────────────────────────────────────────────── syncGroup() wrapped all three extractors (http, grpc, topic) plus service-boundary detection plus meta.json read in ONE try/catch around the whole repo loop. A single extractor throwing would mark the whole repo as `missingRepos` (looking like "repo not found") and prevent the other extractors from running. Now each step is isolated with per-failure reporting: * initLbug failure → repo marked missing, labeled `init` * detectServiceBoundaries failure → empty boundaries, contracts still extracted without service attribution * each extractor failure → recorded with {repo, extractor, message}; other extractors in the repo keep running * manifest extractor failure → recorded but doesn't fail the sync * bridge write partial failures → summarized as `bridge_write` HIGH #3 — bridge-db.writeBridge per-item tolerance ─────────────────────────────────────────────────────────── writeBridge() had no per-item try/catch on CREATE loops. A single malformed contract, snapshot, or link would abort the whole write. The atomic swap protected the OLD bridge but threw away all valid inserts from the current run. Each CREATE is now wrapped individually. writeBridge() returns a new WriteBridgeReport with insert/fail counts plus a capped sampleErrors array. sync.ts surfaces non-zero failure counts via extractorFailures. Dropped cross-links (findContractNode returning null) are counted separately in linksDroppedMissingNode — previously silently continued. HIGH #4 — writeBridge handle lifecycle ─────────────────────────────────────────────────────────── writeBridge() opened the tmp LadybugDB handle before the insert loops but only closed it at the end of the happy path. If ensureBridgeSchema, any queryBridge CREATE, findContractNode, or the contract/snapshot/link loops threw, the handle leaked — which holds the native file lock on `bridge.lbug.tmp`, leaking an FD and preventing the next sync from reusing the tmp slot. Wrapped everything from `ensureBridgeSchema(handle)` to the explicit `closeBridgeDb(handle)` in try/finally with a `handleClosed` sentinel so we never double-close on the happy path and always close on the error path. HIGH #5 — openBridgeDbReadOnly + openBridgeOrFallback leaks ─────────────────────────────────────────────────────────── Two leak paths: * openBridgeDbReadOnly: `new lbug.Database(...)` followed by `new lbug.Connection(db)` was wrapped in a single try/catch, but if Connection construction threw AFTER Database allocation, the Database was leaked (no teardown path for a partial handle). * openBridgeOrFallback (storage.ts): called `readBridgeMeta` AFTER the handle was opened; any error from meta.json parse would bubble up leaving the handle open on the caller. Fixed: * openBridgeDbReadOnly now tracks `db` and `conn` separately and closes each of them on the error path. * openBridgeOrFallback now wraps `readBridgeMeta` in try/catch and explicitly calls closeBridgeDb (with .catch() swallow) before rethrowing. HIGH #6 — groupContracts JSON.parse of contract meta ─────────────────────────────────────────────────────────── GroupService.groupContracts() (bridge path) reconstructed StoredContract from flat rows via `JSON.parse(r.meta)`. A single corrupted meta row (malformed JSON, non-string leftover from a schema migration) would abort the whole listing with an unhandled exception. Replaced with a `safeParseMeta` helper that returns `{}` on parse failure and increments a `metaParseFailures` counter, which is surfaced in the return value when non-zero so callers can log and debug. ═══════════════════════════════════════════════════════════════════════ MEDIUM severity — robustness gaps, input validation, logging ═══════════════════════════════════════════════════════════════════════ MEDIUM #1 — service.groupImpact input range validation + safeLocalImpact ─────────────────────────────────────────────────────────── groupImpact() accepted any finite number for maxDepth, minConfidence, timeout, crossDepth (including negative, zero, or 1e9). MCP is a public surface; unbounded values were a DoS vector. Additionally, `direction: 'upstreem'` (typo) silently became 'upstream'. All five params are now strictly validated with explicit error messages: * direction ∈ {'upstream', 'downstream'} * maxDepth ∈ [1, 10] integer * minConfidence ∈ [0, 1] * timeout ∈ [100, 300000] ms * crossDepth ∈ [0, 10] integer Additionally: the localImpactFn callback called resolveGroupRepo() which throws if the repo isn't in the group config. runPhase1WithTimeout only catches timeouts, not rejections, so a missing-repo error would crash the whole MCP handler. Wrapped in a safeLocalImpact helper that catches and returns an error-shaped stub instead of bubbling past the timeout guard. MEDIUM #2 — Phase-1 timeout reason field ─────────────────────────────────────────────────────────── runGroupImpact()/runGroupImpactLegacy() stub for Phase-1 timeout used impactedCount: 0, risk: 'LOW', byDepth: {} — indistinguishable from a verified empty walk. Consumers treating these as real could miss actual impact. Added: * GroupImpactResult.truncationReason ('phase1_timeout' | 'wall_deadline') * phase1TimedOut: true marker inside the `local` stub * source repoPath pushed into truncatedRepos immediately on Phase-1 timeout (was only done post-Phase-2) MEDIUM #3 — grpc-extractor proto parser ─────────────────────────────────────────────────────────── extractServiceBlocks() used brace-depth counting on the raw .proto content. Braces inside string literals or `// line` / `/* block */` comments were counted as real body braces — a valid proto with `option deprecated_reason = "use NewService { instead"` would have the service body closed early, dropping methods after the offending string. Previously marked as a known TODO but never fixed. Added stripProtoCommentsAndStrings() helper that returns a sanitized copy with offsets and line breaks preserved (comments/strings replaced by spaces). extractServiceBlocks now scans the sanitized copy for service headers and brace depth, but slices the returned body from the original content to keep source text intact. MEDIUM #4 — topic-extractor Kafka Go regex ─────────────────────────────────────────────────────────── Sarama patterns anchored on `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` had two failure modes: * In a loop with multiple ProducerMessage literals, only the FIRST Topic was captured — the engine's lastIndex advanced past the Topic match, and there was no second NewSyncProducer to re-anchor. * Multiple unrelated messages within 300 chars of the constructor would attribute the wrong topic to the wrong producer context. Changed the sarama patterns to match the struct literal directly: `sarama\.ProducerMessage\s*\{[\s\S]{0,200}?Topic:\s*"([^"]+)"`. This captures every ProducerMessage regardless of producer scope. MEDIUM #5 — cross-impact crossDepth clamp deduplication ─────────────────────────────────────────────────────────── `Math.min(1, …)` was duplicated at cross-impact.ts:149 (legacy) and :354 (bridge path) PLUS at service.ts:252. Extracted the clamp into MAX_SUPPORTED_CROSS_DEPTH with a comment pointing at service.ts so the two places don't drift when multi-hop traversal lands. MEDIUM #6 — matching.ts gRPC normalize comment ─────────────────────────────────────────────────────────── No behavior change. The original comment ("MVP: lowercase the whole token; differs from pkg/method split") falsely implied inconsistency between the no-slash and pkg/method branches. Rewritten to document the wire-protocol rationale (pkg/service case-insensitive, RPC method case-sensitive) and to explicitly note that package-only and pkg/method ids are intentionally distinct canonical forms. MEDIUM #7 — manifest-extractor silent catch ─────────────────────────────────────────────────────────── The Cypher executor try/catch dropped errors with `/* fall through */`. A broken graph query in one repo would silently degrade resolveSymbol to the synthetic-uid path with no diagnostic output. Added a `console.warn` with the link type, contract name, repo key, and error message. MEDIUM #8 — writeBridgeMeta uses retryRename ─────────────────────────────────────────────────────────── writeBridgeMeta was using `fsp.rename` directly while writeBridge used retryRename with exponential backoff for EBUSY/EPERM/EACCES. A concurrent Windows reader could cause the meta.json rename to fail transiently while the bridge.lbug swap survived. Pulled the rename through retryRename so the meta write is at least as robust as the bridge write it accompanies. MEDIUM #9 — sync.ts extractor failure labels ─────────────────────────────────────────────────────────── The `ExtractorFailure.extractor` enum labeled initLbug failures as 'boundaries' and bridge-write failures as 'manifest' — both confusing in production logs. Extended the type union with explicit 'init' and 'bridge_write' variants, moved initLbug failures to 'init', moved writeBridge failure summary to 'bridge_write', and documented each label kind on the new ExtractorKind type. MEDIUM #10 — MCP schema bounds for group_impact ─────────────────────────────────────────────────────────── group_impact tool schema declared crossDepth, maxDepth, minConfidence, timeout as bare `type: 'number'` — clients had no way to discover valid ranges without trial-and-error. Added JSON Schema draft-7 `minimum`/`maximum` bounds mirroring the server-side validation in GroupService.groupImpact(). Extended ToolDefinition interface to allow the new keys. MEDIUM #11 — typed GroupImpactResult.crossDepthWarning ─────────────────────────────────────────────────────────── service.ts's groupImpact was mutating the result via `(result as unknown as Record<string, unknown>).crossDepthWarning = …`. This cast defeated the type system. Added `crossDepthWarning?: string` to GroupImpactResult in types.ts and removed the double-unknown cast in both runGroupImpactLegacy and runGroupImpact code paths. MEDIUM #12 — retryRename unit tests ─────────────────────────────────────────────────────────── retryRename had zero test coverage — critical code for Windows CI, tested only implicitly via writeBridge on macOS/Linux (where EBUSY never fires). Exported the function and added four vi.spyOn-based tests: retries on EBUSY, rethrows non-retryable codes (ENOENT), gives up after attempt count, retries on EACCES. ═══════════════════════════════════════════════════════════════════════ Intentionally deferred to follow-up PRs ═══════════════════════════════════════════════════════════════════════ * AbortSignal plumbing for Phase-1 timeout — requires changing the port.impact / impactByUid interface across lbug-adapter and GroupToolPort. Too broad for a correctness fix batch; tracked as follow-up. * Duplication between runGroupImpact and runGroupImpactLegacy (~500 LOC that share ~70% of logic) — genuine maintenance burden but a dedicated refactor PR is the right tool, not this batch. * Low-impact items: CLI double-open of bridge, proto parser performance optimization, loadPackageDefinition scanner false positives, `catch(err: any)` → `unknown` cleanup. All safe to land separately. ═══════════════════════════════════════════════════════════════════════ Test deltas (all via --pool=forks, files run individually) ═══════════════════════════════════════════════════════════════════════ bridge-db.test.ts 15 → 21 (+6: report shape, dropped links, 4× retryRename) sync.test.ts 10 → 11 (+1: initLbug failure path) manifest-extractor.test.ts 5 → 8 (+3: exact match, synthetic uid, regression cases) service.test.ts 25 → 31 (+6: range validation, safeLocalImpact wrap) grpc-extractor.test.ts 41 → 43 (+2: proto strings, comments) topic-extractor.test.ts 29 → 30 (+1: sarama loop regression) matching.test.ts 28 → 28 (comment change only) ────────────────────────────────────── NEW TESTS: +19 GROUP INTEGRATION: 10/10 green (cli, impact, sync, monorepo) Co-authored-by: Claude <noreply@anthropic.com>
14c0e7d to
d15b8cb
Compare
Hi @abhigyanpatwari — heads-up and an offer. Apologies for the commit noise on this PR. 14 commits from my side, After the last round I also ran a systematic self-review of the Two options — whichever saves your review timeOption A — merge as-is with Squash & Merge. Option B — I close this and open a fresh PR with a clean 3–4
Same content, same CI result, reviewer-friendly diffs per logical step. No preference from my side — whichever works best for your review |
|
@ivkond Thank you for your contribution! You've made an amazing impact with you changes. We would like to ask you to split this up, thus we can iterate on it much confidentially. |
@magyargergo Thanks — happy to do this. Here's what I'd propose; let me Proposed split — 4 stacked PRsLinear dependency chain, each PR has one clear responsibility. Every
PR 1 and PR 2 have no code-level coupling (only a type import) and Alternatives I considered
Thanks for the quick review turnaround. |
|
Please proceed one by one. I can see they have linear deps on them? Is there a way you could split them up, so they can be done parallel? |
|
Good question. The only thing making PR 1 → PR 2 linear in my first Revised split — 5 PRs, with PR 1 and PR 2 truly parallel
So it's still "one at a time" in terms of review context-switching, I'll start on PR 0 as soon as you confirm the shape. |
|
Nice! This sounds like a good plan! Please create tickets for them so we can track this progress |
|
Done! Tickets for the split:
Each issue has full Dependency chain: #790 first, then #791 and #792 in parallel, then #793, then #794. #606 will stay open until #794 merges, at which point it will be closed with backlinks to all 5. Starting on #790 now. |
|
Thank you for doing it! Let me know when you are ready with the first PR and I'll review it ASAP! |
…#606 split) (#795) * feat(group): bridge.lbug storage + contract matching expansion Part 1 of 4 in the split of #606 (ticket: #791, closes #790 with a revised plan per @magyargergo's request). ## What changed Adds the LadybugDB-backed bridge storage infrastructure and extends the contract matching algorithm with wildcard support. All changes are additive: storage.ts, sync.ts, service.ts, cli/group.ts, mcp/tools.ts are left on their upstream main versions and will migrate to the new bridge in follow-up PRs (#792, #793, #794). ### Files **New (844 LOC prod):** - `gitnexus/src/core/group/bridge-db.ts` — atomic write-to-temp with `retryRename` for Windows EBUSY/EPERM, per-item write tolerance via `WriteBridgeReport`, `findContractNode` with three-tier symbol lookup (uid → filePath+name → filePath) - `gitnexus/src/core/group/bridge-schema.ts` — schema DDL - `gitnexus/src/core/group/normalization.ts` — contract ID canonicalization + `dedupeContracts` / `dedupeCrossLinks` helpers used by both matching and bridge write **Modified (+137 LOC prod):** - `gitnexus/src/core/group/matching.ts` — adds `runWildcardMatch` for `grpc::Service/*` wildcard consumers, `buildProviderIndex` helper, and canonical gRPC ID handling in `normalizeContractId` - `gitnexus/src/core/group/types.ts` — `MatchType` gains `'wildcard'`; new `BridgeHandle` and `BridgeMeta` interfaces **New tests (658 LOC):** - `gitnexus/test/unit/group/bridge-db.test.ts` — core write/read round trip, `WriteBridgeReport` shape, dropped-links counter, retryRename behavior on EBUSY/ENOENT/EPERM/EACCES - `gitnexus/test/unit/group/bridge-db-edge.test.ts` — edge cases (malformed meta, missing contract nodes, concurrent access) **Modified tests (+225 LOC):** - `gitnexus/test/unit/group/matching.test.ts` — wildcard consumer matching, gRPC canonical ID handling, same-service guard ### Self-review fixes folded in Carried forward from the original #606 self-review: - `writeBridge` try/finally handle lifecycle + `handleClosed` sentinel - `openBridgeDbReadOnly` partial-handle cleanup - `writeBridgeMeta` uses `retryRename` for Windows consistency - `retryRename` unit tests (was zero coverage) - Per-item try/catch around every CREATE loop so one malformed contract doesn't abort the whole write - Dropped cross-link counter (`linksDroppedMissingNode`) ### Why now magyargergo asked for the #606 PR to be split so we can iterate with confidence (#606 (comment)). This is the foundational layer — pure infra, no user-facing surface, no callers of the new APIs in this PR. Later PRs wire it in. ### How to verify - `cd gitnexus && npx tsc --noEmit` - `cd gitnexus && npx vitest run test/unit/group/bridge-db.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/bridge-db-edge.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/matching.test.ts --pool=forks` - Pre-commit hook runs clean ### Risk / rollback **Low.** All new code sits under `src/core/group/` in new files plus a minimal `+16/-1` diff to `types.ts` and a `+136/-0` diff to `matching.ts` (both purely additive). No existing callers reference the new APIs (bridge-db, openBridgeOrFallback, runWildcardMatch) — the PRs that wire them in come later in the split chain. Rollback = `git revert` of the merge commit; no state introduced, no schema migration triggered. ### Scope discipline (per GUARDRAILS.md) - Only the 8 files listed above are touched; no drive-by refactors - No CI/release/security config changes - No secrets, tokens, or machine-specific paths - Content is lifted from the #606 branch which already passed CI 11/11 green on `d15b8cb` (before the split) ### Dependencies - **Base:** `main` (no dependencies on other split PRs) - **Blocks:** extractor expansion (#792), sync pipeline (#793), cross-impact feature (#794) - **Related ticket:** #791 Co-authored-by: Claude <noreply@anthropic.com> * fix(group): address @claude review on #795 Addresses the findings from the automated review on PR #795 (#795 (comment) — posted by @magyargergo / claude-code Action run). ### Medium severity (reviewer flagged as blockers) - **bridge-db.ts `openBridgeDbReadOnly` bak recovery** — the `.bak` recovery path used bare `fsp.rename(bakPath, dbPath)`, which is exactly the scenario most likely to hit Windows EBUSY/EPERM (an interrupted writer still holding the handle for a few ms). Switched to `retryRename` for consistency with the rest of the file's Windows-safe rename path. - **bridge-db.ts `ensureBridgeSchema` error detection** — the inline `msg.includes('already exists')` substring match has been lifted into a named constant `LBUG_ALREADY_EXISTS_MSG` with a comment documenting the coupling to LadybugDB's error message wording and why we can't use `IF NOT EXISTS` (LadybugDB DDL doesn't support it) or typed errors (LadybugDB's JS driver doesn't expose error codes). Also tightened the `catch (err: any)` to `catch (err: unknown)`. - **bridge-db.ts `findContractNode` — extracted out of writeBridge** — the 35-line async closure living inside `writeBridge` has been lifted to three module-level functions: `createContractLookupIndex`, `indexContract`, and `findContractNode`. `findContractNode` is now a pure synchronous function taking a prebuilt index instead of doing its own DB queries. The `writeBridge` cross-link loop is now ~25 lines instead of ~100. - **bridge-db.ts `findContractNode` — N+1 query elimination** — the old inner-closure version issued up to 6 DB round-trips per cross-link (2 endpoints × up to 3 tiers of fallback queries). For a group with 1000 cross-links, that's up to 6000 DB queries just to resolve endpoints. The new version consults an in-memory `ContractLookupIndex` built incrementally as contracts are inserted (`indexContract` called AFTER each successful insert so failed inserts don't poison the index). Cross-link resolution is now O(1) per link instead of O(3) DB queries per link, with zero DB round-trips during the cross-link loop. ### Minor severity - **bridge-db.ts `queryBridge` empty-array guard** — if LadybugDB ever returns an empty `QueryResult[]` at the top level (shouldn't happen with single-statement calls, but driver contract isn't explicit), the old code would call `.getAll()` on `undefined` and crash with a confusing stack. Added an `unwrapQueryResult` helper that throws an explicit `'empty QueryResult array'` error instead, making a potential driver regression visible immediately. - **normalization.ts `contractRichness` weights** — added a block-level comment documenting the weight ordering (+3 for symbolUid, +2 for each symbol-identifying field, +1 for service tag or non-manifest origin) and explicitly noting that the absolute numbers don't matter, only the relative ordering. Matches the "comment for contributors" suggestion in the review. - **bridge-schema.ts `BRIDGE_SCHEMA_VERSION` migration comment** — added a 4-point contract explaining what bumping the constant means ("discard and re-sync" strategy for V1, no in-place migration yet, new migration logic should live in a separate `bridge-migrations.ts` module when it becomes necessary). - **test/unit/group/fixtures.ts** — extracted the `makeContract` helper previously copy-pasted between `bridge-db.test.ts` and `bridge-db-edge.test.ts` into a shared fixtures module. Both test files now import from `./fixtures.js`. Kept the scope minimal: fixtures is NOT a general-purpose factory module, just the shared baseline contract builder. ### New tests Added 9 pure-function unit tests for the now-extracted `findContractNode` in `bridge-db.test.ts`: - returns null on empty index - tier 1 (symbolUid) match, including repo-scope and role-scope isolation - tier 2 (filePath + symbolName) fallback when symbolUid is empty or mismatches - tier 3 (filePath only) when exactly one contract lives in the file, and refusal when multiple do - priority ordering when multiple tiers could resolve These are fully isolated — no DB, no temp directories, no native LadybugDB binding — so they run in <10ms total and are immediately trustworthy as a regression safety net. ### Deliberately deferred (reviewer marked as "fine for now") - `BridgeHandle._db` / `._conn` typing to `unknown` with casts in `bridge-db.ts` — reviewer's note: "The typing is fine for now." - Batch inserts via `UNWIND` — needs LadybugDB support confirmation, tracked as a follow-up; the per-item pattern remains. - `queryBridge` prepared-statement lifecycle — the current pattern (prepare → execute → GC) relies on LadybugDB's internals, worth verifying against their docs in a separate audit. ### Scope discipline (per `GUARDRAILS.md`) - Only files touched by this PR (`bridge-db.ts`, `bridge-schema.ts`, `normalization.ts`, both bridge test files, new `fixtures.ts`) — no drive-by refactors - No CI/release/security config changes - No secrets ### Test + typecheck status - `npx tsc --noEmit` clean - `bridge-db.test.ts`: added 9 `findContractNode` tests, all pass in isolation. The full-file run still hits the pre-existing native LadybugDB cleanup segfault that flakes the reported count — same as every prior commit on this branch, not a regression. - `bridge-db-edge.test.ts`: 4/4 pass - `matching.test.ts`: 28/28 pass - `types.test.ts`: 5/5 pass - `retryRename` tests (4/4) and `findContractNode` tests (9/9) verified in isolation via `-t` filter Co-authored-by: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
…lit) (#796) * feat(group): extractor expansion + manifest extractor Part 2 of 4 in the split of #606 (ticket: #792). Follows #795 (bridge.lbug storage foundation, already merged), but this PR has no code-level dependency on #795 — it only imports types and the ContractExtractor interface that existed on upstream main before either PR. It could have been reviewed in parallel with #795. ## What changed Expands the 3 existing contract extractors with substantially more language/framework coverage, and adds a new `manifest-extractor` that resolves `group.yaml`-declared cross-links against the per-repo graph via exact-name lookups. ### New file (228 LOC) - `gitnexus/src/core/group/extractors/manifest-extractor.ts` — exact graph lookup for `group.yaml`-declared cross-links. HTTP paths are canonicalized before Route.name matching; gRPC is resolved by service/method name (NO `.proto`-filename fallback); topic and lib use exact-name match. Falls back to a synthetic `manifest::<repo>::<contractId>` uid when the graph has no matching symbol, so cross-impact traversal still has a stable anchor for the contract. ### Modified extractors (+958 LOC prod) - `extractors/grpc-extractor.ts` (+522) — `.proto` parser with comment and string-literal sanitization (braces inside strings no longer truncate service bodies); package/service/method canonical IDs; server/client detection across Go (`grpc.NewServer`, `RegisterXxxServer`, `XxxGrpc.XxxImplBase`), Java (`@GrpcService`, `BlockingStub`), Python (`servicer_to_server`, `XxxStub`), and TypeScript/Node (`@GrpcMethod`, `ClientGrpc`, `loadPackageDefinition`). - `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib `HandleFunc`, NestJS `@Controller`+`@Get`/etc, Python FastAPI decorators, Java Spring `@RequestMapping`/`@GetMapping`, restTemplate / WebClient / OkHttp consumers. - `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}` struct literal detection (replaces a constructor-anchored regex that missed topics inside producer loops), kafka-go Writer/Reader, Python NATS (`await nc.subscribe`/`await nc.publish`), JetStream helpers. ### Modified and new tests (+1264 LOC) - `grpc-extractor.test.ts` (+539) — full coverage of the new proto parser (strings-with-braces regression, comments-with-braces regression), per-language server/client detection - `http-route-extractor.test.ts` (+240) — per-framework route extraction + normalization edge cases - `topic-extractor.test.ts` (+177) — the sarama in-loop regression, JetStream, Python NATS, kafka-go Writer/Reader - `manifest-extractor.test.ts` (+308 NEW) — HTTP path normalization, gRPC exact lookup with proto-fallback regression, lib and topic exact matching, synthetic-uid fallback behavior ### Self-review fixes folded in Carried forward from the #606 self-review (commit `d15b8cb`): - **HIGH #1** — `manifest-extractor.resolveSymbol` was too fuzzy. Previously used `CONTAINS` on route/name fields plus an unconditional `filePath ENDS WITH '.proto'` fallback for gRPC. Consequences: `/orders` matched `/suborders`, and any repo with any `.proto` file returned a random proto symbol for a gRPC manifest entry. Replaced with exact equality + deterministic `ORDER BY` + synthetic-uid fallback for unresolved manifests. Regression tests included. - **MED #3** — gRPC proto parser brace-depth counting now sanitizes strings and comments first (`stripProtoCommentsAndStrings`). A valid proto with `option deprecated_reason = "use NewService { instead"` used to have its service body closed early by the `"{"` inside the literal, silently dropping methods after the offending string. Regression tests for both string-with-brace and comment-with-brace cases. - **MED #4** — sarama Kafka regex changed from `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on constructor, caught only first topic in a loop) to `sarama.ProducerMessage{...Topic:}` (matches every struct literal directly). Regression test with a for-loop that constructs multiple `ProducerMessage`s. - **MED #7** — `manifest-extractor.resolveSymbol` no longer has a silent `catch { /* fall through */ }`. Errors from the graph executor are logged via `console.warn` with link type, contract name, repo key, and error message before falling through to the synthetic-uid path. ## Why Reviewer focus here is pure regex / parser correctness — no storage, no Cypher queries, no algorithmic changes to the cross-link algorithm. Separating this from the bridge foundation PR (#795) meant reviewers could stay in a single mental mode (parsing logic) instead of context-switching between DDL, Cypher, and regex. ## How to verify - `cd gitnexus && npx tsc --noEmit` - `cd gitnexus && npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks` Local pre-push: typecheck clean, all 99 extractor unit tests pass (grpc 43, http 18, topic 30, manifest 8). ## Risk / rollback **Low.** Extractors have no user-facing surface in this PR — they produce `ExtractedContract[]` that is consumed by `sync.ts` in the next split (#793). No existing behavior changes for users who don't run a `group sync`. Rollback = `git revert` of the merge commit; the modifications to `grpc-extractor.ts` / `http-route-extractor.ts` / `topic-extractor.ts` revert to the pre-PR versions that still work (they're subsets of the new functionality). ## Scope discipline (per GUARDRAILS.md) - Only the 8 files above are touched; no drive-by refactors - No CI/release/security config changes - No secrets or machine-specific paths - Content lifted from #606 (CI 11/11 green on `d15b8cb`) ## Dependencies - **Base:** `main` (upstream already includes #795 as `1ff324c`) - **Blocks:** sync pipeline (#793) and the cross-impact feature (#794) - **Tracker issue:** #792 - **Parent PR:** #606 Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate topic-extractor from regex to tree-sitter queries Addresses @magyargergo's feedback on #796 that regex-based lookups should use tree-sitter nodes instead, and that the top-level extractors must NOT carry language dependencies. This is phase 1 of a multi-step migration — topic-extractor first because its patterns are the most uniform (16 "call/annotation with first-arg string literal" variants), which makes it a clean proof of the approach before grpc-extractor and http-route-extractor get the same treatment. ## Architecture: language-agnostic orchestrator + per-language plugins The top-level extractor is a thin orchestrator that never imports a tree-sitter grammar or a query string. Per-language knowledge lives in a new `topic-patterns/` folder with one file per language plus a registry that maps file extensions to compiled plugins: ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic scanning utilities ├── topic-extractor.ts # thin orchestrator (no grammar imports) └── topic-patterns/ ├── types.ts # TopicMeta, Broker ├── index.ts # registry: extension → compiled provider ├── java.ts # tree-sitter-java + JAVA_TOPIC_PROVIDER ├── go.ts # tree-sitter-go + GO_TOPIC_PROVIDER ├── python.ts # tree-sitter-python + PYTHON_TOPIC_PROVIDER └── node.ts # tree-sitter-javascript + tree-sitter-typescript # → JAVASCRIPT_/TYPESCRIPT_/TSX_TOPIC_PROVIDER ``` **Shared scanner (`tree-sitter-scanner.ts`)** — defines `PatternSpec<TMeta>`, `LanguagePatterns<TMeta>`, `CompiledPatterns<TMeta>` and the `scanFile(parser, plugin, content)` helper. Plugins compile their queries eagerly at module load via `compilePatterns()`, so a broken pattern fails loudly at import time instead of silently at scan time. `unquoteLiteral()` handles single/double/template quotes, Python triple-quoted strings, and Go raw backtick strings. **Per-language plugins** own: - the tree-sitter grammar import (this is the ONLY place in `src/core/group/` where tree-sitter grammars are imported), - the query S-expressions, - the `TopicMeta` payload (role, broker, confidence, symbolName) that the orchestrator receives back on every match. Each plugin uses a `@value` capture name to bind the topic literal node. The JavaScript and TypeScript grammars share AST node names for every construct we query, so `node.ts` defines the pattern sources once and compiles them against `JavaScript`, `TypeScript.typescript`, and `TypeScript.tsx` — exporting three providers because `Parser.Query` objects are NOT portable across grammar instances. **Registry (`topic-patterns/index.ts`)** — maps `.java` → Java provider, `.go` → Go, `.py` → Python, `.js`/`.jsx` → JS, `.ts` → TS, `.tsx` → TSX. Also exports `TOPIC_SCAN_GLOB` so adding a new language is a single file-level edit (drop `topic-patterns/<lang>.ts`, import + register it here — zero edits required in `topic-extractor.ts`). **Orchestrator (`topic-extractor.ts`)** — ~110 lines, no grammar or query imports. Per file: `getProviderForFile(rel)` → `scanFile(parser, provider, content)` → `unquoteLiteral(valueText)` → `makeContract(...)`. Reuses one `Parser` instance across files; the scanner calls `setLanguage` per plugin. ## Why this is better than regex 1. **Comments and strings are respected for free.** The old regex would match `// kafkaTemplate.send("fake.topic")` as a real producer; tree-sitter never visits comments or string literals as code nodes, so false positives from commented-out code are eliminated. 2. **Struct/object literal patterns are structural, not textual.** `sarama.ProducerMessage{Topic: "..."}` no longer needs a 300-char lookahead (which was a known cross-match bug partly mitigated by a loop regression test in the self-review). The new query matches a specific `composite_literal` with a specific `qualified_type` and `keyed_element` — exactly one struct literal per match. 3. **No order-of-operations fragility.** Regex for `channel.publish` vs `channel.consume` was independent and file-wide; the AST scopes matches to the specific `call_expression`. 4. **Language-agnostic extension.** Adding Ruby, Rust, or C# topic detection later means dropping one file in `topic-patterns/` — no changes to shared scanner or orchestrator, and no tree-sitter imports leak into top-level code. ## Per-file fault tolerance - Malformed files that tree-sitter can't parse are silently skipped (`parser.parse` is wrapped by `scanFile`). The ingestion pipeline already logs unparseable files at index time. - A syntactically invalid query is caught at `compilePatterns` time, not scan time — broken plugins fail loudly at import. - Per-pattern `matches()` failures are swallowed so one broken query in a plugin doesn't block the rest. ## Tests All 30 existing `topic-extractor.test.ts` tests pass **without any changes to the test file** — they were written as input/output contract tests (given this source file, expect these `ExtractedContract` objects) and that contract is unchanged. Regression coverage includes: - Kafka: Java `@KafkaListener` + `kafkaTemplate.send`; Node `producer.send` + `consumer.subscribe`; Go sarama producer/consumer (sync and async); kafka-go Writer/Reader; Python `KafkaConsumer` + `producer.send/produce` - RabbitMQ: Java `@RabbitListener` + `rabbitTemplate.convertAndSend`; Node `channel.consume/publish/sendToQueue`; Python `basic_consume/ basic_publish` with keyword args - NATS: Go and Node `nc.Subscribe/Publish`; Go and Node JetStream `js.Subscribe/Publish`; Python `await nc.subscribe/publish` Including the regression test for the sarama `ProducerMessage` in-loop case — the AST-based query captures every literal in the file independently, not just the first one after `NewSyncProducer`. ## Neighbor regression check - `topic-extractor.test.ts` — 30/30 pass (rewritten extractor) - `http-route-extractor.test.ts` — 18/18 pass (untouched) - `grpc-extractor.test.ts` — 43/43 pass (untouched) - `manifest-extractor.test.ts` — 8/8 pass (untouched) - Full `npx tsc --noEmit` clean ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched; no changes to other extractors, tests, MCP surface, or pipeline.ts. - No CI/release/security config changes, no secrets. - New tree-sitter imports all reference grammars that are already installed as dependencies (`tree-sitter`, `tree-sitter-javascript`, `tree-sitter-typescript`, `tree-sitter-python`, `tree-sitter-java`, `tree-sitter-go` — all in `package.json` for the existing pipeline). ## Phase 2 / phase 3 plan - **Phase 2 (next commit):** rewrite `http-route-extractor.ts` Strategy B (regex fallback) on the same plugin pattern. Graph-assisted Strategy A stays as-is (already uses pipeline-built tree-sitter data via `HANDLES_ROUTE` Cypher queries). - **Phase 3 (commit after):** rewrite `grpc-extractor.ts` for Java / Go / Python / TypeScript detection. `.proto` files are the one outstanding question — there is no `tree-sitter-proto` grammar installed; the in-tree string-sanitizing parser stays as a pragmatic exception with a comment, alternative being to add `tree-sitter-proto` as a dep (open for the maintainer). Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate http-route-extractor Strategy B to tree-sitter plugins Phase 2 of the extractor refactor requested by @magyargergo on #796. Same architecture as the phase 1 topic-extractor rewrite: a thin, language-agnostic orchestrator plus per-language plugins that own tree-sitter grammars and query sources. The top-level extractor file no longer imports any tree-sitter grammar or query string. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic primitives ├── http-route-extractor.ts # thin orchestrator (no grammar imports) └── http-patterns/ ├── types.ts # HttpDetection, HttpLanguagePlugin, HttpRole ├── index.ts # registry: ext → plugin + HTTP_SCAN_GLOB ├── java.ts # tree-sitter-java: Spring + RestTemplate/WebClient/OkHttp ├── go.ts # tree-sitter-go: gin/echo/HandleFunc + http/resty consumers ├── python.ts # tree-sitter-python: FastAPI + requests ├── php.ts # tree-sitter-php: Laravel Route::get/... └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # NestJS controllers, Express, fetch, axios ``` **Shared scanner (`tree-sitter-scanner.ts`)** — generalised from phase 1: - `ScanMatch<TMeta>.captures` is now a full `CaptureMap` (every named capture the query binds, not just a single `@value`). Topic extractor updated to read `match.captures.value` accordingly. - New `runCompiledPatterns(plugin, tree)` helper lets plugins run multiple query bundles against the same pre-parsed tree. This is needed for HTTP plugins that combine a class-prefix query with a method-route query (Spring, NestJS). - `scanFile` becomes a thin wrapper over `parser.parse + runCompiledPatterns`. **HTTP plugin shape** — unlike topic plugins, HTTP plugins expose a `scan(tree)` function rather than a flat pattern list. This reflects HTTP's more complex extraction: each detection needs method + path + handler name, and framework patterns like Spring `@RequestMapping` / NestJS `@Controller` require cross-referencing a class-level prefix with method-level annotations. Plugins internally use `compilePatterns` + `runCompiledPatterns` and walk the AST to resolve the class/method relationships. **Per-framework coverage:** - **Java (`java.ts`)** - Spring: `@RequestMapping("/api/v2")` class prefix + `@(Get|Post|Put| Delete|Patch)Mapping("/sub")` method routes, joined via the enclosing `class_declaration` node id. - `RestTemplate.getForObject/postForEntity/put/delete/patchForObject` → method derived from API name. - `WebClient.method(HttpMethod.X, "/path")` → method from `HttpMethod.X` capture. - `new Request.Builder().url("/path")` → OkHttp consumer. - **Go (`go.ts`)** - gin / echo / chi frameworks: `\w+.GET("/path", handler)` captures upper-case verb + handler identifier. - `net/http.HandleFunc("/path", handler)` → provider (default GET). - `http.Get/Post/Head` consumer, `http.NewRequest("METHOD", ...)`, resty `client.R().Get/Post/...`. - **Python (`python.ts`)** - `@app.get("/path")` FastAPI decorators. - `requests.get/post/...` and `requests.request("METHOD", "url")`. - **PHP (`php.ts`)** - Laravel `Route::get/post/.../patch('/path', ...)` via `scoped_call_expression`. Uses `PHP.php_only` to match the existing ingestion pipeline's grammar selection. - **Node (`node.ts`) — JS + TS + TSX** - Pattern sources defined once, compiled against three grammar variants (`JavaScript`, `TypeScript.typescript`, `TypeScript.tsx`) because `Parser.Query` objects are not portable across grammars. Exports three plugins sharing the same `scan` logic. - NestJS: `@Controller('prefix')` decorators are siblings of the class in `export_statement` / `program`; `@Get(':id')` decorators are siblings of the method in `class_body`. The plugin walks decorator → next named sibling to find the decorated class / method, then combines the class prefix with the method path. Only emits NestJS detections when the enclosing class has a real `@Controller` decorator — prevents false positives from generic classes that happen to use `@Get` from another library. - Express: `(router|app).<verb>('/path', ...)`. - `fetch(url)` (default GET) + `fetch(url, { method: 'X' })` (uses two queries + a SyntaxNode-id dedupe set so URL literals aren't double-emitted by the options variant). - `axios.get/post/...`. ## Orchestrator changes `http-route-extractor.ts` drops every `scanXxxProviders` / `scanXxxConsumers` regex method and replaces them with a single source-scan loop that delegates to `getPluginForFile(rel).scan(tree)`. The orchestrator still owns: - **Path normalization** (`normalizeHttpPath`, `normalizeConsumerPath`) — language-agnostic string processing shared by both strategies. - **Graph-assisted Strategy A** (`HANDLES_ROUTE` / `FETCHES` / `CONTAINS` Cypher queries) — unchanged in spirit. The only regex helpers it used (`inferMethodFromFileScan`, `pickJavaHandlerName`) are now replaced by a lookup against the plugin's detections for the same file: for each route row, find the detection whose normalized path matches, and pull the HTTP method + handler name from it. - **Per-file parse cache** — the orchestrator parses each relevant file at most once per `extract()` call. Both the graph-assisted enrichment loop and the source-scan fallback share the same `cachedDetections` map, so we never run the plugin twice for the same file. ## Why this is better than the regex version 1. **Comments and strings for free.** The old regex would match `// router.get('/fake')` as a real Express route; tree-sitter never visits string/comment nodes. 2. **Structural controller-prefix.** Spring and NestJS class-prefix joining is now scoped to the enclosing class via `class_declaration` node ids, eliminating file-wide state that broke when a file had multiple controllers. 3. **Precise NestJS disambiguation.** The plugin only emits a NestJS detection when the enclosing class has a real `@Controller` decorator — the old regex would fire on any `@Get(...)` in the file regardless of surrounding context. 4. **Language-agnostic extension.** Adding Ruby / Rust / Kotlin HTTP detection later means dropping one file in `http-patterns/` — no changes to the shared scanner, the orchestrator, or the Strategy A Cypher queries. ## Tests - `http-route-extractor.test.ts` — **18/18 pass** (tests unchanged; they're contract-style input/output tests and the contract shape is unchanged). Covers Spring class prefix, Express, gin/echo, stdlib HandleFunc, NestJS, Laravel, FastAPI for providers and fetch/axios/python-requests/rest-template/webClient/okhttp/go-stdlib/ resty for consumers, plus graph-first Strategy A for both. - `topic-extractor.test.ts` — **30/30 pass** after the `captures.value` API migration. - `grpc-extractor.test.ts` — 43/43 pass (untouched; phase 3). - `manifest-extractor.test.ts` — 8/8 pass (untouched). - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No changes to pipeline.ts, MCP surface, ingestion, or tests. - No CI / release / security / secrets changes. - Tree-sitter grammars imported by plugins (`tree-sitter-java`, `tree-sitter-go`, `tree-sitter-python`, `tree-sitter-php`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already in `package.json` for the existing ingestion pipeline. ## Phase 3 plan - **grpc-extractor** gets the same treatment: plugin-per-language under `grpc-patterns/` for Java / Go / Python / TS detection. `.proto` files remain an open question — no `tree-sitter-proto` grammar is installed, so the in-tree string-sanitizing parser from PR #796's self-review stays as a pragmatic exception unless the maintainer wants us to add `tree-sitter-proto` as a new dep. Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate grpc-extractor source scans to tree-sitter plugins Phase 3 (final) of the extractor refactor requested by @magyargergo on #796. Same architecture as phase 1 (topic) and phase 2 (http): thin language-agnostic orchestrator + per-language plugins that own tree-sitter grammars and query sources. With this commit the top-level extractors under `src/core/group/extractors/` import ZERO tree-sitter grammars and ZERO query strings — every grammar import lives in a `*-patterns/<lang>.ts` plugin file, and the orchestrators go through the registry indirection. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared primitives (unchanged) ├── grpc-extractor.ts # orchestrator (only `.proto` parser left) └── grpc-patterns/ ├── types.ts # GrpcDetection, GrpcLanguagePlugin, GrpcRole ├── index.ts # registry: ext → plugin + GRPC_SCAN_GLOB ├── go.ts # tree-sitter-go: RegisterXxxServer, Unimplemented, NewXxxClient ├── java.ts # tree-sitter-java: @GrpcService + XxxImplBase + newBlockingStub ├── python.ts # tree-sitter-python: add_XxxServicer_to_server + XxxStub └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # @GrpcMethod, @GrpcClient field type, # .getService<X>('Svc'), new XxxServiceClient, # loadPackageDefinition dynamic constructors ``` ## Per-language coverage **Go (`go.ts`)** - Provider: `\w+.RegisterXxxServer(...)` via `call_expression → selector_expression → field_identifier` + JS regex filter `^Register(\w+)Server$`. - Provider: `pb.UnimplementedXxxServer` embedded in a struct via `struct_type → field_declaration_list → field_declaration → qualified_type → type_identifier` + JS filter. - Consumer: `\w+.NewXxxClient(...)` via the same call_expression query + JS filter `^New(\w+)Client$`. **Java (`java.ts`)** - Provider: `class X extends YyyGrpc.YyyImplBase` — two queries handle the scoped and plain forms. `scoped_type_identifier`'s children are positional (no `scope:`/`name:` fields), so the query matches the two `type_identifier` children by position. - `#match? @inner "ImplBase$"` restricts matches at query time. - Whether the class has `@GrpcService` or not controls only the `source` metadata label — the plugin walks the class_declaration's `modifiers` child in JS to detect the marker_annotation. - Consumer: `YyyGrpc.newStub(ch)` / `newBlockingStub(ch)` via a `method_invocation` query with `#match? @method "^new(Blocking)?Stub$"`, service name extracted via `^(\w+)Grpc$` on the object identifier. **Python (`python.ts`)** - Single call-expression query covers both bare identifier and `obj.method` attribute forms: `(call function: [(identifier) @fn (attribute attribute: (identifier) @fn)])`. - Plugin filters `@fn.text` against two JS regexes: `^add_(\w+)Servicer_to_server$` (provider) and `^(\w+)Stub$` (consumer), with a reserved-names ignore list for the Stub case (Mock / Test / Fake / Stub). **Node — JavaScript + TypeScript + TSX (`node.ts`)** - Pattern sources defined once, compiled three times (one per grammar) because `Parser.Query` objects are not portable across grammars. Exports three `GrpcLanguagePlugin`s sharing the same `scan`. - `@GrpcMethod('Service', 'Method')`: decorator query captures the two string literals. Confidence is hard-coded 0.8 regardless of proto map resolution (matches the original regex version's behaviour). - `@GrpcClient(...) field: XxxServiceClient`: decorator query captures the decorator node, plugin walks up to find the enclosing `public_field_definition` (decorators on fields are CHILDREN of the field definition in tree-sitter-typescript, not siblings) and reads its first `type_annotation → type_identifier`, then runs the `^(\w+Service)Client$` JS filter. - `client.getService<X>('AuthService')`: call-expression query on `member_expression.property = "getService"` + string literal arg. - `new XxxServiceClient(...)`: `new_expression` with a bare identifier constructor, filtered by `^(\w+Service)Client$` so generic `new AuthClient(...)` (missing the `Service` infix) does NOT falsely register as a consumer. Preserves the regression test `test_extract_ts_non_service_client_constructor_is_ignored`. - `loadPackageDefinition` dynamic loader: gated on `tree.rootNode.text.includes('loadPackageDefinition')`. When set, `new foo.bar.Xxx(...)` qualified constructors with a capitalised property name register as consumers. ## Orchestrator changes `grpc-extractor.ts` loses every `scanGoProviders` / `scanJavaProviders` / ... helper and replaces them with a single source-scan loop that: 1. Parses each file with the plugin's grammar (one shared `Parser` instance across all files, `setLanguage` called per plugin). 2. Calls `plugin.scan(tree)` to get `GrpcDetection[]`. 3. Converts each detection to an `ExtractedContract` via the private `detectionToContract` helper, which: - Looks the short service name up in the proto map (filled by the `.proto` parser). - Picks confidence = `confidenceWithProto` if resolved, else `confidenceWithoutProto`. - Builds a method-level contract id (`grpc::pkg.Svc/Method`) when the detection carries a `methodName` (TS `@GrpcMethod` only), otherwise a service-level id (`grpc::pkg.Svc/*`). Everything else — the `.proto` parser, `buildProtoContext`, `buildProtoMap`, `resolveProtoConflict`, `serviceContractId`, `stripProtoCommentsAndStrings`, `extractServiceBlocks`, the dedupe function — stays exactly as before. The `.proto` parser is kept as a pragmatic exception to the "no regex in extractors" rule because no `tree-sitter-proto` grammar is installed in the repo; a comment at the top of the file explains this and flags the maintainer option of adding `tree-sitter-proto` as a dependency. ## Why this is better than the regex version 1. **Comments and strings are respected for free.** Matched node types are only code constructs, never text inside comments or string literals. 2. **No false positives on partial names.** The old `(\w+?)Grpc`-style regexes would cross-match unrelated identifiers; structural queries restrict matches to the exact AST shape (`scoped_type_identifier → type_identifier` pairs, `method_invocation → identifier` etc.). 3. **NestJS `@GrpcClient` is structural, not regex-based.** The old regex required a specific textual layout (`@GrpcClient(...) private readonly foo!: XxxServiceClient`); the plugin now walks the AST, so modifier order / optional modifiers / multi-line formatting don't break it. 4. **Language-agnostic extension.** Adding Kotlin / Rust / C# gRPC detection later is a one-file edit in `grpc-patterns/index.ts` — no touches to the shared scanner, the orchestrator, or the proto parser. ## Tests - `grpc-extractor.test.ts` — **43/43 pass** (tests unchanged; the contract shape is identical). Covers .proto parsing (including the brace-inside-string regression), Go provider/consumer, Java @GrpcService / plain ImplBase provider + newBlockingStub consumer, Python servicer + stub, TS @GrpcMethod + @GrpcClient + .getService + new XxxServiceClient + loadPackageDefinition + the `AuthClient` vs `AuthServiceClient` discrimination, dedupe across multiple patterns in one file, proto-aware confidence, and the inherited-package resolution for split proto definitions. - `topic-extractor.test.ts` — 30/30 pass. - `http-route-extractor.test.ts` — 18/18 pass. - `manifest-extractor.test.ts` — 8/8 pass. - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No pipeline.ts, MCP surface, ingestion, CI / release / security, or test changes. - New tree-sitter grammar imports (`tree-sitter-go`, `tree-sitter-java`, `tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already installed for the ingestion pipeline. ## End of phase series This commit completes the three-phase extractor refactor: - **Phase 1** (`ea06d11`): topic-extractor → `topic-patterns/` - **Phase 2** (`b6015f6`): http-route-extractor → `http-patterns/` - **Phase 3** (this commit): grpc-extractor → `grpc-patterns/` Every remaining regex-based extractor helper under the `src/core/group/ extractors/` directory is either (a) language-agnostic string processing (path normalization, dedupe keys) or (b) the `.proto` parser, which is documented as an explicit exception. Co-authored-by: Claude <noreply@anthropic.com> * feat(group): add tree-sitter-proto for .proto file parsing Addresses @magyargergo's suggestion on #796 to replace the manual string-sanitizing .proto parser with a tree-sitter grammar. - **Vendored `tree-sitter-proto`** in `vendor/tree-sitter-proto/`. Grammar source from [coder3101/tree-sitter-proto](https://github.com/coder3101/tree-sitter-proto) (latest `grammar.js`), parser.c regenerated with `tree-sitter-cli 0.24` to produce ABI version 14 — compatible with the project's `tree-sitter 0.25` runtime (which supports ABI ≤ 14). Added as `optionalDependency` with `file:./vendor/tree-sitter-proto`. - **New `grpc-patterns/proto.ts` plugin** — uses the same `compilePatterns` + `runCompiledPatterns` infrastructure as every other plugin. Two queries: - `(package (full_ident) @pkg)` — package declaration - `(service (service_name) @service_name (rpc (rpc_name) @rpc_name))` — one match per (service, rpc) pair - **Graceful fallback** — `tree-sitter-proto` is an optional dependency. If it fails to install (platform incompatibility) or fails the runtime smoke-test (`setLanguage` + `parse` on a trivial proto), `PROTO_GRPC_PLUGIN` stays `null` and the orchestrator uses the existing manual parser. The smoke-test catches the `SyntaxNode` TDZ error that occurs in vitest's fork-based test runner. - **Orchestrator updated** — when `hasProtoPlugin` is true, `.proto` files are handled by the plugin loop (they're included in `GRPC_SCAN_GLOB`), and the manual `parseProtoFile` loop is skipped. `buildProtoContext` still runs to build the proto map for cross-referencing source-file detections. 1. **No manual comment/string stripping.** The old parser needed `stripProtoCommentsAndStrings` (110 lines) to avoid counting braces inside comments and string literals. tree-sitter handles this natively. 2. **No brace-depth tracking.** `extractServiceBlocks` used a manual depth counter to find service boundaries. tree-sitter's AST gives us `service` → `service_name` + `rpc` → `rpc_name` directly. 3. **Performance.** tree-sitter's C-based parser is faster than character-by-character JS scanning + regex on large proto files. - `grpc-extractor.test.ts` — **43/43 pass** (unchanged) - All other extractor tests — 99/99 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * chore: add .gitignore for vendored tree-sitter-proto build artifacts https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * fix: correct .gitignore paths for vendored tree-sitter-proto Patterns should be relative to the .gitignore file's directory. https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * refactor(group): address Copilot review feedback on #796 Six fixes suggested by the Copilot AI review: 1. **`normalizeHttpPath` root-path edge case** — stripping trailing slashes on the input `/` produced an empty string, yielding malformed contract ids like `http::GET::`. Now preserves `/` for the root handler/fetch case. 2. **Dedupe `scanFiles` call** — `extract()` was globbing the source-scan file list twice (once for the provider fallback, once for the consumer fallback). Moved to a single lazy call that memoizes the result for the rest of the method. 3. **HTTP `scanFiles` now ignores `**/vendor/**`** — every other extractor's glob already ignored vendored sources; the HTTP one didn't. Fixed for consistency. 4. **`loadPackageDefinition` check is now structural** — was calling `tree.rootNode.text.includes('loadPackageDefinition')` which forces materialization of the entire file text from the parse tree (expensive on large files). Replaced with a dedicated compiled query on `(call_expression function: [(identifier) | (member_expression)])` so the check stays in the AST domain. 5. **`grpc-extractor.ts` header docstring updated** — still claimed ".proto parsing is not tree-sitter-based because no grammar is installed". Now describes the actual behaviour: tree-sitter when `tree-sitter-proto` is available (optionalDependency), manual fallback otherwise. 6. **Eliminated the double proto file parse on the fallback path** — `buildProtoContext` already globs + parses every `.proto` file to build `servicesByName`. On the `!hasProtoPlugin` branch the extractor was globbing + parsing again via the now-removed `parseProtoFile` helper. The fallback branch now iterates the map that `buildProtoContext` already produced to emit provider contracts directly — single pass per proto file. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): address Claude review feedback (bugs + dedup + hygiene) on #796 Follows up `2f28bfc` with the remaining items from the Claude AI review: ## Bugs **Bug 2 — Label-unaware Cypher queries in `resolveSymbol`.** The manifest-extractor's lookup queries were `MATCH (n) WHERE n.name = $x` with no label filter, so a topic/service/package name could silently match any node type (File, Variable, Import, Folder, …). Added label filters: - `topic` → `(n:Function|Method|Class|Interface)` (topics are best-effort symbol-name matches against listener/publisher symbols) - `grpc` method → `(n:Function|Method)` - `grpc` service → `(n:Class|Interface)` - `lib` → `(n:Package|Module)` All 8 manifest-extractor tests still pass (mock executor is label-agnostic, but the production LadybugDB graph now gets correctly scoped queries). **Bug 8 — Tautological `!handlerName` condition.** `http-route-extractor.ts:extractProvidersGraph` had `let handlerName = null; if (!method || !handlerName) { ... }` — the `!handlerName` clause was always true since there was no intervening assignment. Simplified to always run the plugin-scan lookup (we need the handler name even when `methodFromRouteReason` already resolved the method). ## Clean code / dedup **Design 7 — `readSafe` was copy-pasted in all three orchestrators.** Extracted to `extractors/fs-utils.ts` as the single source of truth for the path-traversal guard. Dropped the three local copies and the now-unused `fs`/`path` imports from topic-extractor. **Style 10 — Language-specific `_test.go` skip in the topic orchestrator.** Was `if (rel.endsWith('_test.go')) continue;` inside the language- agnostic extraction loop. Pushed into the glob's ignore list (`'**/*_test.go'`) alongside the existing `node_modules`, `vendor`, `dist`, `build` entries, with a comment explaining that other languages' test file conventions either live in separate directories (Python `tests/`, Java `src/test/`) or are already covered by the existing ignores. ## Already addressed in `2f28bfc` (mentioned again in Claude review) - Bug 3: `normalizeHttpPath('/')` returns `''` — fixed - Bug 4: double glob + double parse of `.proto` — fixed - Bug 5: `scanFiles` called twice in HTTP — fixed - Bug 6: missing `**/vendor/**` in HTTP glob — fixed - Design 9 partially: `tree.rootNode.text.includes('loadPackageDefinition')` replaced with a dedicated structural query ## Deferred - Bug 1 (`http::*::path` vs `http::GET::path` matching) — out of scope; sync.ts matching logic lands in #793, manifest extractor already emits correct synthetic uids for unresolved HTTP contracts. - Design 9 full (change plugin `scan(tree)` → `scan(tree, source)`) — the only real use case (`loadPackageDefinition` gate) is already fixed via a structural query, so the interface change would be cosmetic churn without a concrete consumer. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * docs+fix(group): address remaining Claude review items + add pipeline flow chart ## Fixes **Remaining 🔴 — HTTP contract id wildcard format.** Documented the `http::*::<path>` format as an intentional wildcard for manifest links that omit the HTTP method, alongside the explicit-method form (`GET::/path` → `http::GET::/path`). The docblock on `buildContractId` now states both forms, notes that wildcard-aware matching is the responsibility of the sync / cross-impact layer (#793), and recommends the explicit-method form whenever the author knows the method (it round-trips through exact equality without needing wildcard logic downstream). Tests unchanged — the wildcard format is what they've always asserted. **Minor 1 — stale comment at `manifest-extractor.ts:124-126`.** The comment claimed "creates a contract with an empty symbolUid/ref" but the code switched to `manifestSymbolUid(repo, contractId)` a few commits back. Updated to describe the actual synthetic-uid fallback semantics and the cross-impact path that relies on both sides of the join deriving the same uid. **Minor 2 — exhaustiveness guard on `buildContractId`.** The `switch(type)` covered all five current `ContractType` variants but silently returned `undefined` if a new variant was added. Added a `default: const _exhaustive: never = type; throw new Error(...)` clause so the build fails loudly on an unhandled variant. **Minor 3 — `tree.rootNode.text` in `grpc-patterns/node.ts`.** Already fixed in `2f28bfc` via a dedicated structural query (`LOAD_PACKAGE_DEFINITION_SPEC`). No action needed. ## New: pipeline flow chart (per @magyargergo's request) Added `src/core/group/PIPELINE.md` with four Mermaid diagrams: 1. **High-level overview** — `group.yaml` → extractors + manifest → contract matching → `bridge.lbug` → `runGroupImpact`. 2. **Per-repo extractor two-strategy shape** — graph-assisted Strategy A vs. source-scan Strategy B. 3. **Plugin architecture** — orchestrator → registry → per-language `*-patterns/<lang>.ts` → `tree-sitter-scanner.ts` → `ExtractedContract`. 4. **Manifest extraction** — label-scoped `resolveSymbol` with the synthetic-uid fallback. 5. **Cross-impact query (#606)** — local impact → bridge join → cross-repo fan-out. Each diagram is annotated with which PRs own which stage (this PR: extractors + manifest; #795: bridge storage; #606: cross-impact runtime) and points at the concrete files/functions involved. ## Tests - 99/99 extractor tests pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
|
@abhigyanpatwari @magyargergo — closing this one. The split plan agreed in #issuecomment-4229689301 is complete. Everything from this PR landed piecemeal via 4 follow-up PRs against the 5 tracking issues:
Net effect on
Happy to open follow-up issues for any of the above if useful. Thanks for the patience with the split and for the reviews along the way 🙏 — closing without merge. |
|
@ivkond thank you for this amazing work! You made a big impact on GitNexus! We appreciate your hard work. We have open positions for maintainers if you are interested. Please join to our community on Discord. |
…abhigyanpatwari#606 split) (abhigyanpatwari#795) * feat(group): bridge.lbug storage + contract matching expansion Part 1 of 4 in the split of abhigyanpatwari#606 (ticket: abhigyanpatwari#791, closes abhigyanpatwari#790 with a revised plan per @magyargergo's request). ## What changed Adds the LadybugDB-backed bridge storage infrastructure and extends the contract matching algorithm with wildcard support. All changes are additive: storage.ts, sync.ts, service.ts, cli/group.ts, mcp/tools.ts are left on their upstream main versions and will migrate to the new bridge in follow-up PRs (abhigyanpatwari#792, abhigyanpatwari#793, abhigyanpatwari#794). ### Files **New (844 LOC prod):** - `gitnexus/src/core/group/bridge-db.ts` — atomic write-to-temp with `retryRename` for Windows EBUSY/EPERM, per-item write tolerance via `WriteBridgeReport`, `findContractNode` with three-tier symbol lookup (uid → filePath+name → filePath) - `gitnexus/src/core/group/bridge-schema.ts` — schema DDL - `gitnexus/src/core/group/normalization.ts` — contract ID canonicalization + `dedupeContracts` / `dedupeCrossLinks` helpers used by both matching and bridge write **Modified (+137 LOC prod):** - `gitnexus/src/core/group/matching.ts` — adds `runWildcardMatch` for `grpc::Service/*` wildcard consumers, `buildProviderIndex` helper, and canonical gRPC ID handling in `normalizeContractId` - `gitnexus/src/core/group/types.ts` — `MatchType` gains `'wildcard'`; new `BridgeHandle` and `BridgeMeta` interfaces **New tests (658 LOC):** - `gitnexus/test/unit/group/bridge-db.test.ts` — core write/read round trip, `WriteBridgeReport` shape, dropped-links counter, retryRename behavior on EBUSY/ENOENT/EPERM/EACCES - `gitnexus/test/unit/group/bridge-db-edge.test.ts` — edge cases (malformed meta, missing contract nodes, concurrent access) **Modified tests (+225 LOC):** - `gitnexus/test/unit/group/matching.test.ts` — wildcard consumer matching, gRPC canonical ID handling, same-service guard ### Self-review fixes folded in Carried forward from the original abhigyanpatwari#606 self-review: - `writeBridge` try/finally handle lifecycle + `handleClosed` sentinel - `openBridgeDbReadOnly` partial-handle cleanup - `writeBridgeMeta` uses `retryRename` for Windows consistency - `retryRename` unit tests (was zero coverage) - Per-item try/catch around every CREATE loop so one malformed contract doesn't abort the whole write - Dropped cross-link counter (`linksDroppedMissingNode`) ### Why now magyargergo asked for the abhigyanpatwari#606 PR to be split so we can iterate with confidence (abhigyanpatwari#606 (comment)). This is the foundational layer — pure infra, no user-facing surface, no callers of the new APIs in this PR. Later PRs wire it in. ### How to verify - `cd gitnexus && npx tsc --noEmit` - `cd gitnexus && npx vitest run test/unit/group/bridge-db.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/bridge-db-edge.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/matching.test.ts --pool=forks` - Pre-commit hook runs clean ### Risk / rollback **Low.** All new code sits under `src/core/group/` in new files plus a minimal `+16/-1` diff to `types.ts` and a `+136/-0` diff to `matching.ts` (both purely additive). No existing callers reference the new APIs (bridge-db, openBridgeOrFallback, runWildcardMatch) — the PRs that wire them in come later in the split chain. Rollback = `git revert` of the merge commit; no state introduced, no schema migration triggered. ### Scope discipline (per GUARDRAILS.md) - Only the 8 files listed above are touched; no drive-by refactors - No CI/release/security config changes - No secrets, tokens, or machine-specific paths - Content is lifted from the abhigyanpatwari#606 branch which already passed CI 11/11 green on `d15b8cb` (before the split) ### Dependencies - **Base:** `main` (no dependencies on other split PRs) - **Blocks:** extractor expansion (abhigyanpatwari#792), sync pipeline (abhigyanpatwari#793), cross-impact feature (abhigyanpatwari#794) - **Related ticket:** abhigyanpatwari#791 Co-authored-by: Claude <noreply@anthropic.com> * fix(group): address @claude review on abhigyanpatwari#795 Addresses the findings from the automated review on PR abhigyanpatwari#795 (abhigyanpatwari#795 (comment) — posted by @magyargergo / claude-code Action run). ### Medium severity (reviewer flagged as blockers) - **bridge-db.ts `openBridgeDbReadOnly` bak recovery** — the `.bak` recovery path used bare `fsp.rename(bakPath, dbPath)`, which is exactly the scenario most likely to hit Windows EBUSY/EPERM (an interrupted writer still holding the handle for a few ms). Switched to `retryRename` for consistency with the rest of the file's Windows-safe rename path. - **bridge-db.ts `ensureBridgeSchema` error detection** — the inline `msg.includes('already exists')` substring match has been lifted into a named constant `LBUG_ALREADY_EXISTS_MSG` with a comment documenting the coupling to LadybugDB's error message wording and why we can't use `IF NOT EXISTS` (LadybugDB DDL doesn't support it) or typed errors (LadybugDB's JS driver doesn't expose error codes). Also tightened the `catch (err: any)` to `catch (err: unknown)`. - **bridge-db.ts `findContractNode` — extracted out of writeBridge** — the 35-line async closure living inside `writeBridge` has been lifted to three module-level functions: `createContractLookupIndex`, `indexContract`, and `findContractNode`. `findContractNode` is now a pure synchronous function taking a prebuilt index instead of doing its own DB queries. The `writeBridge` cross-link loop is now ~25 lines instead of ~100. - **bridge-db.ts `findContractNode` — N+1 query elimination** — the old inner-closure version issued up to 6 DB round-trips per cross-link (2 endpoints × up to 3 tiers of fallback queries). For a group with 1000 cross-links, that's up to 6000 DB queries just to resolve endpoints. The new version consults an in-memory `ContractLookupIndex` built incrementally as contracts are inserted (`indexContract` called AFTER each successful insert so failed inserts don't poison the index). Cross-link resolution is now O(1) per link instead of O(3) DB queries per link, with zero DB round-trips during the cross-link loop. ### Minor severity - **bridge-db.ts `queryBridge` empty-array guard** — if LadybugDB ever returns an empty `QueryResult[]` at the top level (shouldn't happen with single-statement calls, but driver contract isn't explicit), the old code would call `.getAll()` on `undefined` and crash with a confusing stack. Added an `unwrapQueryResult` helper that throws an explicit `'empty QueryResult array'` error instead, making a potential driver regression visible immediately. - **normalization.ts `contractRichness` weights** — added a block-level comment documenting the weight ordering (+3 for symbolUid, +2 for each symbol-identifying field, +1 for service tag or non-manifest origin) and explicitly noting that the absolute numbers don't matter, only the relative ordering. Matches the "comment for contributors" suggestion in the review. - **bridge-schema.ts `BRIDGE_SCHEMA_VERSION` migration comment** — added a 4-point contract explaining what bumping the constant means ("discard and re-sync" strategy for V1, no in-place migration yet, new migration logic should live in a separate `bridge-migrations.ts` module when it becomes necessary). - **test/unit/group/fixtures.ts** — extracted the `makeContract` helper previously copy-pasted between `bridge-db.test.ts` and `bridge-db-edge.test.ts` into a shared fixtures module. Both test files now import from `./fixtures.js`. Kept the scope minimal: fixtures is NOT a general-purpose factory module, just the shared baseline contract builder. ### New tests Added 9 pure-function unit tests for the now-extracted `findContractNode` in `bridge-db.test.ts`: - returns null on empty index - tier 1 (symbolUid) match, including repo-scope and role-scope isolation - tier 2 (filePath + symbolName) fallback when symbolUid is empty or mismatches - tier 3 (filePath only) when exactly one contract lives in the file, and refusal when multiple do - priority ordering when multiple tiers could resolve These are fully isolated — no DB, no temp directories, no native LadybugDB binding — so they run in <10ms total and are immediately trustworthy as a regression safety net. ### Deliberately deferred (reviewer marked as "fine for now") - `BridgeHandle._db` / `._conn` typing to `unknown` with casts in `bridge-db.ts` — reviewer's note: "The typing is fine for now." - Batch inserts via `UNWIND` — needs LadybugDB support confirmation, tracked as a follow-up; the per-item pattern remains. - `queryBridge` prepared-statement lifecycle — the current pattern (prepare → execute → GC) relies on LadybugDB's internals, worth verifying against their docs in a separate audit. ### Scope discipline (per `GUARDRAILS.md`) - Only files touched by this PR (`bridge-db.ts`, `bridge-schema.ts`, `normalization.ts`, both bridge test files, new `fixtures.ts`) — no drive-by refactors - No CI/release/security config changes - No secrets ### Test + typecheck status - `npx tsc --noEmit` clean - `bridge-db.test.ts`: added 9 `findContractNode` tests, all pass in isolation. The full-file run still hits the pre-existing native LadybugDB cleanup segfault that flakes the reported count — same as every prior commit on this branch, not a regression. - `bridge-db-edge.test.ts`: 4/4 pass - `matching.test.ts`: 28/28 pass - `types.test.ts`: 5/5 pass - `retryRename` tests (4/4) and `findContractNode` tests (9/9) verified in isolation via `-t` filter Co-authored-by: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
…npatwari#606 split) (abhigyanpatwari#796) * feat(group): extractor expansion + manifest extractor Part 2 of 4 in the split of abhigyanpatwari#606 (ticket: abhigyanpatwari#792). Follows abhigyanpatwari#795 (bridge.lbug storage foundation, already merged), but this PR has no code-level dependency on abhigyanpatwari#795 — it only imports types and the ContractExtractor interface that existed on upstream main before either PR. It could have been reviewed in parallel with abhigyanpatwari#795. ## What changed Expands the 3 existing contract extractors with substantially more language/framework coverage, and adds a new `manifest-extractor` that resolves `group.yaml`-declared cross-links against the per-repo graph via exact-name lookups. ### New file (228 LOC) - `gitnexus/src/core/group/extractors/manifest-extractor.ts` — exact graph lookup for `group.yaml`-declared cross-links. HTTP paths are canonicalized before Route.name matching; gRPC is resolved by service/method name (NO `.proto`-filename fallback); topic and lib use exact-name match. Falls back to a synthetic `manifest::<repo>::<contractId>` uid when the graph has no matching symbol, so cross-impact traversal still has a stable anchor for the contract. ### Modified extractors (+958 LOC prod) - `extractors/grpc-extractor.ts` (+522) — `.proto` parser with comment and string-literal sanitization (braces inside strings no longer truncate service bodies); package/service/method canonical IDs; server/client detection across Go (`grpc.NewServer`, `RegisterXxxServer`, `XxxGrpc.XxxImplBase`), Java (`@GrpcService`, `BlockingStub`), Python (`servicer_to_server`, `XxxStub`), and TypeScript/Node (`@GrpcMethod`, `ClientGrpc`, `loadPackageDefinition`). - `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib `HandleFunc`, NestJS `@Controller`+`@Get`/etc, Python FastAPI decorators, Java Spring `@RequestMapping`/`@GetMapping`, restTemplate / WebClient / OkHttp consumers. - `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}` struct literal detection (replaces a constructor-anchored regex that missed topics inside producer loops), kafka-go Writer/Reader, Python NATS (`await nc.subscribe`/`await nc.publish`), JetStream helpers. ### Modified and new tests (+1264 LOC) - `grpc-extractor.test.ts` (+539) — full coverage of the new proto parser (strings-with-braces regression, comments-with-braces regression), per-language server/client detection - `http-route-extractor.test.ts` (+240) — per-framework route extraction + normalization edge cases - `topic-extractor.test.ts` (+177) — the sarama in-loop regression, JetStream, Python NATS, kafka-go Writer/Reader - `manifest-extractor.test.ts` (+308 NEW) — HTTP path normalization, gRPC exact lookup with proto-fallback regression, lib and topic exact matching, synthetic-uid fallback behavior ### Self-review fixes folded in Carried forward from the abhigyanpatwari#606 self-review (commit `d15b8cb`): - **HIGH abhigyanpatwari#1** — `manifest-extractor.resolveSymbol` was too fuzzy. Previously used `CONTAINS` on route/name fields plus an unconditional `filePath ENDS WITH '.proto'` fallback for gRPC. Consequences: `/orders` matched `/suborders`, and any repo with any `.proto` file returned a random proto symbol for a gRPC manifest entry. Replaced with exact equality + deterministic `ORDER BY` + synthetic-uid fallback for unresolved manifests. Regression tests included. - **MED abhigyanpatwari#3** — gRPC proto parser brace-depth counting now sanitizes strings and comments first (`stripProtoCommentsAndStrings`). A valid proto with `option deprecated_reason = "use NewService { instead"` used to have its service body closed early by the `"{"` inside the literal, silently dropping methods after the offending string. Regression tests for both string-with-brace and comment-with-brace cases. - **MED abhigyanpatwari#4** — sarama Kafka regex changed from `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on constructor, caught only first topic in a loop) to `sarama.ProducerMessage{...Topic:}` (matches every struct literal directly). Regression test with a for-loop that constructs multiple `ProducerMessage`s. - **MED abhigyanpatwari#7** — `manifest-extractor.resolveSymbol` no longer has a silent `catch { /* fall through */ }`. Errors from the graph executor are logged via `console.warn` with link type, contract name, repo key, and error message before falling through to the synthetic-uid path. ## Why Reviewer focus here is pure regex / parser correctness — no storage, no Cypher queries, no algorithmic changes to the cross-link algorithm. Separating this from the bridge foundation PR (abhigyanpatwari#795) meant reviewers could stay in a single mental mode (parsing logic) instead of context-switching between DDL, Cypher, and regex. ## How to verify - `cd gitnexus && npx tsc --noEmit` - `cd gitnexus && npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks` Local pre-push: typecheck clean, all 99 extractor unit tests pass (grpc 43, http 18, topic 30, manifest 8). ## Risk / rollback **Low.** Extractors have no user-facing surface in this PR — they produce `ExtractedContract[]` that is consumed by `sync.ts` in the next split (abhigyanpatwari#793). No existing behavior changes for users who don't run a `group sync`. Rollback = `git revert` of the merge commit; the modifications to `grpc-extractor.ts` / `http-route-extractor.ts` / `topic-extractor.ts` revert to the pre-PR versions that still work (they're subsets of the new functionality). ## Scope discipline (per GUARDRAILS.md) - Only the 8 files above are touched; no drive-by refactors - No CI/release/security config changes - No secrets or machine-specific paths - Content lifted from abhigyanpatwari#606 (CI 11/11 green on `d15b8cb`) ## Dependencies - **Base:** `main` (upstream already includes abhigyanpatwari#795 as `f6fb87f`) - **Blocks:** sync pipeline (abhigyanpatwari#793) and the cross-impact feature (abhigyanpatwari#794) - **Tracker issue:** abhigyanpatwari#792 - **Parent PR:** abhigyanpatwari#606 Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate topic-extractor from regex to tree-sitter queries Addresses @magyargergo's feedback on abhigyanpatwari#796 that regex-based lookups should use tree-sitter nodes instead, and that the top-level extractors must NOT carry language dependencies. This is phase 1 of a multi-step migration — topic-extractor first because its patterns are the most uniform (16 "call/annotation with first-arg string literal" variants), which makes it a clean proof of the approach before grpc-extractor and http-route-extractor get the same treatment. ## Architecture: language-agnostic orchestrator + per-language plugins The top-level extractor is a thin orchestrator that never imports a tree-sitter grammar or a query string. Per-language knowledge lives in a new `topic-patterns/` folder with one file per language plus a registry that maps file extensions to compiled plugins: ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic scanning utilities ├── topic-extractor.ts # thin orchestrator (no grammar imports) └── topic-patterns/ ├── types.ts # TopicMeta, Broker ├── index.ts # registry: extension → compiled provider ├── java.ts # tree-sitter-java + JAVA_TOPIC_PROVIDER ├── go.ts # tree-sitter-go + GO_TOPIC_PROVIDER ├── python.ts # tree-sitter-python + PYTHON_TOPIC_PROVIDER └── node.ts # tree-sitter-javascript + tree-sitter-typescript # → JAVASCRIPT_/TYPESCRIPT_/TSX_TOPIC_PROVIDER ``` **Shared scanner (`tree-sitter-scanner.ts`)** — defines `PatternSpec<TMeta>`, `LanguagePatterns<TMeta>`, `CompiledPatterns<TMeta>` and the `scanFile(parser, plugin, content)` helper. Plugins compile their queries eagerly at module load via `compilePatterns()`, so a broken pattern fails loudly at import time instead of silently at scan time. `unquoteLiteral()` handles single/double/template quotes, Python triple-quoted strings, and Go raw backtick strings. **Per-language plugins** own: - the tree-sitter grammar import (this is the ONLY place in `src/core/group/` where tree-sitter grammars are imported), - the query S-expressions, - the `TopicMeta` payload (role, broker, confidence, symbolName) that the orchestrator receives back on every match. Each plugin uses a `@value` capture name to bind the topic literal node. The JavaScript and TypeScript grammars share AST node names for every construct we query, so `node.ts` defines the pattern sources once and compiles them against `JavaScript`, `TypeScript.typescript`, and `TypeScript.tsx` — exporting three providers because `Parser.Query` objects are NOT portable across grammar instances. **Registry (`topic-patterns/index.ts`)** — maps `.java` → Java provider, `.go` → Go, `.py` → Python, `.js`/`.jsx` → JS, `.ts` → TS, `.tsx` → TSX. Also exports `TOPIC_SCAN_GLOB` so adding a new language is a single file-level edit (drop `topic-patterns/<lang>.ts`, import + register it here — zero edits required in `topic-extractor.ts`). **Orchestrator (`topic-extractor.ts`)** — ~110 lines, no grammar or query imports. Per file: `getProviderForFile(rel)` → `scanFile(parser, provider, content)` → `unquoteLiteral(valueText)` → `makeContract(...)`. Reuses one `Parser` instance across files; the scanner calls `setLanguage` per plugin. ## Why this is better than regex 1. **Comments and strings are respected for free.** The old regex would match `// kafkaTemplate.send("fake.topic")` as a real producer; tree-sitter never visits comments or string literals as code nodes, so false positives from commented-out code are eliminated. 2. **Struct/object literal patterns are structural, not textual.** `sarama.ProducerMessage{Topic: "..."}` no longer needs a 300-char lookahead (which was a known cross-match bug partly mitigated by a loop regression test in the self-review). The new query matches a specific `composite_literal` with a specific `qualified_type` and `keyed_element` — exactly one struct literal per match. 3. **No order-of-operations fragility.** Regex for `channel.publish` vs `channel.consume` was independent and file-wide; the AST scopes matches to the specific `call_expression`. 4. **Language-agnostic extension.** Adding Ruby, Rust, or C# topic detection later means dropping one file in `topic-patterns/` — no changes to shared scanner or orchestrator, and no tree-sitter imports leak into top-level code. ## Per-file fault tolerance - Malformed files that tree-sitter can't parse are silently skipped (`parser.parse` is wrapped by `scanFile`). The ingestion pipeline already logs unparseable files at index time. - A syntactically invalid query is caught at `compilePatterns` time, not scan time — broken plugins fail loudly at import. - Per-pattern `matches()` failures are swallowed so one broken query in a plugin doesn't block the rest. ## Tests All 30 existing `topic-extractor.test.ts` tests pass **without any changes to the test file** — they were written as input/output contract tests (given this source file, expect these `ExtractedContract` objects) and that contract is unchanged. Regression coverage includes: - Kafka: Java `@KafkaListener` + `kafkaTemplate.send`; Node `producer.send` + `consumer.subscribe`; Go sarama producer/consumer (sync and async); kafka-go Writer/Reader; Python `KafkaConsumer` + `producer.send/produce` - RabbitMQ: Java `@RabbitListener` + `rabbitTemplate.convertAndSend`; Node `channel.consume/publish/sendToQueue`; Python `basic_consume/ basic_publish` with keyword args - NATS: Go and Node `nc.Subscribe/Publish`; Go and Node JetStream `js.Subscribe/Publish`; Python `await nc.subscribe/publish` Including the regression test for the sarama `ProducerMessage` in-loop case — the AST-based query captures every literal in the file independently, not just the first one after `NewSyncProducer`. ## Neighbor regression check - `topic-extractor.test.ts` — 30/30 pass (rewritten extractor) - `http-route-extractor.test.ts` — 18/18 pass (untouched) - `grpc-extractor.test.ts` — 43/43 pass (untouched) - `manifest-extractor.test.ts` — 8/8 pass (untouched) - Full `npx tsc --noEmit` clean ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched; no changes to other extractors, tests, MCP surface, or pipeline.ts. - No CI/release/security config changes, no secrets. - New tree-sitter imports all reference grammars that are already installed as dependencies (`tree-sitter`, `tree-sitter-javascript`, `tree-sitter-typescript`, `tree-sitter-python`, `tree-sitter-java`, `tree-sitter-go` — all in `package.json` for the existing pipeline). ## Phase 2 / phase 3 plan - **Phase 2 (next commit):** rewrite `http-route-extractor.ts` Strategy B (regex fallback) on the same plugin pattern. Graph-assisted Strategy A stays as-is (already uses pipeline-built tree-sitter data via `HANDLES_ROUTE` Cypher queries). - **Phase 3 (commit after):** rewrite `grpc-extractor.ts` for Java / Go / Python / TypeScript detection. `.proto` files are the one outstanding question — there is no `tree-sitter-proto` grammar installed; the in-tree string-sanitizing parser stays as a pragmatic exception with a comment, alternative being to add `tree-sitter-proto` as a dep (open for the maintainer). Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate http-route-extractor Strategy B to tree-sitter plugins Phase 2 of the extractor refactor requested by @magyargergo on abhigyanpatwari#796. Same architecture as the phase 1 topic-extractor rewrite: a thin, language-agnostic orchestrator plus per-language plugins that own tree-sitter grammars and query sources. The top-level extractor file no longer imports any tree-sitter grammar or query string. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic primitives ├── http-route-extractor.ts # thin orchestrator (no grammar imports) └── http-patterns/ ├── types.ts # HttpDetection, HttpLanguagePlugin, HttpRole ├── index.ts # registry: ext → plugin + HTTP_SCAN_GLOB ├── java.ts # tree-sitter-java: Spring + RestTemplate/WebClient/OkHttp ├── go.ts # tree-sitter-go: gin/echo/HandleFunc + http/resty consumers ├── python.ts # tree-sitter-python: FastAPI + requests ├── php.ts # tree-sitter-php: Laravel Route::get/... └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # NestJS controllers, Express, fetch, axios ``` **Shared scanner (`tree-sitter-scanner.ts`)** — generalised from phase 1: - `ScanMatch<TMeta>.captures` is now a full `CaptureMap` (every named capture the query binds, not just a single `@value`). Topic extractor updated to read `match.captures.value` accordingly. - New `runCompiledPatterns(plugin, tree)` helper lets plugins run multiple query bundles against the same pre-parsed tree. This is needed for HTTP plugins that combine a class-prefix query with a method-route query (Spring, NestJS). - `scanFile` becomes a thin wrapper over `parser.parse + runCompiledPatterns`. **HTTP plugin shape** — unlike topic plugins, HTTP plugins expose a `scan(tree)` function rather than a flat pattern list. This reflects HTTP's more complex extraction: each detection needs method + path + handler name, and framework patterns like Spring `@RequestMapping` / NestJS `@Controller` require cross-referencing a class-level prefix with method-level annotations. Plugins internally use `compilePatterns` + `runCompiledPatterns` and walk the AST to resolve the class/method relationships. **Per-framework coverage:** - **Java (`java.ts`)** - Spring: `@RequestMapping("/api/v2")` class prefix + `@(Get|Post|Put| Delete|Patch)Mapping("/sub")` method routes, joined via the enclosing `class_declaration` node id. - `RestTemplate.getForObject/postForEntity/put/delete/patchForObject` → method derived from API name. - `WebClient.method(HttpMethod.X, "/path")` → method from `HttpMethod.X` capture. - `new Request.Builder().url("/path")` → OkHttp consumer. - **Go (`go.ts`)** - gin / echo / chi frameworks: `\w+.GET("/path", handler)` captures upper-case verb + handler identifier. - `net/http.HandleFunc("/path", handler)` → provider (default GET). - `http.Get/Post/Head` consumer, `http.NewRequest("METHOD", ...)`, resty `client.R().Get/Post/...`. - **Python (`python.ts`)** - `@app.get("/path")` FastAPI decorators. - `requests.get/post/...` and `requests.request("METHOD", "url")`. - **PHP (`php.ts`)** - Laravel `Route::get/post/.../patch('/path', ...)` via `scoped_call_expression`. Uses `PHP.php_only` to match the existing ingestion pipeline's grammar selection. - **Node (`node.ts`) — JS + TS + TSX** - Pattern sources defined once, compiled against three grammar variants (`JavaScript`, `TypeScript.typescript`, `TypeScript.tsx`) because `Parser.Query` objects are not portable across grammars. Exports three plugins sharing the same `scan` logic. - NestJS: `@Controller('prefix')` decorators are siblings of the class in `export_statement` / `program`; `@Get(':id')` decorators are siblings of the method in `class_body`. The plugin walks decorator → next named sibling to find the decorated class / method, then combines the class prefix with the method path. Only emits NestJS detections when the enclosing class has a real `@Controller` decorator — prevents false positives from generic classes that happen to use `@Get` from another library. - Express: `(router|app).<verb>('/path', ...)`. - `fetch(url)` (default GET) + `fetch(url, { method: 'X' })` (uses two queries + a SyntaxNode-id dedupe set so URL literals aren't double-emitted by the options variant). - `axios.get/post/...`. ## Orchestrator changes `http-route-extractor.ts` drops every `scanXxxProviders` / `scanXxxConsumers` regex method and replaces them with a single source-scan loop that delegates to `getPluginForFile(rel).scan(tree)`. The orchestrator still owns: - **Path normalization** (`normalizeHttpPath`, `normalizeConsumerPath`) — language-agnostic string processing shared by both strategies. - **Graph-assisted Strategy A** (`HANDLES_ROUTE` / `FETCHES` / `CONTAINS` Cypher queries) — unchanged in spirit. The only regex helpers it used (`inferMethodFromFileScan`, `pickJavaHandlerName`) are now replaced by a lookup against the plugin's detections for the same file: for each route row, find the detection whose normalized path matches, and pull the HTTP method + handler name from it. - **Per-file parse cache** — the orchestrator parses each relevant file at most once per `extract()` call. Both the graph-assisted enrichment loop and the source-scan fallback share the same `cachedDetections` map, so we never run the plugin twice for the same file. ## Why this is better than the regex version 1. **Comments and strings for free.** The old regex would match `// router.get('/fake')` as a real Express route; tree-sitter never visits string/comment nodes. 2. **Structural controller-prefix.** Spring and NestJS class-prefix joining is now scoped to the enclosing class via `class_declaration` node ids, eliminating file-wide state that broke when a file had multiple controllers. 3. **Precise NestJS disambiguation.** The plugin only emits a NestJS detection when the enclosing class has a real `@Controller` decorator — the old regex would fire on any `@Get(...)` in the file regardless of surrounding context. 4. **Language-agnostic extension.** Adding Ruby / Rust / Kotlin HTTP detection later means dropping one file in `http-patterns/` — no changes to the shared scanner, the orchestrator, or the Strategy A Cypher queries. ## Tests - `http-route-extractor.test.ts` — **18/18 pass** (tests unchanged; they're contract-style input/output tests and the contract shape is unchanged). Covers Spring class prefix, Express, gin/echo, stdlib HandleFunc, NestJS, Laravel, FastAPI for providers and fetch/axios/python-requests/rest-template/webClient/okhttp/go-stdlib/ resty for consumers, plus graph-first Strategy A for both. - `topic-extractor.test.ts` — **30/30 pass** after the `captures.value` API migration. - `grpc-extractor.test.ts` — 43/43 pass (untouched; phase 3). - `manifest-extractor.test.ts` — 8/8 pass (untouched). - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No changes to pipeline.ts, MCP surface, ingestion, or tests. - No CI / release / security / secrets changes. - Tree-sitter grammars imported by plugins (`tree-sitter-java`, `tree-sitter-go`, `tree-sitter-python`, `tree-sitter-php`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already in `package.json` for the existing ingestion pipeline. ## Phase 3 plan - **grpc-extractor** gets the same treatment: plugin-per-language under `grpc-patterns/` for Java / Go / Python / TS detection. `.proto` files remain an open question — no `tree-sitter-proto` grammar is installed, so the in-tree string-sanitizing parser from PR abhigyanpatwari#796's self-review stays as a pragmatic exception unless the maintainer wants us to add `tree-sitter-proto` as a new dep. Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate grpc-extractor source scans to tree-sitter plugins Phase 3 (final) of the extractor refactor requested by @magyargergo on abhigyanpatwari#796. Same architecture as phase 1 (topic) and phase 2 (http): thin language-agnostic orchestrator + per-language plugins that own tree-sitter grammars and query sources. With this commit the top-level extractors under `src/core/group/extractors/` import ZERO tree-sitter grammars and ZERO query strings — every grammar import lives in a `*-patterns/<lang>.ts` plugin file, and the orchestrators go through the registry indirection. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared primitives (unchanged) ├── grpc-extractor.ts # orchestrator (only `.proto` parser left) └── grpc-patterns/ ├── types.ts # GrpcDetection, GrpcLanguagePlugin, GrpcRole ├── index.ts # registry: ext → plugin + GRPC_SCAN_GLOB ├── go.ts # tree-sitter-go: RegisterXxxServer, Unimplemented, NewXxxClient ├── java.ts # tree-sitter-java: @GrpcService + XxxImplBase + newBlockingStub ├── python.ts # tree-sitter-python: add_XxxServicer_to_server + XxxStub └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # @GrpcMethod, @GrpcClient field type, # .getService<X>('Svc'), new XxxServiceClient, # loadPackageDefinition dynamic constructors ``` ## Per-language coverage **Go (`go.ts`)** - Provider: `\w+.RegisterXxxServer(...)` via `call_expression → selector_expression → field_identifier` + JS regex filter `^Register(\w+)Server$`. - Provider: `pb.UnimplementedXxxServer` embedded in a struct via `struct_type → field_declaration_list → field_declaration → qualified_type → type_identifier` + JS filter. - Consumer: `\w+.NewXxxClient(...)` via the same call_expression query + JS filter `^New(\w+)Client$`. **Java (`java.ts`)** - Provider: `class X extends YyyGrpc.YyyImplBase` — two queries handle the scoped and plain forms. `scoped_type_identifier`'s children are positional (no `scope:`/`name:` fields), so the query matches the two `type_identifier` children by position. - `#match? @inner "ImplBase$"` restricts matches at query time. - Whether the class has `@GrpcService` or not controls only the `source` metadata label — the plugin walks the class_declaration's `modifiers` child in JS to detect the marker_annotation. - Consumer: `YyyGrpc.newStub(ch)` / `newBlockingStub(ch)` via a `method_invocation` query with `#match? @method "^new(Blocking)?Stub$"`, service name extracted via `^(\w+)Grpc$` on the object identifier. **Python (`python.ts`)** - Single call-expression query covers both bare identifier and `obj.method` attribute forms: `(call function: [(identifier) @fn (attribute attribute: (identifier) @fn)])`. - Plugin filters `@fn.text` against two JS regexes: `^add_(\w+)Servicer_to_server$` (provider) and `^(\w+)Stub$` (consumer), with a reserved-names ignore list for the Stub case (Mock / Test / Fake / Stub). **Node — JavaScript + TypeScript + TSX (`node.ts`)** - Pattern sources defined once, compiled three times (one per grammar) because `Parser.Query` objects are not portable across grammars. Exports three `GrpcLanguagePlugin`s sharing the same `scan`. - `@GrpcMethod('Service', 'Method')`: decorator query captures the two string literals. Confidence is hard-coded 0.8 regardless of proto map resolution (matches the original regex version's behaviour). - `@GrpcClient(...) field: XxxServiceClient`: decorator query captures the decorator node, plugin walks up to find the enclosing `public_field_definition` (decorators on fields are CHILDREN of the field definition in tree-sitter-typescript, not siblings) and reads its first `type_annotation → type_identifier`, then runs the `^(\w+Service)Client$` JS filter. - `client.getService<X>('AuthService')`: call-expression query on `member_expression.property = "getService"` + string literal arg. - `new XxxServiceClient(...)`: `new_expression` with a bare identifier constructor, filtered by `^(\w+Service)Client$` so generic `new AuthClient(...)` (missing the `Service` infix) does NOT falsely register as a consumer. Preserves the regression test `test_extract_ts_non_service_client_constructor_is_ignored`. - `loadPackageDefinition` dynamic loader: gated on `tree.rootNode.text.includes('loadPackageDefinition')`. When set, `new foo.bar.Xxx(...)` qualified constructors with a capitalised property name register as consumers. ## Orchestrator changes `grpc-extractor.ts` loses every `scanGoProviders` / `scanJavaProviders` / ... helper and replaces them with a single source-scan loop that: 1. Parses each file with the plugin's grammar (one shared `Parser` instance across all files, `setLanguage` called per plugin). 2. Calls `plugin.scan(tree)` to get `GrpcDetection[]`. 3. Converts each detection to an `ExtractedContract` via the private `detectionToContract` helper, which: - Looks the short service name up in the proto map (filled by the `.proto` parser). - Picks confidence = `confidenceWithProto` if resolved, else `confidenceWithoutProto`. - Builds a method-level contract id (`grpc::pkg.Svc/Method`) when the detection carries a `methodName` (TS `@GrpcMethod` only), otherwise a service-level id (`grpc::pkg.Svc/*`). Everything else — the `.proto` parser, `buildProtoContext`, `buildProtoMap`, `resolveProtoConflict`, `serviceContractId`, `stripProtoCommentsAndStrings`, `extractServiceBlocks`, the dedupe function — stays exactly as before. The `.proto` parser is kept as a pragmatic exception to the "no regex in extractors" rule because no `tree-sitter-proto` grammar is installed in the repo; a comment at the top of the file explains this and flags the maintainer option of adding `tree-sitter-proto` as a dependency. ## Why this is better than the regex version 1. **Comments and strings are respected for free.** Matched node types are only code constructs, never text inside comments or string literals. 2. **No false positives on partial names.** The old `(\w+?)Grpc`-style regexes would cross-match unrelated identifiers; structural queries restrict matches to the exact AST shape (`scoped_type_identifier → type_identifier` pairs, `method_invocation → identifier` etc.). 3. **NestJS `@GrpcClient` is structural, not regex-based.** The old regex required a specific textual layout (`@GrpcClient(...) private readonly foo!: XxxServiceClient`); the plugin now walks the AST, so modifier order / optional modifiers / multi-line formatting don't break it. 4. **Language-agnostic extension.** Adding Kotlin / Rust / C# gRPC detection later is a one-file edit in `grpc-patterns/index.ts` — no touches to the shared scanner, the orchestrator, or the proto parser. ## Tests - `grpc-extractor.test.ts` — **43/43 pass** (tests unchanged; the contract shape is identical). Covers .proto parsing (including the brace-inside-string regression), Go provider/consumer, Java @GrpcService / plain ImplBase provider + newBlockingStub consumer, Python servicer + stub, TS @GrpcMethod + @GrpcClient + .getService + new XxxServiceClient + loadPackageDefinition + the `AuthClient` vs `AuthServiceClient` discrimination, dedupe across multiple patterns in one file, proto-aware confidence, and the inherited-package resolution for split proto definitions. - `topic-extractor.test.ts` — 30/30 pass. - `http-route-extractor.test.ts` — 18/18 pass. - `manifest-extractor.test.ts` — 8/8 pass. - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No pipeline.ts, MCP surface, ingestion, CI / release / security, or test changes. - New tree-sitter grammar imports (`tree-sitter-go`, `tree-sitter-java`, `tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already installed for the ingestion pipeline. ## End of phase series This commit completes the three-phase extractor refactor: - **Phase 1** (`ea06d11`): topic-extractor → `topic-patterns/` - **Phase 2** (`b6015f6`): http-route-extractor → `http-patterns/` - **Phase 3** (this commit): grpc-extractor → `grpc-patterns/` Every remaining regex-based extractor helper under the `src/core/group/ extractors/` directory is either (a) language-agnostic string processing (path normalization, dedupe keys) or (b) the `.proto` parser, which is documented as an explicit exception. Co-authored-by: Claude <noreply@anthropic.com> * feat(group): add tree-sitter-proto for .proto file parsing Addresses @magyargergo's suggestion on abhigyanpatwari#796 to replace the manual string-sanitizing .proto parser with a tree-sitter grammar. - **Vendored `tree-sitter-proto`** in `vendor/tree-sitter-proto/`. Grammar source from [coder3101/tree-sitter-proto](https://github.com/coder3101/tree-sitter-proto) (latest `grammar.js`), parser.c regenerated with `tree-sitter-cli 0.24` to produce ABI version 14 — compatible with the project's `tree-sitter 0.25` runtime (which supports ABI ≤ 14). Added as `optionalDependency` with `file:./vendor/tree-sitter-proto`. - **New `grpc-patterns/proto.ts` plugin** — uses the same `compilePatterns` + `runCompiledPatterns` infrastructure as every other plugin. Two queries: - `(package (full_ident) @pkg)` — package declaration - `(service (service_name) @service_name (rpc (rpc_name) @rpc_name))` — one match per (service, rpc) pair - **Graceful fallback** — `tree-sitter-proto` is an optional dependency. If it fails to install (platform incompatibility) or fails the runtime smoke-test (`setLanguage` + `parse` on a trivial proto), `PROTO_GRPC_PLUGIN` stays `null` and the orchestrator uses the existing manual parser. The smoke-test catches the `SyntaxNode` TDZ error that occurs in vitest's fork-based test runner. - **Orchestrator updated** — when `hasProtoPlugin` is true, `.proto` files are handled by the plugin loop (they're included in `GRPC_SCAN_GLOB`), and the manual `parseProtoFile` loop is skipped. `buildProtoContext` still runs to build the proto map for cross-referencing source-file detections. 1. **No manual comment/string stripping.** The old parser needed `stripProtoCommentsAndStrings` (110 lines) to avoid counting braces inside comments and string literals. tree-sitter handles this natively. 2. **No brace-depth tracking.** `extractServiceBlocks` used a manual depth counter to find service boundaries. tree-sitter's AST gives us `service` → `service_name` + `rpc` → `rpc_name` directly. 3. **Performance.** tree-sitter's C-based parser is faster than character-by-character JS scanning + regex on large proto files. - `grpc-extractor.test.ts` — **43/43 pass** (unchanged) - All other extractor tests — 99/99 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * chore: add .gitignore for vendored tree-sitter-proto build artifacts https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * fix: correct .gitignore paths for vendored tree-sitter-proto Patterns should be relative to the .gitignore file's directory. https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * refactor(group): address Copilot review feedback on abhigyanpatwari#796 Six fixes suggested by the Copilot AI review: 1. **`normalizeHttpPath` root-path edge case** — stripping trailing slashes on the input `/` produced an empty string, yielding malformed contract ids like `http::GET::`. Now preserves `/` for the root handler/fetch case. 2. **Dedupe `scanFiles` call** — `extract()` was globbing the source-scan file list twice (once for the provider fallback, once for the consumer fallback). Moved to a single lazy call that memoizes the result for the rest of the method. 3. **HTTP `scanFiles` now ignores `**/vendor/**`** — every other extractor's glob already ignored vendored sources; the HTTP one didn't. Fixed for consistency. 4. **`loadPackageDefinition` check is now structural** — was calling `tree.rootNode.text.includes('loadPackageDefinition')` which forces materialization of the entire file text from the parse tree (expensive on large files). Replaced with a dedicated compiled query on `(call_expression function: [(identifier) | (member_expression)])` so the check stays in the AST domain. 5. **`grpc-extractor.ts` header docstring updated** — still claimed ".proto parsing is not tree-sitter-based because no grammar is installed". Now describes the actual behaviour: tree-sitter when `tree-sitter-proto` is available (optionalDependency), manual fallback otherwise. 6. **Eliminated the double proto file parse on the fallback path** — `buildProtoContext` already globs + parses every `.proto` file to build `servicesByName`. On the `!hasProtoPlugin` branch the extractor was globbing + parsing again via the now-removed `parseProtoFile` helper. The fallback branch now iterates the map that `buildProtoContext` already produced to emit provider contracts directly — single pass per proto file. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): address Claude review feedback (bugs + dedup + hygiene) on abhigyanpatwari#796 Follows up `2f28bfc` with the remaining items from the Claude AI review: ## Bugs **Bug 2 — Label-unaware Cypher queries in `resolveSymbol`.** The manifest-extractor's lookup queries were `MATCH (n) WHERE n.name = $x` with no label filter, so a topic/service/package name could silently match any node type (File, Variable, Import, Folder, …). Added label filters: - `topic` → `(n:Function|Method|Class|Interface)` (topics are best-effort symbol-name matches against listener/publisher symbols) - `grpc` method → `(n:Function|Method)` - `grpc` service → `(n:Class|Interface)` - `lib` → `(n:Package|Module)` All 8 manifest-extractor tests still pass (mock executor is label-agnostic, but the production LadybugDB graph now gets correctly scoped queries). **Bug 8 — Tautological `!handlerName` condition.** `http-route-extractor.ts:extractProvidersGraph` had `let handlerName = null; if (!method || !handlerName) { ... }` — the `!handlerName` clause was always true since there was no intervening assignment. Simplified to always run the plugin-scan lookup (we need the handler name even when `methodFromRouteReason` already resolved the method). ## Clean code / dedup **Design 7 — `readSafe` was copy-pasted in all three orchestrators.** Extracted to `extractors/fs-utils.ts` as the single source of truth for the path-traversal guard. Dropped the three local copies and the now-unused `fs`/`path` imports from topic-extractor. **Style 10 — Language-specific `_test.go` skip in the topic orchestrator.** Was `if (rel.endsWith('_test.go')) continue;` inside the language- agnostic extraction loop. Pushed into the glob's ignore list (`'**/*_test.go'`) alongside the existing `node_modules`, `vendor`, `dist`, `build` entries, with a comment explaining that other languages' test file conventions either live in separate directories (Python `tests/`, Java `src/test/`) or are already covered by the existing ignores. ## Already addressed in `2f28bfc` (mentioned again in Claude review) - Bug 3: `normalizeHttpPath('/')` returns `''` — fixed - Bug 4: double glob + double parse of `.proto` — fixed - Bug 5: `scanFiles` called twice in HTTP — fixed - Bug 6: missing `**/vendor/**` in HTTP glob — fixed - Design 9 partially: `tree.rootNode.text.includes('loadPackageDefinition')` replaced with a dedicated structural query ## Deferred - Bug 1 (`http::*::path` vs `http::GET::path` matching) — out of scope; sync.ts matching logic lands in abhigyanpatwari#793, manifest extractor already emits correct synthetic uids for unresolved HTTP contracts. - Design 9 full (change plugin `scan(tree)` → `scan(tree, source)`) — the only real use case (`loadPackageDefinition` gate) is already fixed via a structural query, so the interface change would be cosmetic churn without a concrete consumer. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * docs+fix(group): address remaining Claude review items + add pipeline flow chart ## Fixes **Remaining 🔴 — HTTP contract id wildcard format.** Documented the `http::*::<path>` format as an intentional wildcard for manifest links that omit the HTTP method, alongside the explicit-method form (`GET::/path` → `http::GET::/path`). The docblock on `buildContractId` now states both forms, notes that wildcard-aware matching is the responsibility of the sync / cross-impact layer (abhigyanpatwari#793), and recommends the explicit-method form whenever the author knows the method (it round-trips through exact equality without needing wildcard logic downstream). Tests unchanged — the wildcard format is what they've always asserted. **Minor 1 — stale comment at `manifest-extractor.ts:124-126`.** The comment claimed "creates a contract with an empty symbolUid/ref" but the code switched to `manifestSymbolUid(repo, contractId)` a few commits back. Updated to describe the actual synthetic-uid fallback semantics and the cross-impact path that relies on both sides of the join deriving the same uid. **Minor 2 — exhaustiveness guard on `buildContractId`.** The `switch(type)` covered all five current `ContractType` variants but silently returned `undefined` if a new variant was added. Added a `default: const _exhaustive: never = type; throw new Error(...)` clause so the build fails loudly on an unhandled variant. **Minor 3 — `tree.rootNode.text` in `grpc-patterns/node.ts`.** Already fixed in `2f28bfc` via a dedicated structural query (`LOAD_PACKAGE_DEFINITION_SPEC`). No action needed. ## New: pipeline flow chart (per @magyargergo's request) Added `src/core/group/PIPELINE.md` with four Mermaid diagrams: 1. **High-level overview** — `group.yaml` → extractors + manifest → contract matching → `bridge.lbug` → `runGroupImpact`. 2. **Per-repo extractor two-strategy shape** — graph-assisted Strategy A vs. source-scan Strategy B. 3. **Plugin architecture** — orchestrator → registry → per-language `*-patterns/<lang>.ts` → `tree-sitter-scanner.ts` → `ExtractedContract`. 4. **Manifest extraction** — label-scoped `resolveSymbol` with the synthetic-uid fallback. 5. **Cross-impact query (abhigyanpatwari#606)** — local impact → bridge join → cross-repo fan-out. Each diagram is annotated with which PRs own which stage (this PR: extractors + manifest; abhigyanpatwari#795: bridge storage; abhigyanpatwari#606: cross-impact runtime) and points at the concrete files/functions involved. ## Tests - 99/99 extractor tests pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
Summary
Adds cross-repository impact analysis on top of the group infrastructure from #625. Enables blast radius analysis that traces through contract links (HTTP routes, gRPC, Kafka topics) across repository boundaries.
What changed:
cross-impact.ts— cross-repo blast radius traversal via contract registrymanifest-extractor.ts— manifest-declared cross-links fromgroup.yamlgroupImpact()method on GroupServicegroup impactCLI command with--target,--repo,--direction,--cross-depthgroup_impactMCP tooldocs/specs/2026-03-31-cross-index-impact-design.md)Why:
Completes the cross-repo analysis story started in #625. With intra-repo service tracking proven, this PR adds the virtual graph layer that connects repositories via their contract interfaces — enabling questions like "if I change this function, which other repos are affected?"
How to verify:
Risk / rollback:
cross-impact.tsandmanifest-extractor.ts— new files, no existing code modifiedsync.tsandservice.tswith manifest extraction path (additive)Test plan
🤖 Generated with Claude Code