Skip to content

[group] Cross-repo impact analysis via repository groups#606

Closed
ivkond wants to merge 13 commits into
abhigyanpatwari:mainfrom
ivkond:feat/cross-index-impact-groups
Closed

[group] Cross-repo impact analysis via repository groups#606
ivkond wants to merge 13 commits into
abhigyanpatwari:mainfrom
ivkond:feat/cross-index-impact-groups

Conversation

@ivkond

@ivkond ivkond commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

Summary

Depends on: #625 (intra-repo service tracking). This PR will be rebased once #625 merges.

Adds cross-repository impact analysis on top of the group infrastructure from #625. Enables blast radius analysis that traces through contract links (HTTP routes, gRPC, Kafka topics) across repository boundaries.

What changed:

  • cross-impact.ts — cross-repo blast radius traversal via contract registry
  • manifest-extractor.ts — manifest-declared cross-links from group.yaml
  • groupImpact() method on GroupService
  • group impact CLI command with --target, --repo, --direction, --cross-depth
  • group_impact MCP tool
  • RFC design document (docs/specs/2026-03-31-cross-index-impact-design.md)

Why:
Completes the cross-repo analysis story started in #625. With intra-repo service tracking proven, this PR adds the virtual graph layer that connects repositories via their contract interfaces — enabling questions like "if I change this function, which other repos are affected?"

How to verify:

cd gitnexus
npx tsc --noEmit
npm run test:unit
npm run test:integration
# Cross-repo impact test:
npx vitest run test/integration/group/group-impact.test.ts

Risk / rollback:

  • Adds cross-impact.ts and manifest-extractor.ts — new files, no existing code modified
  • Enhances sync.ts and service.ts with manifest extraction path (additive)
  • Multi-hop traversal intentionally capped at depth 1 (safe default)
  • Rollback: revert the PR

Test plan

  • Unit tests — cross-impact (risk scoring, fan-out, subgroup filtering)
  • Unit tests — manifest-extractor (provider/consumer contracts, cross-links)
  • Integration test — group-impact (local → cross-repo traversal)
  • Full regression: all existing tests pass

🤖 Generated with Claude Code

@vercel

vercel Bot commented Mar 31, 2026

Copy link
Copy Markdown

@ivkond is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@ivkond

ivkond commented Mar 31, 2026

Copy link
Copy Markdown
Contributor Author

Hi @abhigyanpatwari! This is a proposal PR for cross-repository impact analysis (#256, #306, #77). I'd appreciate your feedback on:

  • Whether this direction aligns with your vision for GitNexus
  • The architecture choice (hybrid lazy virtual graph vs unified super-graph)
  • Any concerns about the scope or implementation approach

This is a demo/proof-of-concept — BM25/embedding matching, gRPC/topic extractors, and multi-hop traversal are intentionally deferred. Happy to iterate based on your feedback.

@github-actions

github-actions Bot commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
6119 6022 0 97 237s

✅ All 6022 tests passed

97 test(s) skipped — expand for details
  • Swift MethodExtractor > isTypeDeclaration > recognizes class_declaration
  • Swift MethodExtractor > isTypeDeclaration > recognizes protocol_declaration
  • Swift MethodExtractor > isTypeDeclaration > rejects import_declaration
  • Swift MethodExtractor > visibility > extracts public method
  • Swift MethodExtractor > visibility > extracts private method
  • Swift MethodExtractor > visibility > defaults to internal when no modifier
  • Swift MethodExtractor > protocol methods > marks protocol method as abstract
  • Swift MethodExtractor > static and class methods > detects static func as isStatic
  • Swift MethodExtractor > static and class methods > detects class func as isStatic
  • Swift MethodExtractor > parameters > extracts parameters with types and default values
  • Swift MethodExtractor > return type > extracts return type from -> annotation
  • Swift MethodExtractor > annotations > extracts @objc attribute
  • Swift MethodExtractor > isFinal > detects final func
  • Swift MethodExtractor > isFinal > is false when not final
  • Swift MethodExtractor > isAsync > detects async func
  • Swift MethodExtractor > isOverride > detects override method
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference uses lookupClassByName
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift explicit init inference uses lookupClassByName
  • buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference does not bind plain functions
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature
  • Swift constructor-inferred type resolution > detects User and Repo classes, both with save methods
  • Swift constructor-inferred type resolution > resolves user.save() to Models/User.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > resolves repo.save() to Models/Repo.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > emits exactly 2 save() CALLS edges (one per receiver type)
  • Swift self resolution > detects User and Repo classes, each with a save function
  • Swift self resolution > resolves self.save() inside User.process to User.save, not Repo.save
  • Swift parent resolution > detects BaseModel and User classes plus Serializable protocol
  • Swift parent resolution > emits EXTENDS edge: User → BaseModel
  • Swift parent resolution > emits IMPLEMENTS edge: User → Serializable (protocol conformance)
  • Swift cross-file User.init() inference > resolves user.save() via User.init(name:) inference
  • Swift cross-file User.init() inference > resolves user.greet() via User.init(name:) inference
  • Swift return type inference > detects User class and getUser function
  • Swift return type inference > detects save function on User (Swift class methods are Function nodes)
  • Swift return type inference > resolves user.save() to User#save via return type of getUser() -> User
  • Swift return-type inference via function return type > resolves user.save() to User#save via return type of getUser()
  • Swift return-type inference via function return type > user.save() does NOT resolve to Repo#save
  • Swift return-type inference via function return type > resolves repo.save() to Repo#save via return type of getRepo()
  • Swift implicit imports (cross-file visibility) > detects UserService class in Models.swift
  • Swift implicit imports (cross-file visibility) > resolves UserService() constructor call across files (no explicit import)
  • Swift implicit imports (cross-file visibility) > resolves service.fetchUser() member call across files
  • Swift implicit imports (cross-file visibility) > creates IMPORTS edges between files in the same module
  • Swift extension deduplication > detects Product class
  • Swift extension deduplication > resolves Product() constructor despite extension creating duplicate class node
  • Swift extension deduplication > resolves product.save() to Product.swift (primary definition)
  • Swift constructor call fallback (no new keyword) > resolves OCRService() as constructor call across files
  • Swift constructor call fallback (no new keyword) > resolves ocr.recognize() member call via constructor-inferred type
  • Swift export visibility (internal vs private) > resolves PublicService() constructor across files
  • Swift export visibility (internal vs private) > resolves internalHelper() across files (internal = module-scoped)
  • Swift if let / guard let binding resolution > detects User and Repo classes
  • Swift if let / guard let binding resolution > resolves user.save() inside if-let to User#save
  • Swift if let / guard let binding resolution > resolves repo.save() inside guard-let to Repo#save
  • Swift if let / guard let binding resolution > user.save() in if-let does NOT resolve to Repo#save
  • Swift await / try expression unwrapping > resolves user.save() via await fetchUser() return type
  • Swift await / try expression unwrapping > resolves repo.save() via try parseRepo() return type
  • Swift await / try expression unwrapping > detects fetchUser and parseRepo as functions
  • Swift for-in loop element type inference > detects User and Repo classes
  • Swift for-in loop element type inference > creates implicit import edges between files
  • Swift field-type resolution > detects classes and their properties
  • Swift field-type resolution > emits HAS_PROPERTY edges from class to field
  • Swift field-type resolution > resolves field-chain call user.address.save() → Address#save
  • Swift field-type resolution > emits ACCESSES edges for field reads in chains
  • Swift field-type resolution > populates field metadata (visibility, declaredType) on Property nodes
  • Swift call-result binding > resolves call-result-bound method call user.save() → User#save
  • Swift call-result binding > getUser() is present as a defined function
  • Swift call-result binding > emits processUser -> getUser CALLS edge for let-assigned free function call
  • Swift method enrichment > detects Animal protocol and Dog class
  • Swift method enrichment > emits IMPLEMENTS edge Dog -> Animal
  • Swift method enrichment > emits HAS_METHOD edges for Dog methods
  • Swift method enrichment > marks protocol Animal.speak as isAbstract
  • Swift method enrichment > marks Dog.speak as NOT isAbstract
  • Swift method enrichment > marks breathe as isFinal
  • Swift method enrichment > marks classify as isStatic
  • Swift method enrichment > captures @objc annotation on breathe
  • Swift method enrichment > populates parameterTypes for classify(_ name: String)
  • Swift method enrichment > records parameterCount for classify
  • Swift method enrichment > records returnType for speak
  • Swift method enrichment > resolves dog.speak() CALLS edge
  • Swift method enrichment > resolves Dog.classify("dog") CALLS edge
  • Swift abstract dispatch > detects Repository protocol and SqlRepository class
  • Swift abstract dispatch > emits IMPLEMENTS edge SqlRepository -> Repository
  • Swift abstract dispatch > emits HAS_METHOD edges for Repository.find and Repository.save
  • Swift abstract dispatch > emits HAS_METHOD edges for SqlRepository.find and SqlRepository.save
  • Swift abstract dispatch > marks base Repository.find as isAbstract
  • Swift abstract dispatch > marks base Repository.save as isAbstract
  • Swift abstract dispatch > marks concrete SqlRepository.find as NOT isAbstract
  • Swift abstract dispatch > resolves repo.find(id: 42) CALLS edge
  • Swift abstract dispatch > resolves repo.save(entity: user) CALLS edge
  • Swift abstract dispatch > populates parameterTypes for Repository.find
  • Swift abstract dispatch > populates parameterTypes for Repository.save
  • Swift abstract dispatch > records returnType for SqlRepository.find
  • Swift abstract dispatch > emits METHOD_IMPLEMENTS edges from SqlRepository methods → Repository protocol methods
  • Swift overloaded method disambiguation > detects 2 distinct find Method nodes on SqlRepository
  • Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edges for both find overloads
  • Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edge for save
  • Swift overloaded method disambiguation > emits exactly 3 METHOD_IMPLEMENTS edges total
  • Swift Child extends Parent — inherited method resolution (SM-9) > detects Parent and Child classes
  • Swift Child extends Parent — inherited method resolution (SM-9) > resolves c.parentMethod() to Parent.parentMethod via first-wins MRO walk

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 73.13% 16519/22588 72.2% 📈 +0.9 🟢 ██████████████░░░░░░
Branches 61.94% 10688/17254 61.29% 📈 +0.6 🟢 ████████████░░░░░░░░
Functions 77.92% 1525/1957 76.99% 📈 +0.9 🟢 ███████████████░░░░░
Lines 75.35% 15009/19919 74.37% 📈 +1.0 🟢 ███████████████░░░░░

📋 View full run · Generated by CI

@xkonjin

xkonjin commented Mar 31, 2026

Copy link
Copy Markdown

Code Review: Cross-Index Impact Groups

Overall: Impressive RFC-to-implementation PR. The design document is thorough and the implementation appears solid. This is a significant feature that addresses real gaps in multi-repo analysis.

Issues

1. Graph schema gaps not addressed in this PR (medium risk)
The RFC acknowledges several schema gaps that block full functionality:

  • No Route node label exists
  • No HANDLES_ROUTE or FETCHES relation types
  • Route data is ephemeral (reduced to CALLS edges with reason)
  • .proto files not a supported language
  • impact tool resolves symbols by name with LIMIT 1, not by UID

The PR implements the group infrastructure but these gaps mean the cross-link extraction for HTTP routes, gRPC, and topics will not work as designed. Consider:

  • Adding prerequisite schema changes first
  • Or reducing the scope to shared_lib and manifest contract types initially
  • Documenting which contract types are actually functional

2. Contract ID format could collide (low risk)
The format <type>::<discriminator> uses :: as a delimiter, which appears in:

  • gRPC package names: grpc::hr.UserService/GetUser
  • HTTP paths with empty segments (edge case)

Consider using a less common delimiter or proper escaping. URL-encoding the discriminator could prevent collisions.

3. BM25 threshold hardcoded at 0.7 (minor)
The matching cascade tuning in group.yaml has a BM25 threshold of 0.7. This seems arbitrary — was this tuned on real data? Consider:

  • Making the default data-driven
  • Or documenting how the threshold was chosen
  • Providing a calibration tool

4. Embedding vectors stored separately (question)
embeddings.bin is stored alongside contracts.json. Is this:

  • Generated fresh on each group sync?
  • Or incrementally updated?
  • What happens if the embedding model changes version?

The doc mentions no automatic migration — this could leave stale vectors with incompatible dimensions.

5. Test coverage looks good
The 31 new test files covering config parsing, contract extraction, matching, and sync are comprehensive. Nice work.

Minor

  • The RFC mentions group_impact and group_contracts tools, but I do not see them added to the MCP tools in the diff. Are they implemented?
  • Consider adding a group validate command to check group.yaml syntax before sync.

Verdict

Approve with suggestions. The design is sound and the implementation is solid. The main concern is the gap between designed capability (all contract types) and actual capability (limited by schema). Documenting this honestly would help users understand what works today vs what is planned.

@ivkond

ivkond commented Mar 31, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review! Addressing each point:

1. Graph schema gaps

The RFC "Prerequisites and Current State" section describes the codebase as of the RFC writing date, not the current state. The implementation plan (docs/plans/2026-03-31-cross-index-impact-plan.md) documents the divergences explicitly:

RFC claim Actual state in codebase
No Route node label Route exists in schema.ts
No HANDLES_ROUTE/FETCHES Present in VALID_RELATION_TYPES (13 types)
P2 prerequisite needed (4 relation types) Already 13 types in runtime filter
impact resolves by name with LIMIT 1 impactByUid added in this PR (UID-based resolution)

The HttpRouteExtractor uses a graph-first strategy — it queries Route/HANDLES_ROUTE/FETCHES nodes first, and falls back to source-scan only when graph data is absent.

gRPC, topic, and shared lib extractors are intentionally out of scope for Demo PR (documented in PR description).

2. Contract ID :: delimiter

By design. The :: delimiter separates the type prefix (http, grpc, topic, lib) from the discriminator. It does not appear inside real API paths, gRPC service names, or topic names in practice. URL-encoding would complicate readability of contracts.json and debug output without solving a real collision scenario.

3. BM25 threshold / 4. Embedding vectors

Both are out of scope for this Demo PR — the matching cascade is exact-match only. BM25 and embedding matching are designed in the spec but not implemented. The 0.7 default is configurable via group.yaml matching.bm25_threshold and will be tuned when BM25 is added.

Minor: MCP tools

All 6 group tools are in the diff — tools.ts lines 381-477: group_list, group_sync, group_contracts, group_impact, group_query, group_status. The tool count assertion is updated from 11 to 17 in tools.test.ts.

Minor: group validate

Good suggestion. parseGroupConfig already validates schema, required fields, link references, and roles — so group validate would be a thin CLI wrapper. Worth adding as a follow-up.

@ivkond

ivkond commented Mar 31, 2026

Copy link
Copy Markdown
Contributor Author

@claude ultra-think while identifying cross-indexing impact analysis gaps. Perfom this analysis on this pull request and verify those findings. Act as a senior compiler font-end engineer and expert in static analysis tools and review this change from that perpective and review the architectural fit. Also check out previous comments and reason with their requests.

@abhigyanpatwari

Copy link
Copy Markdown
Owner

@ivkond This is a great idea. I had been implementing something like this myself in the enterprise version of gitnexus. Our indexing system allows cross-repo analyses too but will need a bridge ( virtual graph ). Correct approach. But I would also look at intra repo rpc, grpc, kafka, rabbitmq, api endpoints, etc tracking. Once this can be done for intra repo we can move the same logic into building the virtual graph system so multi-repo support is bulletproof.

A good test would be to see if microservice based repos can be indexed well. If that is done cross repo becomes easy

@ivkond

ivkond commented Apr 1, 2026

Copy link
Copy Markdown
Contributor Author

@abhigyanpatwari Thanks for the direction! I've started implementing intra-repo service communication tracking on a separate branch (stacked on this PR):

What's added (ivkond/GitNexus#1):

  • ServiceBoundaryDetector — auto-detects service boundaries within monorepos via markers (package.json, go.mod, Dockerfile, pom.xml, Cargo.toml, etc.)
  • GrpcExtractor — parses .proto files + detects Go/Java/Python/TS gRPC servers and clients (RegisterXxxServer, @GrpcService, add_XxxServicer_to_server, @GrpcMethod)
  • TopicExtractor — detects Kafka, RabbitMQ, NATS producers/consumers across Java/Node/Go/Python (@KafkaListener, producer.send, channel.publish, nc.Subscribe, etc.)
  • Intra-repo matching — relaxed the same-repo guard so contracts from different services within one repo can match (backward compatible — existing cross-repo matching unchanged)

Proof of concept: Fixture monorepo with 3 services (auth, orders, gateway) connected via gRPC + Kafka + HTTP routes — all cross-links are discovered automatically. 47 new tests, zero regressions (2643 unit + 2016 integration).

The service field on contracts is backward-compatible. Once this PR merges, the same extractors feed directly into the virtual graph for multi-repo support.

Next: test on a real microservice repo, add more framework patterns, then wire into cross-repo virtual graph.

@ivkond

ivkond commented Apr 1, 2026

Copy link
Copy Markdown
Contributor Author

Update: Following your suggestion, I've split the work into two PRs:

  1. [group] Intra-repo service communication tracking #625 (→ main) — Intra-repo service tracking: group infrastructure, extractors (HTTP/gRPC/Kafka/RabbitMQ/NATS), service boundary detection, intra-repo matching. Self-contained, no cross-repo deps.

  2. This PR ([group] Cross-repo impact analysis via repository groups #606) — Will be rebased on top of [group] Intra-repo service communication tracking #625 once merged. Adds: cross-repo impact analysis (group_impact), manifest-declared links, virtual graph.

This way the intra-repo foundation lands first, proving that microservice monorepos index well, before cross-repo is layered on top.

@ivkond ivkond changed the title feat(group): cross-index impact analysis via repository groups [group] Cross-repo impact analysis via repository groups Apr 1, 2026
@ivkond

ivkond commented Apr 1, 2026

Copy link
Copy Markdown
Contributor Author

Update: Following your suggestion, I've split the work into two PRs:

  1. [group] Intra-repo service communication tracking #626 (→ main) — Intra-repo service tracking: group infrastructure, extractors (HTTP/gRPC/Kafka/RabbitMQ/NATS), service boundary detection, intra-repo matching. Self-contained, no cross-repo deps.

  2. This PR ([group] Cross-repo impact analysis via repository groups #606) — Will be rebased on top of [group] Intra-repo service communication tracking #626 once merged. Adds: cross-repo impact analysis (group_impact), manifest-declared links, virtual graph.

This way the intra-repo foundation lands first, proving that microservice monorepos index well, before cross-repo is layered on top.

(Supersedes #625 which was closed due to branch rename.)

ivkond and others added 2 commits April 4, 2026 23:28
Cross-repo blast radius analysis through Contract Registry:
- group_impact MCP tool and CLI command
- ManifestExtractor for explicit cross-repo links
- HTTP route extractor for auto-detected contracts
- closeLbug() resource cleanup in CLI sync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nonical ID

Design spec (12 rounds of review):
- Bridge.lbug: LadybugDB storage replacing contracts.json
- gRPC canonical ID: proto-aware extraction with wildcard matching
- Direction-dependent Cypher queries for cross-impact
- Write-to-temp-then-rename atomic swap strategy
- Backward compatibility via openBridgeOrFallback

Implementation plan (10 tasks, TDD):
- Bridge DB core, schema, write/read lifecycle
- Proto map, source scanner resolution, confidence adjustments
- Wildcard matching, sync migration, service layer integration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ivkond ivkond force-pushed the feat/cross-index-impact-groups branch from 88fd9d6 to de79d4b Compare April 4, 2026 21:03
@vercel

vercel Bot commented Apr 4, 2026

Copy link
Copy Markdown

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

Bridge.lbug — contract storage in LadybugDB:
- bridge-schema.ts: Contract, RepoSnapshot, ContractLink DDL
- bridge-db.ts: open/close, writeBridge (atomic swap), queryBridge,
  openBridgeDbReadOnly (with version gate), readBridgeMeta, openBridgeOrFallback
- BridgeHandle, BridgeMeta types; LegacyContractRegistry alias

gRPC canonical ID — proto-aware extraction:
- buildProtoMap: scan .proto files (excludes node_modules/vendor)
- resolveProtoConflict: directory proximity disambiguation
- serviceContractId: grpc::pkg.Service/* format
- All 4 source scanners (Go/Java/Python/TS) resolve via proto map
- Confidence adjustments: reduced without proto, boosted with

Matching — wildcard support:
- buildProviderIndex exported for reuse
- runExactMatch skips gRPC /* contracts
- runWildcardMatch: bare-name and FQ service matching

Consumer migration:
- sync.ts: writeBridge + wildcard pass, returns remaining after wildcard
- cross-impact.ts: Cypher-based Phase 2 with direction-dependent queries,
  hint param for empty-UID fan-out, error-object guard
- service.ts: openBridgeOrFallback with bridge/json/none branching,
  groupContracts bridge path returns flat contracts + crossLinks + unmatchedOnly
- CLI: status/impact use openBridgeOrFallback, deprecation warnings
- MCP tools: descriptions updated (contracts.json → bridge.lbug)
- Old storage API removed (writeContractRegistry, readContractRegistry)
- Cross-link insertion uses endpoint symbolRef lookup (not link.contractId)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ivkond ivkond force-pushed the feat/cross-index-impact-groups branch from de79d4b to ae68bfb Compare April 4, 2026 21:33
@ivkond

ivkond commented Apr 4, 2026

Copy link
Copy Markdown
Contributor Author

@abhigyanpatwari This PR is ready for re-review. I've addressed both items from your PR #626 review:

1. Contract storage → bridge.lbug

  • Contracts and cross-links are now stored in LadybugDB (bridge.lbug) as Cypher-queryable graph nodes and edges
  • group_impact Phase 2 uses direction-dependent Cypher queries instead of JS iteration
  • Atomic write via temp-file-then-rename with Windows EBUSY retry
  • Backward compat: automatic contracts.json fallback with deprecation warning

2. gRPC normalization mismatch

  • buildProtoMap() scans .proto files to resolve ServiceName → package.Service
  • Go/Java/Python/TS source scanners produce canonical grpc::pkg.Service/* IDs when proto is available
  • New runWildcardMatch() handles remaining unresolved wildcards as a separate matching pass
  • Confidence adjustments: reduced without proto (0.65/0.55), unchanged with proto

Tests: 239 in group module (11 new coverage tests), 5410 total in CI. Build: clean.

Note on force-push: Rebased on current main (includes PR #626) and squashed 24 intermediate commits into 3 logical units — original feature, documentation (spec + plan), implementation. The intermediate commits were iterative spec review rounds and per-task TDD commits that added noise to the history without providing bisect value.

@ivkond

ivkond commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

@abhigyanpatwari hi! Friendly follow-up on this PR — I’d really appreciate your feedback when you have time, especially on whether this direction and architecture fit your vision.

@abhigyanpatwari

Copy link
Copy Markdown
Owner

@claude do a deep review of it, especially focused on coverage of cross repo communication protocols. Also see that there are real world test to validate it against real multi repo and microsercice based repos.

@github-actions

github-actions Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Claude finished @abhigyanpatwari's task in 2m 43s —— View job


Deep Review: Cross-Repo Impact Analysis — Protocol Coverage & Real-World Test Validation

Todo list:

  • Post initial comment
  • Read core implementation files (cross-impact.ts, grpc-extractor.ts, matching.ts, service.ts)
  • Read manifest-extractor.ts and bridge-db.ts
  • Review protocol extractor coverage (HTTP, gRPC, Kafka, RabbitMQ, NATS)
  • Review test files for real-world coverage
  • Assess real-world multi-repo / microservice test gaps
  • Write comprehensive review

Summary

The architecture is fundamentally sound — the hybrid lazy virtual graph via bridge.lbug, direction-aware Cypher queries, and the extraction→matching→bridge pipeline are all well-designed. However, the protocol coverage has significant gaps for real-world polyglot microservice repos, and there are no true end-to-end tests against real multi-repo structures. Below is a detailed breakdown.


1. Protocol Coverage: What Works vs. What's Missing

HTTP (http-route-extractor.ts)

Covered providers: Spring @GetMapping/@PostMapping, Express router.get/post, Laravel Route::get, FastAPI @app.get.

Covered consumers: fetch() and axios.get/post in TS/JS.

Critical gaps:

Missing pattern Impact
Go HTTP handlers (gin.GET, echo.GET, mux.HandleFunc, stdlib http.HandleFunc) Any Go-based API service goes undetected as provider
NestJS @Controller/@Get/@Post decorators NestJS is the dominant Node.js microservice framework; not covered
Python requests.get/post consumer Python consumer side is entirely absent
Java RestTemplate.getForEntity, WebClient.get(), OkHttpClient Java inter-service HTTP calls not tracked
Go http.Get, resty.R().Get() consumer patterns Go consumer side absent

The fallback consumer detection source scan, lines 409–423 only scans .ts,.tsx,.js,.jsx,.vue,.svelte — Go, Python, Java, PHP consumers are never found without graph data. Fix this →

gRPC (grpc-extractor.ts)

Covered: Proto parsing (definitive), Go Register*Server/New*Client, Java @GrpcService/*ImplBase/*Grpc.newStub, Python add_*Servicer_to_server/*Stub, NestJS @GrpcMethod (server only).

Critical gap — TypeScript gRPC consumer is entirely missing:

// These are completely invisible to the extractor:
// grpc-js
const client = new proto.hr.UserService('host', credentials.createInsecure());
// NestJS @GrpcClient (consumer side)
@GrpcClient({ service: 'UserService', url: '...' })
private readonly userClient: ClientGrpc;
// NestJS ClientGrpc
const svc = this.userClient.getService<UserServiceClient>('UserService');

scanTsProviders handles @GrpcMethod only (server side). There is no scanTsConsumers. Given TypeScript/NestJS is a primary gRPC consumer framework in Node.js microservices, this is a significant real-world gap. Fix this →

Other gaps: No C# gRPC (GrpcChannel.ForAddress, new Service.ServiceClient()), no Ruby (Grpc::ClientStub), no Rust tonic patterns. These matter less than TypeScript but are worth documenting.

Proto import resolution not implemented: When a .proto uses import "other.proto" to reference types from another package, the buildProtoMap doesn't follow imports. This means services that split their types across files may get wrong or missing package names.

Topic/Messaging (topic-extractor.ts)

Covered well: Java Spring Kafka/RabbitMQ annotations, basic Node.js producer/consumer, Go sarama ConsumePartition, Python Kafka. This is the most complete extractor.

Notable gaps:

Missing Framework Impact
consumer.run({ eachMessage: ... }) kafkajs (most popular Node Kafka lib) Node.js Kafka consumers won't be detected as they use run() not subscribe()
kafka.NewWriter / kafka.NewReader segmentio/kafka-go Second most popular Go Kafka library
sarama.NewSyncProducer / sarama.NewAsyncProducer sarama (most popular Go Kafka) Go Kafka producers missing
js.Publish / js.Subscribe NATS JetStream JetStream is the modern NATS API, largely replacing core NATS
nats.connect / nc.subscribe (Python) nats-py NATS Python consumer absent
Redis Pub/Sub, AWS SQS/SNS, Azure Service Bus Common in enterprise microservices

The kafkajs gap is the most impactful — fix this →


2. Real-World Test Coverage: The Main Gap

The integration test (test/integration/group/group-impact.test.ts) is explicitly noted as "mocks localImpactFn / crossImpactFn. E2E with real graphs is a follow-up." This is the most important gap to close.

What exists today:

  • Synthetic in-memory registry with hardcoded cross-links
  • Single test scenario (upstream fan-out via UID match)
  • Test-monorepo fixture (test/fixtures/group/test-monorepo/) with 3 service stubs — not wired to any integration test

What's needed for real multi-repo / microservice validation:

A. Fixture-based integration test against real code

The test-monorepo fixture exists but isn't exercised end-to-end. A test like:

it('discovers gRPC link from gateway → auth across service boundaries', async () => {
  // 1. Index each service directory with real indexer
  // 2. Run group sync against the fixture
  // 3. Assert bridge.lbug contains expected ContractLink
  // 4. Run group impact on a gateway function
  // 5. Assert auth service appears in cross-repo hits
});

This requires running the real extractor pipeline — but it would catch bugs that the current mocked tests cannot.

B. Known-good open-source repo test

There are popular open-source microservice demos (e.g., Google's Online Boutique, Netflix's Conductor) that have well-documented service boundaries. A test that clones a specific commit and asserts specific cross-links would give confidence that the system works on real code.

C. Confidence degradation test

No test verifies that removing a .proto file degrades confidence from 0.8 to 0.65 for Go providers. No test exercises the wildcard→method match path end-to-end.


3. Correctness Bugs Found

Bug 1: outOfScope always empty in Cypher path (cross-impact.ts:417)

// runGroupImpact returns:
return {
  ...
  outOfScope: [],  // ← always empty, never populated
  ...
};

The legacy runGroupImpactLegacy populates outOfScope when a matched repo falls outside the subgroup filter. The Cypher-based runGroupImpact uses a WHERE clause to filter, so matches never come back — but the caller has no way to know which repos were filtered. This is a silent information loss that affects UX when --subgroup is used. Fix this →

Bug 2: Manifest cross-links silently dropped for gRPC/lib (manifest-extractor.ts:92)

private async resolveSymbol(...): Promise<...> {
  ...
  } else {
    return null;  // ← gRPC and lib types always return null
  }

When null is returned, symbolRef = { filePath: '', name: link.contract }. In bridge-db.ts the cross-link insert query matches nodes by c.filePath = $fp AND c.symbolName = $sn with fp = ''. Since no contract node will have an empty filePath, fromRows or toRows will be empty and the edge is silently skipped (line 209: if (!fromRows.length || !toRows.length) continue). Manifest-declared gRPC cross-links produce contracts but no ContractLink edges in the bridge DB, making group_impact miss them entirely.

Bug 3: resolveProtoConflict directory proximity heuristic fails for centralized proto repos

// Matches character-by-character, not path-segment-by-segment
for (let i = 0; i < min; i++) {
  if (sourceDir[i] === protoDir[i]) shared++;
  else break;
}

For a monorepo with protos/ at root and services in services/auth/, services/orders/, the paths share the same repo prefix. A source file at services/auth/server.go and proto at protos/auth/auth.proto will score lower than proto at protos/orders/orders.proto if "protos/auth" lexically matches less of the path than another option. The heuristic should compare normalized path segments, not characters. Fix this →


4. Minor Issues

phase1Timeout ignores user intent for large timeouts (cross-impact.ts:129/312): Math.min(5000, timeout) means users who set timeout: 60000 still only get 5s for Phase 1. For large graphs this might truncate the local impact result. Consider Math.min(timeout * 0.3, 10000) or make it configurable.

Deduplication key in GrpcExtractor uses file path (grpc-extractor.ts:491): ${c.contractId}|${c.role}|${c.symbolRef.filePath}. A file that registers both UserService and UserService (via two different patterns) would only keep one. This is correct but worth a comment.

Topic extractor scans .go but Kafka/NATS patterns use Go-style calls without Go file detection — the glob pattern includes *.go but many Go patterns like ConsumePartition may appear in test files. Consider filtering *_test.go.


5. What's Working Well

  • The bridge.lbug atomic write (temp→rename with .bak recovery) is solid for concurrent use
  • Wildcard gRPC matching (runWildcardMatch) correctly handles the case where a consumer doesn't have proto context — this is a real-world pattern
  • normalizeContractId handles HTTP method case and trailing slashes, gRPC package lowercasing correctly
  • Graph-first strategy in HttpRouteExtractor is the right approach — when graph data is available, use it; fall back to source scan only when needed
  • Test naming convention (test_verb_condition_expectation) is clean and consistent throughout

Recommended Next Steps (Prioritized)

  1. High — Add fixture-based end-to-end test using test-monorepo that runs the real extractor pipeline and asserts specific ContractLink edges in bridge.lbug
  2. High — Add TypeScript gRPC consumer detection (@GrpcClient, ClientGrpc, grpc-js new ServiceClient())
  3. High — Fix manifest cross-link silently dropped for gRPC/lib types (Bug 2)
  4. Medium — Add Go HTTP provider patterns (gin, echo, stdlib mux)
  5. Medium — Add kafkajs consumer.run({ eachMessage }) and sarama Go producer patterns
  6. Medium — Fix outOfScope always empty in Cypher path (Bug 1)
  7. Low — Fix directory proximity heuristic to use path segments (Bug 3)
  8. Low — Add NestJS @Controller/@Get provider detection

@abhigyanpatwari

Copy link
Copy Markdown
Owner

@ivkond can u look into these pls 🙏. Also great work, mostly clean implementation few easy bugs to bash

@ivkond ivkond force-pushed the feat/cross-index-impact-groups branch 2 times, most recently from daeac6a to 55089ae Compare April 11, 2026 08:35
Before this PR lands, I ran a systematic self-review of the cross-impact
module and fixed 13 correctness / robustness issues caught during it.
All changes target files added by this PR; nothing outside of group/*
or the new MCP group_impact tool schema is touched.

Everything below is in this single commit. The rest of the PR history
is unchanged.

Full group suite: all unit and integration test files green when run
individually (bridge-db.test.ts hits a pre-existing native LadybugDB
worker cleanup segfault that sometimes flakes the reported count —
documented in vitest.config.ts, unrelated to these fixes). CI on this
commit: 11/11 green across ubuntu/macos/windows + quality + typecheck
+ e2e.

═══════════════════════════════════════════════════════════════════════
HIGH severity — correctness bugs with regression tests
═══════════════════════════════════════════════════════════════════════

HIGH #1 — manifest-extractor.resolveSymbol was too fuzzy
───────────────────────────────────────────────────────────
resolveSymbol() used CONTAINS on route/name fields plus an unconditional
`filePath ENDS WITH '.proto'` fallback for gRPC. Consequences in
production:
  * HTTP: manifest "/orders" matched "/suborders", "/myorders/{id}"
  * gRPC: ANY repo with any .proto file returned a random proto symbol
    regardless of whether the service/method matched
  * topic/lib: similar CONTAINS false positives

Now all four types use EXACT equality on the relevant name field plus
deterministic ORDER BY before LIMIT 1. HTTP paths are canonicalized
(ensureSlash + trailing-slash strip) before matching Route.name via a
new normalizeRoutePath() helper. gRPC matches by structured
service/method name with NO .proto fallback.

To avoid cross-impact breakage when resolveSymbol now legitimately
returns null (the indexer may not expose a node with the exact method
name), unresolved manifest contracts now get a deterministic synthetic
symbolUid `manifest::<repo>::<contractId>` instead of an empty string.
This gives the bridge Cypher query a stable anchor on both provider
and consumer sides.

HIGH #2 — sync.ts per-extractor isolation
───────────────────────────────────────────────────────────
syncGroup() wrapped all three extractors (http, grpc, topic) plus
service-boundary detection plus meta.json read in ONE try/catch around
the whole repo loop. A single extractor throwing would mark the whole
repo as `missingRepos` (looking like "repo not found") and prevent the
other extractors from running.

Now each step is isolated with per-failure reporting:
  * initLbug failure → repo marked missing, labeled `init`
  * detectServiceBoundaries failure → empty boundaries, contracts still
    extracted without service attribution
  * each extractor failure → recorded with {repo, extractor, message};
    other extractors in the repo keep running
  * manifest extractor failure → recorded but doesn't fail the sync
  * bridge write partial failures → summarized as `bridge_write`

HIGH #3 — bridge-db.writeBridge per-item tolerance
───────────────────────────────────────────────────────────
writeBridge() had no per-item try/catch on CREATE loops. A single
malformed contract, snapshot, or link would abort the whole write. The
atomic swap protected the OLD bridge but threw away all valid inserts
from the current run.

Each CREATE is now wrapped individually. writeBridge() returns a new
WriteBridgeReport with insert/fail counts plus a capped sampleErrors
array. sync.ts surfaces non-zero failure counts via extractorFailures.
Dropped cross-links (findContractNode returning null) are counted
separately in linksDroppedMissingNode — previously silently continued.

HIGH #4 — writeBridge handle lifecycle
───────────────────────────────────────────────────────────
writeBridge() opened the tmp LadybugDB handle before the insert loops
but only closed it at the end of the happy path. If ensureBridgeSchema,
any queryBridge CREATE, findContractNode, or the contract/snapshot/link
loops threw, the handle leaked — which holds the native file lock on
`bridge.lbug.tmp`, leaking an FD and preventing the next sync from
reusing the tmp slot.

Wrapped everything from `ensureBridgeSchema(handle)` to the explicit
`closeBridgeDb(handle)` in try/finally with a `handleClosed` sentinel
so we never double-close on the happy path and always close on the
error path.

HIGH #5 — openBridgeDbReadOnly + openBridgeOrFallback leaks
───────────────────────────────────────────────────────────
Two leak paths:
  * openBridgeDbReadOnly: `new lbug.Database(...)` followed by
    `new lbug.Connection(db)` was wrapped in a single try/catch, but
    if Connection construction threw AFTER Database allocation, the
    Database was leaked (no teardown path for a partial handle).
  * openBridgeOrFallback (storage.ts): called `readBridgeMeta` AFTER
    the handle was opened; any error from meta.json parse would bubble
    up leaving the handle open on the caller.

Fixed:
  * openBridgeDbReadOnly now tracks `db` and `conn` separately and
    closes each of them on the error path.
  * openBridgeOrFallback now wraps `readBridgeMeta` in try/catch and
    explicitly calls closeBridgeDb (with .catch() swallow) before
    rethrowing.

HIGH #6 — groupContracts JSON.parse of contract meta
───────────────────────────────────────────────────────────
GroupService.groupContracts() (bridge path) reconstructed StoredContract
from flat rows via `JSON.parse(r.meta)`. A single corrupted meta row
(malformed JSON, non-string leftover from a schema migration) would
abort the whole listing with an unhandled exception.

Replaced with a `safeParseMeta` helper that returns `{}` on parse
failure and increments a `metaParseFailures` counter, which is
surfaced in the return value when non-zero so callers can log and
debug.

═══════════════════════════════════════════════════════════════════════
MEDIUM severity — robustness gaps, input validation, logging
═══════════════════════════════════════════════════════════════════════

MEDIUM #1 — service.groupImpact input range validation + safeLocalImpact
───────────────────────────────────────────────────────────
groupImpact() accepted any finite number for maxDepth, minConfidence,
timeout, crossDepth (including negative, zero, or 1e9). MCP is a
public surface; unbounded values were a DoS vector. Additionally,
`direction: 'upstreem'` (typo) silently became 'upstream'.

All five params are now strictly validated with explicit error messages:
  * direction ∈ {'upstream', 'downstream'}
  * maxDepth    ∈ [1, 10] integer
  * minConfidence ∈ [0, 1]
  * timeout     ∈ [100, 300000] ms
  * crossDepth  ∈ [0, 10] integer

Additionally: the localImpactFn callback called resolveGroupRepo()
which throws if the repo isn't in the group config. runPhase1WithTimeout
only catches timeouts, not rejections, so a missing-repo error would
crash the whole MCP handler. Wrapped in a safeLocalImpact helper that
catches and returns an error-shaped stub instead of bubbling past the
timeout guard.

MEDIUM #2 — Phase-1 timeout reason field
───────────────────────────────────────────────────────────
runGroupImpact()/runGroupImpactLegacy() stub for Phase-1 timeout used
impactedCount: 0, risk: 'LOW', byDepth: {} — indistinguishable from a
verified empty walk. Consumers treating these as real could miss
actual impact.

Added:
  * GroupImpactResult.truncationReason ('phase1_timeout' | 'wall_deadline')
  * phase1TimedOut: true marker inside the `local` stub
  * source repoPath pushed into truncatedRepos immediately on Phase-1
    timeout (was only done post-Phase-2)

MEDIUM #3 — grpc-extractor proto parser
───────────────────────────────────────────────────────────
extractServiceBlocks() used brace-depth counting on the raw .proto
content. Braces inside string literals or `// line` / `/* block */`
comments were counted as real body braces — a valid proto with
`option deprecated_reason = "use NewService { instead"` would have the
service body closed early, dropping methods after the offending string.
Previously marked as a known TODO but never fixed.

Added stripProtoCommentsAndStrings() helper that returns a sanitized
copy with offsets and line breaks preserved (comments/strings replaced
by spaces). extractServiceBlocks now scans the sanitized copy for
service headers and brace depth, but slices the returned body from
the original content to keep source text intact.

MEDIUM #4 — topic-extractor Kafka Go regex
───────────────────────────────────────────────────────────
Sarama patterns anchored on `sarama.NewSyncProducer[\s\S]{0,300}?Topic:`
had two failure modes:
  * In a loop with multiple ProducerMessage literals, only the FIRST
    Topic was captured — the engine's lastIndex advanced past the
    Topic match, and there was no second NewSyncProducer to re-anchor.
  * Multiple unrelated messages within 300 chars of the constructor
    would attribute the wrong topic to the wrong producer context.

Changed the sarama patterns to match the struct literal directly:
`sarama\.ProducerMessage\s*\{[\s\S]{0,200}?Topic:\s*"([^"]+)"`.
This captures every ProducerMessage regardless of producer scope.

MEDIUM #5 — cross-impact crossDepth clamp deduplication
───────────────────────────────────────────────────────────
`Math.min(1, …)` was duplicated at cross-impact.ts:149 (legacy) and
:354 (bridge path) PLUS at service.ts:252. Extracted the clamp into
MAX_SUPPORTED_CROSS_DEPTH with a comment pointing at service.ts so the
two places don't drift when multi-hop traversal lands.

MEDIUM #6 — matching.ts gRPC normalize comment
───────────────────────────────────────────────────────────
No behavior change. The original comment ("MVP: lowercase the whole
token; differs from pkg/method split") falsely implied inconsistency
between the no-slash and pkg/method branches. Rewritten to document
the wire-protocol rationale (pkg/service case-insensitive, RPC method
case-sensitive) and to explicitly note that package-only and
pkg/method ids are intentionally distinct canonical forms.

MEDIUM #7 — manifest-extractor silent catch
───────────────────────────────────────────────────────────
The Cypher executor try/catch dropped errors with `/* fall through */`.
A broken graph query in one repo would silently degrade resolveSymbol
to the synthetic-uid path with no diagnostic output. Added a
`console.warn` with the link type, contract name, repo key, and error
message.

MEDIUM #8 — writeBridgeMeta uses retryRename
───────────────────────────────────────────────────────────
writeBridgeMeta was using `fsp.rename` directly while writeBridge used
retryRename with exponential backoff for EBUSY/EPERM/EACCES. A
concurrent Windows reader could cause the meta.json rename to fail
transiently while the bridge.lbug swap survived. Pulled the rename
through retryRename so the meta write is at least as robust as the
bridge write it accompanies.

MEDIUM #9 — sync.ts extractor failure labels
───────────────────────────────────────────────────────────
The `ExtractorFailure.extractor` enum labeled initLbug failures as
'boundaries' and bridge-write failures as 'manifest' — both confusing
in production logs. Extended the type union with explicit 'init' and
'bridge_write' variants, moved initLbug failures to 'init', moved
writeBridge failure summary to 'bridge_write', and documented each
label kind on the new ExtractorKind type.

MEDIUM #10 — MCP schema bounds for group_impact
───────────────────────────────────────────────────────────
group_impact tool schema declared crossDepth, maxDepth, minConfidence,
timeout as bare `type: 'number'` — clients had no way to discover
valid ranges without trial-and-error. Added JSON Schema draft-7
`minimum`/`maximum` bounds mirroring the server-side validation in
GroupService.groupImpact(). Extended ToolDefinition interface to
allow the new keys.

MEDIUM #11 — typed GroupImpactResult.crossDepthWarning
───────────────────────────────────────────────────────────
service.ts's groupImpact was mutating the result via
`(result as unknown as Record<string, unknown>).crossDepthWarning = …`.
This cast defeated the type system. Added `crossDepthWarning?: string`
to GroupImpactResult in types.ts and removed the double-unknown cast
in both runGroupImpactLegacy and runGroupImpact code paths.

MEDIUM #12 — retryRename unit tests
───────────────────────────────────────────────────────────
retryRename had zero test coverage — critical code for Windows CI,
tested only implicitly via writeBridge on macOS/Linux (where EBUSY
never fires). Exported the function and added four vi.spyOn-based
tests: retries on EBUSY, rethrows non-retryable codes (ENOENT), gives
up after attempt count, retries on EACCES.

═══════════════════════════════════════════════════════════════════════
Intentionally deferred to follow-up PRs
═══════════════════════════════════════════════════════════════════════

  * AbortSignal plumbing for Phase-1 timeout — requires changing the
    port.impact / impactByUid interface across lbug-adapter and
    GroupToolPort. Too broad for a correctness fix batch; tracked
    as follow-up.
  * Duplication between runGroupImpact and runGroupImpactLegacy
    (~500 LOC that share ~70% of logic) — genuine maintenance burden
    but a dedicated refactor PR is the right tool, not this batch.
  * Low-impact items: CLI double-open of bridge, proto parser
    performance optimization, loadPackageDefinition scanner false
    positives, `catch(err: any)` → `unknown` cleanup. All safe to
    land separately.

═══════════════════════════════════════════════════════════════════════
Test deltas (all via --pool=forks, files run individually)
═══════════════════════════════════════════════════════════════════════

  bridge-db.test.ts          15 → 21   (+6: report shape, dropped
                                        links, 4× retryRename)
  sync.test.ts               10 → 11   (+1: initLbug failure path)
  manifest-extractor.test.ts  5 →  8   (+3: exact match, synthetic
                                        uid, regression cases)
  service.test.ts            25 → 31   (+6: range validation,
                                        safeLocalImpact wrap)
  grpc-extractor.test.ts     41 → 43   (+2: proto strings, comments)
  topic-extractor.test.ts    29 → 30   (+1: sarama loop regression)
  matching.test.ts           28 → 28   (comment change only)
  ──────────────────────────────────────
  NEW TESTS:                 +19
  GROUP INTEGRATION:         10/10 green (cli, impact, sync, monorepo)

Co-authored-by: Claude <noreply@anthropic.com>
@ivkond ivkond force-pushed the feat/cross-index-impact-groups branch from 14c0e7d to d15b8cb Compare April 11, 2026 13:34
@ivkond

ivkond commented Apr 11, 2026

Copy link
Copy Markdown
Contributor Author

@ivkond can u look into these pls 🙏. Also great work, mostly clean implementation few easy bugs to bash

Hi @abhigyanpatwari — heads-up and an offer.

Apologies for the commit noise on this PR. 14 commits from my side,
including two duplicated WIP iterations (restore out-of-scope reporting…
and expand grpc extraction… each landed twice during earlier review
rounds). I should have been squashing before each re-request for review.

After the last round I also ran a systematic self-review of the
cross-impact module and landed a batch of correctness / robustness
fixes in a single commit d15b8cb fix(group): self-review before merge.
The commit body has a severity-ordered breakdown with per-finding
rationale and test coverage — easy to skim or cherry-pick. Two larger
items are intentionally deferred to separate follow-up PRs:
AbortSignal plumbing for the Phase-1 timeout path, and the
runGroupImpact / runGroupImpactLegacy de-duplication.

Two options — whichever saves your review time

Option A — merge as-is with Squash & Merge. main ends up with
one clean commit regardless; the messy interim history stays only in
the PR.

Option B — I close this and open a fresh PR with a clean 3–4
commit history:

  1. feat(group): infrastructure, contract matching, service boundary detection
  2. feat(group): bridge.lbug storage + gRPC canonical IDs
  3. feat(group): http / grpc / topic extractor coverage across languages
  4. fix(group): self-review — correctness + robustness

Same content, same CI result, reviewer-friendly diffs per logical step.
The review thread here is lost but main stays clean. I can have it up
within an hour if that's easier for you.

No preference from my side — whichever works best for your review
cadence. Let me know.

@magyargergo

Copy link
Copy Markdown
Collaborator

@ivkond Thank you for your contribution! You've made an amazing impact with you changes. We would like to ask you to split this up, thus we can iterate on it much confidentially.

@ivkond

ivkond commented Apr 11, 2026

Copy link
Copy Markdown
Contributor Author

@ivkond Thank you for your contribution! You've made an amazing impact with you changes. We would like to ask you to split this up, thus we can iterate on it much confidentially.

@magyargergo Thanks — happy to do this. Here's what I'd propose; let me
know if the shape works before I start moving commits around.

Proposed split — 4 stacked PRs

Linear dependency chain, each PR has one clear responsibility. Every
self-review fix from d15b8cb travels with the code it patches, so
there's no orphan "fix" PR at the end.

  1. feat(group): bridge.lbug storage + contract matching foundation
    (~1.5k LOC + ~1.2k LOC tests)
    Files: bridge-db, matching, normalization, types, storage,
    config-parser, bridge-schema, contract-extractor interface.
    Pure infra, no user-facing surface yet.
    Base: main.

  2. feat(group): service boundary detection + per-language contract extractors
    (~2.5k LOC + ~2.5k LOC tests)
    Files: service-boundary-detector plus http-route-extractor,
    grpc-extractor, topic-extractor, manifest-extractor. Mostly
    regex/parser logic across Go, Java, Python, Node, TS.
    Base: PR 1 (for types only).

  3. feat(group): sync pipeline (~500 LOC + ~700 LOC tests)
    Files: sync.ts wiring extractors → matching → bridge write.
    Deliberately small and standalone so the pipeline wiring is easy
    to review in isolation from the algorithm.
    Base: PR 2.

  4. feat(group): cross-repo impact analysis + CLI + MCP tool
    (~1.5k LOC + ~1.5k LOC tests)
    Files: cross-impact.ts, GroupService (groupImpact +
    groupContracts + groupQuery), cli/group.ts, mcp/tools.ts
    (the group_impact tool entry), and the RFC doc
    2026-03-31-cross-index-impact-design.md. The user-visible feature.
    Base: PR 3.

PR 1 and PR 2 have no code-level coupling (only a type import) and
can be reviewed in parallel if you'd like.

Alternatives I considered

  • 3 PRs — merge 1+2 into a single "foundation" PR and 3+4 into
    a single "feature" PR. Fewer review rounds, but each PR ends up
    5k–8k LOC which largely defeats the point of splitting.
  • 5–7 PRs — further split the extractors per language or per
    broker. Adds a lot of ceremony without meaningfully reducing
    per-PR review load, and manifest-extractor imports types from the
    other extractors which would create awkward cross-dependencies.

Thanks for the quick review turnaround.

@magyargergo

Copy link
Copy Markdown
Collaborator

Please proceed one by one. I can see they have linear deps on them? Is there a way you could split them up, so they can be done parallel?

@ivkond

ivkond commented Apr 11, 2026

Copy link
Copy Markdown
Contributor Author

Good question. The only thing making PR 1 → PR 2 linear in my first
proposal is a type import; everything else is genuinely independent. I
can fix that by pulling a tiny "types + interface" PR ahead of them.
Here's the revised stack:

Revised split — 5 PRs, with PR 1 and PR 2 truly parallel

  • PR 0types.ts + contract-extractor.ts interface. Pure
    shared type definitions that everyone downstream imports. No logic,
    no user-facing surface. ~230 LOC.
    Base: main. Should be a trivial review.

  • PR 1bridge-db, bridge-schema, matching, normalization,
    storage, config-parser. Bridge storage infra + contract matching

    • group config parser. ~1.5k LOC + ~1.2k LOC tests.
      Base: PR 0. Parallel with PR 2.
  • PR 2service-boundary-detector + 4 contract extractors
    (HTTP / gRPC / topic / manifest). ~2.5k LOC + ~2.5k LOC tests.
    Base: PR 0. Parallel with PR 1.

  • PR 3sync.ts pipeline wiring extractors → matching → bridge
    write. ~500 LOC + ~700 LOC tests.
    Base: PR 1 + PR 2 (needs both merged).

  • PR 4cross-impact.ts, GroupService, cli/group.ts,
    mcp/tools.ts (group_impact), RFC doc
    2026-03-31-cross-index-impact-design.md. The user-visible feature.
    ~1.5k LOC + ~1.5k LOC tests.
    Base: PR 3.

So it's still "one at a time" in terms of review context-switching,
except for the PR 1 / PR 2 pair which you can take in whichever order. Each
PR has a clear, non-overlapping scope and none blocks on anything
other than its direct dependency.

I'll start on PR 0 as soon as you confirm the shape.

@magyargergo

Copy link
Copy Markdown
Collaborator

Nice! This sounds like a good plan! Please create tickets for them so we can track this progress

@ivkond

ivkond commented Apr 11, 2026

Copy link
Copy Markdown
Contributor Author

Done! Tickets for the split:

Each issue has full what / why / verify / risk sections per CONTRIBUTING.md, plus a scope-discipline note per GUARDRAILS.md confirming no drive-by refactors and no CI/release/security changes.

Dependency chain: #790 first, then #791 and #792 in parallel, then #793, then #794.

#606 will stay open until #794 merges, at which point it will be closed with backlinks to all 5.

Starting on #790 now.

@magyargergo

Copy link
Copy Markdown
Collaborator

Thank you for doing it! Let me know when you are ready with the first PR and I'll review it ASAP!

magyargergo pushed a commit that referenced this pull request Apr 11, 2026
…#606 split) (#795)

* feat(group): bridge.lbug storage + contract matching expansion

Part 1 of 4 in the split of #606 (ticket: #791, closes #790 with a
revised plan per @magyargergo's request).

## What changed

Adds the LadybugDB-backed bridge storage infrastructure and extends
the contract matching algorithm with wildcard support. All changes are
additive: storage.ts, sync.ts, service.ts, cli/group.ts, mcp/tools.ts
are left on their upstream main versions and will migrate to the new
bridge in follow-up PRs (#792, #793, #794).

### Files

**New (844 LOC prod):**
- `gitnexus/src/core/group/bridge-db.ts` — atomic write-to-temp with
  `retryRename` for Windows EBUSY/EPERM, per-item write tolerance via
  `WriteBridgeReport`, `findContractNode` with three-tier symbol
  lookup (uid → filePath+name → filePath)
- `gitnexus/src/core/group/bridge-schema.ts` — schema DDL
- `gitnexus/src/core/group/normalization.ts` — contract ID
  canonicalization + `dedupeContracts` / `dedupeCrossLinks` helpers
  used by both matching and bridge write

**Modified (+137 LOC prod):**
- `gitnexus/src/core/group/matching.ts` — adds `runWildcardMatch` for
  `grpc::Service/*` wildcard consumers, `buildProviderIndex` helper,
  and canonical gRPC ID handling in `normalizeContractId`
- `gitnexus/src/core/group/types.ts` — `MatchType` gains `'wildcard'`;
  new `BridgeHandle` and `BridgeMeta` interfaces

**New tests (658 LOC):**
- `gitnexus/test/unit/group/bridge-db.test.ts` — core write/read round
  trip, `WriteBridgeReport` shape, dropped-links counter, retryRename
  behavior on EBUSY/ENOENT/EPERM/EACCES
- `gitnexus/test/unit/group/bridge-db-edge.test.ts` — edge cases
  (malformed meta, missing contract nodes, concurrent access)

**Modified tests (+225 LOC):**
- `gitnexus/test/unit/group/matching.test.ts` — wildcard consumer
  matching, gRPC canonical ID handling, same-service guard

### Self-review fixes folded in

Carried forward from the original #606 self-review:
- `writeBridge` try/finally handle lifecycle + `handleClosed` sentinel
- `openBridgeDbReadOnly` partial-handle cleanup
- `writeBridgeMeta` uses `retryRename` for Windows consistency
- `retryRename` unit tests (was zero coverage)
- Per-item try/catch around every CREATE loop so one malformed contract
  doesn't abort the whole write
- Dropped cross-link counter (`linksDroppedMissingNode`)

### Why now

magyargergo asked for the #606 PR to be split so we can iterate with
confidence (#606 (comment)).
This is the foundational layer — pure infra, no user-facing surface,
no callers of the new APIs in this PR. Later PRs wire it in.

### How to verify

- `cd gitnexus && npx tsc --noEmit`
- `cd gitnexus && npx vitest run test/unit/group/bridge-db.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/bridge-db-edge.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/matching.test.ts --pool=forks`
- Pre-commit hook runs clean

### Risk / rollback

**Low.** All new code sits under `src/core/group/` in new files plus a
minimal `+16/-1` diff to `types.ts` and a `+136/-0` diff to `matching.ts`
(both purely additive). No existing callers reference the new APIs
(bridge-db, openBridgeOrFallback, runWildcardMatch) — the PRs that wire
them in come later in the split chain. Rollback = `git revert` of the
merge commit; no state introduced, no schema migration triggered.

### Scope discipline (per GUARDRAILS.md)

- Only the 8 files listed above are touched; no drive-by refactors
- No CI/release/security config changes
- No secrets, tokens, or machine-specific paths
- Content is lifted from the #606 branch which already passed CI 11/11
  green on `d15b8cb` (before the split)

### Dependencies

- **Base:** `main` (no dependencies on other split PRs)
- **Blocks:** extractor expansion (#792), sync pipeline (#793),
  cross-impact feature (#794)
- **Related ticket:** #791

Co-authored-by: Claude <noreply@anthropic.com>

* fix(group): address @claude review on #795

Addresses the findings from the automated review on PR #795
(#795 (comment)
— posted by @magyargergo / claude-code Action run).

### Medium severity (reviewer flagged as blockers)

- **bridge-db.ts `openBridgeDbReadOnly` bak recovery** — the `.bak`
  recovery path used bare `fsp.rename(bakPath, dbPath)`, which is
  exactly the scenario most likely to hit Windows EBUSY/EPERM (an
  interrupted writer still holding the handle for a few ms). Switched
  to `retryRename` for consistency with the rest of the file's
  Windows-safe rename path.
- **bridge-db.ts `ensureBridgeSchema` error detection** — the inline
  `msg.includes('already exists')` substring match has been lifted
  into a named constant `LBUG_ALREADY_EXISTS_MSG` with a comment
  documenting the coupling to LadybugDB's error message wording and
  why we can't use `IF NOT EXISTS` (LadybugDB DDL doesn't support it)
  or typed errors (LadybugDB's JS driver doesn't expose error codes).
  Also tightened the `catch (err: any)` to `catch (err: unknown)`.
- **bridge-db.ts `findContractNode` — extracted out of writeBridge**
  — the 35-line async closure living inside `writeBridge` has been
  lifted to three module-level functions: `createContractLookupIndex`,
  `indexContract`, and `findContractNode`. `findContractNode` is now
  a pure synchronous function taking a prebuilt index instead of
  doing its own DB queries. The `writeBridge` cross-link loop is now
  ~25 lines instead of ~100.
- **bridge-db.ts `findContractNode` — N+1 query elimination** — the
  old inner-closure version issued up to 6 DB round-trips per
  cross-link (2 endpoints × up to 3 tiers of fallback queries). For a
  group with 1000 cross-links, that's up to 6000 DB queries just to
  resolve endpoints. The new version consults an in-memory
  `ContractLookupIndex` built incrementally as contracts are inserted
  (`indexContract` called AFTER each successful insert so failed
  inserts don't poison the index). Cross-link resolution is now
  O(1) per link instead of O(3) DB queries per link, with zero DB
  round-trips during the cross-link loop.

### Minor severity

- **bridge-db.ts `queryBridge` empty-array guard** — if LadybugDB
  ever returns an empty `QueryResult[]` at the top level (shouldn't
  happen with single-statement calls, but driver contract isn't
  explicit), the old code would call `.getAll()` on `undefined` and
  crash with a confusing stack. Added an `unwrapQueryResult` helper
  that throws an explicit `'empty QueryResult array'` error instead,
  making a potential driver regression visible immediately.
- **normalization.ts `contractRichness` weights** — added a
  block-level comment documenting the weight ordering (+3 for
  symbolUid, +2 for each symbol-identifying field, +1 for service
  tag or non-manifest origin) and explicitly noting that the
  absolute numbers don't matter, only the relative ordering. Matches
  the "comment for contributors" suggestion in the review.
- **bridge-schema.ts `BRIDGE_SCHEMA_VERSION` migration comment** —
  added a 4-point contract explaining what bumping the constant
  means ("discard and re-sync" strategy for V1, no in-place
  migration yet, new migration logic should live in a separate
  `bridge-migrations.ts` module when it becomes necessary).
- **test/unit/group/fixtures.ts** — extracted the `makeContract`
  helper previously copy-pasted between `bridge-db.test.ts` and
  `bridge-db-edge.test.ts` into a shared fixtures module. Both test
  files now import from `./fixtures.js`. Kept the scope minimal:
  fixtures is NOT a general-purpose factory module, just the shared
  baseline contract builder.

### New tests

Added 9 pure-function unit tests for the now-extracted
`findContractNode` in `bridge-db.test.ts`:
  - returns null on empty index
  - tier 1 (symbolUid) match, including repo-scope and role-scope
    isolation
  - tier 2 (filePath + symbolName) fallback when symbolUid is empty
    or mismatches
  - tier 3 (filePath only) when exactly one contract lives in the
    file, and refusal when multiple do
  - priority ordering when multiple tiers could resolve

These are fully isolated — no DB, no temp directories, no native
LadybugDB binding — so they run in &lt;10ms total and are
immediately trustworthy as a regression safety net.

### Deliberately deferred (reviewer marked as "fine for now")

- `BridgeHandle._db` / `._conn` typing to `unknown` with casts in
  `bridge-db.ts` — reviewer's note: "The typing is fine for now."
- Batch inserts via `UNWIND` — needs LadybugDB support confirmation,
  tracked as a follow-up; the per-item pattern remains.
- `queryBridge` prepared-statement lifecycle — the current pattern
  (prepare → execute → GC) relies on LadybugDB's internals, worth
  verifying against their docs in a separate audit.

### Scope discipline (per `GUARDRAILS.md`)

- Only files touched by this PR (`bridge-db.ts`, `bridge-schema.ts`,
  `normalization.ts`, both bridge test files, new `fixtures.ts`) —
  no drive-by refactors
- No CI/release/security config changes
- No secrets

### Test + typecheck status

- `npx tsc --noEmit` clean
- `bridge-db.test.ts`: added 9 `findContractNode` tests, all pass in
  isolation. The full-file run still hits the pre-existing native
  LadybugDB cleanup segfault that flakes the reported count — same
  as every prior commit on this branch, not a regression.
- `bridge-db-edge.test.ts`: 4/4 pass
- `matching.test.ts`: 28/28 pass
- `types.test.ts`: 5/5 pass
- `retryRename` tests (4/4) and `findContractNode` tests (9/9)
  verified in isolation via `-t` filter

Co-authored-by: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
magyargergo pushed a commit that referenced this pull request Apr 13, 2026
…lit) (#796)

* feat(group): extractor expansion + manifest extractor

Part 2 of 4 in the split of #606 (ticket: #792). Follows #795
(bridge.lbug storage foundation, already merged), but this PR has no
code-level dependency on #795 — it only imports types and the
ContractExtractor interface that existed on upstream main before
either PR. It could have been reviewed in parallel with #795.

## What changed

Expands the 3 existing contract extractors with substantially more
language/framework coverage, and adds a new `manifest-extractor`
that resolves `group.yaml`-declared cross-links against the per-repo
graph via exact-name lookups.

### New file (228 LOC)

- `gitnexus/src/core/group/extractors/manifest-extractor.ts` —
  exact graph lookup for `group.yaml`-declared cross-links. HTTP
  paths are canonicalized before Route.name matching; gRPC is
  resolved by service/method name (NO `.proto`-filename fallback);
  topic and lib use exact-name match. Falls back to a synthetic
  `manifest::<repo>::<contractId>` uid when the graph has no
  matching symbol, so cross-impact traversal still has a stable
  anchor for the contract.

### Modified extractors (+958 LOC prod)

- `extractors/grpc-extractor.ts` (+522) — `.proto` parser with
  comment and string-literal sanitization (braces inside strings no
  longer truncate service bodies); package/service/method canonical
  IDs; server/client detection across Go (`grpc.NewServer`,
  `RegisterXxxServer`, `XxxGrpc.XxxImplBase`), Java (`@GrpcService`,
  `BlockingStub`), Python (`servicer_to_server`, `XxxStub`), and
  TypeScript/Node (`@GrpcMethod`, `ClientGrpc`, `loadPackageDefinition`).
- `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib
  `HandleFunc`, NestJS `@Controller`+`@Get`/etc, Python FastAPI
  decorators, Java Spring `@RequestMapping`/`@GetMapping`,
  restTemplate / WebClient / OkHttp consumers.
- `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}`
  struct literal detection (replaces a constructor-anchored regex
  that missed topics inside producer loops), kafka-go Writer/Reader,
  Python NATS (`await nc.subscribe`/`await nc.publish`), JetStream
  helpers.

### Modified and new tests (+1264 LOC)

- `grpc-extractor.test.ts` (+539) — full coverage of the new proto
  parser (strings-with-braces regression, comments-with-braces
  regression), per-language server/client detection
- `http-route-extractor.test.ts` (+240) — per-framework route
  extraction + normalization edge cases
- `topic-extractor.test.ts` (+177) — the sarama in-loop regression,
  JetStream, Python NATS, kafka-go Writer/Reader
- `manifest-extractor.test.ts` (+308 NEW) — HTTP path normalization,
  gRPC exact lookup with proto-fallback regression, lib and topic
  exact matching, synthetic-uid fallback behavior

### Self-review fixes folded in

Carried forward from the #606 self-review (commit `d15b8cb`):

- **HIGH #1** — `manifest-extractor.resolveSymbol` was too fuzzy.
  Previously used `CONTAINS` on route/name fields plus an
  unconditional `filePath ENDS WITH '.proto'` fallback for gRPC.
  Consequences: `/orders` matched `/suborders`, and any repo with
  any `.proto` file returned a random proto symbol for a gRPC
  manifest entry. Replaced with exact equality + deterministic
  `ORDER BY` + synthetic-uid fallback for unresolved manifests.
  Regression tests included.
- **MED #3** — gRPC proto parser brace-depth counting now sanitizes
  strings and comments first (`stripProtoCommentsAndStrings`). A
  valid proto with `option deprecated_reason = "use NewService {
  instead"` used to have its service body closed early by the `"{"`
  inside the literal, silently dropping methods after the offending
  string. Regression tests for both string-with-brace and
  comment-with-brace cases.
- **MED #4** — sarama Kafka regex changed from
  `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on
  constructor, caught only first topic in a loop) to
  `sarama.ProducerMessage{...Topic:}` (matches every struct literal
  directly). Regression test with a for-loop that constructs
  multiple `ProducerMessage`s.
- **MED #7** — `manifest-extractor.resolveSymbol` no longer has a
  silent `catch { /* fall through */ }`. Errors from the graph
  executor are logged via `console.warn` with link type, contract
  name, repo key, and error message before falling through to the
  synthetic-uid path.

## Why

Reviewer focus here is pure regex / parser correctness — no
storage, no Cypher queries, no algorithmic changes to the cross-link
algorithm. Separating this from the bridge foundation PR (#795)
meant reviewers could stay in a single mental mode (parsing logic)
instead of context-switching between DDL, Cypher, and regex.

## How to verify

- `cd gitnexus && npx tsc --noEmit`
- `cd gitnexus && npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks`

Local pre-push: typecheck clean, all 99 extractor unit tests pass
(grpc 43, http 18, topic 30, manifest 8).

## Risk / rollback

**Low.** Extractors have no user-facing surface in this PR — they
produce `ExtractedContract[]` that is consumed by `sync.ts` in the
next split (#793). No existing behavior changes for users who don't
run a `group sync`. Rollback = `git revert` of the merge commit;
the modifications to `grpc-extractor.ts` / `http-route-extractor.ts`
/ `topic-extractor.ts` revert to the pre-PR versions that still
work (they're subsets of the new functionality).

## Scope discipline (per GUARDRAILS.md)

- Only the 8 files above are touched; no drive-by refactors
- No CI/release/security config changes
- No secrets or machine-specific paths
- Content lifted from #606 (CI 11/11 green on `d15b8cb`)

## Dependencies

- **Base:** `main` (upstream already includes #795 as `1ff324c`)
- **Blocks:** sync pipeline (#793) and the cross-impact feature (#794)
- **Tracker issue:** #792
- **Parent PR:** #606

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): migrate topic-extractor from regex to tree-sitter queries

Addresses @magyargergo's feedback on #796 that regex-based lookups
should use tree-sitter nodes instead, and that the top-level
extractors must NOT carry language dependencies. This is phase 1 of
a multi-step migration — topic-extractor first because its patterns
are the most uniform (16 "call/annotation with first-arg string
literal" variants), which makes it a clean proof of the approach
before grpc-extractor and http-route-extractor get the same treatment.

## Architecture: language-agnostic orchestrator + per-language plugins

The top-level extractor is a thin orchestrator that never imports a
tree-sitter grammar or a query string. Per-language knowledge lives
in a new `topic-patterns/` folder with one file per language plus a
registry that maps file extensions to compiled plugins:

```
src/core/group/extractors/
├── tree-sitter-scanner.ts         # shared, language-agnostic scanning utilities
├── topic-extractor.ts              # thin orchestrator (no grammar imports)
└── topic-patterns/
    ├── types.ts                    # TopicMeta, Broker
    ├── index.ts                    # registry: extension → compiled provider
    ├── java.ts                     # tree-sitter-java + JAVA_TOPIC_PROVIDER
    ├── go.ts                       # tree-sitter-go + GO_TOPIC_PROVIDER
    ├── python.ts                   # tree-sitter-python + PYTHON_TOPIC_PROVIDER
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript
                                    # → JAVASCRIPT_/TYPESCRIPT_/TSX_TOPIC_PROVIDER
```

**Shared scanner (`tree-sitter-scanner.ts`)** — defines
`PatternSpec<TMeta>`, `LanguagePatterns<TMeta>`, `CompiledPatterns<TMeta>`
and the `scanFile(parser, plugin, content)` helper. Plugins compile their
queries eagerly at module load via `compilePatterns()`, so a broken
pattern fails loudly at import time instead of silently at scan time.
`unquoteLiteral()` handles single/double/template quotes, Python
triple-quoted strings, and Go raw backtick strings.

**Per-language plugins** own:
- the tree-sitter grammar import (this is the ONLY place in
  `src/core/group/` where tree-sitter grammars are imported),
- the query S-expressions,
- the `TopicMeta` payload (role, broker, confidence, symbolName) that
  the orchestrator receives back on every match.

Each plugin uses a `@value` capture name to bind the topic literal node.
The JavaScript and TypeScript grammars share AST node names for every
construct we query, so `node.ts` defines the pattern sources once and
compiles them against `JavaScript`, `TypeScript.typescript`, and
`TypeScript.tsx` — exporting three providers because `Parser.Query`
objects are NOT portable across grammar instances.

**Registry (`topic-patterns/index.ts`)** — maps `.java` → Java provider,
`.go` → Go, `.py` → Python, `.js`/`.jsx` → JS, `.ts` → TS, `.tsx` → TSX.
Also exports `TOPIC_SCAN_GLOB` so adding a new language is a single
file-level edit (drop `topic-patterns/<lang>.ts`, import + register it
here — zero edits required in `topic-extractor.ts`).

**Orchestrator (`topic-extractor.ts`)** — ~110 lines, no grammar or
query imports. Per file: `getProviderForFile(rel)` → `scanFile(parser,
provider, content)` → `unquoteLiteral(valueText)` → `makeContract(...)`.
Reuses one `Parser` instance across files; the scanner calls
`setLanguage` per plugin.

## Why this is better than regex

1. **Comments and strings are respected for free.** The old regex
   would match `// kafkaTemplate.send("fake.topic")` as a real
   producer; tree-sitter never visits comments or string literals as
   code nodes, so false positives from commented-out code are
   eliminated.
2. **Struct/object literal patterns are structural, not textual.**
   `sarama.ProducerMessage{Topic: "..."}` no longer needs a 300-char
   lookahead (which was a known cross-match bug partly mitigated by a
   loop regression test in the self-review). The new query matches a
   specific `composite_literal` with a specific `qualified_type` and
   `keyed_element` — exactly one struct literal per match.
3. **No order-of-operations fragility.** Regex for
   `channel.publish` vs `channel.consume` was independent and
   file-wide; the AST scopes matches to the specific `call_expression`.
4. **Language-agnostic extension.** Adding Ruby, Rust, or C# topic
   detection later means dropping one file in `topic-patterns/` — no
   changes to shared scanner or orchestrator, and no tree-sitter
   imports leak into top-level code.

## Per-file fault tolerance

- Malformed files that tree-sitter can't parse are silently skipped
  (`parser.parse` is wrapped by `scanFile`). The ingestion pipeline
  already logs unparseable files at index time.
- A syntactically invalid query is caught at `compilePatterns` time,
  not scan time — broken plugins fail loudly at import.
- Per-pattern `matches()` failures are swallowed so one broken query
  in a plugin doesn't block the rest.

## Tests

All 30 existing `topic-extractor.test.ts` tests pass **without any
changes to the test file** — they were written as input/output contract
tests (given this source file, expect these `ExtractedContract` objects)
and that contract is unchanged. Regression coverage includes:

- Kafka: Java `@KafkaListener` + `kafkaTemplate.send`; Node
  `producer.send` + `consumer.subscribe`; Go sarama producer/consumer
  (sync and async); kafka-go Writer/Reader; Python `KafkaConsumer` +
  `producer.send/produce`
- RabbitMQ: Java `@RabbitListener` + `rabbitTemplate.convertAndSend`;
  Node `channel.consume/publish/sendToQueue`; Python `basic_consume/
  basic_publish` with keyword args
- NATS: Go and Node `nc.Subscribe/Publish`; Go and Node JetStream
  `js.Subscribe/Publish`; Python `await nc.subscribe/publish`

Including the regression test for the sarama `ProducerMessage`
in-loop case — the AST-based query captures every literal in the
file independently, not just the first one after `NewSyncProducer`.

## Neighbor regression check

- `topic-extractor.test.ts` — 30/30 pass (rewritten extractor)
- `http-route-extractor.test.ts` — 18/18 pass (untouched)
- `grpc-extractor.test.ts` — 43/43 pass (untouched)
- `manifest-extractor.test.ts` — 8/8 pass (untouched)
- Full `npx tsc --noEmit` clean

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched; no
  changes to other extractors, tests, MCP surface, or pipeline.ts.
- No CI/release/security config changes, no secrets.
- New tree-sitter imports all reference grammars that are already
  installed as dependencies (`tree-sitter`, `tree-sitter-javascript`,
  `tree-sitter-typescript`, `tree-sitter-python`, `tree-sitter-java`,
  `tree-sitter-go` — all in `package.json` for the existing pipeline).

## Phase 2 / phase 3 plan

- **Phase 2 (next commit):** rewrite `http-route-extractor.ts`
  Strategy B (regex fallback) on the same plugin pattern. Graph-assisted
  Strategy A stays as-is (already uses pipeline-built tree-sitter data
  via `HANDLES_ROUTE` Cypher queries).
- **Phase 3 (commit after):** rewrite `grpc-extractor.ts` for Java /
  Go / Python / TypeScript detection. `.proto` files are the one
  outstanding question — there is no `tree-sitter-proto` grammar
  installed; the in-tree string-sanitizing parser stays as a pragmatic
  exception with a comment, alternative being to add
  `tree-sitter-proto` as a dep (open for the maintainer).

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): migrate http-route-extractor Strategy B to tree-sitter plugins

Phase 2 of the extractor refactor requested by @magyargergo on #796.
Same architecture as the phase 1 topic-extractor rewrite: a thin,
language-agnostic orchestrator plus per-language plugins that own
tree-sitter grammars and query sources. The top-level extractor file
no longer imports any tree-sitter grammar or query string.

## Architecture

```
src/core/group/extractors/
├── tree-sitter-scanner.ts          # shared, language-agnostic primitives
├── http-route-extractor.ts         # thin orchestrator (no grammar imports)
└── http-patterns/
    ├── types.ts                    # HttpDetection, HttpLanguagePlugin, HttpRole
    ├── index.ts                    # registry: ext → plugin + HTTP_SCAN_GLOB
    ├── java.ts                     # tree-sitter-java: Spring + RestTemplate/WebClient/OkHttp
    ├── go.ts                       # tree-sitter-go: gin/echo/HandleFunc + http/resty consumers
    ├── python.ts                   # tree-sitter-python: FastAPI + requests
    ├── php.ts                      # tree-sitter-php: Laravel Route::get/...
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript:
                                    #   NestJS controllers, Express, fetch, axios
```

**Shared scanner (`tree-sitter-scanner.ts`)** — generalised from phase 1:
- `ScanMatch<TMeta>.captures` is now a full `CaptureMap` (every named
  capture the query binds, not just a single `@value`). Topic extractor
  updated to read `match.captures.value` accordingly.
- New `runCompiledPatterns(plugin, tree)` helper lets plugins run
  multiple query bundles against the same pre-parsed tree. This is
  needed for HTTP plugins that combine a class-prefix query with a
  method-route query (Spring, NestJS).
- `scanFile` becomes a thin wrapper over `parser.parse + runCompiledPatterns`.

**HTTP plugin shape** — unlike topic plugins, HTTP plugins expose a
`scan(tree)` function rather than a flat pattern list. This reflects
HTTP's more complex extraction: each detection needs method + path +
handler name, and framework patterns like Spring `@RequestMapping` /
NestJS `@Controller` require cross-referencing a class-level prefix
with method-level annotations. Plugins internally use
`compilePatterns` + `runCompiledPatterns` and walk the AST to resolve
the class/method relationships.

**Per-framework coverage:**

- **Java (`java.ts`)**
  - Spring: `@RequestMapping("/api/v2")` class prefix + `@(Get|Post|Put|
    Delete|Patch)Mapping("/sub")` method routes, joined via the
    enclosing `class_declaration` node id.
  - `RestTemplate.getForObject/postForEntity/put/delete/patchForObject` →
    method derived from API name.
  - `WebClient.method(HttpMethod.X, "/path")` → method from
    `HttpMethod.X` capture.
  - `new Request.Builder().url("/path")` → OkHttp consumer.

- **Go (`go.ts`)**
  - gin / echo / chi frameworks: `\w+.GET("/path", handler)` captures
    upper-case verb + handler identifier.
  - `net/http.HandleFunc("/path", handler)` → provider (default GET).
  - `http.Get/Post/Head` consumer, `http.NewRequest("METHOD", ...)`,
    resty `client.R().Get/Post/...`.

- **Python (`python.ts`)**
  - `@app.get("/path")` FastAPI decorators.
  - `requests.get/post/...` and `requests.request("METHOD", "url")`.

- **PHP (`php.ts`)**
  - Laravel `Route::get/post/.../patch('/path', ...)` via
    `scoped_call_expression`. Uses `PHP.php_only` to match the
    existing ingestion pipeline's grammar selection.

- **Node (`node.ts`) — JS + TS + TSX**
  - Pattern sources defined once, compiled against three grammar
    variants (`JavaScript`, `TypeScript.typescript`, `TypeScript.tsx`)
    because `Parser.Query` objects are not portable across grammars.
    Exports three plugins sharing the same `scan` logic.
  - NestJS: `@Controller('prefix')` decorators are siblings of the
    class in `export_statement` / `program`; `@Get(':id')` decorators
    are siblings of the method in `class_body`. The plugin walks
    decorator → next named sibling to find the decorated class /
    method, then combines the class prefix with the method path.
    Only emits NestJS detections when the enclosing class has a real
    `@Controller` decorator — prevents false positives from generic
    classes that happen to use `@Get` from another library.
  - Express: `(router|app).<verb>('/path', ...)`.
  - `fetch(url)` (default GET) + `fetch(url, { method: 'X' })`
    (uses two queries + a SyntaxNode-id dedupe set so URL literals
    aren't double-emitted by the options variant).
  - `axios.get/post/...`.

## Orchestrator changes

`http-route-extractor.ts` drops every `scanXxxProviders` / `scanXxxConsumers`
regex method and replaces them with a single source-scan loop that
delegates to `getPluginForFile(rel).scan(tree)`. The orchestrator
still owns:

- **Path normalization** (`normalizeHttpPath`, `normalizeConsumerPath`)
  — language-agnostic string processing shared by both strategies.
- **Graph-assisted Strategy A** (`HANDLES_ROUTE` / `FETCHES` / `CONTAINS`
  Cypher queries) — unchanged in spirit. The only regex helpers it
  used (`inferMethodFromFileScan`, `pickJavaHandlerName`) are now
  replaced by a lookup against the plugin's detections for the same
  file: for each route row, find the detection whose normalized path
  matches, and pull the HTTP method + handler name from it.
- **Per-file parse cache** — the orchestrator parses each relevant
  file at most once per `extract()` call. Both the graph-assisted
  enrichment loop and the source-scan fallback share the same
  `cachedDetections` map, so we never run the plugin twice for the
  same file.

## Why this is better than the regex version

1. **Comments and strings for free.** The old regex would match
   `// router.get('/fake')` as a real Express route; tree-sitter
   never visits string/comment nodes.
2. **Structural controller-prefix.** Spring and NestJS class-prefix
   joining is now scoped to the enclosing class via `class_declaration`
   node ids, eliminating file-wide state that broke when a file had
   multiple controllers.
3. **Precise NestJS disambiguation.** The plugin only emits a NestJS
   detection when the enclosing class has a real `@Controller`
   decorator — the old regex would fire on any `@Get(...)` in the
   file regardless of surrounding context.
4. **Language-agnostic extension.** Adding Ruby / Rust / Kotlin HTTP
   detection later means dropping one file in `http-patterns/` — no
   changes to the shared scanner, the orchestrator, or the Strategy A
   Cypher queries.

## Tests

- `http-route-extractor.test.ts` — **18/18 pass** (tests unchanged;
  they're contract-style input/output tests and the contract shape is
  unchanged). Covers Spring class prefix, Express, gin/echo, stdlib
  HandleFunc, NestJS, Laravel, FastAPI for providers and
  fetch/axios/python-requests/rest-template/webClient/okhttp/go-stdlib/
  resty for consumers, plus graph-first Strategy A for both.
- `topic-extractor.test.ts` — **30/30 pass** after the `captures.value`
  API migration.
- `grpc-extractor.test.ts` — 43/43 pass (untouched; phase 3).
- `manifest-extractor.test.ts` — 8/8 pass (untouched).
- `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass.
- `npx tsc -p tsconfig.json --noEmit` clean.

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched.
- No changes to pipeline.ts, MCP surface, ingestion, or tests.
- No CI / release / security / secrets changes.
- Tree-sitter grammars imported by plugins (`tree-sitter-java`,
  `tree-sitter-go`, `tree-sitter-python`, `tree-sitter-php`,
  `tree-sitter-javascript`, `tree-sitter-typescript`) are all already
  in `package.json` for the existing ingestion pipeline.

## Phase 3 plan

- **grpc-extractor** gets the same treatment: plugin-per-language under
  `grpc-patterns/` for Java / Go / Python / TS detection. `.proto`
  files remain an open question — no `tree-sitter-proto` grammar is
  installed, so the in-tree string-sanitizing parser from PR #796's
  self-review stays as a pragmatic exception unless the maintainer
  wants us to add `tree-sitter-proto` as a new dep.

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): migrate grpc-extractor source scans to tree-sitter plugins

Phase 3 (final) of the extractor refactor requested by @magyargergo on
#796. Same architecture as phase 1 (topic) and phase 2 (http): thin
language-agnostic orchestrator + per-language plugins that own
tree-sitter grammars and query sources. With this commit the top-level
extractors under `src/core/group/extractors/` import ZERO tree-sitter
grammars and ZERO query strings — every grammar import lives in a
`*-patterns/<lang>.ts` plugin file, and the orchestrators go through
the registry indirection.

## Architecture

```
src/core/group/extractors/
├── tree-sitter-scanner.ts         # shared primitives (unchanged)
├── grpc-extractor.ts               # orchestrator (only `.proto` parser left)
└── grpc-patterns/
    ├── types.ts                    # GrpcDetection, GrpcLanguagePlugin, GrpcRole
    ├── index.ts                    # registry: ext → plugin + GRPC_SCAN_GLOB
    ├── go.ts                       # tree-sitter-go: RegisterXxxServer, Unimplemented, NewXxxClient
    ├── java.ts                     # tree-sitter-java: @GrpcService + XxxImplBase + newBlockingStub
    ├── python.ts                   # tree-sitter-python: add_XxxServicer_to_server + XxxStub
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript:
                                    #   @GrpcMethod, @GrpcClient field type,
                                    #   .getService<X>('Svc'), new XxxServiceClient,
                                    #   loadPackageDefinition dynamic constructors
```

## Per-language coverage

**Go (`go.ts`)**
- Provider: `\w+.RegisterXxxServer(...)` via `call_expression →
  selector_expression → field_identifier` + JS regex filter
  `^Register(\w+)Server$`.
- Provider: `pb.UnimplementedXxxServer` embedded in a struct via
  `struct_type → field_declaration_list → field_declaration →
  qualified_type → type_identifier` + JS filter.
- Consumer: `\w+.NewXxxClient(...)` via the same call_expression
  query + JS filter `^New(\w+)Client$`.

**Java (`java.ts`)**
- Provider: `class X extends YyyGrpc.YyyImplBase` — two queries
  handle the scoped and plain forms. `scoped_type_identifier`'s
  children are positional (no `scope:`/`name:` fields), so the
  query matches the two `type_identifier` children by position.
- `#match? @inner "ImplBase$"` restricts matches at query time.
- Whether the class has `@GrpcService` or not controls only the
  `source` metadata label — the plugin walks the class_declaration's
  `modifiers` child in JS to detect the marker_annotation.
- Consumer: `YyyGrpc.newStub(ch)` / `newBlockingStub(ch)` via a
  `method_invocation` query with `#match? @method
  "^new(Blocking)?Stub$"`, service name extracted via
  `^(\w+)Grpc$` on the object identifier.

**Python (`python.ts`)**
- Single call-expression query covers both bare identifier and
  `obj.method` attribute forms:
  `(call function: [(identifier) @fn (attribute attribute: (identifier) @fn)])`.
- Plugin filters `@fn.text` against two JS regexes:
  `^add_(\w+)Servicer_to_server$` (provider) and `^(\w+)Stub$`
  (consumer), with a reserved-names ignore list for the Stub case
  (Mock / Test / Fake / Stub).

**Node — JavaScript + TypeScript + TSX (`node.ts`)**
- Pattern sources defined once, compiled three times (one per grammar)
  because `Parser.Query` objects are not portable across grammars.
  Exports three `GrpcLanguagePlugin`s sharing the same `scan`.
- `@GrpcMethod('Service', 'Method')`: decorator query captures the
  two string literals. Confidence is hard-coded 0.8 regardless of
  proto map resolution (matches the original regex version's
  behaviour).
- `@GrpcClient(...) field: XxxServiceClient`: decorator query
  captures the decorator node, plugin walks up to find the enclosing
  `public_field_definition` (decorators on fields are CHILDREN of
  the field definition in tree-sitter-typescript, not siblings) and
  reads its first `type_annotation → type_identifier`, then runs the
  `^(\w+Service)Client$` JS filter.
- `client.getService<X>('AuthService')`: call-expression query on
  `member_expression.property = "getService"` + string literal arg.
- `new XxxServiceClient(...)`: `new_expression` with a bare
  identifier constructor, filtered by `^(\w+Service)Client$` so
  generic `new AuthClient(...)` (missing the `Service` infix) does
  NOT falsely register as a consumer. Preserves the regression test
  `test_extract_ts_non_service_client_constructor_is_ignored`.
- `loadPackageDefinition` dynamic loader: gated on
  `tree.rootNode.text.includes('loadPackageDefinition')`. When set,
  `new foo.bar.Xxx(...)` qualified constructors with a capitalised
  property name register as consumers.

## Orchestrator changes

`grpc-extractor.ts` loses every `scanGoProviders` / `scanJavaProviders`
/ ... helper and replaces them with a single source-scan loop that:

1. Parses each file with the plugin's grammar (one shared `Parser`
   instance across all files, `setLanguage` called per plugin).
2. Calls `plugin.scan(tree)` to get `GrpcDetection[]`.
3. Converts each detection to an `ExtractedContract` via the private
   `detectionToContract` helper, which:
   - Looks the short service name up in the proto map (filled by
     the `.proto` parser).
   - Picks confidence = `confidenceWithProto` if resolved, else
     `confidenceWithoutProto`.
   - Builds a method-level contract id (`grpc::pkg.Svc/Method`) when
     the detection carries a `methodName` (TS `@GrpcMethod` only),
     otherwise a service-level id (`grpc::pkg.Svc/*`).

Everything else — the `.proto` parser, `buildProtoContext`,
`buildProtoMap`, `resolveProtoConflict`, `serviceContractId`,
`stripProtoCommentsAndStrings`, `extractServiceBlocks`, the dedupe
function — stays exactly as before. The `.proto` parser is kept as a
pragmatic exception to the "no regex in extractors" rule because no
`tree-sitter-proto` grammar is installed in the repo; a comment at the
top of the file explains this and flags the maintainer option of
adding `tree-sitter-proto` as a dependency.

## Why this is better than the regex version

1. **Comments and strings are respected for free.** Matched node types
   are only code constructs, never text inside comments or string
   literals.
2. **No false positives on partial names.** The old `(\w+?)Grpc`-style
   regexes would cross-match unrelated identifiers; structural queries
   restrict matches to the exact AST shape (`scoped_type_identifier →
   type_identifier` pairs, `method_invocation → identifier` etc.).
3. **NestJS `@GrpcClient` is structural, not regex-based.** The old
   regex required a specific textual layout
   (`@GrpcClient(...) private readonly foo!: XxxServiceClient`); the
   plugin now walks the AST, so modifier order / optional modifiers /
   multi-line formatting don't break it.
4. **Language-agnostic extension.** Adding Kotlin / Rust / C# gRPC
   detection later is a one-file edit in `grpc-patterns/index.ts` —
   no touches to the shared scanner, the orchestrator, or the proto
   parser.

## Tests

- `grpc-extractor.test.ts` — **43/43 pass** (tests unchanged; the
  contract shape is identical). Covers .proto parsing (including the
  brace-inside-string regression), Go provider/consumer,
  Java @GrpcService / plain ImplBase provider + newBlockingStub
  consumer, Python servicer + stub, TS @GrpcMethod + @GrpcClient +
  .getService + new XxxServiceClient + loadPackageDefinition + the
  `AuthClient` vs `AuthServiceClient` discrimination, dedupe across
  multiple patterns in one file, proto-aware confidence, and the
  inherited-package resolution for split proto definitions.
- `topic-extractor.test.ts` — 30/30 pass.
- `http-route-extractor.test.ts` — 18/18 pass.
- `manifest-extractor.test.ts` — 8/8 pass.
- `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass.
- `npx tsc -p tsconfig.json --noEmit` clean.

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched.
- No pipeline.ts, MCP surface, ingestion, CI / release / security, or
  test changes.
- New tree-sitter grammar imports (`tree-sitter-go`, `tree-sitter-java`,
  `tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`)
  are all already installed for the ingestion pipeline.

## End of phase series

This commit completes the three-phase extractor refactor:
  - **Phase 1** (`ea06d11`): topic-extractor → `topic-patterns/`
  - **Phase 2** (`b6015f6`): http-route-extractor → `http-patterns/`
  - **Phase 3** (this commit): grpc-extractor → `grpc-patterns/`

Every remaining regex-based extractor helper under the `src/core/group/
extractors/` directory is either (a) language-agnostic string
processing (path normalization, dedupe keys) or (b) the `.proto`
parser, which is documented as an explicit exception.

Co-authored-by: Claude <noreply@anthropic.com>

* feat(group): add tree-sitter-proto for .proto file parsing

Addresses @magyargergo's suggestion on #796 to replace the manual
string-sanitizing .proto parser with a tree-sitter grammar.

- **Vendored `tree-sitter-proto`** in `vendor/tree-sitter-proto/`.
  Grammar source from [coder3101/tree-sitter-proto](https://github.com/coder3101/tree-sitter-proto)
  (latest `grammar.js`), parser.c regenerated with `tree-sitter-cli
  0.24` to produce ABI version 14 — compatible with the project's
  `tree-sitter 0.25` runtime (which supports ABI ≤ 14). Added as
  `optionalDependency` with `file:./vendor/tree-sitter-proto`.

- **New `grpc-patterns/proto.ts` plugin** — uses the same
  `compilePatterns` + `runCompiledPatterns` infrastructure as every
  other plugin. Two queries:
  - `(package (full_ident) @pkg)` — package declaration
  - `(service (service_name) @service_name (rpc (rpc_name) @rpc_name))`
    — one match per (service, rpc) pair

- **Graceful fallback** — `tree-sitter-proto` is an optional
  dependency. If it fails to install (platform incompatibility) or
  fails the runtime smoke-test (`setLanguage` + `parse` on a trivial
  proto), `PROTO_GRPC_PLUGIN` stays `null` and the orchestrator
  uses the existing manual parser. The smoke-test catches the
  `SyntaxNode` TDZ error that occurs in vitest's fork-based test
  runner.

- **Orchestrator updated** — when `hasProtoPlugin` is true, `.proto`
  files are handled by the plugin loop (they're included in
  `GRPC_SCAN_GLOB`), and the manual `parseProtoFile` loop is
  skipped. `buildProtoContext` still runs to build the proto map
  for cross-referencing source-file detections.

1. **No manual comment/string stripping.** The old parser needed
   `stripProtoCommentsAndStrings` (110 lines) to avoid counting
   braces inside comments and string literals. tree-sitter handles
   this natively.
2. **No brace-depth tracking.** `extractServiceBlocks` used a manual
   depth counter to find service boundaries. tree-sitter's AST gives
   us `service` → `service_name` + `rpc` → `rpc_name` directly.
3. **Performance.** tree-sitter's C-based parser is faster than
   character-by-character JS scanning + regex on large proto files.

- `grpc-extractor.test.ts` — **43/43 pass** (unchanged)
- All other extractor tests — 99/99 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

* chore: add .gitignore for vendored tree-sitter-proto build artifacts

https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU

* fix: correct .gitignore paths for vendored tree-sitter-proto

Patterns should be relative to the .gitignore file's directory.

https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU

* refactor(group): address Copilot review feedback on #796

Six fixes suggested by the Copilot AI review:

1. **`normalizeHttpPath` root-path edge case** — stripping trailing
   slashes on the input `/` produced an empty string, yielding
   malformed contract ids like `http::GET::`. Now preserves `/` for
   the root handler/fetch case.

2. **Dedupe `scanFiles` call** — `extract()` was globbing the
   source-scan file list twice (once for the provider fallback, once
   for the consumer fallback). Moved to a single lazy call that
   memoizes the result for the rest of the method.

3. **HTTP `scanFiles` now ignores `**/vendor/**`** — every other
   extractor's glob already ignored vendored sources; the HTTP one
   didn't. Fixed for consistency.

4. **`loadPackageDefinition` check is now structural** — was calling
   `tree.rootNode.text.includes('loadPackageDefinition')` which forces
   materialization of the entire file text from the parse tree
   (expensive on large files). Replaced with a dedicated compiled
   query on `(call_expression function: [(identifier) | (member_expression)])`
   so the check stays in the AST domain.

5. **`grpc-extractor.ts` header docstring updated** — still claimed
   ".proto parsing is not tree-sitter-based because no grammar is
   installed". Now describes the actual behaviour: tree-sitter when
   `tree-sitter-proto` is available (optionalDependency), manual
   fallback otherwise.

6. **Eliminated the double proto file parse on the fallback path** —
   `buildProtoContext` already globs + parses every `.proto` file to
   build `servicesByName`. On the `!hasProtoPlugin` branch the
   extractor was globbing + parsing again via the now-removed
   `parseProtoFile` helper. The fallback branch now iterates the map
   that `buildProtoContext` already produced to emit provider
   contracts directly — single pass per proto file.

## Tests

- `topic-extractor.test.ts` — 30/30 pass
- `http-route-extractor.test.ts` — 18/18 pass
- `grpc-extractor.test.ts` — 43/43 pass
- `manifest-extractor.test.ts` — 8/8 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): address Claude review feedback (bugs + dedup + hygiene) on #796

Follows up `2f28bfc` with the remaining items from the Claude AI review:

## Bugs

**Bug 2 — Label-unaware Cypher queries in `resolveSymbol`.**
The manifest-extractor's lookup queries were `MATCH (n) WHERE n.name = $x`
with no label filter, so a topic/service/package name could silently match
any node type (File, Variable, Import, Folder, …). Added label filters:
- `topic` → `(n:Function|Method|Class|Interface)` (topics are best-effort
  symbol-name matches against listener/publisher symbols)
- `grpc` method → `(n:Function|Method)`
- `grpc` service → `(n:Class|Interface)`
- `lib` → `(n:Package|Module)`

All 8 manifest-extractor tests still pass (mock executor is
label-agnostic, but the production LadybugDB graph now gets correctly
scoped queries).

**Bug 8 — Tautological `!handlerName` condition.**
`http-route-extractor.ts:extractProvidersGraph` had
`let handlerName = null; if (!method || !handlerName) { ... }` — the
`!handlerName` clause was always true since there was no intervening
assignment. Simplified to always run the plugin-scan lookup (we need
the handler name even when `methodFromRouteReason` already resolved
the method).

## Clean code / dedup

**Design 7 — `readSafe` was copy-pasted in all three orchestrators.**
Extracted to `extractors/fs-utils.ts` as the single source of truth
for the path-traversal guard. Dropped the three local copies and the
now-unused `fs`/`path` imports from topic-extractor.

**Style 10 — Language-specific `_test.go` skip in the topic orchestrator.**
Was `if (rel.endsWith('_test.go')) continue;` inside the language-
agnostic extraction loop. Pushed into the glob's ignore list
(`'**/*_test.go'`) alongside the existing `node_modules`, `vendor`,
`dist`, `build` entries, with a comment explaining that other
languages' test file conventions either live in separate directories
(Python `tests/`, Java `src/test/`) or are already covered by the
existing ignores.

## Already addressed in `2f28bfc` (mentioned again in Claude review)

- Bug 3: `normalizeHttpPath('/')` returns `''` — fixed
- Bug 4: double glob + double parse of `.proto` — fixed
- Bug 5: `scanFiles` called twice in HTTP — fixed
- Bug 6: missing `**/vendor/**` in HTTP glob — fixed
- Design 9 partially: `tree.rootNode.text.includes('loadPackageDefinition')`
  replaced with a dedicated structural query

## Deferred

- Bug 1 (`http::*::path` vs `http::GET::path` matching) — out of scope;
  sync.ts matching logic lands in #793, manifest extractor already
  emits correct synthetic uids for unresolved HTTP contracts.
- Design 9 full (change plugin `scan(tree)` → `scan(tree, source)`) —
  the only real use case (`loadPackageDefinition` gate) is already
  fixed via a structural query, so the interface change would be
  cosmetic churn without a concrete consumer.

## Tests

- `topic-extractor.test.ts` — 30/30 pass
- `http-route-extractor.test.ts` — 18/18 pass
- `grpc-extractor.test.ts` — 43/43 pass
- `manifest-extractor.test.ts` — 8/8 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

* docs+fix(group): address remaining Claude review items + add pipeline flow chart

## Fixes

**Remaining 🔴 — HTTP contract id wildcard format.** Documented the
`http::*::<path>` format as an intentional wildcard for manifest links
that omit the HTTP method, alongside the explicit-method form
(`GET::/path` → `http::GET::/path`). The docblock on `buildContractId`
now states both forms, notes that wildcard-aware matching is the
responsibility of the sync / cross-impact layer (#793), and
recommends the explicit-method form whenever the author knows the
method (it round-trips through exact equality without needing
wildcard logic downstream). Tests unchanged — the wildcard format is
what they've always asserted.

**Minor 1 — stale comment at `manifest-extractor.ts:124-126`.** The
comment claimed "creates a contract with an empty symbolUid/ref" but
the code switched to `manifestSymbolUid(repo, contractId)` a few
commits back. Updated to describe the actual synthetic-uid fallback
semantics and the cross-impact path that relies on both sides of the
join deriving the same uid.

**Minor 2 — exhaustiveness guard on `buildContractId`.** The
`switch(type)` covered all five current `ContractType` variants but
silently returned `undefined` if a new variant was added. Added a
`default: const _exhaustive: never = type; throw new Error(...)`
clause so the build fails loudly on an unhandled variant.

**Minor 3 — `tree.rootNode.text` in `grpc-patterns/node.ts`.** Already
fixed in `2f28bfc` via a dedicated structural query
(`LOAD_PACKAGE_DEFINITION_SPEC`). No action needed.

## New: pipeline flow chart (per @magyargergo's request)

Added `src/core/group/PIPELINE.md` with four Mermaid diagrams:
1. **High-level overview** — `group.yaml` → extractors + manifest →
   contract matching → `bridge.lbug` → `runGroupImpact`.
2. **Per-repo extractor two-strategy shape** — graph-assisted
   Strategy A vs. source-scan Strategy B.
3. **Plugin architecture** — orchestrator → registry →
   per-language `*-patterns/<lang>.ts` → `tree-sitter-scanner.ts` →
   `ExtractedContract`.
4. **Manifest extraction** — label-scoped `resolveSymbol` with the
   synthetic-uid fallback.
5. **Cross-impact query (#606)** — local impact → bridge join →
   cross-repo fan-out.

Each diagram is annotated with which PRs own which stage (this PR:
extractors + manifest; #795: bridge storage; #606: cross-impact
runtime) and points at the concrete files/functions involved.

## Tests

- 99/99 extractor tests pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
@magyargergo magyargergo mentioned this pull request Apr 13, 2026
2 tasks
@ivkond

ivkond commented Apr 20, 2026

Copy link
Copy Markdown
Contributor Author

@abhigyanpatwari @magyargergo — closing this one.

The split plan agreed in #issuecomment-4229689301 is complete. Everything from this PR landed piecemeal via 4 follow-up PRs against the 5 tracking issues:

# Issue Merged via Scope
1 #790 #795 Shared types.ts + ContractExtractor interface (rolled into PR 2)
2 #791 #795 bridge.lbug storage + contract matching foundation
3 #792 #796 Service boundary detection + HTTP/gRPC/topic/manifest extractors
4 #793 #817 Sync pipeline + extractor contract-resolution fixes
5 #794 #984 group_impact / cross-repo traversal + @<group> MCP routing + group resources

Net effect on main matches the original intent of this PR, minus the items explicitly deferred along the way — tracked separately:

Happy to open follow-up issues for any of the above if useful.

Thanks for the patience with the split and for the reviews along the way 🙏 — closing without merge.

@ivkond ivkond closed this Apr 20, 2026
@ivkond ivkond deleted the feat/cross-index-impact-groups branch April 20, 2026 15:55
@magyargergo

Copy link
Copy Markdown
Collaborator

@ivkond thank you for this amazing work! You made a big impact on GitNexus! We appreciate your hard work. We have open positions for maintainers if you are interested.

Please join to our community on Discord.

github714801013 pushed a commit to github714801013/GitNexus that referenced this pull request Apr 28, 2026
…abhigyanpatwari#606 split) (abhigyanpatwari#795)

* feat(group): bridge.lbug storage + contract matching expansion

Part 1 of 4 in the split of abhigyanpatwari#606 (ticket: abhigyanpatwari#791, closes abhigyanpatwari#790 with a
revised plan per @magyargergo's request).

## What changed

Adds the LadybugDB-backed bridge storage infrastructure and extends
the contract matching algorithm with wildcard support. All changes are
additive: storage.ts, sync.ts, service.ts, cli/group.ts, mcp/tools.ts
are left on their upstream main versions and will migrate to the new
bridge in follow-up PRs (abhigyanpatwari#792, abhigyanpatwari#793, abhigyanpatwari#794).

### Files

**New (844 LOC prod):**
- `gitnexus/src/core/group/bridge-db.ts` — atomic write-to-temp with
  `retryRename` for Windows EBUSY/EPERM, per-item write tolerance via
  `WriteBridgeReport`, `findContractNode` with three-tier symbol
  lookup (uid → filePath+name → filePath)
- `gitnexus/src/core/group/bridge-schema.ts` — schema DDL
- `gitnexus/src/core/group/normalization.ts` — contract ID
  canonicalization + `dedupeContracts` / `dedupeCrossLinks` helpers
  used by both matching and bridge write

**Modified (+137 LOC prod):**
- `gitnexus/src/core/group/matching.ts` — adds `runWildcardMatch` for
  `grpc::Service/*` wildcard consumers, `buildProviderIndex` helper,
  and canonical gRPC ID handling in `normalizeContractId`
- `gitnexus/src/core/group/types.ts` — `MatchType` gains `'wildcard'`;
  new `BridgeHandle` and `BridgeMeta` interfaces

**New tests (658 LOC):**
- `gitnexus/test/unit/group/bridge-db.test.ts` — core write/read round
  trip, `WriteBridgeReport` shape, dropped-links counter, retryRename
  behavior on EBUSY/ENOENT/EPERM/EACCES
- `gitnexus/test/unit/group/bridge-db-edge.test.ts` — edge cases
  (malformed meta, missing contract nodes, concurrent access)

**Modified tests (+225 LOC):**
- `gitnexus/test/unit/group/matching.test.ts` — wildcard consumer
  matching, gRPC canonical ID handling, same-service guard

### Self-review fixes folded in

Carried forward from the original abhigyanpatwari#606 self-review:
- `writeBridge` try/finally handle lifecycle + `handleClosed` sentinel
- `openBridgeDbReadOnly` partial-handle cleanup
- `writeBridgeMeta` uses `retryRename` for Windows consistency
- `retryRename` unit tests (was zero coverage)
- Per-item try/catch around every CREATE loop so one malformed contract
  doesn't abort the whole write
- Dropped cross-link counter (`linksDroppedMissingNode`)

### Why now

magyargergo asked for the abhigyanpatwari#606 PR to be split so we can iterate with
confidence (abhigyanpatwari#606 (comment)).
This is the foundational layer — pure infra, no user-facing surface,
no callers of the new APIs in this PR. Later PRs wire it in.

### How to verify

- `cd gitnexus && npx tsc --noEmit`
- `cd gitnexus && npx vitest run test/unit/group/bridge-db.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/bridge-db-edge.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/matching.test.ts --pool=forks`
- Pre-commit hook runs clean

### Risk / rollback

**Low.** All new code sits under `src/core/group/` in new files plus a
minimal `+16/-1` diff to `types.ts` and a `+136/-0` diff to `matching.ts`
(both purely additive). No existing callers reference the new APIs
(bridge-db, openBridgeOrFallback, runWildcardMatch) — the PRs that wire
them in come later in the split chain. Rollback = `git revert` of the
merge commit; no state introduced, no schema migration triggered.

### Scope discipline (per GUARDRAILS.md)

- Only the 8 files listed above are touched; no drive-by refactors
- No CI/release/security config changes
- No secrets, tokens, or machine-specific paths
- Content is lifted from the abhigyanpatwari#606 branch which already passed CI 11/11
  green on `d15b8cb` (before the split)

### Dependencies

- **Base:** `main` (no dependencies on other split PRs)
- **Blocks:** extractor expansion (abhigyanpatwari#792), sync pipeline (abhigyanpatwari#793),
  cross-impact feature (abhigyanpatwari#794)
- **Related ticket:** abhigyanpatwari#791

Co-authored-by: Claude <noreply@anthropic.com>

* fix(group): address @claude review on abhigyanpatwari#795

Addresses the findings from the automated review on PR abhigyanpatwari#795
(abhigyanpatwari#795 (comment)
— posted by @magyargergo / claude-code Action run).

### Medium severity (reviewer flagged as blockers)

- **bridge-db.ts `openBridgeDbReadOnly` bak recovery** — the `.bak`
  recovery path used bare `fsp.rename(bakPath, dbPath)`, which is
  exactly the scenario most likely to hit Windows EBUSY/EPERM (an
  interrupted writer still holding the handle for a few ms). Switched
  to `retryRename` for consistency with the rest of the file's
  Windows-safe rename path.
- **bridge-db.ts `ensureBridgeSchema` error detection** — the inline
  `msg.includes('already exists')` substring match has been lifted
  into a named constant `LBUG_ALREADY_EXISTS_MSG` with a comment
  documenting the coupling to LadybugDB's error message wording and
  why we can't use `IF NOT EXISTS` (LadybugDB DDL doesn't support it)
  or typed errors (LadybugDB's JS driver doesn't expose error codes).
  Also tightened the `catch (err: any)` to `catch (err: unknown)`.
- **bridge-db.ts `findContractNode` — extracted out of writeBridge**
  — the 35-line async closure living inside `writeBridge` has been
  lifted to three module-level functions: `createContractLookupIndex`,
  `indexContract`, and `findContractNode`. `findContractNode` is now
  a pure synchronous function taking a prebuilt index instead of
  doing its own DB queries. The `writeBridge` cross-link loop is now
  ~25 lines instead of ~100.
- **bridge-db.ts `findContractNode` — N+1 query elimination** — the
  old inner-closure version issued up to 6 DB round-trips per
  cross-link (2 endpoints × up to 3 tiers of fallback queries). For a
  group with 1000 cross-links, that's up to 6000 DB queries just to
  resolve endpoints. The new version consults an in-memory
  `ContractLookupIndex` built incrementally as contracts are inserted
  (`indexContract` called AFTER each successful insert so failed
  inserts don't poison the index). Cross-link resolution is now
  O(1) per link instead of O(3) DB queries per link, with zero DB
  round-trips during the cross-link loop.

### Minor severity

- **bridge-db.ts `queryBridge` empty-array guard** — if LadybugDB
  ever returns an empty `QueryResult[]` at the top level (shouldn't
  happen with single-statement calls, but driver contract isn't
  explicit), the old code would call `.getAll()` on `undefined` and
  crash with a confusing stack. Added an `unwrapQueryResult` helper
  that throws an explicit `'empty QueryResult array'` error instead,
  making a potential driver regression visible immediately.
- **normalization.ts `contractRichness` weights** — added a
  block-level comment documenting the weight ordering (+3 for
  symbolUid, +2 for each symbol-identifying field, +1 for service
  tag or non-manifest origin) and explicitly noting that the
  absolute numbers don't matter, only the relative ordering. Matches
  the "comment for contributors" suggestion in the review.
- **bridge-schema.ts `BRIDGE_SCHEMA_VERSION` migration comment** —
  added a 4-point contract explaining what bumping the constant
  means ("discard and re-sync" strategy for V1, no in-place
  migration yet, new migration logic should live in a separate
  `bridge-migrations.ts` module when it becomes necessary).
- **test/unit/group/fixtures.ts** — extracted the `makeContract`
  helper previously copy-pasted between `bridge-db.test.ts` and
  `bridge-db-edge.test.ts` into a shared fixtures module. Both test
  files now import from `./fixtures.js`. Kept the scope minimal:
  fixtures is NOT a general-purpose factory module, just the shared
  baseline contract builder.

### New tests

Added 9 pure-function unit tests for the now-extracted
`findContractNode` in `bridge-db.test.ts`:
  - returns null on empty index
  - tier 1 (symbolUid) match, including repo-scope and role-scope
    isolation
  - tier 2 (filePath + symbolName) fallback when symbolUid is empty
    or mismatches
  - tier 3 (filePath only) when exactly one contract lives in the
    file, and refusal when multiple do
  - priority ordering when multiple tiers could resolve

These are fully isolated — no DB, no temp directories, no native
LadybugDB binding — so they run in &lt;10ms total and are
immediately trustworthy as a regression safety net.

### Deliberately deferred (reviewer marked as "fine for now")

- `BridgeHandle._db` / `._conn` typing to `unknown` with casts in
  `bridge-db.ts` — reviewer's note: "The typing is fine for now."
- Batch inserts via `UNWIND` — needs LadybugDB support confirmation,
  tracked as a follow-up; the per-item pattern remains.
- `queryBridge` prepared-statement lifecycle — the current pattern
  (prepare → execute → GC) relies on LadybugDB's internals, worth
  verifying against their docs in a separate audit.

### Scope discipline (per `GUARDRAILS.md`)

- Only files touched by this PR (`bridge-db.ts`, `bridge-schema.ts`,
  `normalization.ts`, both bridge test files, new `fixtures.ts`) —
  no drive-by refactors
- No CI/release/security config changes
- No secrets

### Test + typecheck status

- `npx tsc --noEmit` clean
- `bridge-db.test.ts`: added 9 `findContractNode` tests, all pass in
  isolation. The full-file run still hits the pre-existing native
  LadybugDB cleanup segfault that flakes the reported count — same
  as every prior commit on this branch, not a regression.
- `bridge-db-edge.test.ts`: 4/4 pass
- `matching.test.ts`: 28/28 pass
- `types.test.ts`: 5/5 pass
- `retryRename` tests (4/4) and `findContractNode` tests (9/9)
  verified in isolation via `-t` filter

Co-authored-by: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
github714801013 pushed a commit to github714801013/GitNexus that referenced this pull request Apr 28, 2026
…npatwari#606 split) (abhigyanpatwari#796)

* feat(group): extractor expansion + manifest extractor

Part 2 of 4 in the split of abhigyanpatwari#606 (ticket: abhigyanpatwari#792). Follows abhigyanpatwari#795
(bridge.lbug storage foundation, already merged), but this PR has no
code-level dependency on abhigyanpatwari#795 — it only imports types and the
ContractExtractor interface that existed on upstream main before
either PR. It could have been reviewed in parallel with abhigyanpatwari#795.

## What changed

Expands the 3 existing contract extractors with substantially more
language/framework coverage, and adds a new `manifest-extractor`
that resolves `group.yaml`-declared cross-links against the per-repo
graph via exact-name lookups.

### New file (228 LOC)

- `gitnexus/src/core/group/extractors/manifest-extractor.ts` —
  exact graph lookup for `group.yaml`-declared cross-links. HTTP
  paths are canonicalized before Route.name matching; gRPC is
  resolved by service/method name (NO `.proto`-filename fallback);
  topic and lib use exact-name match. Falls back to a synthetic
  `manifest::<repo>::<contractId>` uid when the graph has no
  matching symbol, so cross-impact traversal still has a stable
  anchor for the contract.

### Modified extractors (+958 LOC prod)

- `extractors/grpc-extractor.ts` (+522) — `.proto` parser with
  comment and string-literal sanitization (braces inside strings no
  longer truncate service bodies); package/service/method canonical
  IDs; server/client detection across Go (`grpc.NewServer`,
  `RegisterXxxServer`, `XxxGrpc.XxxImplBase`), Java (`@GrpcService`,
  `BlockingStub`), Python (`servicer_to_server`, `XxxStub`), and
  TypeScript/Node (`@GrpcMethod`, `ClientGrpc`, `loadPackageDefinition`).
- `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib
  `HandleFunc`, NestJS `@Controller`+`@Get`/etc, Python FastAPI
  decorators, Java Spring `@RequestMapping`/`@GetMapping`,
  restTemplate / WebClient / OkHttp consumers.
- `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}`
  struct literal detection (replaces a constructor-anchored regex
  that missed topics inside producer loops), kafka-go Writer/Reader,
  Python NATS (`await nc.subscribe`/`await nc.publish`), JetStream
  helpers.

### Modified and new tests (+1264 LOC)

- `grpc-extractor.test.ts` (+539) — full coverage of the new proto
  parser (strings-with-braces regression, comments-with-braces
  regression), per-language server/client detection
- `http-route-extractor.test.ts` (+240) — per-framework route
  extraction + normalization edge cases
- `topic-extractor.test.ts` (+177) — the sarama in-loop regression,
  JetStream, Python NATS, kafka-go Writer/Reader
- `manifest-extractor.test.ts` (+308 NEW) — HTTP path normalization,
  gRPC exact lookup with proto-fallback regression, lib and topic
  exact matching, synthetic-uid fallback behavior

### Self-review fixes folded in

Carried forward from the abhigyanpatwari#606 self-review (commit `d15b8cb`):

- **HIGH abhigyanpatwari#1** — `manifest-extractor.resolveSymbol` was too fuzzy.
  Previously used `CONTAINS` on route/name fields plus an
  unconditional `filePath ENDS WITH '.proto'` fallback for gRPC.
  Consequences: `/orders` matched `/suborders`, and any repo with
  any `.proto` file returned a random proto symbol for a gRPC
  manifest entry. Replaced with exact equality + deterministic
  `ORDER BY` + synthetic-uid fallback for unresolved manifests.
  Regression tests included.
- **MED abhigyanpatwari#3** — gRPC proto parser brace-depth counting now sanitizes
  strings and comments first (`stripProtoCommentsAndStrings`). A
  valid proto with `option deprecated_reason = "use NewService {
  instead"` used to have its service body closed early by the `"{"`
  inside the literal, silently dropping methods after the offending
  string. Regression tests for both string-with-brace and
  comment-with-brace cases.
- **MED abhigyanpatwari#4** — sarama Kafka regex changed from
  `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on
  constructor, caught only first topic in a loop) to
  `sarama.ProducerMessage{...Topic:}` (matches every struct literal
  directly). Regression test with a for-loop that constructs
  multiple `ProducerMessage`s.
- **MED abhigyanpatwari#7** — `manifest-extractor.resolveSymbol` no longer has a
  silent `catch { /* fall through */ }`. Errors from the graph
  executor are logged via `console.warn` with link type, contract
  name, repo key, and error message before falling through to the
  synthetic-uid path.

## Why

Reviewer focus here is pure regex / parser correctness — no
storage, no Cypher queries, no algorithmic changes to the cross-link
algorithm. Separating this from the bridge foundation PR (abhigyanpatwari#795)
meant reviewers could stay in a single mental mode (parsing logic)
instead of context-switching between DDL, Cypher, and regex.

## How to verify

- `cd gitnexus && npx tsc --noEmit`
- `cd gitnexus && npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks`
- `cd gitnexus && npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks`

Local pre-push: typecheck clean, all 99 extractor unit tests pass
(grpc 43, http 18, topic 30, manifest 8).

## Risk / rollback

**Low.** Extractors have no user-facing surface in this PR — they
produce `ExtractedContract[]` that is consumed by `sync.ts` in the
next split (abhigyanpatwari#793). No existing behavior changes for users who don't
run a `group sync`. Rollback = `git revert` of the merge commit;
the modifications to `grpc-extractor.ts` / `http-route-extractor.ts`
/ `topic-extractor.ts` revert to the pre-PR versions that still
work (they're subsets of the new functionality).

## Scope discipline (per GUARDRAILS.md)

- Only the 8 files above are touched; no drive-by refactors
- No CI/release/security config changes
- No secrets or machine-specific paths
- Content lifted from abhigyanpatwari#606 (CI 11/11 green on `d15b8cb`)

## Dependencies

- **Base:** `main` (upstream already includes abhigyanpatwari#795 as `f6fb87f`)
- **Blocks:** sync pipeline (abhigyanpatwari#793) and the cross-impact feature (abhigyanpatwari#794)
- **Tracker issue:** abhigyanpatwari#792
- **Parent PR:** abhigyanpatwari#606

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): migrate topic-extractor from regex to tree-sitter queries

Addresses @magyargergo's feedback on abhigyanpatwari#796 that regex-based lookups
should use tree-sitter nodes instead, and that the top-level
extractors must NOT carry language dependencies. This is phase 1 of
a multi-step migration — topic-extractor first because its patterns
are the most uniform (16 "call/annotation with first-arg string
literal" variants), which makes it a clean proof of the approach
before grpc-extractor and http-route-extractor get the same treatment.

## Architecture: language-agnostic orchestrator + per-language plugins

The top-level extractor is a thin orchestrator that never imports a
tree-sitter grammar or a query string. Per-language knowledge lives
in a new `topic-patterns/` folder with one file per language plus a
registry that maps file extensions to compiled plugins:

```
src/core/group/extractors/
├── tree-sitter-scanner.ts         # shared, language-agnostic scanning utilities
├── topic-extractor.ts              # thin orchestrator (no grammar imports)
└── topic-patterns/
    ├── types.ts                    # TopicMeta, Broker
    ├── index.ts                    # registry: extension → compiled provider
    ├── java.ts                     # tree-sitter-java + JAVA_TOPIC_PROVIDER
    ├── go.ts                       # tree-sitter-go + GO_TOPIC_PROVIDER
    ├── python.ts                   # tree-sitter-python + PYTHON_TOPIC_PROVIDER
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript
                                    # → JAVASCRIPT_/TYPESCRIPT_/TSX_TOPIC_PROVIDER
```

**Shared scanner (`tree-sitter-scanner.ts`)** — defines
`PatternSpec<TMeta>`, `LanguagePatterns<TMeta>`, `CompiledPatterns<TMeta>`
and the `scanFile(parser, plugin, content)` helper. Plugins compile their
queries eagerly at module load via `compilePatterns()`, so a broken
pattern fails loudly at import time instead of silently at scan time.
`unquoteLiteral()` handles single/double/template quotes, Python
triple-quoted strings, and Go raw backtick strings.

**Per-language plugins** own:
- the tree-sitter grammar import (this is the ONLY place in
  `src/core/group/` where tree-sitter grammars are imported),
- the query S-expressions,
- the `TopicMeta` payload (role, broker, confidence, symbolName) that
  the orchestrator receives back on every match.

Each plugin uses a `@value` capture name to bind the topic literal node.
The JavaScript and TypeScript grammars share AST node names for every
construct we query, so `node.ts` defines the pattern sources once and
compiles them against `JavaScript`, `TypeScript.typescript`, and
`TypeScript.tsx` — exporting three providers because `Parser.Query`
objects are NOT portable across grammar instances.

**Registry (`topic-patterns/index.ts`)** — maps `.java` → Java provider,
`.go` → Go, `.py` → Python, `.js`/`.jsx` → JS, `.ts` → TS, `.tsx` → TSX.
Also exports `TOPIC_SCAN_GLOB` so adding a new language is a single
file-level edit (drop `topic-patterns/<lang>.ts`, import + register it
here — zero edits required in `topic-extractor.ts`).

**Orchestrator (`topic-extractor.ts`)** — ~110 lines, no grammar or
query imports. Per file: `getProviderForFile(rel)` → `scanFile(parser,
provider, content)` → `unquoteLiteral(valueText)` → `makeContract(...)`.
Reuses one `Parser` instance across files; the scanner calls
`setLanguage` per plugin.

## Why this is better than regex

1. **Comments and strings are respected for free.** The old regex
   would match `// kafkaTemplate.send("fake.topic")` as a real
   producer; tree-sitter never visits comments or string literals as
   code nodes, so false positives from commented-out code are
   eliminated.
2. **Struct/object literal patterns are structural, not textual.**
   `sarama.ProducerMessage{Topic: "..."}` no longer needs a 300-char
   lookahead (which was a known cross-match bug partly mitigated by a
   loop regression test in the self-review). The new query matches a
   specific `composite_literal` with a specific `qualified_type` and
   `keyed_element` — exactly one struct literal per match.
3. **No order-of-operations fragility.** Regex for
   `channel.publish` vs `channel.consume` was independent and
   file-wide; the AST scopes matches to the specific `call_expression`.
4. **Language-agnostic extension.** Adding Ruby, Rust, or C# topic
   detection later means dropping one file in `topic-patterns/` — no
   changes to shared scanner or orchestrator, and no tree-sitter
   imports leak into top-level code.

## Per-file fault tolerance

- Malformed files that tree-sitter can't parse are silently skipped
  (`parser.parse` is wrapped by `scanFile`). The ingestion pipeline
  already logs unparseable files at index time.
- A syntactically invalid query is caught at `compilePatterns` time,
  not scan time — broken plugins fail loudly at import.
- Per-pattern `matches()` failures are swallowed so one broken query
  in a plugin doesn't block the rest.

## Tests

All 30 existing `topic-extractor.test.ts` tests pass **without any
changes to the test file** — they were written as input/output contract
tests (given this source file, expect these `ExtractedContract` objects)
and that contract is unchanged. Regression coverage includes:

- Kafka: Java `@KafkaListener` + `kafkaTemplate.send`; Node
  `producer.send` + `consumer.subscribe`; Go sarama producer/consumer
  (sync and async); kafka-go Writer/Reader; Python `KafkaConsumer` +
  `producer.send/produce`
- RabbitMQ: Java `@RabbitListener` + `rabbitTemplate.convertAndSend`;
  Node `channel.consume/publish/sendToQueue`; Python `basic_consume/
  basic_publish` with keyword args
- NATS: Go and Node `nc.Subscribe/Publish`; Go and Node JetStream
  `js.Subscribe/Publish`; Python `await nc.subscribe/publish`

Including the regression test for the sarama `ProducerMessage`
in-loop case — the AST-based query captures every literal in the
file independently, not just the first one after `NewSyncProducer`.

## Neighbor regression check

- `topic-extractor.test.ts` — 30/30 pass (rewritten extractor)
- `http-route-extractor.test.ts` — 18/18 pass (untouched)
- `grpc-extractor.test.ts` — 43/43 pass (untouched)
- `manifest-extractor.test.ts` — 8/8 pass (untouched)
- Full `npx tsc --noEmit` clean

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched; no
  changes to other extractors, tests, MCP surface, or pipeline.ts.
- No CI/release/security config changes, no secrets.
- New tree-sitter imports all reference grammars that are already
  installed as dependencies (`tree-sitter`, `tree-sitter-javascript`,
  `tree-sitter-typescript`, `tree-sitter-python`, `tree-sitter-java`,
  `tree-sitter-go` — all in `package.json` for the existing pipeline).

## Phase 2 / phase 3 plan

- **Phase 2 (next commit):** rewrite `http-route-extractor.ts`
  Strategy B (regex fallback) on the same plugin pattern. Graph-assisted
  Strategy A stays as-is (already uses pipeline-built tree-sitter data
  via `HANDLES_ROUTE` Cypher queries).
- **Phase 3 (commit after):** rewrite `grpc-extractor.ts` for Java /
  Go / Python / TypeScript detection. `.proto` files are the one
  outstanding question — there is no `tree-sitter-proto` grammar
  installed; the in-tree string-sanitizing parser stays as a pragmatic
  exception with a comment, alternative being to add
  `tree-sitter-proto` as a dep (open for the maintainer).

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): migrate http-route-extractor Strategy B to tree-sitter plugins

Phase 2 of the extractor refactor requested by @magyargergo on abhigyanpatwari#796.
Same architecture as the phase 1 topic-extractor rewrite: a thin,
language-agnostic orchestrator plus per-language plugins that own
tree-sitter grammars and query sources. The top-level extractor file
no longer imports any tree-sitter grammar or query string.

## Architecture

```
src/core/group/extractors/
├── tree-sitter-scanner.ts          # shared, language-agnostic primitives
├── http-route-extractor.ts         # thin orchestrator (no grammar imports)
└── http-patterns/
    ├── types.ts                    # HttpDetection, HttpLanguagePlugin, HttpRole
    ├── index.ts                    # registry: ext → plugin + HTTP_SCAN_GLOB
    ├── java.ts                     # tree-sitter-java: Spring + RestTemplate/WebClient/OkHttp
    ├── go.ts                       # tree-sitter-go: gin/echo/HandleFunc + http/resty consumers
    ├── python.ts                   # tree-sitter-python: FastAPI + requests
    ├── php.ts                      # tree-sitter-php: Laravel Route::get/...
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript:
                                    #   NestJS controllers, Express, fetch, axios
```

**Shared scanner (`tree-sitter-scanner.ts`)** — generalised from phase 1:
- `ScanMatch<TMeta>.captures` is now a full `CaptureMap` (every named
  capture the query binds, not just a single `@value`). Topic extractor
  updated to read `match.captures.value` accordingly.
- New `runCompiledPatterns(plugin, tree)` helper lets plugins run
  multiple query bundles against the same pre-parsed tree. This is
  needed for HTTP plugins that combine a class-prefix query with a
  method-route query (Spring, NestJS).
- `scanFile` becomes a thin wrapper over `parser.parse + runCompiledPatterns`.

**HTTP plugin shape** — unlike topic plugins, HTTP plugins expose a
`scan(tree)` function rather than a flat pattern list. This reflects
HTTP's more complex extraction: each detection needs method + path +
handler name, and framework patterns like Spring `@RequestMapping` /
NestJS `@Controller` require cross-referencing a class-level prefix
with method-level annotations. Plugins internally use
`compilePatterns` + `runCompiledPatterns` and walk the AST to resolve
the class/method relationships.

**Per-framework coverage:**

- **Java (`java.ts`)**
  - Spring: `@RequestMapping("/api/v2")` class prefix + `@(Get|Post|Put|
    Delete|Patch)Mapping("/sub")` method routes, joined via the
    enclosing `class_declaration` node id.
  - `RestTemplate.getForObject/postForEntity/put/delete/patchForObject` →
    method derived from API name.
  - `WebClient.method(HttpMethod.X, "/path")` → method from
    `HttpMethod.X` capture.
  - `new Request.Builder().url("/path")` → OkHttp consumer.

- **Go (`go.ts`)**
  - gin / echo / chi frameworks: `\w+.GET("/path", handler)` captures
    upper-case verb + handler identifier.
  - `net/http.HandleFunc("/path", handler)` → provider (default GET).
  - `http.Get/Post/Head` consumer, `http.NewRequest("METHOD", ...)`,
    resty `client.R().Get/Post/...`.

- **Python (`python.ts`)**
  - `@app.get("/path")` FastAPI decorators.
  - `requests.get/post/...` and `requests.request("METHOD", "url")`.

- **PHP (`php.ts`)**
  - Laravel `Route::get/post/.../patch('/path', ...)` via
    `scoped_call_expression`. Uses `PHP.php_only` to match the
    existing ingestion pipeline's grammar selection.

- **Node (`node.ts`) — JS + TS + TSX**
  - Pattern sources defined once, compiled against three grammar
    variants (`JavaScript`, `TypeScript.typescript`, `TypeScript.tsx`)
    because `Parser.Query` objects are not portable across grammars.
    Exports three plugins sharing the same `scan` logic.
  - NestJS: `@Controller('prefix')` decorators are siblings of the
    class in `export_statement` / `program`; `@Get(':id')` decorators
    are siblings of the method in `class_body`. The plugin walks
    decorator → next named sibling to find the decorated class /
    method, then combines the class prefix with the method path.
    Only emits NestJS detections when the enclosing class has a real
    `@Controller` decorator — prevents false positives from generic
    classes that happen to use `@Get` from another library.
  - Express: `(router|app).<verb>('/path', ...)`.
  - `fetch(url)` (default GET) + `fetch(url, { method: 'X' })`
    (uses two queries + a SyntaxNode-id dedupe set so URL literals
    aren't double-emitted by the options variant).
  - `axios.get/post/...`.

## Orchestrator changes

`http-route-extractor.ts` drops every `scanXxxProviders` / `scanXxxConsumers`
regex method and replaces them with a single source-scan loop that
delegates to `getPluginForFile(rel).scan(tree)`. The orchestrator
still owns:

- **Path normalization** (`normalizeHttpPath`, `normalizeConsumerPath`)
  — language-agnostic string processing shared by both strategies.
- **Graph-assisted Strategy A** (`HANDLES_ROUTE` / `FETCHES` / `CONTAINS`
  Cypher queries) — unchanged in spirit. The only regex helpers it
  used (`inferMethodFromFileScan`, `pickJavaHandlerName`) are now
  replaced by a lookup against the plugin's detections for the same
  file: for each route row, find the detection whose normalized path
  matches, and pull the HTTP method + handler name from it.
- **Per-file parse cache** — the orchestrator parses each relevant
  file at most once per `extract()` call. Both the graph-assisted
  enrichment loop and the source-scan fallback share the same
  `cachedDetections` map, so we never run the plugin twice for the
  same file.

## Why this is better than the regex version

1. **Comments and strings for free.** The old regex would match
   `// router.get('/fake')` as a real Express route; tree-sitter
   never visits string/comment nodes.
2. **Structural controller-prefix.** Spring and NestJS class-prefix
   joining is now scoped to the enclosing class via `class_declaration`
   node ids, eliminating file-wide state that broke when a file had
   multiple controllers.
3. **Precise NestJS disambiguation.** The plugin only emits a NestJS
   detection when the enclosing class has a real `@Controller`
   decorator — the old regex would fire on any `@Get(...)` in the
   file regardless of surrounding context.
4. **Language-agnostic extension.** Adding Ruby / Rust / Kotlin HTTP
   detection later means dropping one file in `http-patterns/` — no
   changes to the shared scanner, the orchestrator, or the Strategy A
   Cypher queries.

## Tests

- `http-route-extractor.test.ts` — **18/18 pass** (tests unchanged;
  they're contract-style input/output tests and the contract shape is
  unchanged). Covers Spring class prefix, Express, gin/echo, stdlib
  HandleFunc, NestJS, Laravel, FastAPI for providers and
  fetch/axios/python-requests/rest-template/webClient/okhttp/go-stdlib/
  resty for consumers, plus graph-first Strategy A for both.
- `topic-extractor.test.ts` — **30/30 pass** after the `captures.value`
  API migration.
- `grpc-extractor.test.ts` — 43/43 pass (untouched; phase 3).
- `manifest-extractor.test.ts` — 8/8 pass (untouched).
- `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass.
- `npx tsc -p tsconfig.json --noEmit` clean.

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched.
- No changes to pipeline.ts, MCP surface, ingestion, or tests.
- No CI / release / security / secrets changes.
- Tree-sitter grammars imported by plugins (`tree-sitter-java`,
  `tree-sitter-go`, `tree-sitter-python`, `tree-sitter-php`,
  `tree-sitter-javascript`, `tree-sitter-typescript`) are all already
  in `package.json` for the existing ingestion pipeline.

## Phase 3 plan

- **grpc-extractor** gets the same treatment: plugin-per-language under
  `grpc-patterns/` for Java / Go / Python / TS detection. `.proto`
  files remain an open question — no `tree-sitter-proto` grammar is
  installed, so the in-tree string-sanitizing parser from PR abhigyanpatwari#796's
  self-review stays as a pragmatic exception unless the maintainer
  wants us to add `tree-sitter-proto` as a new dep.

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): migrate grpc-extractor source scans to tree-sitter plugins

Phase 3 (final) of the extractor refactor requested by @magyargergo on
abhigyanpatwari#796. Same architecture as phase 1 (topic) and phase 2 (http): thin
language-agnostic orchestrator + per-language plugins that own
tree-sitter grammars and query sources. With this commit the top-level
extractors under `src/core/group/extractors/` import ZERO tree-sitter
grammars and ZERO query strings — every grammar import lives in a
`*-patterns/<lang>.ts` plugin file, and the orchestrators go through
the registry indirection.

## Architecture

```
src/core/group/extractors/
├── tree-sitter-scanner.ts         # shared primitives (unchanged)
├── grpc-extractor.ts               # orchestrator (only `.proto` parser left)
└── grpc-patterns/
    ├── types.ts                    # GrpcDetection, GrpcLanguagePlugin, GrpcRole
    ├── index.ts                    # registry: ext → plugin + GRPC_SCAN_GLOB
    ├── go.ts                       # tree-sitter-go: RegisterXxxServer, Unimplemented, NewXxxClient
    ├── java.ts                     # tree-sitter-java: @GrpcService + XxxImplBase + newBlockingStub
    ├── python.ts                   # tree-sitter-python: add_XxxServicer_to_server + XxxStub
    └── node.ts                     # tree-sitter-javascript + tree-sitter-typescript:
                                    #   @GrpcMethod, @GrpcClient field type,
                                    #   .getService<X>('Svc'), new XxxServiceClient,
                                    #   loadPackageDefinition dynamic constructors
```

## Per-language coverage

**Go (`go.ts`)**
- Provider: `\w+.RegisterXxxServer(...)` via `call_expression →
  selector_expression → field_identifier` + JS regex filter
  `^Register(\w+)Server$`.
- Provider: `pb.UnimplementedXxxServer` embedded in a struct via
  `struct_type → field_declaration_list → field_declaration →
  qualified_type → type_identifier` + JS filter.
- Consumer: `\w+.NewXxxClient(...)` via the same call_expression
  query + JS filter `^New(\w+)Client$`.

**Java (`java.ts`)**
- Provider: `class X extends YyyGrpc.YyyImplBase` — two queries
  handle the scoped and plain forms. `scoped_type_identifier`'s
  children are positional (no `scope:`/`name:` fields), so the
  query matches the two `type_identifier` children by position.
- `#match? @inner "ImplBase$"` restricts matches at query time.
- Whether the class has `@GrpcService` or not controls only the
  `source` metadata label — the plugin walks the class_declaration's
  `modifiers` child in JS to detect the marker_annotation.
- Consumer: `YyyGrpc.newStub(ch)` / `newBlockingStub(ch)` via a
  `method_invocation` query with `#match? @method
  "^new(Blocking)?Stub$"`, service name extracted via
  `^(\w+)Grpc$` on the object identifier.

**Python (`python.ts`)**
- Single call-expression query covers both bare identifier and
  `obj.method` attribute forms:
  `(call function: [(identifier) @fn (attribute attribute: (identifier) @fn)])`.
- Plugin filters `@fn.text` against two JS regexes:
  `^add_(\w+)Servicer_to_server$` (provider) and `^(\w+)Stub$`
  (consumer), with a reserved-names ignore list for the Stub case
  (Mock / Test / Fake / Stub).

**Node — JavaScript + TypeScript + TSX (`node.ts`)**
- Pattern sources defined once, compiled three times (one per grammar)
  because `Parser.Query` objects are not portable across grammars.
  Exports three `GrpcLanguagePlugin`s sharing the same `scan`.
- `@GrpcMethod('Service', 'Method')`: decorator query captures the
  two string literals. Confidence is hard-coded 0.8 regardless of
  proto map resolution (matches the original regex version's
  behaviour).
- `@GrpcClient(...) field: XxxServiceClient`: decorator query
  captures the decorator node, plugin walks up to find the enclosing
  `public_field_definition` (decorators on fields are CHILDREN of
  the field definition in tree-sitter-typescript, not siblings) and
  reads its first `type_annotation → type_identifier`, then runs the
  `^(\w+Service)Client$` JS filter.
- `client.getService<X>('AuthService')`: call-expression query on
  `member_expression.property = "getService"` + string literal arg.
- `new XxxServiceClient(...)`: `new_expression` with a bare
  identifier constructor, filtered by `^(\w+Service)Client$` so
  generic `new AuthClient(...)` (missing the `Service` infix) does
  NOT falsely register as a consumer. Preserves the regression test
  `test_extract_ts_non_service_client_constructor_is_ignored`.
- `loadPackageDefinition` dynamic loader: gated on
  `tree.rootNode.text.includes('loadPackageDefinition')`. When set,
  `new foo.bar.Xxx(...)` qualified constructors with a capitalised
  property name register as consumers.

## Orchestrator changes

`grpc-extractor.ts` loses every `scanGoProviders` / `scanJavaProviders`
/ ... helper and replaces them with a single source-scan loop that:

1. Parses each file with the plugin's grammar (one shared `Parser`
   instance across all files, `setLanguage` called per plugin).
2. Calls `plugin.scan(tree)` to get `GrpcDetection[]`.
3. Converts each detection to an `ExtractedContract` via the private
   `detectionToContract` helper, which:
   - Looks the short service name up in the proto map (filled by
     the `.proto` parser).
   - Picks confidence = `confidenceWithProto` if resolved, else
     `confidenceWithoutProto`.
   - Builds a method-level contract id (`grpc::pkg.Svc/Method`) when
     the detection carries a `methodName` (TS `@GrpcMethod` only),
     otherwise a service-level id (`grpc::pkg.Svc/*`).

Everything else — the `.proto` parser, `buildProtoContext`,
`buildProtoMap`, `resolveProtoConflict`, `serviceContractId`,
`stripProtoCommentsAndStrings`, `extractServiceBlocks`, the dedupe
function — stays exactly as before. The `.proto` parser is kept as a
pragmatic exception to the "no regex in extractors" rule because no
`tree-sitter-proto` grammar is installed in the repo; a comment at the
top of the file explains this and flags the maintainer option of
adding `tree-sitter-proto` as a dependency.

## Why this is better than the regex version

1. **Comments and strings are respected for free.** Matched node types
   are only code constructs, never text inside comments or string
   literals.
2. **No false positives on partial names.** The old `(\w+?)Grpc`-style
   regexes would cross-match unrelated identifiers; structural queries
   restrict matches to the exact AST shape (`scoped_type_identifier →
   type_identifier` pairs, `method_invocation → identifier` etc.).
3. **NestJS `@GrpcClient` is structural, not regex-based.** The old
   regex required a specific textual layout
   (`@GrpcClient(...) private readonly foo!: XxxServiceClient`); the
   plugin now walks the AST, so modifier order / optional modifiers /
   multi-line formatting don't break it.
4. **Language-agnostic extension.** Adding Kotlin / Rust / C# gRPC
   detection later is a one-file edit in `grpc-patterns/index.ts` —
   no touches to the shared scanner, the orchestrator, or the proto
   parser.

## Tests

- `grpc-extractor.test.ts` — **43/43 pass** (tests unchanged; the
  contract shape is identical). Covers .proto parsing (including the
  brace-inside-string regression), Go provider/consumer,
  Java @GrpcService / plain ImplBase provider + newBlockingStub
  consumer, Python servicer + stub, TS @GrpcMethod + @GrpcClient +
  .getService + new XxxServiceClient + loadPackageDefinition + the
  `AuthClient` vs `AuthServiceClient` discrimination, dedupe across
  multiple patterns in one file, proto-aware confidence, and the
  inherited-package resolution for split proto definitions.
- `topic-extractor.test.ts` — 30/30 pass.
- `http-route-extractor.test.ts` — 18/18 pass.
- `manifest-extractor.test.ts` — 8/8 pass.
- `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass.
- `npx tsc -p tsconfig.json --noEmit` clean.

## Scope discipline (per GUARDRAILS.md)

- Only files under `src/core/group/extractors/` are touched.
- No pipeline.ts, MCP surface, ingestion, CI / release / security, or
  test changes.
- New tree-sitter grammar imports (`tree-sitter-go`, `tree-sitter-java`,
  `tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`)
  are all already installed for the ingestion pipeline.

## End of phase series

This commit completes the three-phase extractor refactor:
  - **Phase 1** (`ea06d11`): topic-extractor → `topic-patterns/`
  - **Phase 2** (`b6015f6`): http-route-extractor → `http-patterns/`
  - **Phase 3** (this commit): grpc-extractor → `grpc-patterns/`

Every remaining regex-based extractor helper under the `src/core/group/
extractors/` directory is either (a) language-agnostic string
processing (path normalization, dedupe keys) or (b) the `.proto`
parser, which is documented as an explicit exception.

Co-authored-by: Claude <noreply@anthropic.com>

* feat(group): add tree-sitter-proto for .proto file parsing

Addresses @magyargergo's suggestion on abhigyanpatwari#796 to replace the manual
string-sanitizing .proto parser with a tree-sitter grammar.

- **Vendored `tree-sitter-proto`** in `vendor/tree-sitter-proto/`.
  Grammar source from [coder3101/tree-sitter-proto](https://github.com/coder3101/tree-sitter-proto)
  (latest `grammar.js`), parser.c regenerated with `tree-sitter-cli
  0.24` to produce ABI version 14 — compatible with the project's
  `tree-sitter 0.25` runtime (which supports ABI ≤ 14). Added as
  `optionalDependency` with `file:./vendor/tree-sitter-proto`.

- **New `grpc-patterns/proto.ts` plugin** — uses the same
  `compilePatterns` + `runCompiledPatterns` infrastructure as every
  other plugin. Two queries:
  - `(package (full_ident) @pkg)` — package declaration
  - `(service (service_name) @service_name (rpc (rpc_name) @rpc_name))`
    — one match per (service, rpc) pair

- **Graceful fallback** — `tree-sitter-proto` is an optional
  dependency. If it fails to install (platform incompatibility) or
  fails the runtime smoke-test (`setLanguage` + `parse` on a trivial
  proto), `PROTO_GRPC_PLUGIN` stays `null` and the orchestrator
  uses the existing manual parser. The smoke-test catches the
  `SyntaxNode` TDZ error that occurs in vitest's fork-based test
  runner.

- **Orchestrator updated** — when `hasProtoPlugin` is true, `.proto`
  files are handled by the plugin loop (they're included in
  `GRPC_SCAN_GLOB`), and the manual `parseProtoFile` loop is
  skipped. `buildProtoContext` still runs to build the proto map
  for cross-referencing source-file detections.

1. **No manual comment/string stripping.** The old parser needed
   `stripProtoCommentsAndStrings` (110 lines) to avoid counting
   braces inside comments and string literals. tree-sitter handles
   this natively.
2. **No brace-depth tracking.** `extractServiceBlocks` used a manual
   depth counter to find service boundaries. tree-sitter's AST gives
   us `service` → `service_name` + `rpc` → `rpc_name` directly.
3. **Performance.** tree-sitter's C-based parser is faster than
   character-by-character JS scanning + regex on large proto files.

- `grpc-extractor.test.ts` — **43/43 pass** (unchanged)
- All other extractor tests — 99/99 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

* chore: add .gitignore for vendored tree-sitter-proto build artifacts

https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU

* fix: correct .gitignore paths for vendored tree-sitter-proto

Patterns should be relative to the .gitignore file's directory.

https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU

* refactor(group): address Copilot review feedback on abhigyanpatwari#796

Six fixes suggested by the Copilot AI review:

1. **`normalizeHttpPath` root-path edge case** — stripping trailing
   slashes on the input `/` produced an empty string, yielding
   malformed contract ids like `http::GET::`. Now preserves `/` for
   the root handler/fetch case.

2. **Dedupe `scanFiles` call** — `extract()` was globbing the
   source-scan file list twice (once for the provider fallback, once
   for the consumer fallback). Moved to a single lazy call that
   memoizes the result for the rest of the method.

3. **HTTP `scanFiles` now ignores `**/vendor/**`** — every other
   extractor's glob already ignored vendored sources; the HTTP one
   didn't. Fixed for consistency.

4. **`loadPackageDefinition` check is now structural** — was calling
   `tree.rootNode.text.includes('loadPackageDefinition')` which forces
   materialization of the entire file text from the parse tree
   (expensive on large files). Replaced with a dedicated compiled
   query on `(call_expression function: [(identifier) | (member_expression)])`
   so the check stays in the AST domain.

5. **`grpc-extractor.ts` header docstring updated** — still claimed
   ".proto parsing is not tree-sitter-based because no grammar is
   installed". Now describes the actual behaviour: tree-sitter when
   `tree-sitter-proto` is available (optionalDependency), manual
   fallback otherwise.

6. **Eliminated the double proto file parse on the fallback path** —
   `buildProtoContext` already globs + parses every `.proto` file to
   build `servicesByName`. On the `!hasProtoPlugin` branch the
   extractor was globbing + parsing again via the now-removed
   `parseProtoFile` helper. The fallback branch now iterates the map
   that `buildProtoContext` already produced to emit provider
   contracts directly — single pass per proto file.

## Tests

- `topic-extractor.test.ts` — 30/30 pass
- `http-route-extractor.test.ts` — 18/18 pass
- `grpc-extractor.test.ts` — 43/43 pass
- `manifest-extractor.test.ts` — 8/8 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

* refactor(group): address Claude review feedback (bugs + dedup + hygiene) on abhigyanpatwari#796

Follows up `2f28bfc` with the remaining items from the Claude AI review:

## Bugs

**Bug 2 — Label-unaware Cypher queries in `resolveSymbol`.**
The manifest-extractor's lookup queries were `MATCH (n) WHERE n.name = $x`
with no label filter, so a topic/service/package name could silently match
any node type (File, Variable, Import, Folder, …). Added label filters:
- `topic` → `(n:Function|Method|Class|Interface)` (topics are best-effort
  symbol-name matches against listener/publisher symbols)
- `grpc` method → `(n:Function|Method)`
- `grpc` service → `(n:Class|Interface)`
- `lib` → `(n:Package|Module)`

All 8 manifest-extractor tests still pass (mock executor is
label-agnostic, but the production LadybugDB graph now gets correctly
scoped queries).

**Bug 8 — Tautological `!handlerName` condition.**
`http-route-extractor.ts:extractProvidersGraph` had
`let handlerName = null; if (!method || !handlerName) { ... }` — the
`!handlerName` clause was always true since there was no intervening
assignment. Simplified to always run the plugin-scan lookup (we need
the handler name even when `methodFromRouteReason` already resolved
the method).

## Clean code / dedup

**Design 7 — `readSafe` was copy-pasted in all three orchestrators.**
Extracted to `extractors/fs-utils.ts` as the single source of truth
for the path-traversal guard. Dropped the three local copies and the
now-unused `fs`/`path` imports from topic-extractor.

**Style 10 — Language-specific `_test.go` skip in the topic orchestrator.**
Was `if (rel.endsWith('_test.go')) continue;` inside the language-
agnostic extraction loop. Pushed into the glob's ignore list
(`'**/*_test.go'`) alongside the existing `node_modules`, `vendor`,
`dist`, `build` entries, with a comment explaining that other
languages' test file conventions either live in separate directories
(Python `tests/`, Java `src/test/`) or are already covered by the
existing ignores.

## Already addressed in `2f28bfc` (mentioned again in Claude review)

- Bug 3: `normalizeHttpPath('/')` returns `''` — fixed
- Bug 4: double glob + double parse of `.proto` — fixed
- Bug 5: `scanFiles` called twice in HTTP — fixed
- Bug 6: missing `**/vendor/**` in HTTP glob — fixed
- Design 9 partially: `tree.rootNode.text.includes('loadPackageDefinition')`
  replaced with a dedicated structural query

## Deferred

- Bug 1 (`http::*::path` vs `http::GET::path` matching) — out of scope;
  sync.ts matching logic lands in abhigyanpatwari#793, manifest extractor already
  emits correct synthetic uids for unresolved HTTP contracts.
- Design 9 full (change plugin `scan(tree)` → `scan(tree, source)`) —
  the only real use case (`loadPackageDefinition` gate) is already
  fixed via a structural query, so the interface change would be
  cosmetic churn without a concrete consumer.

## Tests

- `topic-extractor.test.ts` — 30/30 pass
- `http-route-extractor.test.ts` — 18/18 pass
- `grpc-extractor.test.ts` — 43/43 pass
- `manifest-extractor.test.ts` — 8/8 pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

* docs+fix(group): address remaining Claude review items + add pipeline flow chart

## Fixes

**Remaining 🔴 — HTTP contract id wildcard format.** Documented the
`http::*::<path>` format as an intentional wildcard for manifest links
that omit the HTTP method, alongside the explicit-method form
(`GET::/path` → `http::GET::/path`). The docblock on `buildContractId`
now states both forms, notes that wildcard-aware matching is the
responsibility of the sync / cross-impact layer (abhigyanpatwari#793), and
recommends the explicit-method form whenever the author knows the
method (it round-trips through exact equality without needing
wildcard logic downstream). Tests unchanged — the wildcard format is
what they've always asserted.

**Minor 1 — stale comment at `manifest-extractor.ts:124-126`.** The
comment claimed "creates a contract with an empty symbolUid/ref" but
the code switched to `manifestSymbolUid(repo, contractId)` a few
commits back. Updated to describe the actual synthetic-uid fallback
semantics and the cross-impact path that relies on both sides of the
join deriving the same uid.

**Minor 2 — exhaustiveness guard on `buildContractId`.** The
`switch(type)` covered all five current `ContractType` variants but
silently returned `undefined` if a new variant was added. Added a
`default: const _exhaustive: never = type; throw new Error(...)`
clause so the build fails loudly on an unhandled variant.

**Minor 3 — `tree.rootNode.text` in `grpc-patterns/node.ts`.** Already
fixed in `2f28bfc` via a dedicated structural query
(`LOAD_PACKAGE_DEFINITION_SPEC`). No action needed.

## New: pipeline flow chart (per @magyargergo's request)

Added `src/core/group/PIPELINE.md` with four Mermaid diagrams:
1. **High-level overview** — `group.yaml` → extractors + manifest →
   contract matching → `bridge.lbug` → `runGroupImpact`.
2. **Per-repo extractor two-strategy shape** — graph-assisted
   Strategy A vs. source-scan Strategy B.
3. **Plugin architecture** — orchestrator → registry →
   per-language `*-patterns/<lang>.ts` → `tree-sitter-scanner.ts` →
   `ExtractedContract`.
4. **Manifest extraction** — label-scoped `resolveSymbol` with the
   synthetic-uid fallback.
5. **Cross-impact query (abhigyanpatwari#606)** — local impact → bridge join →
   cross-repo fan-out.

Each diagram is annotated with which PRs own which stage (this PR:
extractors + manifest; abhigyanpatwari#795: bridge storage; abhigyanpatwari#606: cross-impact
runtime) and points at the concrete files/functions involved.

## Tests

- 99/99 extractor tests pass
- `npx tsc -p tsconfig.json --noEmit` clean

Co-authored-by: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants