[group] Cross-repo impact analysis via repository groups by ivkond · Pull Request #606 · abhigyanpatwari/GitNexus

ivkond · 2026-03-31T14:58:50Z

Summary

Depends on: #625 (intra-repo service tracking). This PR will be rebased once #625 merges.

Adds cross-repository impact analysis on top of the group infrastructure from #625. Enables blast radius analysis that traces through contract links (HTTP routes, gRPC, Kafka topics) across repository boundaries.

What changed:

cross-impact.ts — cross-repo blast radius traversal via contract registry
manifest-extractor.ts — manifest-declared cross-links from group.yaml
groupImpact() method on GroupService
group impact CLI command with --target, --repo, --direction, --cross-depth
group_impact MCP tool
RFC design document (docs/specs/2026-03-31-cross-index-impact-design.md)

Why:
Completes the cross-repo analysis story started in #625. With intra-repo service tracking proven, this PR adds the virtual graph layer that connects repositories via their contract interfaces — enabling questions like "if I change this function, which other repos are affected?"

How to verify:

cd gitnexus
npx tsc --noEmit
npm run test:unit
npm run test:integration
# Cross-repo impact test:
npx vitest run test/integration/group/group-impact.test.ts

Risk / rollback:

Adds cross-impact.ts and manifest-extractor.ts — new files, no existing code modified
Enhances sync.ts and service.ts with manifest extraction path (additive)
Multi-hop traversal intentionally capped at depth 1 (safe default)
Rollback: revert the PR

Test plan

Unit tests — cross-impact (risk scoring, fan-out, subgroup filtering)
Unit tests — manifest-extractor (provider/consumer contracts, cross-links)
Integration test — group-impact (local → cross-repo traversal)
Full regression: all existing tests pass

🤖 Generated with Claude Code

vercel · 2026-03-31T14:59:39Z

@ivkond is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

ivkond · 2026-03-31T15:03:08Z

Hi @abhigyanpatwari! This is a proposal PR for cross-repository impact analysis (#256, #306, #77). I'd appreciate your feedback on:

Whether this direction aligns with your vision for GitNexus
The architecture choice (hybrid lazy virtual graph vs unified super-graph)
Any concerns about the scope or implementation approach

This is a demo/proof-of-concept — BM25/embedding matching, gRPC/topic extractors, and multi-hop traversal are intentionally deferred. Happy to iterate based on your feedback.

github-actions · 2026-03-31T15:07:52Z

CI Report

✅ All checks passed

Pipeline Status

Stage	Status	Details
✅ Typecheck	`success`	tsc --noEmit
✅ Tests	`success`	unit tests, 3 platforms
✅ E2E	`success`	gitnexus-web changes only

Test Results

Tests	Passed	Failed	Skipped	Duration
6119	6022	0	97	237s

✅ All 6022 tests passed

97 test(s) skipped — expand for details

Swift MethodExtractor > isTypeDeclaration > recognizes class_declaration
Swift MethodExtractor > isTypeDeclaration > recognizes protocol_declaration
Swift MethodExtractor > isTypeDeclaration > rejects import_declaration
Swift MethodExtractor > visibility > extracts public method
Swift MethodExtractor > visibility > extracts private method
Swift MethodExtractor > visibility > defaults to internal when no modifier
Swift MethodExtractor > protocol methods > marks protocol method as abstract
Swift MethodExtractor > static and class methods > detects static func as isStatic
Swift MethodExtractor > static and class methods > detects class func as isStatic
Swift MethodExtractor > parameters > extracts parameters with types and default values
Swift MethodExtractor > return type > extracts return type from -> annotation
Swift MethodExtractor > annotations > extracts @objc attribute
Swift MethodExtractor > isFinal > detects final func
Swift MethodExtractor > isFinal > is false when not final
Swift MethodExtractor > isAsync > detects async func
Swift MethodExtractor > isOverride > detects override method
buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference uses lookupClassByName
buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift explicit init inference uses lookupClassByName
buildTypeEnv > constructor inference (Tier 1 fallback) > lookupClassByName regression coverage > Swift lookupClassByName regression coverage > Swift cross-file constructor inference does not bind plain functions
buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature
Swift constructor-inferred type resolution > detects User and Repo classes, both with save methods
Swift constructor-inferred type resolution > resolves user.save() to Models/User.swift via constructor-inferred type
Swift constructor-inferred type resolution > resolves repo.save() to Models/Repo.swift via constructor-inferred type
Swift constructor-inferred type resolution > emits exactly 2 save() CALLS edges (one per receiver type)
Swift self resolution > detects User and Repo classes, each with a save function
Swift self resolution > resolves self.save() inside User.process to User.save, not Repo.save
Swift parent resolution > detects BaseModel and User classes plus Serializable protocol
Swift parent resolution > emits EXTENDS edge: User → BaseModel
Swift parent resolution > emits IMPLEMENTS edge: User → Serializable (protocol conformance)
Swift cross-file User.init() inference > resolves user.save() via User.init(name:) inference
Swift cross-file User.init() inference > resolves user.greet() via User.init(name:) inference
Swift return type inference > detects User class and getUser function
Swift return type inference > detects save function on User (Swift class methods are Function nodes)
Swift return type inference > resolves user.save() to User#save via return type of getUser() -> User
Swift return-type inference via function return type > resolves user.save() to User#save via return type of getUser()
Swift return-type inference via function return type > user.save() does NOT resolve to Repo#save
Swift return-type inference via function return type > resolves repo.save() to Repo#save via return type of getRepo()
Swift implicit imports (cross-file visibility) > detects UserService class in Models.swift
Swift implicit imports (cross-file visibility) > resolves UserService() constructor call across files (no explicit import)
Swift implicit imports (cross-file visibility) > resolves service.fetchUser() member call across files
Swift implicit imports (cross-file visibility) > creates IMPORTS edges between files in the same module
Swift extension deduplication > detects Product class
Swift extension deduplication > resolves Product() constructor despite extension creating duplicate class node
Swift extension deduplication > resolves product.save() to Product.swift (primary definition)
Swift constructor call fallback (no new keyword) > resolves OCRService() as constructor call across files
Swift constructor call fallback (no new keyword) > resolves ocr.recognize() member call via constructor-inferred type
Swift export visibility (internal vs private) > resolves PublicService() constructor across files
Swift export visibility (internal vs private) > resolves internalHelper() across files (internal = module-scoped)
Swift if let / guard let binding resolution > detects User and Repo classes
Swift if let / guard let binding resolution > resolves user.save() inside if-let to User#save
Swift if let / guard let binding resolution > resolves repo.save() inside guard-let to Repo#save
Swift if let / guard let binding resolution > user.save() in if-let does NOT resolve to Repo#save
Swift await / try expression unwrapping > resolves user.save() via await fetchUser() return type
Swift await / try expression unwrapping > resolves repo.save() via try parseRepo() return type
Swift await / try expression unwrapping > detects fetchUser and parseRepo as functions
Swift for-in loop element type inference > detects User and Repo classes
Swift for-in loop element type inference > creates implicit import edges between files
Swift field-type resolution > detects classes and their properties
Swift field-type resolution > emits HAS_PROPERTY edges from class to field
Swift field-type resolution > resolves field-chain call user.address.save() → Address#save
Swift field-type resolution > emits ACCESSES edges for field reads in chains
Swift field-type resolution > populates field metadata (visibility, declaredType) on Property nodes
Swift call-result binding > resolves call-result-bound method call user.save() → User#save
Swift call-result binding > getUser() is present as a defined function
Swift call-result binding > emits processUser -> getUser CALLS edge for let-assigned free function call
Swift method enrichment > detects Animal protocol and Dog class
Swift method enrichment > emits IMPLEMENTS edge Dog -> Animal
Swift method enrichment > emits HAS_METHOD edges for Dog methods
Swift method enrichment > marks protocol Animal.speak as isAbstract
Swift method enrichment > marks Dog.speak as NOT isAbstract
Swift method enrichment > marks breathe as isFinal
Swift method enrichment > marks classify as isStatic
Swift method enrichment > captures @objc annotation on breathe
Swift method enrichment > populates parameterTypes for classify(_ name: String)
Swift method enrichment > records parameterCount for classify
Swift method enrichment > records returnType for speak
Swift method enrichment > resolves dog.speak() CALLS edge
Swift method enrichment > resolves Dog.classify("dog") CALLS edge
Swift abstract dispatch > detects Repository protocol and SqlRepository class
Swift abstract dispatch > emits IMPLEMENTS edge SqlRepository -> Repository
Swift abstract dispatch > emits HAS_METHOD edges for Repository.find and Repository.save
Swift abstract dispatch > emits HAS_METHOD edges for SqlRepository.find and SqlRepository.save
Swift abstract dispatch > marks base Repository.find as isAbstract
Swift abstract dispatch > marks base Repository.save as isAbstract
Swift abstract dispatch > marks concrete SqlRepository.find as NOT isAbstract
Swift abstract dispatch > resolves repo.find(id: 42) CALLS edge
Swift abstract dispatch > resolves repo.save(entity: user) CALLS edge
Swift abstract dispatch > populates parameterTypes for Repository.find
Swift abstract dispatch > populates parameterTypes for Repository.save
Swift abstract dispatch > records returnType for SqlRepository.find
Swift abstract dispatch > emits METHOD_IMPLEMENTS edges from SqlRepository methods → Repository protocol methods
Swift overloaded method disambiguation > detects 2 distinct find Method nodes on SqlRepository
Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edges for both find overloads
Swift overloaded method disambiguation > emits METHOD_IMPLEMENTS edge for save
Swift overloaded method disambiguation > emits exactly 3 METHOD_IMPLEMENTS edges total
Swift Child extends Parent — inherited method resolution (SM-9) > detects Parent and Child classes
Swift Child extends Parent — inherited method resolution (SM-9) > resolves c.parentMethod() to Parent.parentMethod via first-wins MRO walk

Code Coverage

Tests

Metric	Coverage	Covered	Base	Delta	Status
Statements	73.13%	16519/22588	72.2%	📈 +0.9	🟢 ██████████████░░░░░░
Branches	61.94%	10688/17254	61.29%	📈 +0.6	🟢 ████████████░░░░░░░░
Functions	77.92%	1525/1957	76.99%	📈 +0.9	🟢 ███████████████░░░░░
Lines	75.35%	15009/19919	74.37%	📈 +1.0	🟢 ███████████████░░░░░

_{📋 View full run · Generated by CI}

xkonjin · 2026-03-31T20:11:52Z

Code Review: Cross-Index Impact Groups

Overall: Impressive RFC-to-implementation PR. The design document is thorough and the implementation appears solid. This is a significant feature that addresses real gaps in multi-repo analysis.

Issues

1. Graph schema gaps not addressed in this PR (medium risk)
The RFC acknowledges several schema gaps that block full functionality:

No Route node label exists
No HANDLES_ROUTE or FETCHES relation types
Route data is ephemeral (reduced to CALLS edges with reason)
.proto files not a supported language
impact tool resolves symbols by name with LIMIT 1, not by UID

The PR implements the group infrastructure but these gaps mean the cross-link extraction for HTTP routes, gRPC, and topics will not work as designed. Consider:

Adding prerequisite schema changes first
Or reducing the scope to shared_lib and manifest contract types initially
Documenting which contract types are actually functional

2. Contract ID format could collide (low risk)
The format <type>::<discriminator> uses :: as a delimiter, which appears in:

gRPC package names: grpc::hr.UserService/GetUser
HTTP paths with empty segments (edge case)

Consider using a less common delimiter or proper escaping. URL-encoding the discriminator could prevent collisions.

3. BM25 threshold hardcoded at 0.7 (minor)
The matching cascade tuning in group.yaml has a BM25 threshold of 0.7. This seems arbitrary — was this tuned on real data? Consider:

Making the default data-driven
Or documenting how the threshold was chosen
Providing a calibration tool

4. Embedding vectors stored separately (question)
embeddings.bin is stored alongside contracts.json. Is this:

Generated fresh on each group sync?
Or incrementally updated?
What happens if the embedding model changes version?

The doc mentions no automatic migration — this could leave stale vectors with incompatible dimensions.

5. Test coverage looks good
The 31 new test files covering config parsing, contract extraction, matching, and sync are comprehensive. Nice work.

Minor

The RFC mentions group_impact and group_contracts tools, but I do not see them added to the MCP tools in the diff. Are they implemented?
Consider adding a group validate command to check group.yaml syntax before sync.

Verdict

Approve with suggestions. The design is sound and the implementation is solid. The main concern is the gap between designed capability (all contract types) and actual capability (limited by schema). Documenting this honestly would help users understand what works today vs what is planned.

ivkond · 2026-03-31T20:18:22Z

Thanks for the thorough review! Addressing each point:

1. Graph schema gaps

The RFC "Prerequisites and Current State" section describes the codebase as of the RFC writing date, not the current state. The implementation plan (docs/plans/2026-03-31-cross-index-impact-plan.md) documents the divergences explicitly:

RFC claim	Actual state in codebase
No `Route` node label	`Route` exists in `schema.ts`
No `HANDLES_ROUTE`/`FETCHES`	Present in `VALID_RELATION_TYPES` (13 types)
P2 prerequisite needed (4 relation types)	Already 13 types in runtime filter
`impact` resolves by name with `LIMIT 1`	`impactByUid` added in this PR (UID-based resolution)

The HttpRouteExtractor uses a graph-first strategy — it queries Route/HANDLES_ROUTE/FETCHES nodes first, and falls back to source-scan only when graph data is absent.

gRPC, topic, and shared lib extractors are intentionally out of scope for Demo PR (documented in PR description).

2. Contract ID `::` delimiter

By design. The :: delimiter separates the type prefix (http, grpc, topic, lib) from the discriminator. It does not appear inside real API paths, gRPC service names, or topic names in practice. URL-encoding would complicate readability of contracts.json and debug output without solving a real collision scenario.

3. BM25 threshold / 4. Embedding vectors

Both are out of scope for this Demo PR — the matching cascade is exact-match only. BM25 and embedding matching are designed in the spec but not implemented. The 0.7 default is configurable via group.yaml matching.bm25_threshold and will be tuned when BM25 is added.

Minor: MCP tools

All 6 group tools are in the diff — tools.ts lines 381-477: group_list, group_sync, group_contracts, group_impact, group_query, group_status. The tool count assertion is updated from 11 to 17 in tools.test.ts.

Minor: `group validate`

Good suggestion. parseGroupConfig already validates schema, required fields, link references, and roles — so group validate would be a thin CLI wrapper. Worth adding as a follow-up.

ivkond · 2026-03-31T20:27:30Z

@claude ultra-think while identifying cross-indexing impact analysis gaps. Perfom this analysis on this pull request and verify those findings. Act as a senior compiler font-end engineer and expert in static analysis tools and review this change from that perpective and review the architectural fit. Also check out previous comments and reason with their requests.

abhigyanpatwari · 2026-04-01T05:37:49Z

@ivkond This is a great idea. I had been implementing something like this myself in the enterprise version of gitnexus. Our indexing system allows cross-repo analyses too but will need a bridge ( virtual graph ). Correct approach. But I would also look at intra repo rpc, grpc, kafka, rabbitmq, api endpoints, etc tracking. Once this can be done for intra repo we can move the same logic into building the virtual graph system so multi-repo support is bulletproof.

A good test would be to see if microservice based repos can be indexed well. If that is done cross repo becomes easy

ivkond · 2026-04-01T20:21:45Z

@abhigyanpatwari Thanks for the direction! I've started implementing intra-repo service communication tracking on a separate branch (stacked on this PR):

What's added (ivkond/GitNexus#1):

ServiceBoundaryDetector — auto-detects service boundaries within monorepos via markers (package.json, go.mod, Dockerfile, pom.xml, Cargo.toml, etc.)
GrpcExtractor — parses .proto files + detects Go/Java/Python/TS gRPC servers and clients (RegisterXxxServer, @GrpcService, add_XxxServicer_to_server, @GrpcMethod)
TopicExtractor — detects Kafka, RabbitMQ, NATS producers/consumers across Java/Node/Go/Python (@KafkaListener, producer.send, channel.publish, nc.Subscribe, etc.)
Intra-repo matching — relaxed the same-repo guard so contracts from different services within one repo can match (backward compatible — existing cross-repo matching unchanged)

Proof of concept: Fixture monorepo with 3 services (auth, orders, gateway) connected via gRPC + Kafka + HTTP routes — all cross-links are discovered automatically. 47 new tests, zero regressions (2643 unit + 2016 integration).

The service field on contracts is backward-compatible. Once this PR merges, the same extractors feed directly into the virtual graph for multi-repo support.

Next: test on a real microservice repo, add more framework patterns, then wire into cross-repo virtual graph.

ivkond · 2026-04-01T20:43:04Z

Update: Following your suggestion, I've split the work into two PRs:

[group] Intra-repo service communication tracking #625 (→ main) — Intra-repo service tracking: group infrastructure, extractors (HTTP/gRPC/Kafka/RabbitMQ/NATS), service boundary detection, intra-repo matching. Self-contained, no cross-repo deps.
This PR ([group] Cross-repo impact analysis via repository groups #606) — Will be rebased on top of [group] Intra-repo service communication tracking #625 once merged. Adds: cross-repo impact analysis (group_impact), manifest-declared links, virtual graph.

This way the intra-repo foundation lands first, proving that microservice monorepos index well, before cross-repo is layered on top.

ivkond · 2026-04-01T21:43:46Z

Update: Following your suggestion, I've split the work into two PRs:

[group] Intra-repo service communication tracking #626 (→ main) — Intra-repo service tracking: group infrastructure, extractors (HTTP/gRPC/Kafka/RabbitMQ/NATS), service boundary detection, intra-repo matching. Self-contained, no cross-repo deps.
This PR ([group] Cross-repo impact analysis via repository groups #606) — Will be rebased on top of [group] Intra-repo service communication tracking #626 once merged. Adds: cross-repo impact analysis (group_impact), manifest-declared links, virtual graph.

This way the intra-repo foundation lands first, proving that microservice monorepos index well, before cross-repo is layered on top.

(Supersedes #625 which was closed due to branch rename.)

Cross-repo blast radius analysis through Contract Registry: - group_impact MCP tool and CLI command - ManifestExtractor for explicit cross-repo links - HTTP route extractor for auto-detected contracts - closeLbug() resource cleanup in CLI sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nonical ID Design spec (12 rounds of review): - Bridge.lbug: LadybugDB storage replacing contracts.json - gRPC canonical ID: proto-aware extraction with wildcard matching - Direction-dependent Cypher queries for cross-impact - Write-to-temp-then-rename atomic swap strategy - Backward compatibility via openBridgeOrFallback Implementation plan (10 tasks, TDD): - Bridge DB core, schema, write/read lifecycle - Proto map, source scanner resolution, confidence adjustments - Wildcard matching, sync migration, service layer integration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vercel · 2026-04-04T21:03:26Z

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

Bridge.lbug — contract storage in LadybugDB: - bridge-schema.ts: Contract, RepoSnapshot, ContractLink DDL - bridge-db.ts: open/close, writeBridge (atomic swap), queryBridge, openBridgeDbReadOnly (with version gate), readBridgeMeta, openBridgeOrFallback - BridgeHandle, BridgeMeta types; LegacyContractRegistry alias gRPC canonical ID — proto-aware extraction: - buildProtoMap: scan .proto files (excludes node_modules/vendor) - resolveProtoConflict: directory proximity disambiguation - serviceContractId: grpc::pkg.Service/* format - All 4 source scanners (Go/Java/Python/TS) resolve via proto map - Confidence adjustments: reduced without proto, boosted with Matching — wildcard support: - buildProviderIndex exported for reuse - runExactMatch skips gRPC /* contracts - runWildcardMatch: bare-name and FQ service matching Consumer migration: - sync.ts: writeBridge + wildcard pass, returns remaining after wildcard - cross-impact.ts: Cypher-based Phase 2 with direction-dependent queries, hint param for empty-UID fan-out, error-object guard - service.ts: openBridgeOrFallback with bridge/json/none branching, groupContracts bridge path returns flat contracts + crossLinks + unmatchedOnly - CLI: status/impact use openBridgeOrFallback, deprecation warnings - MCP tools: descriptions updated (contracts.json → bridge.lbug) - Old storage API removed (writeContractRegistry, readContractRegistry) - Cross-link insertion uses endpoint symbolRef lookup (not link.contractId) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ivkond · 2026-04-04T21:34:12Z

@abhigyanpatwari This PR is ready for re-review. I've addressed both items from your PR #626 review:

1. Contract storage → bridge.lbug

Contracts and cross-links are now stored in LadybugDB (bridge.lbug) as Cypher-queryable graph nodes and edges
group_impact Phase 2 uses direction-dependent Cypher queries instead of JS iteration
Atomic write via temp-file-then-rename with Windows EBUSY retry
Backward compat: automatic contracts.json fallback with deprecation warning

2. gRPC normalization mismatch

buildProtoMap() scans .proto files to resolve ServiceName → package.Service
Go/Java/Python/TS source scanners produce canonical grpc::pkg.Service/* IDs when proto is available
New runWildcardMatch() handles remaining unresolved wildcards as a separate matching pass
Confidence adjustments: reduced without proto (0.65/0.55), unchanged with proto

Tests: 239 in group module (11 new coverage tests), 5410 total in CI. Build: clean.

Note on force-push: Rebased on current main (includes PR #626) and squashed 24 intermediate commits into 3 logical units — original feature, documentation (spec + plan), implementation. The intermediate commits were iterative spec review rounds and per-task TDD commits that added noise to the history without providing bisect value.

ivkond · 2026-04-08T20:53:35Z

@abhigyanpatwari hi! Friendly follow-up on this PR — I’d really appreciate your feedback when you have time, especially on whether this direction and architecture fit your vision.

abhigyanpatwari · 2026-04-08T20:56:33Z

@claude do a deep review of it, especially focused on coverage of cross repo communication protocols. Also see that there are real world test to validate it against real multi repo and microsercice based repos.

github-actions · 2026-04-08T20:56:47Z

Claude finished @abhigyanpatwari's task in 2m 43s —— View job

Deep Review: Cross-Repo Impact Analysis — Protocol Coverage & Real-World Test Validation

Todo list:

Post initial comment
Read core implementation files (cross-impact.ts, grpc-extractor.ts, matching.ts, service.ts)
Read manifest-extractor.ts and bridge-db.ts
Review protocol extractor coverage (HTTP, gRPC, Kafka, RabbitMQ, NATS)
Review test files for real-world coverage
Assess real-world multi-repo / microservice test gaps
Write comprehensive review

Summary

The architecture is fundamentally sound — the hybrid lazy virtual graph via bridge.lbug, direction-aware Cypher queries, and the extraction→matching→bridge pipeline are all well-designed. However, the protocol coverage has significant gaps for real-world polyglot microservice repos, and there are no true end-to-end tests against real multi-repo structures. Below is a detailed breakdown.

1. Protocol Coverage: What Works vs. What's Missing

HTTP (`http-route-extractor.ts`)

Covered providers: Spring @GetMapping/@PostMapping, Express router.get/post, Laravel Route::get, FastAPI @app.get.

Covered consumers: fetch() and axios.get/post in TS/JS.

Critical gaps:

Missing pattern	Impact
Go HTTP handlers (`gin.GET`, `echo.GET`, `mux.HandleFunc`, stdlib `http.HandleFunc`)	Any Go-based API service goes undetected as provider
NestJS `@Controller`/`@Get`/`@Post` decorators	NestJS is the dominant Node.js microservice framework; not covered
Python `requests.get/post` consumer	Python consumer side is entirely absent
Java `RestTemplate.getForEntity`, `WebClient.get()`, `OkHttpClient`	Java inter-service HTTP calls not tracked
Go `http.Get`, `resty.R().Get()` consumer patterns	Go consumer side absent

The fallback consumer detection source scan, lines 409–423 only scans .ts,.tsx,.js,.jsx,.vue,.svelte — Go, Python, Java, PHP consumers are never found without graph data. Fix this →

gRPC (`grpc-extractor.ts`)

Covered: Proto parsing (definitive), Go Register*Server/New*Client, Java @GrpcService/*ImplBase/*Grpc.newStub, Python add_*Servicer_to_server/*Stub, NestJS @GrpcMethod (server only).

Critical gap — TypeScript gRPC consumer is entirely missing:

// These are completely invisible to the extractor:
// grpc-js
const client = new proto.hr.UserService('host', credentials.createInsecure());
// NestJS @GrpcClient (consumer side)
@GrpcClient({ service: 'UserService', url: '...' })
private readonly userClient: ClientGrpc;
// NestJS ClientGrpc
const svc = this.userClient.getService<UserServiceClient>('UserService');

scanTsProviders handles @GrpcMethod only (server side). There is no scanTsConsumers. Given TypeScript/NestJS is a primary gRPC consumer framework in Node.js microservices, this is a significant real-world gap. Fix this →

Other gaps: No C# gRPC (GrpcChannel.ForAddress, new Service.ServiceClient()), no Ruby (Grpc::ClientStub), no Rust tonic patterns. These matter less than TypeScript but are worth documenting.

Proto import resolution not implemented: When a .proto uses import "other.proto" to reference types from another package, the buildProtoMap doesn't follow imports. This means services that split their types across files may get wrong or missing package names.

Topic/Messaging (`topic-extractor.ts`)

Covered well: Java Spring Kafka/RabbitMQ annotations, basic Node.js producer/consumer, Go sarama ConsumePartition, Python Kafka. This is the most complete extractor.

Notable gaps:

Missing	Framework	Impact
`consumer.run({ eachMessage: ... })`	kafkajs (most popular Node Kafka lib)	Node.js Kafka consumers won't be detected as they use `run()` not `subscribe()`
`kafka.NewWriter` / `kafka.NewReader`	segmentio/kafka-go	Second most popular Go Kafka library
`sarama.NewSyncProducer` / `sarama.NewAsyncProducer`	sarama (most popular Go Kafka)	Go Kafka producers missing
`js.Publish` / `js.Subscribe`	NATS JetStream	JetStream is the modern NATS API, largely replacing core NATS
`nats.connect` / `nc.subscribe` (Python)	nats-py	NATS Python consumer absent
Redis Pub/Sub, AWS SQS/SNS, Azure Service Bus	—	Common in enterprise microservices

The kafkajs gap is the most impactful — fix this →

2. Real-World Test Coverage: The Main Gap

The integration test (test/integration/group/group-impact.test.ts) is explicitly noted as "mocks localImpactFn / crossImpactFn. E2E with real graphs is a follow-up." This is the most important gap to close.

What exists today:

Synthetic in-memory registry with hardcoded cross-links
Single test scenario (upstream fan-out via UID match)
Test-monorepo fixture (test/fixtures/group/test-monorepo/) with 3 service stubs — not wired to any integration test

What's needed for real multi-repo / microservice validation:

A. Fixture-based integration test against real code

The test-monorepo fixture exists but isn't exercised end-to-end. A test like:

it('discovers gRPC link from gateway → auth across service boundaries', async () => {
  // 1. Index each service directory with real indexer
  // 2. Run group sync against the fixture
  // 3. Assert bridge.lbug contains expected ContractLink
  // 4. Run group impact on a gateway function
  // 5. Assert auth service appears in cross-repo hits
});

This requires running the real extractor pipeline — but it would catch bugs that the current mocked tests cannot.

B. Known-good open-source repo test

There are popular open-source microservice demos (e.g., Google's Online Boutique, Netflix's Conductor) that have well-documented service boundaries. A test that clones a specific commit and asserts specific cross-links would give confidence that the system works on real code.

C. Confidence degradation test

No test verifies that removing a .proto file degrades confidence from 0.8 to 0.65 for Go providers. No test exercises the wildcard→method match path end-to-end.

3. Correctness Bugs Found

Bug 1: `outOfScope` always empty in Cypher path (`cross-impact.ts:417`)

// runGroupImpact returns:
return {
  ...
  outOfScope: [],  // ← always empty, never populated
  ...
};

The legacy runGroupImpactLegacy populates outOfScope when a matched repo falls outside the subgroup filter. The Cypher-based runGroupImpact uses a WHERE clause to filter, so matches never come back — but the caller has no way to know which repos were filtered. This is a silent information loss that affects UX when --subgroup is used. Fix this →

Bug 2: Manifest cross-links silently dropped for gRPC/lib (`manifest-extractor.ts:92`)

private async resolveSymbol(...): Promise<...> {
  ...
  } else {
    return null;  // ← gRPC and lib types always return null
  }

When null is returned, symbolRef = { filePath: '', name: link.contract }. In bridge-db.ts the cross-link insert query matches nodes by c.filePath = $fp AND c.symbolName = $sn with fp = ''. Since no contract node will have an empty filePath, fromRows or toRows will be empty and the edge is silently skipped (line 209: if (!fromRows.length || !toRows.length) continue). Manifest-declared gRPC cross-links produce contracts but no ContractLink edges in the bridge DB, making group_impact miss them entirely.

Bug 3: `resolveProtoConflict` directory proximity heuristic fails for centralized proto repos

// Matches character-by-character, not path-segment-by-segment
for (let i = 0; i < min; i++) {
  if (sourceDir[i] === protoDir[i]) shared++;
  else break;
}

For a monorepo with protos/ at root and services in services/auth/, services/orders/, the paths share the same repo prefix. A source file at services/auth/server.go and proto at protos/auth/auth.proto will score lower than proto at protos/orders/orders.proto if "protos/auth" lexically matches less of the path than another option. The heuristic should compare normalized path segments, not characters. Fix this →

4. Minor Issues

phase1Timeout ignores user intent for large timeouts (cross-impact.ts:129/312): Math.min(5000, timeout) means users who set timeout: 60000 still only get 5s for Phase 1. For large graphs this might truncate the local impact result. Consider Math.min(timeout * 0.3, 10000) or make it configurable.

Deduplication key in GrpcExtractor uses file path (grpc-extractor.ts:491): ${c.contractId}|${c.role}|${c.symbolRef.filePath}. A file that registers both UserService and UserService (via two different patterns) would only keep one. This is correct but worth a comment.

Topic extractor scans .go but Kafka/NATS patterns use Go-style calls without Go file detection — the glob pattern includes *.go but many Go patterns like ConsumePartition may appear in test files. Consider filtering *_test.go.

5. What's Working Well

The bridge.lbug atomic write (temp→rename with .bak recovery) is solid for concurrent use
Wildcard gRPC matching (runWildcardMatch) correctly handles the case where a consumer doesn't have proto context — this is a real-world pattern
normalizeContractId handles HTTP method case and trailing slashes, gRPC package lowercasing correctly
Graph-first strategy in HttpRouteExtractor is the right approach — when graph data is available, use it; fall back to source scan only when needed
Test naming convention (test_verb_condition_expectation) is clean and consistent throughout

Recommended Next Steps (Prioritized)

High — Add fixture-based end-to-end test using test-monorepo that runs the real extractor pipeline and asserts specific ContractLink edges in bridge.lbug
High — Add TypeScript gRPC consumer detection (@GrpcClient, ClientGrpc, grpc-js new ServiceClient())
High — Fix manifest cross-link silently dropped for gRPC/lib types (Bug 2)
Medium — Add Go HTTP provider patterns (gin, echo, stdlib mux)
Medium — Add kafkajs consumer.run({ eachMessage }) and sarama Go producer patterns
Medium — Fix outOfScope always empty in Cypher path (Bug 1)
Low — Fix directory proximity heuristic to use path segments (Bug 3)
Low — Add NestJS @Controller/@Get provider detection

abhigyanpatwari · 2026-04-08T21:03:00Z

@ivkond can u look into these pls 🙏. Also great work, mostly clean implementation few easy bugs to bash

… and Java

Before this PR lands, I ran a systematic self-review of the cross-impact module and fixed 13 correctness / robustness issues caught during it. All changes target files added by this PR; nothing outside of group/* or the new MCP group_impact tool schema is touched. Everything below is in this single commit. The rest of the PR history is unchanged. Full group suite: all unit and integration test files green when run individually (bridge-db.test.ts hits a pre-existing native LadybugDB worker cleanup segfault that sometimes flakes the reported count — documented in vitest.config.ts, unrelated to these fixes). CI on this commit: 11/11 green across ubuntu/macos/windows + quality + typecheck + e2e. ═══════════════════════════════════════════════════════════════════════ HIGH severity — correctness bugs with regression tests ═══════════════════════════════════════════════════════════════════════ HIGH #1 — manifest-extractor.resolveSymbol was too fuzzy ─────────────────────────────────────────────────────────── resolveSymbol() used CONTAINS on route/name fields plus an unconditional `filePath ENDS WITH '.proto'` fallback for gRPC. Consequences in production: * HTTP: manifest "/orders" matched "/suborders", "/myorders/{id}" * gRPC: ANY repo with any .proto file returned a random proto symbol regardless of whether the service/method matched * topic/lib: similar CONTAINS false positives Now all four types use EXACT equality on the relevant name field plus deterministic ORDER BY before LIMIT 1. HTTP paths are canonicalized (ensureSlash + trailing-slash strip) before matching Route.name via a new normalizeRoutePath() helper. gRPC matches by structured service/method name with NO .proto fallback. To avoid cross-impact breakage when resolveSymbol now legitimately returns null (the indexer may not expose a node with the exact method name), unresolved manifest contracts now get a deterministic synthetic symbolUid `manifest::<repo>::<contractId>` instead of an empty string. This gives the bridge Cypher query a stable anchor on both provider and consumer sides. HIGH #2 — sync.ts per-extractor isolation ─────────────────────────────────────────────────────────── syncGroup() wrapped all three extractors (http, grpc, topic) plus service-boundary detection plus meta.json read in ONE try/catch around the whole repo loop. A single extractor throwing would mark the whole repo as `missingRepos` (looking like "repo not found") and prevent the other extractors from running. Now each step is isolated with per-failure reporting: * initLbug failure → repo marked missing, labeled `init` * detectServiceBoundaries failure → empty boundaries, contracts still extracted without service attribution * each extractor failure → recorded with {repo, extractor, message}; other extractors in the repo keep running * manifest extractor failure → recorded but doesn't fail the sync * bridge write partial failures → summarized as `bridge_write` HIGH #3 — bridge-db.writeBridge per-item tolerance ─────────────────────────────────────────────────────────── writeBridge() had no per-item try/catch on CREATE loops. A single malformed contract, snapshot, or link would abort the whole write. The atomic swap protected the OLD bridge but threw away all valid inserts from the current run. Each CREATE is now wrapped individually. writeBridge() returns a new WriteBridgeReport with insert/fail counts plus a capped sampleErrors array. sync.ts surfaces non-zero failure counts via extractorFailures. Dropped cross-links (findContractNode returning null) are counted separately in linksDroppedMissingNode — previously silently continued. HIGH #4 — writeBridge handle lifecycle ─────────────────────────────────────────────────────────── writeBridge() opened the tmp LadybugDB handle before the insert loops but only closed it at the end of the happy path. If ensureBridgeSchema, any queryBridge CREATE, findContractNode, or the contract/snapshot/link loops threw, the handle leaked — which holds the native file lock on `bridge.lbug.tmp`, leaking an FD and preventing the next sync from reusing the tmp slot. Wrapped everything from `ensureBridgeSchema(handle)` to the explicit `closeBridgeDb(handle)` in try/finally with a `handleClosed` sentinel so we never double-close on the happy path and always close on the error path. HIGH #5 — openBridgeDbReadOnly + openBridgeOrFallback leaks ─────────────────────────────────────────────────────────── Two leak paths: * openBridgeDbReadOnly: `new lbug.Database(...)` followed by `new lbug.Connection(db)` was wrapped in a single try/catch, but if Connection construction threw AFTER Database allocation, the Database was leaked (no teardown path for a partial handle). * openBridgeOrFallback (storage.ts): called `readBridgeMeta` AFTER the handle was opened; any error from meta.json parse would bubble up leaving the handle open on the caller. Fixed: * openBridgeDbReadOnly now tracks `db` and `conn` separately and closes each of them on the error path. * openBridgeOrFallback now wraps `readBridgeMeta` in try/catch and explicitly calls closeBridgeDb (with .catch() swallow) before rethrowing. HIGH #6 — groupContracts JSON.parse of contract meta ─────────────────────────────────────────────────────────── GroupService.groupContracts() (bridge path) reconstructed StoredContract from flat rows via `JSON.parse(r.meta)`. A single corrupted meta row (malformed JSON, non-string leftover from a schema migration) would abort the whole listing with an unhandled exception. Replaced with a `safeParseMeta` helper that returns `{}` on parse failure and increments a `metaParseFailures` counter, which is surfaced in the return value when non-zero so callers can log and debug. ═══════════════════════════════════════════════════════════════════════ MEDIUM severity — robustness gaps, input validation, logging ═══════════════════════════════════════════════════════════════════════ MEDIUM #1 — service.groupImpact input range validation + safeLocalImpact ─────────────────────────────────────────────────────────── groupImpact() accepted any finite number for maxDepth, minConfidence, timeout, crossDepth (including negative, zero, or 1e9). MCP is a public surface; unbounded values were a DoS vector. Additionally, `direction: 'upstreem'` (typo) silently became 'upstream'. All five params are now strictly validated with explicit error messages: * direction ∈ {'upstream', 'downstream'} * maxDepth ∈ [1, 10] integer * minConfidence ∈ [0, 1] * timeout ∈ [100, 300000] ms * crossDepth ∈ [0, 10] integer Additionally: the localImpactFn callback called resolveGroupRepo() which throws if the repo isn't in the group config. runPhase1WithTimeout only catches timeouts, not rejections, so a missing-repo error would crash the whole MCP handler. Wrapped in a safeLocalImpact helper that catches and returns an error-shaped stub instead of bubbling past the timeout guard. MEDIUM #2 — Phase-1 timeout reason field ─────────────────────────────────────────────────────────── runGroupImpact()/runGroupImpactLegacy() stub for Phase-1 timeout used impactedCount: 0, risk: 'LOW', byDepth: {} — indistinguishable from a verified empty walk. Consumers treating these as real could miss actual impact. Added: * GroupImpactResult.truncationReason ('phase1_timeout' | 'wall_deadline') * phase1TimedOut: true marker inside the `local` stub * source repoPath pushed into truncatedRepos immediately on Phase-1 timeout (was only done post-Phase-2) MEDIUM #3 — grpc-extractor proto parser ─────────────────────────────────────────────────────────── extractServiceBlocks() used brace-depth counting on the raw .proto content. Braces inside string literals or `// line` / `/* block */` comments were counted as real body braces — a valid proto with `option deprecated_reason = "use NewService { instead"` would have the service body closed early, dropping methods after the offending string. Previously marked as a known TODO but never fixed. Added stripProtoCommentsAndStrings() helper that returns a sanitized copy with offsets and line breaks preserved (comments/strings replaced by spaces). extractServiceBlocks now scans the sanitized copy for service headers and brace depth, but slices the returned body from the original content to keep source text intact. MEDIUM #4 — topic-extractor Kafka Go regex ─────────────────────────────────────────────────────────── Sarama patterns anchored on `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` had two failure modes: * In a loop with multiple ProducerMessage literals, only the FIRST Topic was captured — the engine's lastIndex advanced past the Topic match, and there was no second NewSyncProducer to re-anchor. * Multiple unrelated messages within 300 chars of the constructor would attribute the wrong topic to the wrong producer context. Changed the sarama patterns to match the struct literal directly: `sarama\.ProducerMessage\s*\{[\s\S]{0,200}?Topic:\s*"([^"]+)"`. This captures every ProducerMessage regardless of producer scope. MEDIUM #5 — cross-impact crossDepth clamp deduplication ─────────────────────────────────────────────────────────── `Math.min(1, …)` was duplicated at cross-impact.ts:149 (legacy) and :354 (bridge path) PLUS at service.ts:252. Extracted the clamp into MAX_SUPPORTED_CROSS_DEPTH with a comment pointing at service.ts so the two places don't drift when multi-hop traversal lands. MEDIUM #6 — matching.ts gRPC normalize comment ─────────────────────────────────────────────────────────── No behavior change. The original comment ("MVP: lowercase the whole token; differs from pkg/method split") falsely implied inconsistency between the no-slash and pkg/method branches. Rewritten to document the wire-protocol rationale (pkg/service case-insensitive, RPC method case-sensitive) and to explicitly note that package-only and pkg/method ids are intentionally distinct canonical forms. MEDIUM #7 — manifest-extractor silent catch ─────────────────────────────────────────────────────────── The Cypher executor try/catch dropped errors with `/* fall through */`. A broken graph query in one repo would silently degrade resolveSymbol to the synthetic-uid path with no diagnostic output. Added a `console.warn` with the link type, contract name, repo key, and error message. MEDIUM #8 — writeBridgeMeta uses retryRename ─────────────────────────────────────────────────────────── writeBridgeMeta was using `fsp.rename` directly while writeBridge used retryRename with exponential backoff for EBUSY/EPERM/EACCES. A concurrent Windows reader could cause the meta.json rename to fail transiently while the bridge.lbug swap survived. Pulled the rename through retryRename so the meta write is at least as robust as the bridge write it accompanies. MEDIUM #9 — sync.ts extractor failure labels ─────────────────────────────────────────────────────────── The `ExtractorFailure.extractor` enum labeled initLbug failures as 'boundaries' and bridge-write failures as 'manifest' — both confusing in production logs. Extended the type union with explicit 'init' and 'bridge_write' variants, moved initLbug failures to 'init', moved writeBridge failure summary to 'bridge_write', and documented each label kind on the new ExtractorKind type. MEDIUM #10 — MCP schema bounds for group_impact ─────────────────────────────────────────────────────────── group_impact tool schema declared crossDepth, maxDepth, minConfidence, timeout as bare `type: 'number'` — clients had no way to discover valid ranges without trial-and-error. Added JSON Schema draft-7 `minimum`/`maximum` bounds mirroring the server-side validation in GroupService.groupImpact(). Extended ToolDefinition interface to allow the new keys. MEDIUM #11 — typed GroupImpactResult.crossDepthWarning ─────────────────────────────────────────────────────────── service.ts's groupImpact was mutating the result via `(result as unknown as Record<string, unknown>).crossDepthWarning = …`. This cast defeated the type system. Added `crossDepthWarning?: string` to GroupImpactResult in types.ts and removed the double-unknown cast in both runGroupImpactLegacy and runGroupImpact code paths. MEDIUM #12 — retryRename unit tests ─────────────────────────────────────────────────────────── retryRename had zero test coverage — critical code for Windows CI, tested only implicitly via writeBridge on macOS/Linux (where EBUSY never fires). Exported the function and added four vi.spyOn-based tests: retries on EBUSY, rethrows non-retryable codes (ENOENT), gives up after attempt count, retries on EACCES. ═══════════════════════════════════════════════════════════════════════ Intentionally deferred to follow-up PRs ═══════════════════════════════════════════════════════════════════════ * AbortSignal plumbing for Phase-1 timeout — requires changing the port.impact / impactByUid interface across lbug-adapter and GroupToolPort. Too broad for a correctness fix batch; tracked as follow-up. * Duplication between runGroupImpact and runGroupImpactLegacy (~500 LOC that share ~70% of logic) — genuine maintenance burden but a dedicated refactor PR is the right tool, not this batch. * Low-impact items: CLI double-open of bridge, proto parser performance optimization, loadPackageDefinition scanner false positives, `catch(err: any)` → `unknown` cleanup. All safe to land separately. ═══════════════════════════════════════════════════════════════════════ Test deltas (all via --pool=forks, files run individually) ═══════════════════════════════════════════════════════════════════════ bridge-db.test.ts 15 → 21 (+6: report shape, dropped links, 4× retryRename) sync.test.ts 10 → 11 (+1: initLbug failure path) manifest-extractor.test.ts 5 → 8 (+3: exact match, synthetic uid, regression cases) service.test.ts 25 → 31 (+6: range validation, safeLocalImpact wrap) grpc-extractor.test.ts 41 → 43 (+2: proto strings, comments) topic-extractor.test.ts 29 → 30 (+1: sarama loop regression) matching.test.ts 28 → 28 (comment change only) ────────────────────────────────────── NEW TESTS: +19 GROUP INTEGRATION: 10/10 green (cli, impact, sync, monorepo) Co-authored-by: Claude <noreply@anthropic.com>

ivkond · 2026-04-11T14:40:46Z

@ivkond can u look into these pls 🙏. Also great work, mostly clean implementation few easy bugs to bash

Hi @abhigyanpatwari — heads-up and an offer.

Apologies for the commit noise on this PR. 14 commits from my side,
including two duplicated WIP iterations (restore out-of-scope reporting…
and expand grpc extraction… each landed twice during earlier review
rounds). I should have been squashing before each re-request for review.

After the last round I also ran a systematic self-review of the
cross-impact module and landed a batch of correctness / robustness
fixes in a single commit d15b8cb fix(group): self-review before merge.
The commit body has a severity-ordered breakdown with per-finding
rationale and test coverage — easy to skim or cherry-pick. Two larger
items are intentionally deferred to separate follow-up PRs:
AbortSignal plumbing for the Phase-1 timeout path, and the
runGroupImpact / runGroupImpactLegacy de-duplication.

Two options — whichever saves your review time

Option A — merge as-is with Squash & Merge. main ends up with
one clean commit regardless; the messy interim history stays only in
the PR.

Option B — I close this and open a fresh PR with a clean 3–4
commit history:

feat(group): infrastructure, contract matching, service boundary detection
feat(group): bridge.lbug storage + gRPC canonical IDs
feat(group): http / grpc / topic extractor coverage across languages
fix(group): self-review — correctness + robustness

Same content, same CI result, reviewer-friendly diffs per logical step.
The review thread here is lost but main stays clean. I can have it up
within an hour if that's easier for you.

No preference from my side — whichever works best for your review
cadence. Let me know.

magyargergo · 2026-04-11T14:47:02Z

@ivkond Thank you for your contribution! You've made an amazing impact with you changes. We would like to ask you to split this up, thus we can iterate on it much confidentially.

ivkond · 2026-04-11T15:15:11Z

@ivkond Thank you for your contribution! You've made an amazing impact with you changes. We would like to ask you to split this up, thus we can iterate on it much confidentially.

@magyargergo Thanks — happy to do this. Here's what I'd propose; let me
know if the shape works before I start moving commits around.

Proposed split — 4 stacked PRs

Linear dependency chain, each PR has one clear responsibility. Every
self-review fix from d15b8cb travels with the code it patches, so
there's no orphan "fix" PR at the end.

feat(group): bridge.lbug storage + contract matching foundation
(~1.5k LOC + ~1.2k LOC tests)
Files: bridge-db, matching, normalization, types, storage,
config-parser, bridge-schema, contract-extractor interface.
Pure infra, no user-facing surface yet.
Base: main.
feat(group): service boundary detection + per-language contract extractors
(~2.5k LOC + ~2.5k LOC tests)
Files: service-boundary-detector plus http-route-extractor,
grpc-extractor, topic-extractor, manifest-extractor. Mostly
regex/parser logic across Go, Java, Python, Node, TS.
Base: PR 1 (for types only).
feat(group): sync pipeline (~500 LOC + ~700 LOC tests)
Files: sync.ts wiring extractors → matching → bridge write.
Deliberately small and standalone so the pipeline wiring is easy
to review in isolation from the algorithm.
Base: PR 2.
feat(group): cross-repo impact analysis + CLI + MCP tool
(~1.5k LOC + ~1.5k LOC tests)
Files: cross-impact.ts, GroupService (groupImpact +
groupContracts + groupQuery), cli/group.ts, mcp/tools.ts
(the group_impact tool entry), and the RFC doc
2026-03-31-cross-index-impact-design.md. The user-visible feature.
Base: PR 3.

PR 1 and PR 2 have no code-level coupling (only a type import) and
can be reviewed in parallel if you'd like.

Alternatives I considered

3 PRs — merge 1+2 into a single "foundation" PR and 3+4 into
a single "feature" PR. Fewer review rounds, but each PR ends up
5k–8k LOC which largely defeats the point of splitting.
5–7 PRs — further split the extractors per language or per
broker. Adds a lot of ceremony without meaningfully reducing
per-PR review load, and manifest-extractor imports types from the
other extractors which would create awkward cross-dependencies.

Thanks for the quick review turnaround.

magyargergo · 2026-04-11T15:17:25Z

Please proceed one by one. I can see they have linear deps on them? Is there a way you could split them up, so they can be done parallel?

ivkond · 2026-04-11T15:39:36Z

Good question. The only thing making PR 1 → PR 2 linear in my first
proposal is a type import; everything else is genuinely independent. I
can fix that by pulling a tiny "types + interface" PR ahead of them.
Here's the revised stack:

Revised split — 5 PRs, with PR 1 and PR 2 truly parallel

PR 0 — types.ts + contract-extractor.ts interface. Pure
shared type definitions that everyone downstream imports. No logic,
no user-facing surface. ~230 LOC.
Base: main. Should be a trivial review.
PR 1 — bridge-db, bridge-schema, matching, normalization,
storage, config-parser. Bridge storage infra + contract matching
- group config parser. ~1.5k LOC + ~1.2k LOC tests.
  Base: PR 0. Parallel with PR 2.
PR 2 — service-boundary-detector + 4 contract extractors
(HTTP / gRPC / topic / manifest). ~2.5k LOC + ~2.5k LOC tests.
Base: PR 0. Parallel with PR 1.
PR 3 — sync.ts pipeline wiring extractors → matching → bridge
write. ~500 LOC + ~700 LOC tests.
Base: PR 1 + PR 2 (needs both merged).
PR 4 — cross-impact.ts, GroupService, cli/group.ts,
mcp/tools.ts (group_impact), RFC doc
2026-03-31-cross-index-impact-design.md. The user-visible feature.
~1.5k LOC + ~1.5k LOC tests.
Base: PR 3.

So it's still "one at a time" in terms of review context-switching,
except for the PR 1 / PR 2 pair which you can take in whichever order. Each
PR has a clear, non-overlapping scope and none blocks on anything
other than its direct dependency.

I'll start on PR 0 as soon as you confirm the shape.

magyargergo · 2026-04-11T15:40:30Z

Nice! This sounds like a good plan! Please create tickets for them so we can track this progress

ivkond · 2026-04-11T16:32:11Z

Done! Tickets for the split:

[group] Shared types and ContractExtractor interface (split 1/5 of #606) #790 — shared types + ContractExtractor interface (bootstrap)
[group] Bridge storage and contract matching foundation (split 2/5 of #606) #791 — bridge storage + contract matching foundation
[group] Service boundary detection and language contract extractors (split 3/5 of #606) #792 — service boundary detection + language contract extractors
[group] Sync pipeline (split 4/5 of #606) #793 — sync pipeline
[group] Cross-repo impact analysis, CLI and MCP tool (split 5/5 of #606) #794 — cross-repo impact analysis + CLI + MCP tool

Each issue has full what / why / verify / risk sections per CONTRIBUTING.md, plus a scope-discipline note per GUARDRAILS.md confirming no drive-by refactors and no CI/release/security changes.

Dependency chain: #790 first, then #791 and #792 in parallel, then #793, then #794.

#606 will stay open until #794 merges, at which point it will be closed with backlinks to all 5.

Starting on #790 now.

magyargergo · 2026-04-11T16:52:33Z

Thank you for doing it! Let me know when you are ready with the first PR and I'll review it ASAP!

@magyargergo

…#606 split) (#795) * feat(group): bridge.lbug storage + contract matching expansion Part 1 of 4 in the split of #606 (ticket: #791, closes #790 with a revised plan per @magyargergo's request). ## What changed Adds the LadybugDB-backed bridge storage infrastructure and extends the contract matching algorithm with wildcard support. All changes are additive: storage.ts, sync.ts, service.ts, cli/group.ts, mcp/tools.ts are left on their upstream main versions and will migrate to the new bridge in follow-up PRs (#792, #793, #794). ### Files **New (844 LOC prod):** - `gitnexus/src/core/group/bridge-db.ts` — atomic write-to-temp with `retryRename` for Windows EBUSY/EPERM, per-item write tolerance via `WriteBridgeReport`, `findContractNode` with three-tier symbol lookup (uid → filePath+name → filePath) - `gitnexus/src/core/group/bridge-schema.ts` — schema DDL - `gitnexus/src/core/group/normalization.ts` — contract ID canonicalization + `dedupeContracts` / `dedupeCrossLinks` helpers used by both matching and bridge write **Modified (+137 LOC prod):** - `gitnexus/src/core/group/matching.ts` — adds `runWildcardMatch` for `grpc::Service/*` wildcard consumers, `buildProviderIndex` helper, and canonical gRPC ID handling in `normalizeContractId` - `gitnexus/src/core/group/types.ts` — `MatchType` gains `'wildcard'`; new `BridgeHandle` and `BridgeMeta` interfaces **New tests (658 LOC):** - `gitnexus/test/unit/group/bridge-db.test.ts` — core write/read round trip, `WriteBridgeReport` shape, dropped-links counter, retryRename behavior on EBUSY/ENOENT/EPERM/EACCES - `gitnexus/test/unit/group/bridge-db-edge.test.ts` — edge cases (malformed meta, missing contract nodes, concurrent access) **Modified tests (+225 LOC):** - `gitnexus/test/unit/group/matching.test.ts` — wildcard consumer matching, gRPC canonical ID handling, same-service guard ### Self-review fixes folded in Carried forward from the original #606 self-review: - `writeBridge` try/finally handle lifecycle + `handleClosed` sentinel - `openBridgeDbReadOnly` partial-handle cleanup - `writeBridgeMeta` uses `retryRename` for Windows consistency - `retryRename` unit tests (was zero coverage) - Per-item try/catch around every CREATE loop so one malformed contract doesn't abort the whole write - Dropped cross-link counter (`linksDroppedMissingNode`) ### Why now magyargergo asked for the #606 PR to be split so we can iterate with confidence (#606 (comment)). This is the foundational layer — pure infra, no user-facing surface, no callers of the new APIs in this PR. Later PRs wire it in. ### How to verify - `cd gitnexus && npx tsc --noEmit` - `cd gitnexus && npx vitest run test/unit/group/bridge-db.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/bridge-db-edge.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/matching.test.ts --pool=forks` - Pre-commit hook runs clean ### Risk / rollback **Low.** All new code sits under `src/core/group/` in new files plus a minimal `+16/-1` diff to `types.ts` and a `+136/-0` diff to `matching.ts` (both purely additive). No existing callers reference the new APIs (bridge-db, openBridgeOrFallback, runWildcardMatch) — the PRs that wire them in come later in the split chain. Rollback = `git revert` of the merge commit; no state introduced, no schema migration triggered. ### Scope discipline (per GUARDRAILS.md) - Only the 8 files listed above are touched; no drive-by refactors - No CI/release/security config changes - No secrets, tokens, or machine-specific paths - Content is lifted from the #606 branch which already passed CI 11/11 green on `d15b8cb` (before the split) ### Dependencies - **Base:** `main` (no dependencies on other split PRs) - **Blocks:** extractor expansion (#792), sync pipeline (#793), cross-impact feature (#794) - **Related ticket:** #791 Co-authored-by: Claude <noreply@anthropic.com> * fix(group): address @claude review on #795 Addresses the findings from the automated review on PR #795 (#795 (comment) — posted by @magyargergo / claude-code Action run). ### Medium severity (reviewer flagged as blockers) - **bridge-db.ts `openBridgeDbReadOnly` bak recovery** — the `.bak` recovery path used bare `fsp.rename(bakPath, dbPath)`, which is exactly the scenario most likely to hit Windows EBUSY/EPERM (an interrupted writer still holding the handle for a few ms). Switched to `retryRename` for consistency with the rest of the file's Windows-safe rename path. - **bridge-db.ts `ensureBridgeSchema` error detection** — the inline `msg.includes('already exists')` substring match has been lifted into a named constant `LBUG_ALREADY_EXISTS_MSG` with a comment documenting the coupling to LadybugDB's error message wording and why we can't use `IF NOT EXISTS` (LadybugDB DDL doesn't support it) or typed errors (LadybugDB's JS driver doesn't expose error codes). Also tightened the `catch (err: any)` to `catch (err: unknown)`. - **bridge-db.ts `findContractNode` — extracted out of writeBridge** — the 35-line async closure living inside `writeBridge` has been lifted to three module-level functions: `createContractLookupIndex`, `indexContract`, and `findContractNode`. `findContractNode` is now a pure synchronous function taking a prebuilt index instead of doing its own DB queries. The `writeBridge` cross-link loop is now ~25 lines instead of ~100. - **bridge-db.ts `findContractNode` — N+1 query elimination** — the old inner-closure version issued up to 6 DB round-trips per cross-link (2 endpoints × up to 3 tiers of fallback queries). For a group with 1000 cross-links, that's up to 6000 DB queries just to resolve endpoints. The new version consults an in-memory `ContractLookupIndex` built incrementally as contracts are inserted (`indexContract` called AFTER each successful insert so failed inserts don't poison the index). Cross-link resolution is now O(1) per link instead of O(3) DB queries per link, with zero DB round-trips during the cross-link loop. ### Minor severity - **bridge-db.ts `queryBridge` empty-array guard** — if LadybugDB ever returns an empty `QueryResult[]` at the top level (shouldn't happen with single-statement calls, but driver contract isn't explicit), the old code would call `.getAll()` on `undefined` and crash with a confusing stack. Added an `unwrapQueryResult` helper that throws an explicit `'empty QueryResult array'` error instead, making a potential driver regression visible immediately. - **normalization.ts `contractRichness` weights** — added a block-level comment documenting the weight ordering (+3 for symbolUid, +2 for each symbol-identifying field, +1 for service tag or non-manifest origin) and explicitly noting that the absolute numbers don't matter, only the relative ordering. Matches the "comment for contributors" suggestion in the review. - **bridge-schema.ts `BRIDGE_SCHEMA_VERSION` migration comment** — added a 4-point contract explaining what bumping the constant means ("discard and re-sync" strategy for V1, no in-place migration yet, new migration logic should live in a separate `bridge-migrations.ts` module when it becomes necessary). - **test/unit/group/fixtures.ts** — extracted the `makeContract` helper previously copy-pasted between `bridge-db.test.ts` and `bridge-db-edge.test.ts` into a shared fixtures module. Both test files now import from `./fixtures.js`. Kept the scope minimal: fixtures is NOT a general-purpose factory module, just the shared baseline contract builder. ### New tests Added 9 pure-function unit tests for the now-extracted `findContractNode` in `bridge-db.test.ts`: - returns null on empty index - tier 1 (symbolUid) match, including repo-scope and role-scope isolation - tier 2 (filePath + symbolName) fallback when symbolUid is empty or mismatches - tier 3 (filePath only) when exactly one contract lives in the file, and refusal when multiple do - priority ordering when multiple tiers could resolve These are fully isolated — no DB, no temp directories, no native LadybugDB binding — so they run in <10ms total and are immediately trustworthy as a regression safety net. ### Deliberately deferred (reviewer marked as "fine for now") - `BridgeHandle._db` / `._conn` typing to `unknown` with casts in `bridge-db.ts` — reviewer's note: "The typing is fine for now." - Batch inserts via `UNWIND` — needs LadybugDB support confirmation, tracked as a follow-up; the per-item pattern remains. - `queryBridge` prepared-statement lifecycle — the current pattern (prepare → execute → GC) relies on LadybugDB's internals, worth verifying against their docs in a separate audit. ### Scope discipline (per `GUARDRAILS.md`) - Only files touched by this PR (`bridge-db.ts`, `bridge-schema.ts`, `normalization.ts`, both bridge test files, new `fixtures.ts`) — no drive-by refactors - No CI/release/security config changes - No secrets ### Test + typecheck status - `npx tsc --noEmit` clean - `bridge-db.test.ts`: added 9 `findContractNode` tests, all pass in isolation. The full-file run still hits the pre-existing native LadybugDB cleanup segfault that flakes the reported count — same as every prior commit on this branch, not a regression. - `bridge-db-edge.test.ts`: 4/4 pass - `matching.test.ts`: 28/28 pass - `types.test.ts`: 5/5 pass - `retryRename` tests (4/4) and `findContractNode` tests (9/9) verified in isolation via `-t` filter Co-authored-by: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

@magyargergo

…lit) (#796) * feat(group): extractor expansion + manifest extractor Part 2 of 4 in the split of #606 (ticket: #792). Follows #795 (bridge.lbug storage foundation, already merged), but this PR has no code-level dependency on #795 — it only imports types and the ContractExtractor interface that existed on upstream main before either PR. It could have been reviewed in parallel with #795. ## What changed Expands the 3 existing contract extractors with substantially more language/framework coverage, and adds a new `manifest-extractor` that resolves `group.yaml`-declared cross-links against the per-repo graph via exact-name lookups. ### New file (228 LOC) - `gitnexus/src/core/group/extractors/manifest-extractor.ts` — exact graph lookup for `group.yaml`-declared cross-links. HTTP paths are canonicalized before Route.name matching; gRPC is resolved by service/method name (NO `.proto`-filename fallback); topic and lib use exact-name match. Falls back to a synthetic `manifest::<repo>::<contractId>` uid when the graph has no matching symbol, so cross-impact traversal still has a stable anchor for the contract. ### Modified extractors (+958 LOC prod) - `extractors/grpc-extractor.ts` (+522) — `.proto` parser with comment and string-literal sanitization (braces inside strings no longer truncate service bodies); package/service/method canonical IDs; server/client detection across Go (`grpc.NewServer`, `RegisterXxxServer`, `XxxGrpc.XxxImplBase`), Java (`@GrpcService`, `BlockingStub`), Python (`servicer_to_server`, `XxxStub`), and TypeScript/Node (`@GrpcMethod`, `ClientGrpc`, `loadPackageDefinition`). - `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib `HandleFunc`, NestJS `@Controller`+`@Get`/etc, Python FastAPI decorators, Java Spring `@RequestMapping`/`@GetMapping`, restTemplate / WebClient / OkHttp consumers. - `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}` struct literal detection (replaces a constructor-anchored regex that missed topics inside producer loops), kafka-go Writer/Reader, Python NATS (`await nc.subscribe`/`await nc.publish`), JetStream helpers. ### Modified and new tests (+1264 LOC) - `grpc-extractor.test.ts` (+539) — full coverage of the new proto parser (strings-with-braces regression, comments-with-braces regression), per-language server/client detection - `http-route-extractor.test.ts` (+240) — per-framework route extraction + normalization edge cases - `topic-extractor.test.ts` (+177) — the sarama in-loop regression, JetStream, Python NATS, kafka-go Writer/Reader - `manifest-extractor.test.ts` (+308 NEW) — HTTP path normalization, gRPC exact lookup with proto-fallback regression, lib and topic exact matching, synthetic-uid fallback behavior ### Self-review fixes folded in Carried forward from the #606 self-review (commit `d15b8cb`): - **HIGH #1** — `manifest-extractor.resolveSymbol` was too fuzzy. Previously used `CONTAINS` on route/name fields plus an unconditional `filePath ENDS WITH '.proto'` fallback for gRPC. Consequences: `/orders` matched `/suborders`, and any repo with any `.proto` file returned a random proto symbol for a gRPC manifest entry. Replaced with exact equality + deterministic `ORDER BY` + synthetic-uid fallback for unresolved manifests. Regression tests included. - **MED #3** — gRPC proto parser brace-depth counting now sanitizes strings and comments first (`stripProtoCommentsAndStrings`). A valid proto with `option deprecated_reason = "use NewService { instead"` used to have its service body closed early by the `"{"` inside the literal, silently dropping methods after the offending string. Regression tests for both string-with-brace and comment-with-brace cases. - **MED #4** — sarama Kafka regex changed from `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on constructor, caught only first topic in a loop) to `sarama.ProducerMessage{...Topic:}` (matches every struct literal directly). Regression test with a for-loop that constructs multiple `ProducerMessage`s. - **MED #7** — `manifest-extractor.resolveSymbol` no longer has a silent `catch { /* fall through */ }`. Errors from the graph executor are logged via `console.warn` with link type, contract name, repo key, and error message before falling through to the synthetic-uid path. ## Why Reviewer focus here is pure regex / parser correctness — no storage, no Cypher queries, no algorithmic changes to the cross-link algorithm. Separating this from the bridge foundation PR (#795) meant reviewers could stay in a single mental mode (parsing logic) instead of context-switching between DDL, Cypher, and regex. ## How to verify - `cd gitnexus && npx tsc --noEmit` - `cd gitnexus && npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks` Local pre-push: typecheck clean, all 99 extractor unit tests pass (grpc 43, http 18, topic 30, manifest 8). ## Risk / rollback **Low.** Extractors have no user-facing surface in this PR — they produce `ExtractedContract[]` that is consumed by `sync.ts` in the next split (#793). No existing behavior changes for users who don't run a `group sync`. Rollback = `git revert` of the merge commit; the modifications to `grpc-extractor.ts` / `http-route-extractor.ts` / `topic-extractor.ts` revert to the pre-PR versions that still work (they're subsets of the new functionality). ## Scope discipline (per GUARDRAILS.md) - Only the 8 files above are touched; no drive-by refactors - No CI/release/security config changes - No secrets or machine-specific paths - Content lifted from #606 (CI 11/11 green on `d15b8cb`) ## Dependencies - **Base:** `main` (upstream already includes #795 as `1ff324c`) - **Blocks:** sync pipeline (#793) and the cross-impact feature (#794) - **Tracker issue:** #792 - **Parent PR:** #606 Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate topic-extractor from regex to tree-sitter queries Addresses @magyargergo's feedback on #796 that regex-based lookups should use tree-sitter nodes instead, and that the top-level extractors must NOT carry language dependencies. This is phase 1 of a multi-step migration — topic-extractor first because its patterns are the most uniform (16 "call/annotation with first-arg string literal" variants), which makes it a clean proof of the approach before grpc-extractor and http-route-extractor get the same treatment. ## Architecture: language-agnostic orchestrator + per-language plugins The top-level extractor is a thin orchestrator that never imports a tree-sitter grammar or a query string. Per-language knowledge lives in a new `topic-patterns/` folder with one file per language plus a registry that maps file extensions to compiled plugins: ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic scanning utilities ├── topic-extractor.ts # thin orchestrator (no grammar imports) └── topic-patterns/ ├── types.ts # TopicMeta, Broker ├── index.ts # registry: extension → compiled provider ├── java.ts # tree-sitter-java + JAVA_TOPIC_PROVIDER ├── go.ts # tree-sitter-go + GO_TOPIC_PROVIDER ├── python.ts # tree-sitter-python + PYTHON_TOPIC_PROVIDER └── node.ts # tree-sitter-javascript + tree-sitter-typescript # → JAVASCRIPT_/TYPESCRIPT_/TSX_TOPIC_PROVIDER ``` **Shared scanner (`tree-sitter-scanner.ts`)** — defines `PatternSpec<TMeta>`, `LanguagePatterns<TMeta>`, `CompiledPatterns<TMeta>` and the `scanFile(parser, plugin, content)` helper. Plugins compile their queries eagerly at module load via `compilePatterns()`, so a broken pattern fails loudly at import time instead of silently at scan time. `unquoteLiteral()` handles single/double/template quotes, Python triple-quoted strings, and Go raw backtick strings. **Per-language plugins** own: - the tree-sitter grammar import (this is the ONLY place in `src/core/group/` where tree-sitter grammars are imported), - the query S-expressions, - the `TopicMeta` payload (role, broker, confidence, symbolName) that the orchestrator receives back on every match. Each plugin uses a `@value` capture name to bind the topic literal node. The JavaScript and TypeScript grammars share AST node names for every construct we query, so `node.ts` defines the pattern sources once and compiles them against `JavaScript`, `TypeScript.typescript`, and `TypeScript.tsx` — exporting three providers because `Parser.Query` objects are NOT portable across grammar instances. **Registry (`topic-patterns/index.ts`)** — maps `.java` → Java provider, `.go` → Go, `.py` → Python, `.js`/`.jsx` → JS, `.ts` → TS, `.tsx` → TSX. Also exports `TOPIC_SCAN_GLOB` so adding a new language is a single file-level edit (drop `topic-patterns/<lang>.ts`, import + register it here — zero edits required in `topic-extractor.ts`). **Orchestrator (`topic-extractor.ts`)** — ~110 lines, no grammar or query imports. Per file: `getProviderForFile(rel)` → `scanFile(parser, provider, content)` → `unquoteLiteral(valueText)` → `makeContract(...)`. Reuses one `Parser` instance across files; the scanner calls `setLanguage` per plugin. ## Why this is better than regex 1. **Comments and strings are respected for free.** The old regex would match `// kafkaTemplate.send("fake.topic")` as a real producer; tree-sitter never visits comments or string literals as code nodes, so false positives from commented-out code are eliminated. 2. **Struct/object literal patterns are structural, not textual.** `sarama.ProducerMessage{Topic: "..."}` no longer needs a 300-char lookahead (which was a known cross-match bug partly mitigated by a loop regression test in the self-review). The new query matches a specific `composite_literal` with a specific `qualified_type` and `keyed_element` — exactly one struct literal per match. 3. **No order-of-operations fragility.** Regex for `channel.publish` vs `channel.consume` was independent and file-wide; the AST scopes matches to the specific `call_expression`. 4. **Language-agnostic extension.** Adding Ruby, Rust, or C# topic detection later means dropping one file in `topic-patterns/` — no changes to shared scanner or orchestrator, and no tree-sitter imports leak into top-level code. ## Per-file fault tolerance - Malformed files that tree-sitter can't parse are silently skipped (`parser.parse` is wrapped by `scanFile`). The ingestion pipeline already logs unparseable files at index time. - A syntactically invalid query is caught at `compilePatterns` time, not scan time — broken plugins fail loudly at import. - Per-pattern `matches()` failures are swallowed so one broken query in a plugin doesn't block the rest. ## Tests All 30 existing `topic-extractor.test.ts` tests pass **without any changes to the test file** — they were written as input/output contract tests (given this source file, expect these `ExtractedContract` objects) and that contract is unchanged. Regression coverage includes: - Kafka: Java `@KafkaListener` + `kafkaTemplate.send`; Node `producer.send` + `consumer.subscribe`; Go sarama producer/consumer (sync and async); kafka-go Writer/Reader; Python `KafkaConsumer` + `producer.send/produce` - RabbitMQ: Java `@RabbitListener` + `rabbitTemplate.convertAndSend`; Node `channel.consume/publish/sendToQueue`; Python `basic_consume/ basic_publish` with keyword args - NATS: Go and Node `nc.Subscribe/Publish`; Go and Node JetStream `js.Subscribe/Publish`; Python `await nc.subscribe/publish` Including the regression test for the sarama `ProducerMessage` in-loop case — the AST-based query captures every literal in the file independently, not just the first one after `NewSyncProducer`. ## Neighbor regression check - `topic-extractor.test.ts` — 30/30 pass (rewritten extractor) - `http-route-extractor.test.ts` — 18/18 pass (untouched) - `grpc-extractor.test.ts` — 43/43 pass (untouched) - `manifest-extractor.test.ts` — 8/8 pass (untouched) - Full `npx tsc --noEmit` clean ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched; no changes to other extractors, tests, MCP surface, or pipeline.ts. - No CI/release/security config changes, no secrets. - New tree-sitter imports all reference grammars that are already installed as dependencies (`tree-sitter`, `tree-sitter-javascript`, `tree-sitter-typescript`, `tree-sitter-python`, `tree-sitter-java`, `tree-sitter-go` — all in `package.json` for the existing pipeline). ## Phase 2 / phase 3 plan - **Phase 2 (next commit):** rewrite `http-route-extractor.ts` Strategy B (regex fallback) on the same plugin pattern. Graph-assisted Strategy A stays as-is (already uses pipeline-built tree-sitter data via `HANDLES_ROUTE` Cypher queries). - **Phase 3 (commit after):** rewrite `grpc-extractor.ts` for Java / Go / Python / TypeScript detection. `.proto` files are the one outstanding question — there is no `tree-sitter-proto` grammar installed; the in-tree string-sanitizing parser stays as a pragmatic exception with a comment, alternative being to add `tree-sitter-proto` as a dep (open for the maintainer). Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate http-route-extractor Strategy B to tree-sitter plugins Phase 2 of the extractor refactor requested by @magyargergo on #796. Same architecture as the phase 1 topic-extractor rewrite: a thin, language-agnostic orchestrator plus per-language plugins that own tree-sitter grammars and query sources. The top-level extractor file no longer imports any tree-sitter grammar or query string. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic primitives ├── http-route-extractor.ts # thin orchestrator (no grammar imports) └── http-patterns/ ├── types.ts # HttpDetection, HttpLanguagePlugin, HttpRole ├── index.ts # registry: ext → plugin + HTTP_SCAN_GLOB ├── java.ts # tree-sitter-java: Spring + RestTemplate/WebClient/OkHttp ├── go.ts # tree-sitter-go: gin/echo/HandleFunc + http/resty consumers ├── python.ts # tree-sitter-python: FastAPI + requests ├── php.ts # tree-sitter-php: Laravel Route::get/... └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # NestJS controllers, Express, fetch, axios ``` **Shared scanner (`tree-sitter-scanner.ts`)** — generalised from phase 1: - `ScanMatch<TMeta>.captures` is now a full `CaptureMap` (every named capture the query binds, not just a single `@value`). Topic extractor updated to read `match.captures.value` accordingly. - New `runCompiledPatterns(plugin, tree)` helper lets plugins run multiple query bundles against the same pre-parsed tree. This is needed for HTTP plugins that combine a class-prefix query with a method-route query (Spring, NestJS). - `scanFile` becomes a thin wrapper over `parser.parse + runCompiledPatterns`. **HTTP plugin shape** — unlike topic plugins, HTTP plugins expose a `scan(tree)` function rather than a flat pattern list. This reflects HTTP's more complex extraction: each detection needs method + path + handler name, and framework patterns like Spring `@RequestMapping` / NestJS `@Controller` require cross-referencing a class-level prefix with method-level annotations. Plugins internally use `compilePatterns` + `runCompiledPatterns` and walk the AST to resolve the class/method relationships. **Per-framework coverage:** - **Java (`java.ts`)** - Spring: `@RequestMapping("/api/v2")` class prefix + `@(Get|Post|Put| Delete|Patch)Mapping("/sub")` method routes, joined via the enclosing `class_declaration` node id. - `RestTemplate.getForObject/postForEntity/put/delete/patchForObject` → method derived from API name. - `WebClient.method(HttpMethod.X, "/path")` → method from `HttpMethod.X` capture. - `new Request.Builder().url("/path")` → OkHttp consumer. - **Go (`go.ts`)** - gin / echo / chi frameworks: `\w+.GET("/path", handler)` captures upper-case verb + handler identifier. - `net/http.HandleFunc("/path", handler)` → provider (default GET). - `http.Get/Post/Head` consumer, `http.NewRequest("METHOD", ...)`, resty `client.R().Get/Post/...`. - **Python (`python.ts`)** - `@app.get("/path")` FastAPI decorators. - `requests.get/post/...` and `requests.request("METHOD", "url")`. - **PHP (`php.ts`)** - Laravel `Route::get/post/.../patch('/path', ...)` via `scoped_call_expression`. Uses `PHP.php_only` to match the existing ingestion pipeline's grammar selection. - **Node (`node.ts`) — JS + TS + TSX** - Pattern sources defined once, compiled against three grammar variants (`JavaScript`, `TypeScript.typescript`, `TypeScript.tsx`) because `Parser.Query` objects are not portable across grammars. Exports three plugins sharing the same `scan` logic. - NestJS: `@Controller('prefix')` decorators are siblings of the class in `export_statement` / `program`; `@Get(':id')` decorators are siblings of the method in `class_body`. The plugin walks decorator → next named sibling to find the decorated class / method, then combines the class prefix with the method path. Only emits NestJS detections when the enclosing class has a real `@Controller` decorator — prevents false positives from generic classes that happen to use `@Get` from another library. - Express: `(router|app).<verb>('/path', ...)`. - `fetch(url)` (default GET) + `fetch(url, { method: 'X' })` (uses two queries + a SyntaxNode-id dedupe set so URL literals aren't double-emitted by the options variant). - `axios.get/post/...`. ## Orchestrator changes `http-route-extractor.ts` drops every `scanXxxProviders` / `scanXxxConsumers` regex method and replaces them with a single source-scan loop that delegates to `getPluginForFile(rel).scan(tree)`. The orchestrator still owns: - **Path normalization** (`normalizeHttpPath`, `normalizeConsumerPath`) — language-agnostic string processing shared by both strategies. - **Graph-assisted Strategy A** (`HANDLES_ROUTE` / `FETCHES` / `CONTAINS` Cypher queries) — unchanged in spirit. The only regex helpers it used (`inferMethodFromFileScan`, `pickJavaHandlerName`) are now replaced by a lookup against the plugin's detections for the same file: for each route row, find the detection whose normalized path matches, and pull the HTTP method + handler name from it. - **Per-file parse cache** — the orchestrator parses each relevant file at most once per `extract()` call. Both the graph-assisted enrichment loop and the source-scan fallback share the same `cachedDetections` map, so we never run the plugin twice for the same file. ## Why this is better than the regex version 1. **Comments and strings for free.** The old regex would match `// router.get('/fake')` as a real Express route; tree-sitter never visits string/comment nodes. 2. **Structural controller-prefix.** Spring and NestJS class-prefix joining is now scoped to the enclosing class via `class_declaration` node ids, eliminating file-wide state that broke when a file had multiple controllers. 3. **Precise NestJS disambiguation.** The plugin only emits a NestJS detection when the enclosing class has a real `@Controller` decorator — the old regex would fire on any `@Get(...)` in the file regardless of surrounding context. 4. **Language-agnostic extension.** Adding Ruby / Rust / Kotlin HTTP detection later means dropping one file in `http-patterns/` — no changes to the shared scanner, the orchestrator, or the Strategy A Cypher queries. ## Tests - `http-route-extractor.test.ts` — **18/18 pass** (tests unchanged; they're contract-style input/output tests and the contract shape is unchanged). Covers Spring class prefix, Express, gin/echo, stdlib HandleFunc, NestJS, Laravel, FastAPI for providers and fetch/axios/python-requests/rest-template/webClient/okhttp/go-stdlib/ resty for consumers, plus graph-first Strategy A for both. - `topic-extractor.test.ts` — **30/30 pass** after the `captures.value` API migration. - `grpc-extractor.test.ts` — 43/43 pass (untouched; phase 3). - `manifest-extractor.test.ts` — 8/8 pass (untouched). - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No changes to pipeline.ts, MCP surface, ingestion, or tests. - No CI / release / security / secrets changes. - Tree-sitter grammars imported by plugins (`tree-sitter-java`, `tree-sitter-go`, `tree-sitter-python`, `tree-sitter-php`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already in `package.json` for the existing ingestion pipeline. ## Phase 3 plan - **grpc-extractor** gets the same treatment: plugin-per-language under `grpc-patterns/` for Java / Go / Python / TS detection. `.proto` files remain an open question — no `tree-sitter-proto` grammar is installed, so the in-tree string-sanitizing parser from PR #796's self-review stays as a pragmatic exception unless the maintainer wants us to add `tree-sitter-proto` as a new dep. Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate grpc-extractor source scans to tree-sitter plugins Phase 3 (final) of the extractor refactor requested by @magyargergo on #796. Same architecture as phase 1 (topic) and phase 2 (http): thin language-agnostic orchestrator + per-language plugins that own tree-sitter grammars and query sources. With this commit the top-level extractors under `src/core/group/extractors/` import ZERO tree-sitter grammars and ZERO query strings — every grammar import lives in a `*-patterns/<lang>.ts` plugin file, and the orchestrators go through the registry indirection. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared primitives (unchanged) ├── grpc-extractor.ts # orchestrator (only `.proto` parser left) └── grpc-patterns/ ├── types.ts # GrpcDetection, GrpcLanguagePlugin, GrpcRole ├── index.ts # registry: ext → plugin + GRPC_SCAN_GLOB ├── go.ts # tree-sitter-go: RegisterXxxServer, Unimplemented, NewXxxClient ├── java.ts # tree-sitter-java: @GrpcService + XxxImplBase + newBlockingStub ├── python.ts # tree-sitter-python: add_XxxServicer_to_server + XxxStub └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # @GrpcMethod, @GrpcClient field type, # .getService<X>('Svc'), new XxxServiceClient, # loadPackageDefinition dynamic constructors ``` ## Per-language coverage **Go (`go.ts`)** - Provider: `\w+.RegisterXxxServer(...)` via `call_expression → selector_expression → field_identifier` + JS regex filter `^Register(\w+)Server$`. - Provider: `pb.UnimplementedXxxServer` embedded in a struct via `struct_type → field_declaration_list → field_declaration → qualified_type → type_identifier` + JS filter. - Consumer: `\w+.NewXxxClient(...)` via the same call_expression query + JS filter `^New(\w+)Client$`. **Java (`java.ts`)** - Provider: `class X extends YyyGrpc.YyyImplBase` — two queries handle the scoped and plain forms. `scoped_type_identifier`'s children are positional (no `scope:`/`name:` fields), so the query matches the two `type_identifier` children by position. - `#match? @inner "ImplBase$"` restricts matches at query time. - Whether the class has `@GrpcService` or not controls only the `source` metadata label — the plugin walks the class_declaration's `modifiers` child in JS to detect the marker_annotation. - Consumer: `YyyGrpc.newStub(ch)` / `newBlockingStub(ch)` via a `method_invocation` query with `#match? @method "^new(Blocking)?Stub$"`, service name extracted via `^(\w+)Grpc$` on the object identifier. **Python (`python.ts`)** - Single call-expression query covers both bare identifier and `obj.method` attribute forms: `(call function: [(identifier) @fn (attribute attribute: (identifier) @fn)])`. - Plugin filters `@fn.text` against two JS regexes: `^add_(\w+)Servicer_to_server$` (provider) and `^(\w+)Stub$` (consumer), with a reserved-names ignore list for the Stub case (Mock / Test / Fake / Stub). **Node — JavaScript + TypeScript + TSX (`node.ts`)** - Pattern sources defined once, compiled three times (one per grammar) because `Parser.Query` objects are not portable across grammars. Exports three `GrpcLanguagePlugin`s sharing the same `scan`. - `@GrpcMethod('Service', 'Method')`: decorator query captures the two string literals. Confidence is hard-coded 0.8 regardless of proto map resolution (matches the original regex version's behaviour). - `@GrpcClient(...) field: XxxServiceClient`: decorator query captures the decorator node, plugin walks up to find the enclosing `public_field_definition` (decorators on fields are CHILDREN of the field definition in tree-sitter-typescript, not siblings) and reads its first `type_annotation → type_identifier`, then runs the `^(\w+Service)Client$` JS filter. - `client.getService<X>('AuthService')`: call-expression query on `member_expression.property = "getService"` + string literal arg. - `new XxxServiceClient(...)`: `new_expression` with a bare identifier constructor, filtered by `^(\w+Service)Client$` so generic `new AuthClient(...)` (missing the `Service` infix) does NOT falsely register as a consumer. Preserves the regression test `test_extract_ts_non_service_client_constructor_is_ignored`. - `loadPackageDefinition` dynamic loader: gated on `tree.rootNode.text.includes('loadPackageDefinition')`. When set, `new foo.bar.Xxx(...)` qualified constructors with a capitalised property name register as consumers. ## Orchestrator changes `grpc-extractor.ts` loses every `scanGoProviders` / `scanJavaProviders` / ... helper and replaces them with a single source-scan loop that: 1. Parses each file with the plugin's grammar (one shared `Parser` instance across all files, `setLanguage` called per plugin). 2. Calls `plugin.scan(tree)` to get `GrpcDetection[]`. 3. Converts each detection to an `ExtractedContract` via the private `detectionToContract` helper, which: - Looks the short service name up in the proto map (filled by the `.proto` parser). - Picks confidence = `confidenceWithProto` if resolved, else `confidenceWithoutProto`. - Builds a method-level contract id (`grpc::pkg.Svc/Method`) when the detection carries a `methodName` (TS `@GrpcMethod` only), otherwise a service-level id (`grpc::pkg.Svc/*`). Everything else — the `.proto` parser, `buildProtoContext`, `buildProtoMap`, `resolveProtoConflict`, `serviceContractId`, `stripProtoCommentsAndStrings`, `extractServiceBlocks`, the dedupe function — stays exactly as before. The `.proto` parser is kept as a pragmatic exception to the "no regex in extractors" rule because no `tree-sitter-proto` grammar is installed in the repo; a comment at the top of the file explains this and flags the maintainer option of adding `tree-sitter-proto` as a dependency. ## Why this is better than the regex version 1. **Comments and strings are respected for free.** Matched node types are only code constructs, never text inside comments or string literals. 2. **No false positives on partial names.** The old `(\w+?)Grpc`-style regexes would cross-match unrelated identifiers; structural queries restrict matches to the exact AST shape (`scoped_type_identifier → type_identifier` pairs, `method_invocation → identifier` etc.). 3. **NestJS `@GrpcClient` is structural, not regex-based.** The old regex required a specific textual layout (`@GrpcClient(...) private readonly foo!: XxxServiceClient`); the plugin now walks the AST, so modifier order / optional modifiers / multi-line formatting don't break it. 4. **Language-agnostic extension.** Adding Kotlin / Rust / C# gRPC detection later is a one-file edit in `grpc-patterns/index.ts` — no touches to the shared scanner, the orchestrator, or the proto parser. ## Tests - `grpc-extractor.test.ts` — **43/43 pass** (tests unchanged; the contract shape is identical). Covers .proto parsing (including the brace-inside-string regression), Go provider/consumer, Java @GrpcService / plain ImplBase provider + newBlockingStub consumer, Python servicer + stub, TS @GrpcMethod + @GrpcClient + .getService + new XxxServiceClient + loadPackageDefinition + the `AuthClient` vs `AuthServiceClient` discrimination, dedupe across multiple patterns in one file, proto-aware confidence, and the inherited-package resolution for split proto definitions. - `topic-extractor.test.ts` — 30/30 pass. - `http-route-extractor.test.ts` — 18/18 pass. - `manifest-extractor.test.ts` — 8/8 pass. - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No pipeline.ts, MCP surface, ingestion, CI / release / security, or test changes. - New tree-sitter grammar imports (`tree-sitter-go`, `tree-sitter-java`, `tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already installed for the ingestion pipeline. ## End of phase series This commit completes the three-phase extractor refactor: - **Phase 1** (`ea06d11`): topic-extractor → `topic-patterns/` - **Phase 2** (`b6015f6`): http-route-extractor → `http-patterns/` - **Phase 3** (this commit): grpc-extractor → `grpc-patterns/` Every remaining regex-based extractor helper under the `src/core/group/ extractors/` directory is either (a) language-agnostic string processing (path normalization, dedupe keys) or (b) the `.proto` parser, which is documented as an explicit exception. Co-authored-by: Claude <noreply@anthropic.com> * feat(group): add tree-sitter-proto for .proto file parsing Addresses @magyargergo's suggestion on #796 to replace the manual string-sanitizing .proto parser with a tree-sitter grammar. - **Vendored `tree-sitter-proto`** in `vendor/tree-sitter-proto/`. Grammar source from [coder3101/tree-sitter-proto](https://github.com/coder3101/tree-sitter-proto) (latest `grammar.js`), parser.c regenerated with `tree-sitter-cli 0.24` to produce ABI version 14 — compatible with the project's `tree-sitter 0.25` runtime (which supports ABI ≤ 14). Added as `optionalDependency` with `file:./vendor/tree-sitter-proto`. - **New `grpc-patterns/proto.ts` plugin** — uses the same `compilePatterns` + `runCompiledPatterns` infrastructure as every other plugin. Two queries: - `(package (full_ident) @pkg)` — package declaration - `(service (service_name) @service_name (rpc (rpc_name) @rpc_name))` — one match per (service, rpc) pair - **Graceful fallback** — `tree-sitter-proto` is an optional dependency. If it fails to install (platform incompatibility) or fails the runtime smoke-test (`setLanguage` + `parse` on a trivial proto), `PROTO_GRPC_PLUGIN` stays `null` and the orchestrator uses the existing manual parser. The smoke-test catches the `SyntaxNode` TDZ error that occurs in vitest's fork-based test runner. - **Orchestrator updated** — when `hasProtoPlugin` is true, `.proto` files are handled by the plugin loop (they're included in `GRPC_SCAN_GLOB`), and the manual `parseProtoFile` loop is skipped. `buildProtoContext` still runs to build the proto map for cross-referencing source-file detections. 1. **No manual comment/string stripping.** The old parser needed `stripProtoCommentsAndStrings` (110 lines) to avoid counting braces inside comments and string literals. tree-sitter handles this natively. 2. **No brace-depth tracking.** `extractServiceBlocks` used a manual depth counter to find service boundaries. tree-sitter's AST gives us `service` → `service_name` + `rpc` → `rpc_name` directly. 3. **Performance.** tree-sitter's C-based parser is faster than character-by-character JS scanning + regex on large proto files. - `grpc-extractor.test.ts` — **43/43 pass** (unchanged) - All other extractor tests — 99/99 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * chore: add .gitignore for vendored tree-sitter-proto build artifacts https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * fix: correct .gitignore paths for vendored tree-sitter-proto Patterns should be relative to the .gitignore file's directory. https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * refactor(group): address Copilot review feedback on #796 Six fixes suggested by the Copilot AI review: 1. **`normalizeHttpPath` root-path edge case** — stripping trailing slashes on the input `/` produced an empty string, yielding malformed contract ids like `http::GET::`. Now preserves `/` for the root handler/fetch case. 2. **Dedupe `scanFiles` call** — `extract()` was globbing the source-scan file list twice (once for the provider fallback, once for the consumer fallback). Moved to a single lazy call that memoizes the result for the rest of the method. 3. **HTTP `scanFiles` now ignores `**/vendor/**`** — every other extractor's glob already ignored vendored sources; the HTTP one didn't. Fixed for consistency. 4. **`loadPackageDefinition` check is now structural** — was calling `tree.rootNode.text.includes('loadPackageDefinition')` which forces materialization of the entire file text from the parse tree (expensive on large files). Replaced with a dedicated compiled query on `(call_expression function: [(identifier) | (member_expression)])` so the check stays in the AST domain. 5. **`grpc-extractor.ts` header docstring updated** — still claimed ".proto parsing is not tree-sitter-based because no grammar is installed". Now describes the actual behaviour: tree-sitter when `tree-sitter-proto` is available (optionalDependency), manual fallback otherwise. 6. **Eliminated the double proto file parse on the fallback path** — `buildProtoContext` already globs + parses every `.proto` file to build `servicesByName`. On the `!hasProtoPlugin` branch the extractor was globbing + parsing again via the now-removed `parseProtoFile` helper. The fallback branch now iterates the map that `buildProtoContext` already produced to emit provider contracts directly — single pass per proto file. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): address Claude review feedback (bugs + dedup + hygiene) on #796 Follows up `2f28bfc` with the remaining items from the Claude AI review: ## Bugs **Bug 2 — Label-unaware Cypher queries in `resolveSymbol`.** The manifest-extractor's lookup queries were `MATCH (n) WHERE n.name = $x` with no label filter, so a topic/service/package name could silently match any node type (File, Variable, Import, Folder, …). Added label filters: - `topic` → `(n:Function|Method|Class|Interface)` (topics are best-effort symbol-name matches against listener/publisher symbols) - `grpc` method → `(n:Function|Method)` - `grpc` service → `(n:Class|Interface)` - `lib` → `(n:Package|Module)` All 8 manifest-extractor tests still pass (mock executor is label-agnostic, but the production LadybugDB graph now gets correctly scoped queries). **Bug 8 — Tautological `!handlerName` condition.** `http-route-extractor.ts:extractProvidersGraph` had `let handlerName = null; if (!method || !handlerName) { ... }` — the `!handlerName` clause was always true since there was no intervening assignment. Simplified to always run the plugin-scan lookup (we need the handler name even when `methodFromRouteReason` already resolved the method). ## Clean code / dedup **Design 7 — `readSafe` was copy-pasted in all three orchestrators.** Extracted to `extractors/fs-utils.ts` as the single source of truth for the path-traversal guard. Dropped the three local copies and the now-unused `fs`/`path` imports from topic-extractor. **Style 10 — Language-specific `_test.go` skip in the topic orchestrator.** Was `if (rel.endsWith('_test.go')) continue;` inside the language- agnostic extraction loop. Pushed into the glob's ignore list (`'**/*_test.go'`) alongside the existing `node_modules`, `vendor`, `dist`, `build` entries, with a comment explaining that other languages' test file conventions either live in separate directories (Python `tests/`, Java `src/test/`) or are already covered by the existing ignores. ## Already addressed in `2f28bfc` (mentioned again in Claude review) - Bug 3: `normalizeHttpPath('/')` returns `''` — fixed - Bug 4: double glob + double parse of `.proto` — fixed - Bug 5: `scanFiles` called twice in HTTP — fixed - Bug 6: missing `**/vendor/**` in HTTP glob — fixed - Design 9 partially: `tree.rootNode.text.includes('loadPackageDefinition')` replaced with a dedicated structural query ## Deferred - Bug 1 (`http::*::path` vs `http::GET::path` matching) — out of scope; sync.ts matching logic lands in #793, manifest extractor already emits correct synthetic uids for unresolved HTTP contracts. - Design 9 full (change plugin `scan(tree)` → `scan(tree, source)`) — the only real use case (`loadPackageDefinition` gate) is already fixed via a structural query, so the interface change would be cosmetic churn without a concrete consumer. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * docs+fix(group): address remaining Claude review items + add pipeline flow chart ## Fixes **Remaining 🔴 — HTTP contract id wildcard format.** Documented the `http::*::<path>` format as an intentional wildcard for manifest links that omit the HTTP method, alongside the explicit-method form (`GET::/path` → `http::GET::/path`). The docblock on `buildContractId` now states both forms, notes that wildcard-aware matching is the responsibility of the sync / cross-impact layer (#793), and recommends the explicit-method form whenever the author knows the method (it round-trips through exact equality without needing wildcard logic downstream). Tests unchanged — the wildcard format is what they've always asserted. **Minor 1 — stale comment at `manifest-extractor.ts:124-126`.** The comment claimed "creates a contract with an empty symbolUid/ref" but the code switched to `manifestSymbolUid(repo, contractId)` a few commits back. Updated to describe the actual synthetic-uid fallback semantics and the cross-impact path that relies on both sides of the join deriving the same uid. **Minor 2 — exhaustiveness guard on `buildContractId`.** The `switch(type)` covered all five current `ContractType` variants but silently returned `undefined` if a new variant was added. Added a `default: const _exhaustive: never = type; throw new Error(...)` clause so the build fails loudly on an unhandled variant. **Minor 3 — `tree.rootNode.text` in `grpc-patterns/node.ts`.** Already fixed in `2f28bfc` via a dedicated structural query (`LOAD_PACKAGE_DEFINITION_SPEC`). No action needed. ## New: pipeline flow chart (per @magyargergo's request) Added `src/core/group/PIPELINE.md` with four Mermaid diagrams: 1. **High-level overview** — `group.yaml` → extractors + manifest → contract matching → `bridge.lbug` → `runGroupImpact`. 2. **Per-repo extractor two-strategy shape** — graph-assisted Strategy A vs. source-scan Strategy B. 3. **Plugin architecture** — orchestrator → registry → per-language `*-patterns/<lang>.ts` → `tree-sitter-scanner.ts` → `ExtractedContract`. 4. **Manifest extraction** — label-scoped `resolveSymbol` with the synthetic-uid fallback. 5. **Cross-impact query (#606)** — local impact → bridge join → cross-repo fan-out. Each diagram is annotated with which PRs own which stage (this PR: extractors + manifest; #795: bridge storage; #606: cross-impact runtime) and points at the concrete files/functions involved. ## Tests - 99/99 extractor tests pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

ivkond · 2026-04-20T15:54:30Z

@abhigyanpatwari @magyargergo — closing this one.

The split plan agreed in #issuecomment-4229689301 is complete. Everything from this PR landed piecemeal via 4 follow-up PRs against the 5 tracking issues:

#	Issue	Merged via	Scope
1	#790	#795	Shared `types.ts` + `ContractExtractor` interface (rolled into PR 2)
2	#791	#795	`bridge.lbug` storage + contract matching foundation
3	#792	#796	Service boundary detection + HTTP/gRPC/topic/manifest extractors
4	#793	#817	Sync pipeline + extractor contract-resolution fixes
5	#794	#984	`group_impact` / cross-repo traversal + `@<group>` MCP routing + group resources

Net effect on main matches the original intent of this PR, minus the items explicitly deferred along the way — tracked separately:

AbortSignal plumbing for the Phase-1 timeout path.
runGroupImpact / runGroupImpactLegacy unification.
NITs from the feat: cross-repo impact analysis (#794) — @repo MCP routing + group resources #984 second-pass review (crossDepth tool description, groupStatus Promise.all, ensuredPoolFTS invalidation on closeLbug).

Happy to open follow-up issues for any of the above if useful.

Thanks for the patience with the split and for the reviews along the way 🙏 — closing without merge.

magyargergo · 2026-04-20T15:59:46Z

@ivkond thank you for this amazing work! You made a big impact on GitNexus! We appreciate your hard work. We have open positions for maintainers if you are interested.

Please join to our community on Discord.

@magyargergo

…abhigyanpatwari#606 split) (abhigyanpatwari#795) * feat(group): bridge.lbug storage + contract matching expansion Part 1 of 4 in the split of abhigyanpatwari#606 (ticket: abhigyanpatwari#791, closes abhigyanpatwari#790 with a revised plan per @magyargergo's request). ## What changed Adds the LadybugDB-backed bridge storage infrastructure and extends the contract matching algorithm with wildcard support. All changes are additive: storage.ts, sync.ts, service.ts, cli/group.ts, mcp/tools.ts are left on their upstream main versions and will migrate to the new bridge in follow-up PRs (abhigyanpatwari#792, abhigyanpatwari#793, abhigyanpatwari#794). ### Files **New (844 LOC prod):** - `gitnexus/src/core/group/bridge-db.ts` — atomic write-to-temp with `retryRename` for Windows EBUSY/EPERM, per-item write tolerance via `WriteBridgeReport`, `findContractNode` with three-tier symbol lookup (uid → filePath+name → filePath) - `gitnexus/src/core/group/bridge-schema.ts` — schema DDL - `gitnexus/src/core/group/normalization.ts` — contract ID canonicalization + `dedupeContracts` / `dedupeCrossLinks` helpers used by both matching and bridge write **Modified (+137 LOC prod):** - `gitnexus/src/core/group/matching.ts` — adds `runWildcardMatch` for `grpc::Service/*` wildcard consumers, `buildProviderIndex` helper, and canonical gRPC ID handling in `normalizeContractId` - `gitnexus/src/core/group/types.ts` — `MatchType` gains `'wildcard'`; new `BridgeHandle` and `BridgeMeta` interfaces **New tests (658 LOC):** - `gitnexus/test/unit/group/bridge-db.test.ts` — core write/read round trip, `WriteBridgeReport` shape, dropped-links counter, retryRename behavior on EBUSY/ENOENT/EPERM/EACCES - `gitnexus/test/unit/group/bridge-db-edge.test.ts` — edge cases (malformed meta, missing contract nodes, concurrent access) **Modified tests (+225 LOC):** - `gitnexus/test/unit/group/matching.test.ts` — wildcard consumer matching, gRPC canonical ID handling, same-service guard ### Self-review fixes folded in Carried forward from the original abhigyanpatwari#606 self-review: - `writeBridge` try/finally handle lifecycle + `handleClosed` sentinel - `openBridgeDbReadOnly` partial-handle cleanup - `writeBridgeMeta` uses `retryRename` for Windows consistency - `retryRename` unit tests (was zero coverage) - Per-item try/catch around every CREATE loop so one malformed contract doesn't abort the whole write - Dropped cross-link counter (`linksDroppedMissingNode`) ### Why now magyargergo asked for the abhigyanpatwari#606 PR to be split so we can iterate with confidence (abhigyanpatwari#606 (comment)). This is the foundational layer — pure infra, no user-facing surface, no callers of the new APIs in this PR. Later PRs wire it in. ### How to verify - `cd gitnexus && npx tsc --noEmit` - `cd gitnexus && npx vitest run test/unit/group/bridge-db.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/bridge-db-edge.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/matching.test.ts --pool=forks` - Pre-commit hook runs clean ### Risk / rollback **Low.** All new code sits under `src/core/group/` in new files plus a minimal `+16/-1` diff to `types.ts` and a `+136/-0` diff to `matching.ts` (both purely additive). No existing callers reference the new APIs (bridge-db, openBridgeOrFallback, runWildcardMatch) — the PRs that wire them in come later in the split chain. Rollback = `git revert` of the merge commit; no state introduced, no schema migration triggered. ### Scope discipline (per GUARDRAILS.md) - Only the 8 files listed above are touched; no drive-by refactors - No CI/release/security config changes - No secrets, tokens, or machine-specific paths - Content is lifted from the abhigyanpatwari#606 branch which already passed CI 11/11 green on `d15b8cb` (before the split) ### Dependencies - **Base:** `main` (no dependencies on other split PRs) - **Blocks:** extractor expansion (abhigyanpatwari#792), sync pipeline (abhigyanpatwari#793), cross-impact feature (abhigyanpatwari#794) - **Related ticket:** abhigyanpatwari#791 Co-authored-by: Claude <noreply@anthropic.com> * fix(group): address @claude review on abhigyanpatwari#795 Addresses the findings from the automated review on PR abhigyanpatwari#795 (abhigyanpatwari#795 (comment) — posted by @magyargergo / claude-code Action run). ### Medium severity (reviewer flagged as blockers) - **bridge-db.ts `openBridgeDbReadOnly` bak recovery** — the `.bak` recovery path used bare `fsp.rename(bakPath, dbPath)`, which is exactly the scenario most likely to hit Windows EBUSY/EPERM (an interrupted writer still holding the handle for a few ms). Switched to `retryRename` for consistency with the rest of the file's Windows-safe rename path. - **bridge-db.ts `ensureBridgeSchema` error detection** — the inline `msg.includes('already exists')` substring match has been lifted into a named constant `LBUG_ALREADY_EXISTS_MSG` with a comment documenting the coupling to LadybugDB's error message wording and why we can't use `IF NOT EXISTS` (LadybugDB DDL doesn't support it) or typed errors (LadybugDB's JS driver doesn't expose error codes). Also tightened the `catch (err: any)` to `catch (err: unknown)`. - **bridge-db.ts `findContractNode` — extracted out of writeBridge** — the 35-line async closure living inside `writeBridge` has been lifted to three module-level functions: `createContractLookupIndex`, `indexContract`, and `findContractNode`. `findContractNode` is now a pure synchronous function taking a prebuilt index instead of doing its own DB queries. The `writeBridge` cross-link loop is now ~25 lines instead of ~100. - **bridge-db.ts `findContractNode` — N+1 query elimination** — the old inner-closure version issued up to 6 DB round-trips per cross-link (2 endpoints × up to 3 tiers of fallback queries). For a group with 1000 cross-links, that's up to 6000 DB queries just to resolve endpoints. The new version consults an in-memory `ContractLookupIndex` built incrementally as contracts are inserted (`indexContract` called AFTER each successful insert so failed inserts don't poison the index). Cross-link resolution is now O(1) per link instead of O(3) DB queries per link, with zero DB round-trips during the cross-link loop. ### Minor severity - **bridge-db.ts `queryBridge` empty-array guard** — if LadybugDB ever returns an empty `QueryResult[]` at the top level (shouldn't happen with single-statement calls, but driver contract isn't explicit), the old code would call `.getAll()` on `undefined` and crash with a confusing stack. Added an `unwrapQueryResult` helper that throws an explicit `'empty QueryResult array'` error instead, making a potential driver regression visible immediately. - **normalization.ts `contractRichness` weights** — added a block-level comment documenting the weight ordering (+3 for symbolUid, +2 for each symbol-identifying field, +1 for service tag or non-manifest origin) and explicitly noting that the absolute numbers don't matter, only the relative ordering. Matches the "comment for contributors" suggestion in the review. - **bridge-schema.ts `BRIDGE_SCHEMA_VERSION` migration comment** — added a 4-point contract explaining what bumping the constant means ("discard and re-sync" strategy for V1, no in-place migration yet, new migration logic should live in a separate `bridge-migrations.ts` module when it becomes necessary). - **test/unit/group/fixtures.ts** — extracted the `makeContract` helper previously copy-pasted between `bridge-db.test.ts` and `bridge-db-edge.test.ts` into a shared fixtures module. Both test files now import from `./fixtures.js`. Kept the scope minimal: fixtures is NOT a general-purpose factory module, just the shared baseline contract builder. ### New tests Added 9 pure-function unit tests for the now-extracted `findContractNode` in `bridge-db.test.ts`: - returns null on empty index - tier 1 (symbolUid) match, including repo-scope and role-scope isolation - tier 2 (filePath + symbolName) fallback when symbolUid is empty or mismatches - tier 3 (filePath only) when exactly one contract lives in the file, and refusal when multiple do - priority ordering when multiple tiers could resolve These are fully isolated — no DB, no temp directories, no native LadybugDB binding — so they run in <10ms total and are immediately trustworthy as a regression safety net. ### Deliberately deferred (reviewer marked as "fine for now") - `BridgeHandle._db` / `._conn` typing to `unknown` with casts in `bridge-db.ts` — reviewer's note: "The typing is fine for now." - Batch inserts via `UNWIND` — needs LadybugDB support confirmation, tracked as a follow-up; the per-item pattern remains. - `queryBridge` prepared-statement lifecycle — the current pattern (prepare → execute → GC) relies on LadybugDB's internals, worth verifying against their docs in a separate audit. ### Scope discipline (per `GUARDRAILS.md`) - Only files touched by this PR (`bridge-db.ts`, `bridge-schema.ts`, `normalization.ts`, both bridge test files, new `fixtures.ts`) — no drive-by refactors - No CI/release/security config changes - No secrets ### Test + typecheck status - `npx tsc --noEmit` clean - `bridge-db.test.ts`: added 9 `findContractNode` tests, all pass in isolation. The full-file run still hits the pre-existing native LadybugDB cleanup segfault that flakes the reported count — same as every prior commit on this branch, not a regression. - `bridge-db-edge.test.ts`: 4/4 pass - `matching.test.ts`: 28/28 pass - `types.test.ts`: 5/5 pass - `retryRename` tests (4/4) and `findContractNode` tests (9/9) verified in isolation via `-t` filter Co-authored-by: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

@magyargergo

…npatwari#606 split) (abhigyanpatwari#796) * feat(group): extractor expansion + manifest extractor Part 2 of 4 in the split of abhigyanpatwari#606 (ticket: abhigyanpatwari#792). Follows abhigyanpatwari#795 (bridge.lbug storage foundation, already merged), but this PR has no code-level dependency on abhigyanpatwari#795 — it only imports types and the ContractExtractor interface that existed on upstream main before either PR. It could have been reviewed in parallel with abhigyanpatwari#795. ## What changed Expands the 3 existing contract extractors with substantially more language/framework coverage, and adds a new `manifest-extractor` that resolves `group.yaml`-declared cross-links against the per-repo graph via exact-name lookups. ### New file (228 LOC) - `gitnexus/src/core/group/extractors/manifest-extractor.ts` — exact graph lookup for `group.yaml`-declared cross-links. HTTP paths are canonicalized before Route.name matching; gRPC is resolved by service/method name (NO `.proto`-filename fallback); topic and lib use exact-name match. Falls back to a synthetic `manifest::<repo>::<contractId>` uid when the graph has no matching symbol, so cross-impact traversal still has a stable anchor for the contract. ### Modified extractors (+958 LOC prod) - `extractors/grpc-extractor.ts` (+522) — `.proto` parser with comment and string-literal sanitization (braces inside strings no longer truncate service bodies); package/service/method canonical IDs; server/client detection across Go (`grpc.NewServer`, `RegisterXxxServer`, `XxxGrpc.XxxImplBase`), Java (`@GrpcService`, `BlockingStub`), Python (`servicer_to_server`, `XxxStub`), and TypeScript/Node (`@GrpcMethod`, `ClientGrpc`, `loadPackageDefinition`). - `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib `HandleFunc`, NestJS `@Controller`+`@Get`/etc, Python FastAPI decorators, Java Spring `@RequestMapping`/`@GetMapping`, restTemplate / WebClient / OkHttp consumers. - `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}` struct literal detection (replaces a constructor-anchored regex that missed topics inside producer loops), kafka-go Writer/Reader, Python NATS (`await nc.subscribe`/`await nc.publish`), JetStream helpers. ### Modified and new tests (+1264 LOC) - `grpc-extractor.test.ts` (+539) — full coverage of the new proto parser (strings-with-braces regression, comments-with-braces regression), per-language server/client detection - `http-route-extractor.test.ts` (+240) — per-framework route extraction + normalization edge cases - `topic-extractor.test.ts` (+177) — the sarama in-loop regression, JetStream, Python NATS, kafka-go Writer/Reader - `manifest-extractor.test.ts` (+308 NEW) — HTTP path normalization, gRPC exact lookup with proto-fallback regression, lib and topic exact matching, synthetic-uid fallback behavior ### Self-review fixes folded in Carried forward from the abhigyanpatwari#606 self-review (commit `d15b8cb`): - **HIGH abhigyanpatwari#1** — `manifest-extractor.resolveSymbol` was too fuzzy. Previously used `CONTAINS` on route/name fields plus an unconditional `filePath ENDS WITH '.proto'` fallback for gRPC. Consequences: `/orders` matched `/suborders`, and any repo with any `.proto` file returned a random proto symbol for a gRPC manifest entry. Replaced with exact equality + deterministic `ORDER BY` + synthetic-uid fallback for unresolved manifests. Regression tests included. - **MED abhigyanpatwari#3** — gRPC proto parser brace-depth counting now sanitizes strings and comments first (`stripProtoCommentsAndStrings`). A valid proto with `option deprecated_reason = "use NewService { instead"` used to have its service body closed early by the `"{"` inside the literal, silently dropping methods after the offending string. Regression tests for both string-with-brace and comment-with-brace cases. - **MED abhigyanpatwari#4** — sarama Kafka regex changed from `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on constructor, caught only first topic in a loop) to `sarama.ProducerMessage{...Topic:}` (matches every struct literal directly). Regression test with a for-loop that constructs multiple `ProducerMessage`s. - **MED abhigyanpatwari#7** — `manifest-extractor.resolveSymbol` no longer has a silent `catch { /* fall through */ }`. Errors from the graph executor are logged via `console.warn` with link type, contract name, repo key, and error message before falling through to the synthetic-uid path. ## Why Reviewer focus here is pure regex / parser correctness — no storage, no Cypher queries, no algorithmic changes to the cross-link algorithm. Separating this from the bridge foundation PR (abhigyanpatwari#795) meant reviewers could stay in a single mental mode (parsing logic) instead of context-switching between DDL, Cypher, and regex. ## How to verify - `cd gitnexus && npx tsc --noEmit` - `cd gitnexus && npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks` Local pre-push: typecheck clean, all 99 extractor unit tests pass (grpc 43, http 18, topic 30, manifest 8). ## Risk / rollback **Low.** Extractors have no user-facing surface in this PR — they produce `ExtractedContract[]` that is consumed by `sync.ts` in the next split (abhigyanpatwari#793). No existing behavior changes for users who don't run a `group sync`. Rollback = `git revert` of the merge commit; the modifications to `grpc-extractor.ts` / `http-route-extractor.ts` / `topic-extractor.ts` revert to the pre-PR versions that still work (they're subsets of the new functionality). ## Scope discipline (per GUARDRAILS.md) - Only the 8 files above are touched; no drive-by refactors - No CI/release/security config changes - No secrets or machine-specific paths - Content lifted from abhigyanpatwari#606 (CI 11/11 green on `d15b8cb`) ## Dependencies - **Base:** `main` (upstream already includes abhigyanpatwari#795 as `f6fb87f`) - **Blocks:** sync pipeline (abhigyanpatwari#793) and the cross-impact feature (abhigyanpatwari#794) - **Tracker issue:** abhigyanpatwari#792 - **Parent PR:** abhigyanpatwari#606 Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate topic-extractor from regex to tree-sitter queries Addresses @magyargergo's feedback on abhigyanpatwari#796 that regex-based lookups should use tree-sitter nodes instead, and that the top-level extractors must NOT carry language dependencies. This is phase 1 of a multi-step migration — topic-extractor first because its patterns are the most uniform (16 "call/annotation with first-arg string literal" variants), which makes it a clean proof of the approach before grpc-extractor and http-route-extractor get the same treatment. ## Architecture: language-agnostic orchestrator + per-language plugins The top-level extractor is a thin orchestrator that never imports a tree-sitter grammar or a query string. Per-language knowledge lives in a new `topic-patterns/` folder with one file per language plus a registry that maps file extensions to compiled plugins: ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic scanning utilities ├── topic-extractor.ts # thin orchestrator (no grammar imports) └── topic-patterns/ ├── types.ts # TopicMeta, Broker ├── index.ts # registry: extension → compiled provider ├── java.ts # tree-sitter-java + JAVA_TOPIC_PROVIDER ├── go.ts # tree-sitter-go + GO_TOPIC_PROVIDER ├── python.ts # tree-sitter-python + PYTHON_TOPIC_PROVIDER └── node.ts # tree-sitter-javascript + tree-sitter-typescript # → JAVASCRIPT_/TYPESCRIPT_/TSX_TOPIC_PROVIDER ``` **Shared scanner (`tree-sitter-scanner.ts`)** — defines `PatternSpec<TMeta>`, `LanguagePatterns<TMeta>`, `CompiledPatterns<TMeta>` and the `scanFile(parser, plugin, content)` helper. Plugins compile their queries eagerly at module load via `compilePatterns()`, so a broken pattern fails loudly at import time instead of silently at scan time. `unquoteLiteral()` handles single/double/template quotes, Python triple-quoted strings, and Go raw backtick strings. **Per-language plugins** own: - the tree-sitter grammar import (this is the ONLY place in `src/core/group/` where tree-sitter grammars are imported), - the query S-expressions, - the `TopicMeta` payload (role, broker, confidence, symbolName) that the orchestrator receives back on every match. Each plugin uses a `@value` capture name to bind the topic literal node. The JavaScript and TypeScript grammars share AST node names for every construct we query, so `node.ts` defines the pattern sources once and compiles them against `JavaScript`, `TypeScript.typescript`, and `TypeScript.tsx` — exporting three providers because `Parser.Query` objects are NOT portable across grammar instances. **Registry (`topic-patterns/index.ts`)** — maps `.java` → Java provider, `.go` → Go, `.py` → Python, `.js`/`.jsx` → JS, `.ts` → TS, `.tsx` → TSX. Also exports `TOPIC_SCAN_GLOB` so adding a new language is a single file-level edit (drop `topic-patterns/<lang>.ts`, import + register it here — zero edits required in `topic-extractor.ts`). **Orchestrator (`topic-extractor.ts`)** — ~110 lines, no grammar or query imports. Per file: `getProviderForFile(rel)` → `scanFile(parser, provider, content)` → `unquoteLiteral(valueText)` → `makeContract(...)`. Reuses one `Parser` instance across files; the scanner calls `setLanguage` per plugin. ## Why this is better than regex 1. **Comments and strings are respected for free.** The old regex would match `// kafkaTemplate.send("fake.topic")` as a real producer; tree-sitter never visits comments or string literals as code nodes, so false positives from commented-out code are eliminated. 2. **Struct/object literal patterns are structural, not textual.** `sarama.ProducerMessage{Topic: "..."}` no longer needs a 300-char lookahead (which was a known cross-match bug partly mitigated by a loop regression test in the self-review). The new query matches a specific `composite_literal` with a specific `qualified_type` and `keyed_element` — exactly one struct literal per match. 3. **No order-of-operations fragility.** Regex for `channel.publish` vs `channel.consume` was independent and file-wide; the AST scopes matches to the specific `call_expression`. 4. **Language-agnostic extension.** Adding Ruby, Rust, or C# topic detection later means dropping one file in `topic-patterns/` — no changes to shared scanner or orchestrator, and no tree-sitter imports leak into top-level code. ## Per-file fault tolerance - Malformed files that tree-sitter can't parse are silently skipped (`parser.parse` is wrapped by `scanFile`). The ingestion pipeline already logs unparseable files at index time. - A syntactically invalid query is caught at `compilePatterns` time, not scan time — broken plugins fail loudly at import. - Per-pattern `matches()` failures are swallowed so one broken query in a plugin doesn't block the rest. ## Tests All 30 existing `topic-extractor.test.ts` tests pass **without any changes to the test file** — they were written as input/output contract tests (given this source file, expect these `ExtractedContract` objects) and that contract is unchanged. Regression coverage includes: - Kafka: Java `@KafkaListener` + `kafkaTemplate.send`; Node `producer.send` + `consumer.subscribe`; Go sarama producer/consumer (sync and async); kafka-go Writer/Reader; Python `KafkaConsumer` + `producer.send/produce` - RabbitMQ: Java `@RabbitListener` + `rabbitTemplate.convertAndSend`; Node `channel.consume/publish/sendToQueue`; Python `basic_consume/ basic_publish` with keyword args - NATS: Go and Node `nc.Subscribe/Publish`; Go and Node JetStream `js.Subscribe/Publish`; Python `await nc.subscribe/publish` Including the regression test for the sarama `ProducerMessage` in-loop case — the AST-based query captures every literal in the file independently, not just the first one after `NewSyncProducer`. ## Neighbor regression check - `topic-extractor.test.ts` — 30/30 pass (rewritten extractor) - `http-route-extractor.test.ts` — 18/18 pass (untouched) - `grpc-extractor.test.ts` — 43/43 pass (untouched) - `manifest-extractor.test.ts` — 8/8 pass (untouched) - Full `npx tsc --noEmit` clean ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched; no changes to other extractors, tests, MCP surface, or pipeline.ts. - No CI/release/security config changes, no secrets. - New tree-sitter imports all reference grammars that are already installed as dependencies (`tree-sitter`, `tree-sitter-javascript`, `tree-sitter-typescript`, `tree-sitter-python`, `tree-sitter-java`, `tree-sitter-go` — all in `package.json` for the existing pipeline). ## Phase 2 / phase 3 plan - **Phase 2 (next commit):** rewrite `http-route-extractor.ts` Strategy B (regex fallback) on the same plugin pattern. Graph-assisted Strategy A stays as-is (already uses pipeline-built tree-sitter data via `HANDLES_ROUTE` Cypher queries). - **Phase 3 (commit after):** rewrite `grpc-extractor.ts` for Java / Go / Python / TypeScript detection. `.proto` files are the one outstanding question — there is no `tree-sitter-proto` grammar installed; the in-tree string-sanitizing parser stays as a pragmatic exception with a comment, alternative being to add `tree-sitter-proto` as a dep (open for the maintainer). Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate http-route-extractor Strategy B to tree-sitter plugins Phase 2 of the extractor refactor requested by @magyargergo on abhigyanpatwari#796. Same architecture as the phase 1 topic-extractor rewrite: a thin, language-agnostic orchestrator plus per-language plugins that own tree-sitter grammars and query sources. The top-level extractor file no longer imports any tree-sitter grammar or query string. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic primitives ├── http-route-extractor.ts # thin orchestrator (no grammar imports) └── http-patterns/ ├── types.ts # HttpDetection, HttpLanguagePlugin, HttpRole ├── index.ts # registry: ext → plugin + HTTP_SCAN_GLOB ├── java.ts # tree-sitter-java: Spring + RestTemplate/WebClient/OkHttp ├── go.ts # tree-sitter-go: gin/echo/HandleFunc + http/resty consumers ├── python.ts # tree-sitter-python: FastAPI + requests ├── php.ts # tree-sitter-php: Laravel Route::get/... └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # NestJS controllers, Express, fetch, axios ``` **Shared scanner (`tree-sitter-scanner.ts`)** — generalised from phase 1: - `ScanMatch<TMeta>.captures` is now a full `CaptureMap` (every named capture the query binds, not just a single `@value`). Topic extractor updated to read `match.captures.value` accordingly. - New `runCompiledPatterns(plugin, tree)` helper lets plugins run multiple query bundles against the same pre-parsed tree. This is needed for HTTP plugins that combine a class-prefix query with a method-route query (Spring, NestJS). - `scanFile` becomes a thin wrapper over `parser.parse + runCompiledPatterns`. **HTTP plugin shape** — unlike topic plugins, HTTP plugins expose a `scan(tree)` function rather than a flat pattern list. This reflects HTTP's more complex extraction: each detection needs method + path + handler name, and framework patterns like Spring `@RequestMapping` / NestJS `@Controller` require cross-referencing a class-level prefix with method-level annotations. Plugins internally use `compilePatterns` + `runCompiledPatterns` and walk the AST to resolve the class/method relationships. **Per-framework coverage:** - **Java (`java.ts`)** - Spring: `@RequestMapping("/api/v2")` class prefix + `@(Get|Post|Put| Delete|Patch)Mapping("/sub")` method routes, joined via the enclosing `class_declaration` node id. - `RestTemplate.getForObject/postForEntity/put/delete/patchForObject` → method derived from API name. - `WebClient.method(HttpMethod.X, "/path")` → method from `HttpMethod.X` capture. - `new Request.Builder().url("/path")` → OkHttp consumer. - **Go (`go.ts`)** - gin / echo / chi frameworks: `\w+.GET("/path", handler)` captures upper-case verb + handler identifier. - `net/http.HandleFunc("/path", handler)` → provider (default GET). - `http.Get/Post/Head` consumer, `http.NewRequest("METHOD", ...)`, resty `client.R().Get/Post/...`. - **Python (`python.ts`)** - `@app.get("/path")` FastAPI decorators. - `requests.get/post/...` and `requests.request("METHOD", "url")`. - **PHP (`php.ts`)** - Laravel `Route::get/post/.../patch('/path', ...)` via `scoped_call_expression`. Uses `PHP.php_only` to match the existing ingestion pipeline's grammar selection. - **Node (`node.ts`) — JS + TS + TSX** - Pattern sources defined once, compiled against three grammar variants (`JavaScript`, `TypeScript.typescript`, `TypeScript.tsx`) because `Parser.Query` objects are not portable across grammars. Exports three plugins sharing the same `scan` logic. - NestJS: `@Controller('prefix')` decorators are siblings of the class in `export_statement` / `program`; `@Get(':id')` decorators are siblings of the method in `class_body`. The plugin walks decorator → next named sibling to find the decorated class / method, then combines the class prefix with the method path. Only emits NestJS detections when the enclosing class has a real `@Controller` decorator — prevents false positives from generic classes that happen to use `@Get` from another library. - Express: `(router|app).<verb>('/path', ...)`. - `fetch(url)` (default GET) + `fetch(url, { method: 'X' })` (uses two queries + a SyntaxNode-id dedupe set so URL literals aren't double-emitted by the options variant). - `axios.get/post/...`. ## Orchestrator changes `http-route-extractor.ts` drops every `scanXxxProviders` / `scanXxxConsumers` regex method and replaces them with a single source-scan loop that delegates to `getPluginForFile(rel).scan(tree)`. The orchestrator still owns: - **Path normalization** (`normalizeHttpPath`, `normalizeConsumerPath`) — language-agnostic string processing shared by both strategies. - **Graph-assisted Strategy A** (`HANDLES_ROUTE` / `FETCHES` / `CONTAINS` Cypher queries) — unchanged in spirit. The only regex helpers it used (`inferMethodFromFileScan`, `pickJavaHandlerName`) are now replaced by a lookup against the plugin's detections for the same file: for each route row, find the detection whose normalized path matches, and pull the HTTP method + handler name from it. - **Per-file parse cache** — the orchestrator parses each relevant file at most once per `extract()` call. Both the graph-assisted enrichment loop and the source-scan fallback share the same `cachedDetections` map, so we never run the plugin twice for the same file. ## Why this is better than the regex version 1. **Comments and strings for free.** The old regex would match `// router.get('/fake')` as a real Express route; tree-sitter never visits string/comment nodes. 2. **Structural controller-prefix.** Spring and NestJS class-prefix joining is now scoped to the enclosing class via `class_declaration` node ids, eliminating file-wide state that broke when a file had multiple controllers. 3. **Precise NestJS disambiguation.** The plugin only emits a NestJS detection when the enclosing class has a real `@Controller` decorator — the old regex would fire on any `@Get(...)` in the file regardless of surrounding context. 4. **Language-agnostic extension.** Adding Ruby / Rust / Kotlin HTTP detection later means dropping one file in `http-patterns/` — no changes to the shared scanner, the orchestrator, or the Strategy A Cypher queries. ## Tests - `http-route-extractor.test.ts` — **18/18 pass** (tests unchanged; they're contract-style input/output tests and the contract shape is unchanged). Covers Spring class prefix, Express, gin/echo, stdlib HandleFunc, NestJS, Laravel, FastAPI for providers and fetch/axios/python-requests/rest-template/webClient/okhttp/go-stdlib/ resty for consumers, plus graph-first Strategy A for both. - `topic-extractor.test.ts` — **30/30 pass** after the `captures.value` API migration. - `grpc-extractor.test.ts` — 43/43 pass (untouched; phase 3). - `manifest-extractor.test.ts` — 8/8 pass (untouched). - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No changes to pipeline.ts, MCP surface, ingestion, or tests. - No CI / release / security / secrets changes. - Tree-sitter grammars imported by plugins (`tree-sitter-java`, `tree-sitter-go`, `tree-sitter-python`, `tree-sitter-php`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already in `package.json` for the existing ingestion pipeline. ## Phase 3 plan - **grpc-extractor** gets the same treatment: plugin-per-language under `grpc-patterns/` for Java / Go / Python / TS detection. `.proto` files remain an open question — no `tree-sitter-proto` grammar is installed, so the in-tree string-sanitizing parser from PR abhigyanpatwari#796's self-review stays as a pragmatic exception unless the maintainer wants us to add `tree-sitter-proto` as a new dep. Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate grpc-extractor source scans to tree-sitter plugins Phase 3 (final) of the extractor refactor requested by @magyargergo on abhigyanpatwari#796. Same architecture as phase 1 (topic) and phase 2 (http): thin language-agnostic orchestrator + per-language plugins that own tree-sitter grammars and query sources. With this commit the top-level extractors under `src/core/group/extractors/` import ZERO tree-sitter grammars and ZERO query strings — every grammar import lives in a `*-patterns/<lang>.ts` plugin file, and the orchestrators go through the registry indirection. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared primitives (unchanged) ├── grpc-extractor.ts # orchestrator (only `.proto` parser left) └── grpc-patterns/ ├── types.ts # GrpcDetection, GrpcLanguagePlugin, GrpcRole ├── index.ts # registry: ext → plugin + GRPC_SCAN_GLOB ├── go.ts # tree-sitter-go: RegisterXxxServer, Unimplemented, NewXxxClient ├── java.ts # tree-sitter-java: @GrpcService + XxxImplBase + newBlockingStub ├── python.ts # tree-sitter-python: add_XxxServicer_to_server + XxxStub └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # @GrpcMethod, @GrpcClient field type, # .getService<X>('Svc'), new XxxServiceClient, # loadPackageDefinition dynamic constructors ``` ## Per-language coverage **Go (`go.ts`)** - Provider: `\w+.RegisterXxxServer(...)` via `call_expression → selector_expression → field_identifier` + JS regex filter `^Register(\w+)Server$`. - Provider: `pb.UnimplementedXxxServer` embedded in a struct via `struct_type → field_declaration_list → field_declaration → qualified_type → type_identifier` + JS filter. - Consumer: `\w+.NewXxxClient(...)` via the same call_expression query + JS filter `^New(\w+)Client$`. **Java (`java.ts`)** - Provider: `class X extends YyyGrpc.YyyImplBase` — two queries handle the scoped and plain forms. `scoped_type_identifier`'s children are positional (no `scope:`/`name:` fields), so the query matches the two `type_identifier` children by position. - `#match? @inner "ImplBase$"` restricts matches at query time. - Whether the class has `@GrpcService` or not controls only the `source` metadata label — the plugin walks the class_declaration's `modifiers` child in JS to detect the marker_annotation. - Consumer: `YyyGrpc.newStub(ch)` / `newBlockingStub(ch)` via a `method_invocation` query with `#match? @method "^new(Blocking)?Stub$"`, service name extracted via `^(\w+)Grpc$` on the object identifier. **Python (`python.ts`)** - Single call-expression query covers both bare identifier and `obj.method` attribute forms: `(call function: [(identifier) @fn (attribute attribute: (identifier) @fn)])`. - Plugin filters `@fn.text` against two JS regexes: `^add_(\w+)Servicer_to_server$` (provider) and `^(\w+)Stub$` (consumer), with a reserved-names ignore list for the Stub case (Mock / Test / Fake / Stub). **Node — JavaScript + TypeScript + TSX (`node.ts`)** - Pattern sources defined once, compiled three times (one per grammar) because `Parser.Query` objects are not portable across grammars. Exports three `GrpcLanguagePlugin`s sharing the same `scan`. - `@GrpcMethod('Service', 'Method')`: decorator query captures the two string literals. Confidence is hard-coded 0.8 regardless of proto map resolution (matches the original regex version's behaviour). - `@GrpcClient(...) field: XxxServiceClient`: decorator query captures the decorator node, plugin walks up to find the enclosing `public_field_definition` (decorators on fields are CHILDREN of the field definition in tree-sitter-typescript, not siblings) and reads its first `type_annotation → type_identifier`, then runs the `^(\w+Service)Client$` JS filter. - `client.getService<X>('AuthService')`: call-expression query on `member_expression.property = "getService"` + string literal arg. - `new XxxServiceClient(...)`: `new_expression` with a bare identifier constructor, filtered by `^(\w+Service)Client$` so generic `new AuthClient(...)` (missing the `Service` infix) does NOT falsely register as a consumer. Preserves the regression test `test_extract_ts_non_service_client_constructor_is_ignored`. - `loadPackageDefinition` dynamic loader: gated on `tree.rootNode.text.includes('loadPackageDefinition')`. When set, `new foo.bar.Xxx(...)` qualified constructors with a capitalised property name register as consumers. ## Orchestrator changes `grpc-extractor.ts` loses every `scanGoProviders` / `scanJavaProviders` / ... helper and replaces them with a single source-scan loop that: 1. Parses each file with the plugin's grammar (one shared `Parser` instance across all files, `setLanguage` called per plugin). 2. Calls `plugin.scan(tree)` to get `GrpcDetection[]`. 3. Converts each detection to an `ExtractedContract` via the private `detectionToContract` helper, which: - Looks the short service name up in the proto map (filled by the `.proto` parser). - Picks confidence = `confidenceWithProto` if resolved, else `confidenceWithoutProto`. - Builds a method-level contract id (`grpc::pkg.Svc/Method`) when the detection carries a `methodName` (TS `@GrpcMethod` only), otherwise a service-level id (`grpc::pkg.Svc/*`). Everything else — the `.proto` parser, `buildProtoContext`, `buildProtoMap`, `resolveProtoConflict`, `serviceContractId`, `stripProtoCommentsAndStrings`, `extractServiceBlocks`, the dedupe function — stays exactly as before. The `.proto` parser is kept as a pragmatic exception to the "no regex in extractors" rule because no `tree-sitter-proto` grammar is installed in the repo; a comment at the top of the file explains this and flags the maintainer option of adding `tree-sitter-proto` as a dependency. ## Why this is better than the regex version 1. **Comments and strings are respected for free.** Matched node types are only code constructs, never text inside comments or string literals. 2. **No false positives on partial names.** The old `(\w+?)Grpc`-style regexes would cross-match unrelated identifiers; structural queries restrict matches to the exact AST shape (`scoped_type_identifier → type_identifier` pairs, `method_invocation → identifier` etc.). 3. **NestJS `@GrpcClient` is structural, not regex-based.** The old regex required a specific textual layout (`@GrpcClient(...) private readonly foo!: XxxServiceClient`); the plugin now walks the AST, so modifier order / optional modifiers / multi-line formatting don't break it. 4. **Language-agnostic extension.** Adding Kotlin / Rust / C# gRPC detection later is a one-file edit in `grpc-patterns/index.ts` — no touches to the shared scanner, the orchestrator, or the proto parser. ## Tests - `grpc-extractor.test.ts` — **43/43 pass** (tests unchanged; the contract shape is identical). Covers .proto parsing (including the brace-inside-string regression), Go provider/consumer, Java @GrpcService / plain ImplBase provider + newBlockingStub consumer, Python servicer + stub, TS @GrpcMethod + @GrpcClient + .getService + new XxxServiceClient + loadPackageDefinition + the `AuthClient` vs `AuthServiceClient` discrimination, dedupe across multiple patterns in one file, proto-aware confidence, and the inherited-package resolution for split proto definitions. - `topic-extractor.test.ts` — 30/30 pass. - `http-route-extractor.test.ts` — 18/18 pass. - `manifest-extractor.test.ts` — 8/8 pass. - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No pipeline.ts, MCP surface, ingestion, CI / release / security, or test changes. - New tree-sitter grammar imports (`tree-sitter-go`, `tree-sitter-java`, `tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already installed for the ingestion pipeline. ## End of phase series This commit completes the three-phase extractor refactor: - **Phase 1** (`ea06d11`): topic-extractor → `topic-patterns/` - **Phase 2** (`b6015f6`): http-route-extractor → `http-patterns/` - **Phase 3** (this commit): grpc-extractor → `grpc-patterns/` Every remaining regex-based extractor helper under the `src/core/group/ extractors/` directory is either (a) language-agnostic string processing (path normalization, dedupe keys) or (b) the `.proto` parser, which is documented as an explicit exception. Co-authored-by: Claude <noreply@anthropic.com> * feat(group): add tree-sitter-proto for .proto file parsing Addresses @magyargergo's suggestion on abhigyanpatwari#796 to replace the manual string-sanitizing .proto parser with a tree-sitter grammar. - **Vendored `tree-sitter-proto`** in `vendor/tree-sitter-proto/`. Grammar source from [coder3101/tree-sitter-proto](https://github.com/coder3101/tree-sitter-proto) (latest `grammar.js`), parser.c regenerated with `tree-sitter-cli 0.24` to produce ABI version 14 — compatible with the project's `tree-sitter 0.25` runtime (which supports ABI ≤ 14). Added as `optionalDependency` with `file:./vendor/tree-sitter-proto`. - **New `grpc-patterns/proto.ts` plugin** — uses the same `compilePatterns` + `runCompiledPatterns` infrastructure as every other plugin. Two queries: - `(package (full_ident) @pkg)` — package declaration - `(service (service_name) @service_name (rpc (rpc_name) @rpc_name))` — one match per (service, rpc) pair - **Graceful fallback** — `tree-sitter-proto` is an optional dependency. If it fails to install (platform incompatibility) or fails the runtime smoke-test (`setLanguage` + `parse` on a trivial proto), `PROTO_GRPC_PLUGIN` stays `null` and the orchestrator uses the existing manual parser. The smoke-test catches the `SyntaxNode` TDZ error that occurs in vitest's fork-based test runner. - **Orchestrator updated** — when `hasProtoPlugin` is true, `.proto` files are handled by the plugin loop (they're included in `GRPC_SCAN_GLOB`), and the manual `parseProtoFile` loop is skipped. `buildProtoContext` still runs to build the proto map for cross-referencing source-file detections. 1. **No manual comment/string stripping.** The old parser needed `stripProtoCommentsAndStrings` (110 lines) to avoid counting braces inside comments and string literals. tree-sitter handles this natively. 2. **No brace-depth tracking.** `extractServiceBlocks` used a manual depth counter to find service boundaries. tree-sitter's AST gives us `service` → `service_name` + `rpc` → `rpc_name` directly. 3. **Performance.** tree-sitter's C-based parser is faster than character-by-character JS scanning + regex on large proto files. - `grpc-extractor.test.ts` — **43/43 pass** (unchanged) - All other extractor tests — 99/99 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * chore: add .gitignore for vendored tree-sitter-proto build artifacts https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * fix: correct .gitignore paths for vendored tree-sitter-proto Patterns should be relative to the .gitignore file's directory. https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * refactor(group): address Copilot review feedback on abhigyanpatwari#796 Six fixes suggested by the Copilot AI review: 1. **`normalizeHttpPath` root-path edge case** — stripping trailing slashes on the input `/` produced an empty string, yielding malformed contract ids like `http::GET::`. Now preserves `/` for the root handler/fetch case. 2. **Dedupe `scanFiles` call** — `extract()` was globbing the source-scan file list twice (once for the provider fallback, once for the consumer fallback). Moved to a single lazy call that memoizes the result for the rest of the method. 3. **HTTP `scanFiles` now ignores `**/vendor/**`** — every other extractor's glob already ignored vendored sources; the HTTP one didn't. Fixed for consistency. 4. **`loadPackageDefinition` check is now structural** — was calling `tree.rootNode.text.includes('loadPackageDefinition')` which forces materialization of the entire file text from the parse tree (expensive on large files). Replaced with a dedicated compiled query on `(call_expression function: [(identifier) | (member_expression)])` so the check stays in the AST domain. 5. **`grpc-extractor.ts` header docstring updated** — still claimed ".proto parsing is not tree-sitter-based because no grammar is installed". Now describes the actual behaviour: tree-sitter when `tree-sitter-proto` is available (optionalDependency), manual fallback otherwise. 6. **Eliminated the double proto file parse on the fallback path** — `buildProtoContext` already globs + parses every `.proto` file to build `servicesByName`. On the `!hasProtoPlugin` branch the extractor was globbing + parsing again via the now-removed `parseProtoFile` helper. The fallback branch now iterates the map that `buildProtoContext` already produced to emit provider contracts directly — single pass per proto file. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): address Claude review feedback (bugs + dedup + hygiene) on abhigyanpatwari#796 Follows up `2f28bfc` with the remaining items from the Claude AI review: ## Bugs **Bug 2 — Label-unaware Cypher queries in `resolveSymbol`.** The manifest-extractor's lookup queries were `MATCH (n) WHERE n.name = $x` with no label filter, so a topic/service/package name could silently match any node type (File, Variable, Import, Folder, …). Added label filters: - `topic` → `(n:Function|Method|Class|Interface)` (topics are best-effort symbol-name matches against listener/publisher symbols) - `grpc` method → `(n:Function|Method)` - `grpc` service → `(n:Class|Interface)` - `lib` → `(n:Package|Module)` All 8 manifest-extractor tests still pass (mock executor is label-agnostic, but the production LadybugDB graph now gets correctly scoped queries). **Bug 8 — Tautological `!handlerName` condition.** `http-route-extractor.ts:extractProvidersGraph` had `let handlerName = null; if (!method || !handlerName) { ... }` — the `!handlerName` clause was always true since there was no intervening assignment. Simplified to always run the plugin-scan lookup (we need the handler name even when `methodFromRouteReason` already resolved the method). ## Clean code / dedup **Design 7 — `readSafe` was copy-pasted in all three orchestrators.** Extracted to `extractors/fs-utils.ts` as the single source of truth for the path-traversal guard. Dropped the three local copies and the now-unused `fs`/`path` imports from topic-extractor. **Style 10 — Language-specific `_test.go` skip in the topic orchestrator.** Was `if (rel.endsWith('_test.go')) continue;` inside the language- agnostic extraction loop. Pushed into the glob's ignore list (`'**/*_test.go'`) alongside the existing `node_modules`, `vendor`, `dist`, `build` entries, with a comment explaining that other languages' test file conventions either live in separate directories (Python `tests/`, Java `src/test/`) or are already covered by the existing ignores. ## Already addressed in `2f28bfc` (mentioned again in Claude review) - Bug 3: `normalizeHttpPath('/')` returns `''` — fixed - Bug 4: double glob + double parse of `.proto` — fixed - Bug 5: `scanFiles` called twice in HTTP — fixed - Bug 6: missing `**/vendor/**` in HTTP glob — fixed - Design 9 partially: `tree.rootNode.text.includes('loadPackageDefinition')` replaced with a dedicated structural query ## Deferred - Bug 1 (`http::*::path` vs `http::GET::path` matching) — out of scope; sync.ts matching logic lands in abhigyanpatwari#793, manifest extractor already emits correct synthetic uids for unresolved HTTP contracts. - Design 9 full (change plugin `scan(tree)` → `scan(tree, source)`) — the only real use case (`loadPackageDefinition` gate) is already fixed via a structural query, so the interface change would be cosmetic churn without a concrete consumer. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * docs+fix(group): address remaining Claude review items + add pipeline flow chart ## Fixes **Remaining 🔴 — HTTP contract id wildcard format.** Documented the `http::*::<path>` format as an intentional wildcard for manifest links that omit the HTTP method, alongside the explicit-method form (`GET::/path` → `http::GET::/path`). The docblock on `buildContractId` now states both forms, notes that wildcard-aware matching is the responsibility of the sync / cross-impact layer (abhigyanpatwari#793), and recommends the explicit-method form whenever the author knows the method (it round-trips through exact equality without needing wildcard logic downstream). Tests unchanged — the wildcard format is what they've always asserted. **Minor 1 — stale comment at `manifest-extractor.ts:124-126`.** The comment claimed "creates a contract with an empty symbolUid/ref" but the code switched to `manifestSymbolUid(repo, contractId)` a few commits back. Updated to describe the actual synthetic-uid fallback semantics and the cross-impact path that relies on both sides of the join deriving the same uid. **Minor 2 — exhaustiveness guard on `buildContractId`.** The `switch(type)` covered all five current `ContractType` variants but silently returned `undefined` if a new variant was added. Added a `default: const _exhaustive: never = type; throw new Error(...)` clause so the build fails loudly on an unhandled variant. **Minor 3 — `tree.rootNode.text` in `grpc-patterns/node.ts`.** Already fixed in `2f28bfc` via a dedicated structural query (`LOAD_PACKAGE_DEFINITION_SPEC`). No action needed. ## New: pipeline flow chart (per @magyargergo's request) Added `src/core/group/PIPELINE.md` with four Mermaid diagrams: 1. **High-level overview** — `group.yaml` → extractors + manifest → contract matching → `bridge.lbug` → `runGroupImpact`. 2. **Per-repo extractor two-strategy shape** — graph-assisted Strategy A vs. source-scan Strategy B. 3. **Plugin architecture** — orchestrator → registry → per-language `*-patterns/<lang>.ts` → `tree-sitter-scanner.ts` → `ExtractedContract`. 4. **Manifest extraction** — label-scoped `resolveSymbol` with the synthetic-uid fallback. 5. **Cross-impact query (abhigyanpatwari#606)** — local impact → bridge join → cross-repo fan-out. Each diagram is annotated with which PRs own which stage (this PR: extractors + manifest; abhigyanpatwari#795: bridge storage; abhigyanpatwari#606: cross-impact runtime) and points at the concrete files/functions involved. ## Tests - 99/99 extractor tests pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>

ivkond mentioned this pull request Apr 1, 2026

[group] Intra-repo service communication tracking #625

Closed

9 tasks

ivkond changed the title ~~feat(group): cross-index impact analysis via repository groups~~ [group] Cross-repo impact analysis via repository groups Apr 1, 2026

ivkond mentioned this pull request Apr 1, 2026

[group] Intra-repo service communication tracking #626

Merged

9 tasks

ivkond and others added 2 commits April 4, 2026 23:28

ivkond force-pushed the feat/cross-index-impact-groups branch from 88fd9d6 to de79d4b Compare April 4, 2026 21:03

ivkond force-pushed the feat/cross-index-impact-groups branch from de79d4b to ae68bfb Compare April 4, 2026 21:33

JaysonAlbert mentioned this pull request Apr 7, 2026

feat(group): support HTTP contract linking when gateway paths differ from backend routes #701

Open

ivkond added 5 commits April 9, 2026 02:36

fix(group): restore out-of-scope reporting and phase1 timeout semantics

149e8ee

fix(group): restore out-of-scope reporting and phase1 timeout semantics

fd2a4d5

fix(group): persist manifest grpc and lib links through bridge

fa80ce6

feat(group): expand http extractor coverage across Go, NestJS, Python…

f614578

… and Java

feat(group): expand grpc extraction and proto resolution coverage

54d681a

fix(group): harden bridge sync and add cross-repo coverage

ca81e87

ivkond force-pushed the feat/cross-index-impact-groups branch 2 times, most recently from daeac6a to 55089ae Compare April 11, 2026 08:35

ivkond force-pushed the feat/cross-index-impact-groups branch from 14c0e7d to d15b8cb Compare April 11, 2026 13:34

ivkond mentioned this pull request Apr 11, 2026

feat(group): bridge.lbug storage + contract matching expansion (1/4 of #606 split) #795

Merged

ivkond mentioned this pull request Apr 11, 2026

feat(group): extractor expansion + manifest extractor (2/4 of #606 split) #796

Merged

magyargergo mentioned this pull request Apr 13, 2026

chore: release v1.6.1 #815

Merged

2 tasks

ivkond closed this Apr 20, 2026

ivkond deleted the feat/cross-index-impact-groups branch April 20, 2026 15:55

magyargergo mentioned this pull request May 24, 2026

fix(group): move manifest/workspace extraction before closeLbug (#1802) #1807

Merged

6 tasks

Uh oh!

Conversation

ivkond commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

vercel Bot commented Mar 31, 2026

Uh oh!

ivkond commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Report

Pipeline Status

Test Results

Code Coverage

Tests

Uh oh!

xkonjin commented Mar 31, 2026

Code Review: Cross-Index Impact Groups

Issues

Minor

Verdict

Uh oh!

ivkond commented Mar 31, 2026

1. Graph schema gaps

2. Contract ID :: delimiter

3. BM25 threshold / 4. Embedding vectors

Minor: MCP tools

Minor: group validate

Uh oh!

ivkond commented Mar 31, 2026

Uh oh!

abhigyanpatwari commented Apr 1, 2026

Uh oh!

ivkond commented Apr 1, 2026

Uh oh!

ivkond commented Apr 1, 2026

Uh oh!

ivkond commented Apr 1, 2026

Uh oh!

vercel Bot commented Apr 4, 2026

Uh oh!

ivkond commented Apr 4, 2026

Uh oh!

ivkond commented Apr 8, 2026

Uh oh!

abhigyanpatwari commented Apr 8, 2026

Uh oh!

github-actions Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deep Review: Cross-Repo Impact Analysis — Protocol Coverage & Real-World Test Validation

Summary

1. Protocol Coverage: What Works vs. What's Missing

HTTP (http-route-extractor.ts)

gRPC (grpc-extractor.ts)

Topic/Messaging (topic-extractor.ts)

2. Real-World Test Coverage: The Main Gap

A. Fixture-based integration test against real code

B. Known-good open-source repo test

C. Confidence degradation test

3. Correctness Bugs Found

Bug 1: outOfScope always empty in Cypher path (cross-impact.ts:417)

Bug 2: Manifest cross-links silently dropped for gRPC/lib (manifest-extractor.ts:92)

Bug 3: resolveProtoConflict directory proximity heuristic fails for centralized proto repos

4. Minor Issues

5. What's Working Well

Recommended Next Steps (Prioritized)

Uh oh!

abhigyanpatwari commented Apr 8, 2026

Uh oh!

ivkond commented Apr 11, 2026

Two options — whichever saves your review time

Uh oh!

magyargergo commented Apr 11, 2026

Uh oh!

ivkond commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed split — 4 stacked PRs

ivkond commented Mar 31, 2026 •

edited

Loading

github-actions Bot commented Mar 31, 2026 •

edited

Loading

2. Contract ID `::` delimiter

Minor: `group validate`

github-actions Bot commented Apr 8, 2026 •

edited

Loading

HTTP (`http-route-extractor.ts`)

gRPC (`grpc-extractor.ts`)

Topic/Messaging (`topic-extractor.ts`)

Bug 1: `outOfScope` always empty in Cypher path (`cross-impact.ts:417`)

Bug 2: Manifest cross-links silently dropped for gRPC/lib (`manifest-extractor.ts:92`)

Bug 3: `resolveProtoConflict` directory proximity heuristic fails for centralized proto repos

ivkond commented Apr 11, 2026 •

edited

Loading