Skip to content

[group] Intra-repo service communication tracking#626

Merged
abhigyanpatwari merged 5 commits into
abhigyanpatwari:mainfrom
ivkond:feat/intra-repo-service-tracking-clean
Apr 3, 2026
Merged

[group] Intra-repo service communication tracking#626
abhigyanpatwari merged 5 commits into
abhigyanpatwari:mainfrom
ivkond:feat/intra-repo-service-tracking-clean

Conversation

@ivkond

@ivkond ivkond commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a group system for repository analysis with intra-repo service communication tracking for microservice monorepos. This is the foundation for cross-repo impact analysis (follow-up PR).

What changed:

  • Group infrastructure: types, config parser, contract registry, exact matching engine
  • Service boundary detection for monorepos (package.json, go.mod, Dockerfile, pom.xml, Cargo.toml markers)
  • HTTP route extractor (Spring, Express, Laravel, FastAPI + fetch/axios consumers)
  • gRPC extractor (proto parsing + Go/Java/Python/TS server/client detection)
  • Topic extractor (Kafka, RabbitMQ, NATS producers/consumers across 4 languages)
  • Intra-repo matching between services within the same monorepo
  • CLI commands: group create/add/remove/list/sync/contracts/query/status
  • MCP tools: group_list, group_sync, group_contracts, group_query, group_status
  • Documentation for new CLI commands and MCP tools in both READMEs

Why:
The GitNexus author requested intra-repo service tracking as a prerequisite for cross-repo analysis (#606). Proving that microservice monorepos index well is the foundation — once intra-repo works, the same extractors feed directly into the virtual graph for multi-repo support.

Backward compatibility & roadmap

Groups are fully opt-ingitnexus analyze continues to work exactly as before. For small/medium monorepos (2-20 services), a single analyze on the whole repo already captures everything in one graph, and impact/query tools see inter-service relationships through existing CALLS/IMPORTS edges.

Groups become valuable for:

  • Large monorepos (20+ services) where per-service indexing is faster and less noisy
  • Multi-repo setups where services live in separate git repositories

The gRPC, Kafka/RabbitMQ/NATS, and HTTP extractors currently run only inside group sync. A natural next step is integrating them into the standard analyze pipeline so that inter-service communication edges (gRPC calls, topic pub/sub) are captured automatically for all repos — no group setup required. This would make service communication tracking zero-config for monorepos of any size, while groups remain available for cross-repo and advanced use cases.

How to verify:

cd gitnexus
npx tsc --noEmit                    # typecheck
npm run test:unit                   # 2658 unit tests
npm run test:integration            # 1973 integration tests
# Monorepo fixture test:
npx vitest run test/integration/group/monorepo-sync.test.ts

Risk / rollback:

  • All new code is in src/core/group/ — isolated from existing functionality
  • No existing APIs changed; service field is optional and backward-compatible
  • gitnexus analyze behavior is completely unchanged
  • Rollback: revert the PR (no migrations, no schema changes)

Not in this PR (follow-up):

Test plan

  • 15 unit tests — ServiceBoundaryDetector
  • 16 unit tests — matching (incl. 4 intra-repo cases)
  • 13 unit tests — GrpcExtractor
  • 19 unit tests — TopicExtractor
  • 14 unit tests — GroupService
  • 9 unit tests — sync pipeline + stableRepoPoolId
  • Config parser, storage, types, HTTP extractor tests
  • 3 integration tests — monorepo sync (fixture with auth/orders/gateway)
  • Full regression: all existing tests pass

🤖 Generated with Claude Code

ivkond and others added 3 commits April 2, 2026 00:39
Core foundation for repository group analysis:
- Type system: ContractType, ExtractedContract, StoredContract, CrossLink
  with optional `service` field for intra-repo matching
- Config parser for group.yaml (repos, detection flags, matching thresholds)
- Contract registry storage with atomic writes
- Exact matching engine with per-type normalization (HTTP, gRPC, topic)
  and intra-repo support (different services within same repo can match)
- Extract LadybugDB pool-adapter from MCP backend for reuse by sync pipeline
- Git staleness checker for group status reporting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Service communication detection for microservice monorepos:

- ServiceBoundaryDetector: auto-detects service boundaries via markers
  (package.json, go.mod, Dockerfile, pom.xml, Cargo.toml, build.gradle,
  pyproject.toml, etc.)
- HttpRouteExtractor: graph-assisted (Strategy A) with source-scan
  fallback (Strategy B) for Spring, Express, Laravel, FastAPI providers
  and fetch/axios consumers
- GrpcExtractor: parses .proto files, detects Go/Java/Python/TS gRPC
  servers (RegisterXxxServer, @GrpcService, add_XxxServicer_to_server,
  @GrpcMethod) and clients (NewXxxClient, newBlockingStub, XxxStub)
- TopicExtractor: Kafka (@KafkaListener, producer.send), RabbitMQ
  (@RabbitListener, channel.publish/consume), NATS (nc.Subscribe/Publish)
  across Java, Node, Go, and Python

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire extractors into the sync pipeline with service boundary detection.
GroupService provides high-level API for all group operations.

- Sync pipeline: orchestrates extraction (HTTP, gRPC, topics) with
  service boundary assignment and exact matching
- GroupService: groupList, groupSync, groupContracts, groupQuery,
  groupStatus (groupImpact deferred to cross-repo follow-up PR)
- CLI: group create/add/remove/list/sync/contracts/query/status
- MCP tools: group_list, group_sync, group_contracts, group_query,
  group_status
- Monorepo fixture: 3 services (auth/orders/gateway) connected via
  gRPC + Kafka + HTTP — all intra-repo cross-links discovered
- Documentation: CLI commands and MCP tools added to both READMEs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Apr 1, 2026

Copy link
Copy Markdown

@ivkond is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions

github-actions Bot commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
5023 4977 0 46 183s

✅ All 4977 tests passed

46 test(s) skipped — expand for details
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature
  • Swift constructor-inferred type resolution > detects User and Repo classes, both with save methods
  • Swift constructor-inferred type resolution > resolves user.save() to Models/User.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > resolves repo.save() to Models/Repo.swift via constructor-inferred type
  • Swift constructor-inferred type resolution > emits exactly 2 save() CALLS edges (one per receiver type)
  • Swift self resolution > detects User and Repo classes, each with a save function
  • Swift self resolution > resolves self.save() inside User.process to User.save, not Repo.save
  • Swift parent resolution > detects BaseModel and User classes plus Serializable protocol
  • Swift parent resolution > emits EXTENDS edge: User → BaseModel
  • Swift parent resolution > emits IMPLEMENTS edge: User → Serializable (protocol conformance)
  • Swift cross-file User.init() inference > resolves user.save() via User.init(name:) inference
  • Swift cross-file User.init() inference > resolves user.greet() via User.init(name:) inference
  • Swift return type inference > detects User class and getUser function
  • Swift return type inference > detects save function on User (Swift class methods are Function nodes)
  • Swift return type inference > resolves user.save() to User#save via return type of getUser() -> User
  • Swift return-type inference via function return type > resolves user.save() to User#save via return type of getUser()
  • Swift return-type inference via function return type > user.save() does NOT resolve to Repo#save
  • Swift return-type inference via function return type > resolves repo.save() to Repo#save via return type of getRepo()
  • Swift implicit imports (cross-file visibility) > detects UserService class in Models.swift
  • Swift implicit imports (cross-file visibility) > resolves UserService() constructor call across files (no explicit import)
  • Swift implicit imports (cross-file visibility) > resolves service.fetchUser() member call across files
  • Swift implicit imports (cross-file visibility) > creates IMPORTS edges between files in the same module
  • Swift extension deduplication > detects Product class
  • Swift extension deduplication > resolves Product() constructor despite extension creating duplicate class node
  • Swift extension deduplication > resolves product.save() to Product.swift (primary definition)
  • Swift constructor call fallback (no new keyword) > resolves OCRService() as constructor call across files
  • Swift constructor call fallback (no new keyword) > resolves ocr.recognize() member call via constructor-inferred type
  • Swift export visibility (internal vs private) > resolves PublicService() constructor across files
  • Swift export visibility (internal vs private) > resolves internalHelper() across files (internal = module-scoped)
  • Swift if let / guard let binding resolution > detects User and Repo classes
  • Swift if let / guard let binding resolution > resolves user.save() inside if-let to User#save
  • Swift if let / guard let binding resolution > resolves repo.save() inside guard-let to Repo#save
  • Swift if let / guard let binding resolution > user.save() in if-let does NOT resolve to Repo#save
  • Swift await / try expression unwrapping > resolves user.save() via await fetchUser() return type
  • Swift await / try expression unwrapping > resolves repo.save() via try parseRepo() return type
  • Swift await / try expression unwrapping > detects fetchUser and parseRepo as functions
  • Swift for-in loop element type inference > detects User and Repo classes
  • Swift for-in loop element type inference > creates implicit import edges between files
  • Swift field-type resolution > detects classes and their properties
  • Swift field-type resolution > emits HAS_PROPERTY edges from class to field
  • Swift field-type resolution > resolves field-chain call user.address.save() → Address#save
  • Swift field-type resolution > emits ACCESSES edges for field reads in chains
  • Swift field-type resolution > populates field metadata (visibility, declaredType) on Property nodes
  • Swift call-result binding > resolves call-result-bound method call user.save() → User#save
  • Swift call-result binding > getUser() is present as a defined function
  • Swift call-result binding > emits processUser -> getUser CALLS edge for let-assigned free function call

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 71.02% 13959/19655 70.85% 📈 +0.2 🟢 ██████████████░░░░░░
Branches 60.18% 9071/15072 60.09% 📈 +0.1 🟢 ████████████░░░░░░░░
Functions 75.35% 1263/1676 75.59% 📉 -0.2 🔴 ███████████████░░░░░
Lines 73.17% 12703/17359 73.06% 📈 +0.1 🟢 ██████████████░░░░░░

📋 View full run · Generated by CI

@xkonjin xkonjin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Intra-repo Service Communication Tracking

Overall: Well-structured foundational PR for group analysis. The contract matching system shows good architectural thinking.

Strengths

  • Clean type system with ContractType, ExtractedContract, StoredContract, CrossLink — the optional service field for intra-repo matching is a thoughtful addition
  • Atomic writes in storage layer — writeContractRegistry with temp file + rename pattern prevents corruption during concurrent writes
  • Per-type normalization for contract IDs — the HTTP (method + path normalization), gRPC (package lowercase), topic (trim + lowercase), and lib handling shows attention to protocol-specific requirements
  • Intra-repo matching logic correctly skips same-service matches while allowing cross-service matching within the same repository
  • Pool adapter extraction from MCP backend to core layer is proper refactoring for reuse

Issues

1. Missing error handling in checkStaleness
The function catches errors and returns { isStale: false, commitsBehind: 0 }. This could mask real git failures and report stale indexes as fresh. Consider at least logging the error or returning a tri-state (fresh | stale | unknown).

2. Race condition in contract registry writes
While the atomic rename is good, concurrent calls to writeContractRegistry for the same group could interleave temp file creation. Consider including a unique suffix or timestamp in the temp filename.

3. gRPC normalization edge case
The normalizeContractId function has inconsistent behavior: when there is no slash, the whole string is lowercased; when there is a leading slash, case is preserved. This could cause non-deterministic matching for malformed IDs.

4. MAX_HTTP_METHOD_LEN = 16 seems arbitrary
HTTP methods like SEARCH or custom methods could exceed this. Consider referencing RFC 7231 or using a more generous limit (32?).

5. Missing test coverage for malformed YAML scenarios
The config parser tests validate happy paths well, but missing coverage for: duplicate repo paths, circular links (a->b and b->a), and YAML anchors/aliases.

Security

  • No direct security concerns — all paths are validated through path.normalize before file operations
  • The git staleness check uses execFileSync with controlled arguments (no shell injection risk)

Verdict

LGTM with minor issues noted. The foundation is solid for building the extractor pipeline on top.

@abhigyanpatwari

Copy link
Copy Markdown
Owner

Good work on this, the intra-repo service tracking direction is exactly right. The extractors and service boundary detection fill a real gap — this is the foundation for cross-repo analysis.

On storage/architecture: The JSON contract registry is fine for this PR — it's already large enough. But in #606, the cross-repo layer will need to use LadybugDB (bridge.lbug) instead of JSON — Cypher queries can't run against flat files, and impact traversal across repos needs a proper graph. So keep the JSON here, but plan for #606 to introduce the virtual bridge graph and migrate contract storage into it.

Issues to address before merge:

1. Path traversal via group name (HIGH)
storage.ts:getGroupDir does path.join(gitnexusDir, 'groups', groupName) with no sanitization. A group name like ../../etc creates directories outside the intended path. Validate group names to [a-zA-Z0-9_-] in createGroupDir or parseGroupConfig.

2. gRPC proto regex can't handle google.api.http annotations (HIGH)
grpc-extractor.ts — the serviceRe regex uses [^}]* which stops at the first }. Proto services using option (google.api.http) = { get: "/v1/..." }; inside RPCs will have their methods cut short. You'll need to handle nested braces or switch to a proper brace-depth counter.

3. Service boundary detection needs directory exclusions (HIGH)
service-boundary-detector.ts walks the entire repo tree, only skipping dotfiles and node_modules. On Go repos with vendor/, Java with target/, or Python with __pycache__/.venv — this will be extremely slow. Add exclusions for vendor, target, build, dist, __pycache__, .venv, venv.

4. Double-close of LadybugDB pools (HIGH)
sync.ts finally block calls closeLbug(id) per repo, then cli/group.ts calls closeLbug() (no arg = close all). In an MCP server context, this tears down ALL active pools including unrelated ones. The CLI should not call a blanket closeLbug() — sync's per-id cleanup is sufficient.

5. checkStaleness swallows git errors (LOW but easy fix)
git-staleness.ts catches errors and returns { isStale: false } — masking real failures as "fresh." Should default to isStale: true on errors, or at minimum include a warning field.

6. gRPC normalization mismatch (MEDIUM)
matching.ts:normalizeContractId — gRPC IDs with a slash lowercase only the package prefix, but IDs without a slash lowercase everything. Proto-derived grpc::pkg.Service/Method and source-scan-derived grpc::Service/* will normalize differently and fail to match.

7. writeContractRegistry temp file race (MEDIUM)
The reviewer already flagged this — concurrent syncs for the same group can collide on the temp filename. Use crypto.randomUUID() or process.pid + Date.now() as the suffix instead of just the timestamp.

Test gaps worth closing:

  • createGroupDir is completely untested (force flag, existing group error)
  • No test for mixed contract types in runExactMatch (http + grpc + topic together)
  • No test for gRPC proto files with multiple services or google.api.http annotations
  • CLI integration tests only cover create + list — missing add, remove, sync
  • types.test.ts is pure compile-time checks with zero behavioral coverage — consider removing or replacing with validation tests

Items 1-4 should be fixed before merge. 5-7 can be follow-up if you prefer.

@xkonjin xkonjin left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Intra-repo service communication tracking

Overall: This is a substantial feature addition with good architectural separation (extractors, matching engine, service boundary detection). Found a few issues worth addressing:

Bugs / Logic Issues

1. topic-extractor.ts: Missing dedupe after pattern scanning
The dedupe() method uses a key of ${c.contractId}|${c.role}|${c.symbolRef.filePath} but the contractId for topics is topic::${topicName}. If the same topic is referenced multiple times in the same file with the same role, only one will be kept - which is correct. However, if the same topic is consumed AND produced in the same file, they should both be tracked. The current logic handles this correctly since role differs.

2. service-boundary-detector.ts: Recursive subdirectory scanning may miss deeply nested source files
In hasSourceFilesInSubdirs(), the function recurses only one level deeper when it finds a directory. This could miss source files at depth > 2. Consider using a breadth-first approach or increasing recursion depth.

3. grpc-extractor.ts: Pattern for Go Register...Server may produce false positives
The regex /\bRegister(\w+)Server\b/g could match strings that are not actual gRPC registrations (e.g., in comments or strings). Consider requiring \s*\( after the pattern to ensure it is a function call.

Security / Input Handling

4. http-route-extractor.ts: No length limit on path normalization
The normalizeHttpPath() function does not limit input size. While not a direct security issue, malformed input with extremely long paths could cause performance issues.

5. Topic extractor patterns could match malicious topic names
The topic patterns extract any string between quotes. If source code contains something like:

// Do not use: kafka.send('user.${malicious}') 

The extractor would incorrectly extract the partial template string.

Test Coverage Gaps

6. Missing tests for edge cases in service-boundary-detector.ts:

  • No test for repos with only requirements.txt (no source files)
  • No test for empty subdirectory traversal
  • No test for circular symlinks (could cause infinite recursion)

7. Missing integration test for full sync with real LadybugDB
The tests use extractorOverride which bypasses the actual DB integration. The comments acknowledge this, but there is no tracking issue or TODO to add these tests.

8. sync.ts: Error handling for DB operations
In syncGroup(), if initLbug() succeeds but the graph query fails, the error is caught and the repo is added to missingRepos. However, closeLbug() is only called in the finally block which iterates over openPoolIds. If init fails partway through, some pools may not be tracked. This appears handled correctly but warrants a closer look.

Style / Maintainability

9. group.yaml template in storage.ts hardcodes defaults
The template string creates a default config, but any changes to the schema require updating this string manually. Consider using the same defaults object used elsewhere.

10. grpc-extractor.ts: Duplicate proto file scanning
The extractProtoFiles() method scans for **/*.proto files, then scanProtoFile() re-reads each file. This is fine for typical repo sizes but could be optimized by passing the already-read content.

Suggestions

  1. Consider adding rate limiting or max file size checks in the source scanners
  2. Add a metric/counter for how many contracts were extracted per type
  3. Document the confidence scoring methodology (why 0.8 for some patterns, 0.7 for others?)

None of these are blockers - the PR is well-structured and the core logic is sound. The matching engine's handling of same-repo/different-service boundaries is particularly well done.

ivkond and others added 2 commits April 2, 2026 12:55
Spec covers 4 HIGH-priority issues from review: path traversal via
group name, gRPC proto regex nested braces, service boundary detector
directory exclusions, double-close of LadybugDB pools.

Plan: 6 tasks with TDD, ordered by complexity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Path traversal via group name — add validateGroupName() with regex
   [a-zA-Z0-9][a-zA-Z0-9_-]*, called in getGroupDir (defense in depth)

2. gRPC proto regex can't handle nested braces — replace serviceRe with
   extractServiceBlocks() brace-depth counter (init depth=1, skip
   malformed protos)

3. Service boundary detector directory exclusions — add EXCLUDED_DIRS
   set (vendor, target, build, dist, __pycache__, .venv, venv, .tox,
   .mypy_cache, .gradle, .mvn, out, bin) replacing inline node_modules

4. Double-close of LadybugDB pools — remove blanket closeLbug() from
   cli/group.ts; sync.ts per-id cleanup is sufficient

Tests: 22 new tests across 5 files. Full suite: 4706 passed, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ivkond

ivkond commented Apr 2, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review — all 4 HIGH items are addressed in the latest commits (each atomic, tests-first).

Items 5-7 (staleness error handling, gRPC normalization, temp file race) — I'll fix these as a small follow-up PR right after merge, so this one doesn't grow further.

Proposed next steps toward #606

  1. Bridge graph layer — migrate contract storage from JSON into LadybugDB (bridge.lbug), so Cypher queries can traverse cross-repo edges. This is the prerequisite for everything below.

  2. Cross-repo matching — extend syncGroup to match contracts across different repos (not just intra-repo services). The extractors are already repo-agnostic, so this is mostly wiring the matching engine to the bridge graph.

  3. group_impact MCP tool — given a changed file in repo A, traverse the bridge graph to find affected services/contracts in repos B, C. This is the core deliverable of [group] Cross-repo impact analysis via repository groups #606.

  4. Zero-config extractor integration — run HTTP/gRPC/topic extractors during regular gitnexus analyze (no group setup needed). Makes service communication edges available for all monorepos automatically. Groups remain for multi-repo setups.

Happy to start with (1) once this lands. Does this sequence match your vision for #606, or would you prioritize differently?

@ivkond ivkond requested a review from xkonjin April 3, 2026 08:06

@abhigyanpatwari abhigyanpatwari left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled the branch and verified — all 4 HIGHs are properly fixed, typecheck clean, 150/150 tests pass. Good work.

Merge approved. Go ahead with #606 — the roadmap you proposed (bridge graph layer → cross-repo matching → group_impact → zero-config extractors) is the right sequence. One note for #606: contract storage should move into bridge.lbug (LadybugDB) instead of JSON — the virtual bridge graph needs Cypher-queryable edges for cross-repo impact traversal.

For the follow-up PR on items 5-7: prioritize the gRPC normalization mismatch (#6) — grpc::ServiceName/* vs grpc::pkg.Service/Method normalize differently and will cause silent matching failures once cross-repo is in play. The staleness and temp-file race are lower priority.

@abhigyanpatwari abhigyanpatwari merged commit 5c4fca2 into abhigyanpatwari:main Apr 3, 2026
12 of 13 checks passed
@ivkond ivkond deleted the feat/intra-repo-service-tracking-clean branch April 3, 2026 11:55
motolese pushed a commit to motolese/datamoto-gitnexus that referenced this pull request Apr 23, 2026
 HIGH fixes

Spec covers 4 HIGH-priority issues from review: path traversal via
group name, gRPC proto regex nested braces, service boundary detector
directory exclusions, double-close of LadybugDB pools.

Plan: 6 tasks with TDD, ordered by complexity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
motolese pushed a commit to motolese/datamoto-gitnexus that referenced this pull request Apr 23, 2026
…review

1. Path traversal via group name — add validateGroupName() with regex
   [a-zA-Z0-9][a-zA-Z0-9_-]*, called in getGroupDir (defense in depth)

2. gRPC proto regex can't handle nested braces — replace serviceRe with
   extractServiceBlocks() brace-depth counter (init depth=1, skip
   malformed protos)

3. Service boundary detector directory exclusions — add EXCLUDED_DIRS
   set (vendor, target, build, dist, __pycache__, .venv, venv, .tox,
   .mypy_cache, .gradle, .mvn, out, bin) replacing inline node_modules

4. Double-close of LadybugDB pools — remove blanket closeLbug() from
   cli/group.ts; sync.ts per-id cleanup is sufficient

Tests: 22 new tests across 5 files. Full suite: 4706 passed, 0 failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
motolese pushed a commit to motolese/datamoto-gitnexus that referenced this pull request Apr 23, 2026
…rvice-tracking-clean

[group] Intra-repo service communication tracking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants