[group] Intra-repo service communication tracking#626
Conversation
Core foundation for repository group analysis: - Type system: ContractType, ExtractedContract, StoredContract, CrossLink with optional `service` field for intra-repo matching - Config parser for group.yaml (repos, detection flags, matching thresholds) - Contract registry storage with atomic writes - Exact matching engine with per-type normalization (HTTP, gRPC, topic) and intra-repo support (different services within same repo can match) - Extract LadybugDB pool-adapter from MCP backend for reuse by sync pipeline - Git staleness checker for group status reporting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Service communication detection for microservice monorepos: - ServiceBoundaryDetector: auto-detects service boundaries via markers (package.json, go.mod, Dockerfile, pom.xml, Cargo.toml, build.gradle, pyproject.toml, etc.) - HttpRouteExtractor: graph-assisted (Strategy A) with source-scan fallback (Strategy B) for Spring, Express, Laravel, FastAPI providers and fetch/axios consumers - GrpcExtractor: parses .proto files, detects Go/Java/Python/TS gRPC servers (RegisterXxxServer, @GrpcService, add_XxxServicer_to_server, @GrpcMethod) and clients (NewXxxClient, newBlockingStub, XxxStub) - TopicExtractor: Kafka (@KafkaListener, producer.send), RabbitMQ (@RabbitListener, channel.publish/consume), NATS (nc.Subscribe/Publish) across Java, Node, Go, and Python Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire extractors into the sync pipeline with service boundary detection. GroupService provides high-level API for all group operations. - Sync pipeline: orchestrates extraction (HTTP, gRPC, topics) with service boundary assignment and exact matching - GroupService: groupList, groupSync, groupContracts, groupQuery, groupStatus (groupImpact deferred to cross-repo follow-up PR) - CLI: group create/add/remove/list/sync/contracts/query/status - MCP tools: group_list, group_sync, group_contracts, group_query, group_status - Monorepo fixture: 3 services (auth/orders/gateway) connected via gRPC + Kafka + HTTP — all intra-repo cross-links discovered - Documentation: CLI commands and MCP tools added to both READMEs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@ivkond is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 4977 tests passed 46 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
xkonjin
left a comment
There was a problem hiding this comment.
Code Review: Intra-repo Service Communication Tracking
Overall: Well-structured foundational PR for group analysis. The contract matching system shows good architectural thinking.
Strengths
- Clean type system with ContractType, ExtractedContract, StoredContract, CrossLink — the optional
servicefield for intra-repo matching is a thoughtful addition - Atomic writes in storage layer — writeContractRegistry with temp file + rename pattern prevents corruption during concurrent writes
- Per-type normalization for contract IDs — the HTTP (method + path normalization), gRPC (package lowercase), topic (trim + lowercase), and lib handling shows attention to protocol-specific requirements
- Intra-repo matching logic correctly skips same-service matches while allowing cross-service matching within the same repository
- Pool adapter extraction from MCP backend to core layer is proper refactoring for reuse
Issues
1. Missing error handling in checkStaleness
The function catches errors and returns { isStale: false, commitsBehind: 0 }. This could mask real git failures and report stale indexes as fresh. Consider at least logging the error or returning a tri-state (fresh | stale | unknown).
2. Race condition in contract registry writes
While the atomic rename is good, concurrent calls to writeContractRegistry for the same group could interleave temp file creation. Consider including a unique suffix or timestamp in the temp filename.
3. gRPC normalization edge case
The normalizeContractId function has inconsistent behavior: when there is no slash, the whole string is lowercased; when there is a leading slash, case is preserved. This could cause non-deterministic matching for malformed IDs.
4. MAX_HTTP_METHOD_LEN = 16 seems arbitrary
HTTP methods like SEARCH or custom methods could exceed this. Consider referencing RFC 7231 or using a more generous limit (32?).
5. Missing test coverage for malformed YAML scenarios
The config parser tests validate happy paths well, but missing coverage for: duplicate repo paths, circular links (a->b and b->a), and YAML anchors/aliases.
Security
- No direct security concerns — all paths are validated through path.normalize before file operations
- The git staleness check uses execFileSync with controlled arguments (no shell injection risk)
Verdict
LGTM with minor issues noted. The foundation is solid for building the extractor pipeline on top.
|
Good work on this, the intra-repo service tracking direction is exactly right. The extractors and service boundary detection fill a real gap — this is the foundation for cross-repo analysis. On storage/architecture: The JSON contract registry is fine for this PR — it's already large enough. But in #606, the cross-repo layer will need to use LadybugDB ( Issues to address before merge: 1. Path traversal via group name (HIGH) 2. gRPC proto regex can't handle 3. Service boundary detection needs directory exclusions (HIGH) 4. Double-close of LadybugDB pools (HIGH) 5. 6. gRPC normalization mismatch (MEDIUM) 7. Test gaps worth closing:
Items 1-4 should be fixed before merge. 5-7 can be follow-up if you prefer. |
xkonjin
left a comment
There was a problem hiding this comment.
Code Review: Intra-repo service communication tracking
Overall: This is a substantial feature addition with good architectural separation (extractors, matching engine, service boundary detection). Found a few issues worth addressing:
Bugs / Logic Issues
1. topic-extractor.ts: Missing dedupe after pattern scanning
The dedupe() method uses a key of ${c.contractId}|${c.role}|${c.symbolRef.filePath} but the contractId for topics is topic::${topicName}. If the same topic is referenced multiple times in the same file with the same role, only one will be kept - which is correct. However, if the same topic is consumed AND produced in the same file, they should both be tracked. The current logic handles this correctly since role differs.
2. service-boundary-detector.ts: Recursive subdirectory scanning may miss deeply nested source files
In hasSourceFilesInSubdirs(), the function recurses only one level deeper when it finds a directory. This could miss source files at depth > 2. Consider using a breadth-first approach or increasing recursion depth.
3. grpc-extractor.ts: Pattern for Go Register...Server may produce false positives
The regex /\bRegister(\w+)Server\b/g could match strings that are not actual gRPC registrations (e.g., in comments or strings). Consider requiring \s*\( after the pattern to ensure it is a function call.
Security / Input Handling
4. http-route-extractor.ts: No length limit on path normalization
The normalizeHttpPath() function does not limit input size. While not a direct security issue, malformed input with extremely long paths could cause performance issues.
5. Topic extractor patterns could match malicious topic names
The topic patterns extract any string between quotes. If source code contains something like:
// Do not use: kafka.send('user.${malicious}') The extractor would incorrectly extract the partial template string.
Test Coverage Gaps
6. Missing tests for edge cases in service-boundary-detector.ts:
- No test for repos with only
requirements.txt(no source files) - No test for empty subdirectory traversal
- No test for circular symlinks (could cause infinite recursion)
7. Missing integration test for full sync with real LadybugDB
The tests use extractorOverride which bypasses the actual DB integration. The comments acknowledge this, but there is no tracking issue or TODO to add these tests.
8. sync.ts: Error handling for DB operations
In syncGroup(), if initLbug() succeeds but the graph query fails, the error is caught and the repo is added to missingRepos. However, closeLbug() is only called in the finally block which iterates over openPoolIds. If init fails partway through, some pools may not be tracked. This appears handled correctly but warrants a closer look.
Style / Maintainability
9. group.yaml template in storage.ts hardcodes defaults
The template string creates a default config, but any changes to the schema require updating this string manually. Consider using the same defaults object used elsewhere.
10. grpc-extractor.ts: Duplicate proto file scanning
The extractProtoFiles() method scans for **/*.proto files, then scanProtoFile() re-reads each file. This is fine for typical repo sizes but could be optimized by passing the already-read content.
Suggestions
- Consider adding rate limiting or max file size checks in the source scanners
- Add a metric/counter for how many contracts were extracted per type
- Document the confidence scoring methodology (why 0.8 for some patterns, 0.7 for others?)
None of these are blockers - the PR is well-structured and the core logic is sound. The matching engine's handling of same-repo/different-service boundaries is particularly well done.
Spec covers 4 HIGH-priority issues from review: path traversal via group name, gRPC proto regex nested braces, service boundary detector directory exclusions, double-close of LadybugDB pools. Plan: 6 tasks with TDD, ordered by complexity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Path traversal via group name — add validateGroupName() with regex [a-zA-Z0-9][a-zA-Z0-9_-]*, called in getGroupDir (defense in depth) 2. gRPC proto regex can't handle nested braces — replace serviceRe with extractServiceBlocks() brace-depth counter (init depth=1, skip malformed protos) 3. Service boundary detector directory exclusions — add EXCLUDED_DIRS set (vendor, target, build, dist, __pycache__, .venv, venv, .tox, .mypy_cache, .gradle, .mvn, out, bin) replacing inline node_modules 4. Double-close of LadybugDB pools — remove blanket closeLbug() from cli/group.ts; sync.ts per-id cleanup is sufficient Tests: 22 new tests across 5 files. Full suite: 4706 passed, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review — all 4 HIGH items are addressed in the latest commits (each atomic, tests-first). Items 5-7 (staleness error handling, gRPC normalization, temp file race) — I'll fix these as a small follow-up PR right after merge, so this one doesn't grow further. Proposed next steps toward #606
Happy to start with (1) once this lands. Does this sequence match your vision for #606, or would you prioritize differently? |
abhigyanpatwari
left a comment
There was a problem hiding this comment.
Pulled the branch and verified — all 4 HIGHs are properly fixed, typecheck clean, 150/150 tests pass. Good work.
Merge approved. Go ahead with #606 — the roadmap you proposed (bridge graph layer → cross-repo matching → group_impact → zero-config extractors) is the right sequence. One note for #606: contract storage should move into bridge.lbug (LadybugDB) instead of JSON — the virtual bridge graph needs Cypher-queryable edges for cross-repo impact traversal.
For the follow-up PR on items 5-7: prioritize the gRPC normalization mismatch (#6) — grpc::ServiceName/* vs grpc::pkg.Service/Method normalize differently and will cause silent matching failures once cross-repo is in play. The staleness and temp-file race are lower priority.
HIGH fixes Spec covers 4 HIGH-priority issues from review: path traversal via group name, gRPC proto regex nested braces, service boundary detector directory exclusions, double-close of LadybugDB pools. Plan: 6 tasks with TDD, ordered by complexity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…review 1. Path traversal via group name — add validateGroupName() with regex [a-zA-Z0-9][a-zA-Z0-9_-]*, called in getGroupDir (defense in depth) 2. gRPC proto regex can't handle nested braces — replace serviceRe with extractServiceBlocks() brace-depth counter (init depth=1, skip malformed protos) 3. Service boundary detector directory exclusions — add EXCLUDED_DIRS set (vendor, target, build, dist, __pycache__, .venv, venv, .tox, .mypy_cache, .gradle, .mvn, out, bin) replacing inline node_modules 4. Double-close of LadybugDB pools — remove blanket closeLbug() from cli/group.ts; sync.ts per-id cleanup is sufficient Tests: 22 new tests across 5 files. Full suite: 4706 passed, 0 failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rvice-tracking-clean [group] Intra-repo service communication tracking
Summary
Adds a group system for repository analysis with intra-repo service communication tracking for microservice monorepos. This is the foundation for cross-repo impact analysis (follow-up PR).
What changed:
Why:
The GitNexus author requested intra-repo service tracking as a prerequisite for cross-repo analysis (#606). Proving that microservice monorepos index well is the foundation — once intra-repo works, the same extractors feed directly into the virtual graph for multi-repo support.
Backward compatibility & roadmap
Groups are fully opt-in —
gitnexus analyzecontinues to work exactly as before. For small/medium monorepos (2-20 services), a singleanalyzeon the whole repo already captures everything in one graph, andimpact/querytools see inter-service relationships through existing CALLS/IMPORTS edges.Groups become valuable for:
The gRPC, Kafka/RabbitMQ/NATS, and HTTP extractors currently run only inside
group sync. A natural next step is integrating them into the standardanalyzepipeline so that inter-service communication edges (gRPC calls, topic pub/sub) are captured automatically for all repos — no group setup required. This would make service communication tracking zero-config for monorepos of any size, while groups remain available for cross-repo and advanced use cases.How to verify:
Risk / rollback:
src/core/group/— isolated from existing functionalityservicefield is optional and backward-compatiblegitnexus analyzebehavior is completely unchangedNot in this PR (follow-up):
group_impacttool) — [group] Cross-repo impact analysis via repository groups #606linksin group.yaml)analyzepipeline (zero-config mode)Test plan
🤖 Generated with Claude Code