Skip to content

feat: add IncludeExtractor for C++ cross-repo include tracking (group)#1156

Merged
magyargergo merged 29 commits into
abhigyanpatwari:mainfrom
SZU-WenjieHuang:feat/group-include-extractor
May 9, 2026
Merged

feat: add IncludeExtractor for C++ cross-repo include tracking (group)#1156
magyargergo merged 29 commits into
abhigyanpatwari:mainfrom
SZU-WenjieHuang:feat/group-include-extractor

Conversation

@SZU-WenjieHuang

Copy link
Copy Markdown
Contributor

Summary

Closes #1142

Add cross-repo #include dependency tracking to the Group feature, enabling impact analysis across C/C++ repositories that share header files (e.g., an InterfaceMap repo providing .h headers consumed by a VMap repo).

What changed

New: IncludeExtractor (gitnexus/src/core/group/extractors/include-extractor.ts)

  • Provider side: registers every .h/.hpp/.hxx/.hh file in a repo as a provider contract (include::<relative-path>)
  • Consumer side: parses all C/C++ source/header files for #include "…" directives via tree-sitter, resolves against the repo's own file list (reusing suffixResolve from the ingestion pipeline), and emits unresolved paths as consumer contracts
  • Regex fallback: when tree-sitter's native binding fails on files > 32 KB (parser.parse() throws Invalid argument), falls back to regex extraction (/^[ \t]*#\s*include\s*"([^"]+)"/gm) so large files are not silently skipped
  • Filters angle-bracket includes, well-known system/stdlib headers, and system path prefixes

Integration into Group sync (sync.ts)

  • IncludeExtractor runs when config.detect.includes is enabled (default: true)
  • writeBridge() is now called after writeContractRegistry() to generate bridge.lbug for cross-repo impact queries

Type system & matching (types.ts, config-parser.ts, matching.ts, manifest-extractor.ts)

  • Added 'include' to ContractType union and DetectConfig
  • Added include:: normalization in normalizeContractId
  • Added include type graph-lookup branch in ManifestExtractor

Key design decisions

  1. Suffix-based local resolution: reuses the existing buildSuffixIndex / suffixResolve from the single-repo ingestion pipeline to distinguish local includes from cross-repo ones — only unresolved includes become consumer contracts
  2. tree-sitter + regex dual strategy: tree-sitter provides accurate AST-based extraction for files ≤ 32 KB; regex handles the long tail of large files (33 files in a 415-file VMap repo exceeded the limit)
  3. Contract ID format: include::map/base/dice_map_view.h — path-normalized, lowercase, matched via runExactMatch

Validation (on two of my real repos)

Metric Before After
Total contracts 1191
Cross-links 141
Matched IF headers vs grep 72/72 (100%)
VMap source files covered 51

Test plan

  • TypeScript compiles cleanly (npx tsc --noEmit)
  • All 7248 existing tests pass (no regressions)
  • Unit tests added (include-extractor.test.ts, 262 lines)
  • End-to-end validation: gitnexus group sync on real C++ repos produces 100% header coverage vs independent grep scan
  • Regex fallback verified: 33 files > 32 KB correctly extracted via regex (8 contained cross-repo includes)
  • MCP impact({repo: "@group", target: "header.h"}) returns non-empty cross field with correct downstream VMap files

@vercel

vercel Bot commented Apr 28, 2026

Copy link
Copy Markdown

Someone is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@SZU-WenjieHuang

Copy link
Copy Markdown
Contributor Author

Hi @abhigyanpatwari @magyargergo, this PR is ready for review when you have a moment.

What it does: Adds IncludeExtractor for C/C++ cross-repo #include dependency tracking in Group sync — addresses the gap where header-only interface repos currently show 0 cross-links with their consumers.

Validation done:

  • ✅ All 7248 existing tests pass (no regressions)
  • ✅ New unit tests added (262 lines)
  • ✅ End-to-end verified on two real C++ repos — 100% header coverage
  • ✅ TypeScript compiles cleanly

The branch is up to date with main and has no conflicts. Happy to address any feedback:)

Best regards,
Wenjie

Comment thread gitnexus/test/unit/group/include-extractor.test.ts Fixed
Comment thread gitnexus/test/unit/group/include-extractor.test.ts Fixed
Comment thread gitnexus/test/unit/group/include-extractor.test.ts Fixed
Comment thread gitnexus/src/core/group/extractors/include-extractor.ts Fixed
- Remove unused HEADER_GLOB constant in include-extractor.ts
- Use fs.mkdtempSync for secure temp dir creation in tests
  (CodeQL: 'Insecure temporary file')
@SZU-WenjieHuang

Copy link
Copy Markdown
Contributor Author

I've pushed a fix commit fcc4319 addressing the 4 CodeQL warnings (1 unused variable + 3 insecure temp file).Thanks!

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 5m 44s —— View job


PR #1156 Deep Review — IncludeExtractor for C++ cross-repo include tracking


Review bar for this PR

  • IncludeExtractor registers provider contracts for header files only — provider IDs must be path-normalized, deterministic, repo-relative, and cross-platform.
  • Consumer extraction must suppress local includes via suffix resolution before emitting cross-repo contracts, with deterministic behavior under ambiguity.
  • include:: normalization must be byte-for-byte consistent between provider and consumer emit paths, normalizeContractId, and ManifestExtractor graph lookup.
  • writeBridge() must be called after writeContractRegistry() with valid data and must not regress the existing group sync lifecycle.
  • Tests must cover the real runtime path including extraction, normalization, local suppression, filtering, regex fallback, and a provider/consumer round-trip matching scenario.
  • CI, typecheck, ESLint, and Prettier must all be green. Hidden Unicode and GHAS findings must be resolved or explicitly justified.
  • Default detect.includes: true must not make group sync unexpectedly expensive for non-C++ repos.

Current PR state

Attribute Value
Head SHA fcc4319 (latest, post-GHAS fixes)
Status Open, not draft
Commits 3
Changed files 7
Gitleaks ✅ pass
Dependency Review ✅ pass
CodeQL ✅ pass (GHAS issues resolved in latest commit)
CI (quality/lint) FAIL — ESLint Parsing error: Argument expression expected at manifest-extractor.ts:283
CI (quality/format) FAIL — Prettier SyntaxError: Argument expression expected at manifest-extractor.ts:283
CI (tests/ubuntu, windows, macos) FAIL — setup-gitnexus action fails (root cause: syntax error prevents compilation)
GHAS comments Resolved in commit fcc4319 (HEADER_GLOB unused var removed, insecure temp file fixed)
Hidden Unicode None found (checked with grep for non-ASCII in both new files)
Vercel Irrelevant to this review
PR body accuracy Partially — describes correct intent, but current HEAD has a blocking syntax error introduced in the manifest-extractor change

Understanding of the change

Issue #1142 reports that header-only interface repos show 0 cross-repo links with their consumers because the Group sync pipeline has no concept of C/C++ #include dependencies. This PR adds IncludeExtractor which scans every repo for header files (provider contracts include::<path>) and for quoted #include directives in source/header files (consumer contracts, after filtering local matches via suffixResolve). Unresolved quoted includes become cross-repo consumer contracts. The extractor is wired into syncGroup behind config.detect.includes (default true), and writeBridge() is now called after writeContractRegistry() to make cross-repo include links available to the MCP impact engine. The include contract type is also added to types.ts, config-parser.ts, matching.ts, and a new lookup branch in ManifestExtractor.resolveSymbol.


Findings


[BLOCKER] Syntax error in manifest-extractor.ts — missing closing parenthesis

  • Category: Correctness / Build integrity
  • Files: gitnexus/src/core/group/extractors/manifest-extractor.ts:276–283
  • Issue: The include branch inside resolveSymbol calls await executor(...) but the closing ) is missing. The code ends the argument list with { contract: link.contract }, and then flows directly into } else if (link.type === 'custom') without closing the function call:
} else if (link.type === 'include') {
  rows = await executor(
    `MATCH (f:File) WHERE f.filePath = $contract
     ...
     LIMIT 1`,
    { contract: link.contract },   // ← closing ) is missing here
} else if (link.type === 'custom') {

ESLint reports error Parsing error: Argument expression expected at line 283:6. Prettier reports SyntaxError: Argument expression expected at line 283:7. This prevents TypeScript compilation, which cascades to all test/typecheck/setup-gitnexus CI jobs failing.

  • Why it matters: This is a hard syntax error in a file that is on the hot path of every group sync and manifest-link resolution call. The code cannot compile in its current state. All CI checks that depend on compilation are failing because of this single missing character.
  • Recommended fix: Add the missing ) to close the executor(...) call before the } else if (link.type === 'custom') line.
  • Blocks merge: YES

[BLOCKER] CI Gate is failing — not production-ready in current state

  • Category: CI / Validation baseline
  • Files: CI runs linked above
  • Issue: The CI gate job concludes failure. Specifically: ESLint (error, not warning), Prettier (syntax error), and all test/typecheck jobs fail. The quality/lint failure is an error-level ESLint finding, not a pre-existing warning. All test job failures trace back to the same root cause (syntax error prevents setup-gitnexus compilation).
  • Why it matters: DoD §4.2 requires npx tsc --noEmit and npm test to pass. DoD §6 explicitly lists "Pre-commit, prettier, or typecheck was bypassed" and CI failure as Not-Done signals.
  • Recommended fix: Fix the syntax error. Re-run CI.
  • Blocks merge: YES

[HIGH] Path lowercasing can cause collisions on case-sensitive Linux filesystems

  • Category: C/C++ include semantics / Correctness
  • Files: include-extractor.ts:215
  • Issue: normalizeIncludePath lowercases the entire path:
return raw.replace(/\\/g, '/').replace(/^\.\//, '').replace(/\/+/g, '/').toLowerCase();

This means include/Map/Base/View.h and include/map/base/view.h produce the same contract ID include::map/base/view.h. On Linux (case-sensitive), these are distinct files. Two distinct headers can produce identical provider contract IDs, causing one to silently shadow the other in buildProviderIndex. Conversely, a consumer using #include "Map/Base/View.h" and one using #include "map/base/view.h" will both normalize to the same contract, which is correct for matching purposes — but only if the provider also normalizes consistently, which it does here. The real risk is provider collision: if a repo contains Foo.h and foo.h, only one gets registered in the provider index.

  • Why it matters: C++ projects on Linux frequently distinguish case in header paths (e.g. Qt, Boost, kernel headers). Provider collision is a correctness problem for repositories that have mixed-case headers. No test covers Foo.h vs foo.h provider behavior.
  • Recommended fix: Document the case-folding assumption explicitly. Add a unit test asserting behavior when both Foo.h and foo.h exist. Consider emitting a warning when a collision is detected during deduplication.
  • Blocks merge: no (as an isolated finding, unlikely in typical cross-repo header-sharing scenarios, but should be documented and tested)

[HIGH] suffixResolve ambiguity can silently suppress true cross-repo includes

  • Category: Local include resolution / False negatives
  • Files: include-extractor.ts:417–419, utils.ts:150–185
  • Issue: suffixResolve is used to classify an include as local. It tries all suffix prefixes (e.g., for "map/base/view.h" it tries "map/base/view.h", then "base/view.h", then "view.h") AND appends all EXTENSIONS from utils.ts (including .tsx, .ts, etc.) to each suffix. If the local repo has any file ending in view.h under any directory — even internal/view.h while the include is "map/base/view.h" — the shorter suffix "view.h" matches, and the include is suppressed as local. The cross-repo contract is silently dropped.
  • Why it matters: This is a correctness risk for large repos with many headers. If a consumer repo happens to have any local file named types.h (extremely common), every #include "something/types.h" in that repo — including genuine cross-repo includes — would be suppressed. No test covers this ambiguity case.
  • Recommended fix: Use only the full include path for local resolution (i.e., call suffixResolve with only pathParts from the complete include string, not allowing truncated suffix matching). Or add an ambiguity signal: when the suffix is shorter than the full include path and multiple local files match, emit the consumer contract anyway (conservative: prefer false positives over false negatives for impact analysis). At minimum, add a test where a consumer has #include "ext/api.h" and the local repo has local/api.h — assert that the consumer contract IS emitted.
  • Blocks merge: no (the existing two-repo validation worked correctly for the author's repos, but this is a systematic risk for many real C++ codebases)

[MEDIUM] Regex fallback matches #include inside block comments — false positive contracts

  • Category: Extraction correctness / False positives
  • Files: include-extractor.ts:403–409
  • Issue: The fallback regex INCLUDE_REGEX = /^[ \t]*#\s*include\s*"([^"]+)"/gm runs on the raw file content when tree-sitter fails. The m flag makes ^ match the start of any line, so an include inside a block comment like:
/*
#include "cross-repo/old-api.h"    ← this line matches the regex
*/

...will emit a spurious consumer contract. Tree-sitter handles this correctly by walking the AST. The regex fallback has no comment awareness.

  • Why it matters: Header files for large repos often contain commented-out includes or historical includes in comment blocks. These would produce false cross-repo contracts, creating phantom cross-links. The PR body notes 33 files used the regex fallback, and 8 contained cross-repo includes. If any of those 8 had commented includes, false cross-links exist in the end-to-end validation data.
  • Recommended fix: Pre-strip block comments before applying the regex. A minimal approach strips /* ... */ spans using a non-greedy replace before scanning. Alternatively, annotate the limitation and add a test with a block-commented include asserting it is NOT emitted.
  • Blocks merge: no (acceptable risk given the fallback is last-resort for files > 32KB, but should be documented and ideally tested)

[MEDIUM] meta.source always reports 'tree_sitter' even for regex fallback path

  • Category: Observability / DoD §2.8
  • Files: include-extractor.ts:431
  • Issue: The consumer contract metadata hardcodes source: 'tree_sitter' regardless of which extraction path ran:
meta: {
  source: 'tree_sitter',   // ← wrong when regex fallback was used
  includePath: cleaned,
},

When the regex fallback runs (the entire catch branch starting at line 403), the contracts it produces are indistinguishable from tree-sitter-extracted contracts in the stored registry.

  • Why it matters: This makes it impossible to audit which contracts came from the fallback path when debugging false positives. DoD §2.8 requires errors/decisions to be observable. The author stated 8 regex-fallback files contained cross-repo includes — none of those contracts are traceable after the fact.
  • Recommended fix: Pass a source: 'regex_fallback' value when building contracts from the regex branch. Track whether the current iteration used tree-sitter or regex.
  • Blocks merge: no (observability gap, not a correctness bug)

[MEDIUM] No integration test covering group sync → matching → bridge → impact path

  • Category: Tests / DoD §2.7
  • Files: include-extractor.test.ts
  • Issue: All tests exercise IncludeExtractor.extract() directly. The cross-repo matching scenario test (lines 163–196) verifies that contract IDs match between provider and consumer runs, but does NOT exercise:
    • buildProviderIndexrunExactMatchCrossLink generation
    • writeContractRegistry + writeBridge lifecycle
    • MCP impact({repo: "@group", target: "header.h"}) traversal
  • The PR body claims end-to-end MCP validation on private repos. DoD §2.7 requires tests to cover the real runtime path and to fail if wiring is broken.
  • Why it matters: The writeBridge() call in sync.ts and the include lookup branch in ManifestExtractor.resolveSymbol are on the real production path but are completely untested. A regression in matching normalization (e.g., a slight format change in normalizeContractId) would produce 0 CrossLinks with no test failure.
  • Recommended fix: Add an integration test using syncGroup() directly with two fixture repo paths (one provider, one consumer), asserting that crossLinks contains at least one entry linking the correct provider header to the consumer file. This can be done without a real DB by passing dbExecutor: null.
  • Blocks merge: no (the unit tests plus normalization round-trip test provide reasonable confidence, but this is a clear gap)

[MEDIUM] writeBridge() error not handled — sync failure propagates uncaught

  • Category: Group sync lifecycle / Observability
  • Files: sync.ts:284–292
  • Issue:
if (opts?.groupDir && !opts.skipWrite) {
  await writeContractRegistry(opts.groupDir, registry);
  await writeBridge(opts.groupDir, { ... });  // ← no try/catch
}

If writeBridge fails (e.g., LadybugDB creation fails, disk full, permission error), the entire syncGroup call rejects. Callers may not handle this gracefully.

  • Why it matters: writeContractRegistry persists a JSON file that succeeds; writeBridge writes a binary DB. If the bridge write fails, the JSON registry is written but the bridge is stale. The sync appears to succeed from the registry perspective but impact queries against @group will use stale bridge data.
  • Recommended fix: Wrap writeBridge in a try/catch, log a warning, and continue. Impact queries degrade gracefully on a missing/stale bridge — the contract registry is still valid.
  • Blocks merge: no (acceptable for initial integration, but a robustness gap)

[LOW] canExtract() always returns true — non-C++ repos scanned unconditionally

  • Category: Performance / DoD §2.6
  • Files: include-extractor.ts:258–260
  • Issue: canExtract() returns true for all repos. This triggers extract() for every repo in the group, including JavaScript, Python, Go, and Ruby repos. The extract method does two glob calls (one for all files, one for SOURCE_GLOB) on the filesystem before finding nothing. For groups with many non-C++ repos, this adds N × 2 filesystem traversals on every group sync.
  • Why it matters: The default detect.includes: true means this overhead is active by default. For groups with 10+ non-C++ repos, this is measurable overhead with no benefit.
  • Recommended fix: Add a quick pre-check in canExtract() or at the start of extract() that checks for any C/C++ file existence before proceeding (e.g., a fast glob limited to *.{h,hpp,hxx,hh,c,cpp,cc,cxx} at depth 1). Or check if SOURCE_GLOB matches produce any results before building the suffix index.
  • Blocks merge: no

[LOW] Deduplication key allows O(N_files × M_headers) CrossLinks

  • Category: Performance / Graph explosion
  • Files: include-extractor.ts:444–453
  • Issue: The dedup key is ${contractId}|${role}|${symbolRef.filePath}. This means 100 source files each including the same external header produce 100 distinct consumer contracts (different symbolRef.filePath), each generating a CrossLink to the single provider. For a C++ project with 500 source files including a widely-used interface header, this produces 500 CrossLinks per external header. With 72 matched headers (as in the PR's validation), this could be ~36,000 CrossLinks.
  • Why it matters: The dedupeCrossLinks function in sync.ts deduplicates by from.symbolUid, which is per-file. The bridge graph and impact response would contain thousands of CrossLinks for heavily-included headers, potentially causing MCP response size and traversal cost issues.
  • Recommended fix: Consider deduplication at the contract level (file-to-contract), not file-to-file level, for include contracts. Or apply a cap per (consumer_repo, provider_contract) pair.
  • Blocks merge: no (manageable at current repo sizes, needs monitoring at scale)

C/C++ include semantics assessment

Quoted includes: Correctly captured. Tree-sitter preproc_include node with path: (_) @import.source correctly identifies #include "..." nodes at the AST level.

Angle-bracket includes: Correctly filtered via isAngleBracketInclude() which checks the raw node text for <...> wrapping. The tree-sitter path handles this at the AST level. The regex fallback INCLUDE_REGEX only captures quoted form ("...") — angle brackets are not matched. ✅

Whitespace variants (# include "foo.h", #include "foo.h"): Tree-sitter handles whitespace in the preprocessor grammar correctly. The regex ^[ \t]*#\s*include\s*"([^"]+)" handles leading whitespace and space between # and include and between include and ". ✅

Comments: Tree-sitter correctly ignores // #include "fake.h" and /* #include "fake.h" */ via AST context. The regex fallback is line-anchored but cannot detect block comments spanning multiple lines (see finding above). ⚠️

Macro includes (#include HEADER_PATH): Tree-sitter query targets path: (_) which would capture the macro node. However, isAngleBracketInclude() checks for <...> and the cleaned = rawText.replace(/['"<>]/g, '') call would produce the macro name without filtering it. The macro name would then fail isSystemHeader and suffixResolve, producing a spurious consumer contract like include::header_path. This is a false positive risk for macro-include patterns. The regex fallback correctly ignores macro includes since it requires the literal "..." quote form.

File extensions: .c, .h, .cpp, .cc, .cxx, .hpp, .hxx, .hh are all handled via getLanguageForFile. ✅

Cross-platform separators: Backslashes are converted to forward slashes in normalizeIncludePath. ✅

Case sensitivity: Lowercasing applied uniformly — consistent for matching but risks collision on Linux. ⚠️ (see Finding above)


Compiler frontend / extraction assessment

Tree-sitter query: (preproc_include path: (_) @import.source) @import is a correct S-expression query for the tree-sitter C/C++ grammar. The preproc_include node has a path child that is either a string_literal (for quoted form) or a system_lib_string (for angle-bracket form). Capturing (_) handles both, with isAngleBracketInclude filtering the angle-bracket form afterward. ✅

Parser failure handling: Caught in the outer try/catch at line 383. Fallback to regex is triggered. Parse errors don't crash group sync. ✅

Regex fallback: INCLUDE_REGEX.lastIndex = 0 is correctly reset before each exec loop. The regex is line-anchored (^ with m), not ReDoS-prone (the [^"]+ character class is safe). ✅ Block comment false positives remain (see finding).

Query cache: Language objects (C, Cpp) are module-level singletons, making them safe Map keys. Queries are compiled once per language, not per file. ✅

Deduplication: Provider contracts are deduped by contractId|role|filePath. Consumer contracts are not deduped per-include across files by design (different symbolRef). ✅ for providers, intentional for consumers.

Source location metadata: Not preserved per include directive — only the file path is tracked. This is acceptable for the current model.


Local resolution assessment

suffix index: buildSuffixIndex builds an O(N) index over all files in the repo. Used O(1) per lookup. ✅ for performance.

Local include suppression: Correctly implemented — suffixResolve result !== null means local, skip. ✅ for the happy path.

Ambiguity under suffix truncation: The suffixResolve function tries progressively shorter suffixes. A 3-part include path a/b/c.h will match against local x/y/c.h when c.h is a shared suffix. This is the false negative risk documented in the finding above. ⚠️

Include search paths: Only suffix-based matching is used — no concept of -I include directories or relative-to-including-file resolution. For typical cross-repo scenarios this is fine; for complex build systems with custom include paths this may miss resolutions or incorrectly classify. Acceptable for the stated use case.

Performance: Two glob calls per repo + one buildSuffixIndex. For a 10K-file repo, the index build is O(10K × avg_path_depth) ≈ fast. ✅


Provider/consumer contract assessment

Provider side: Only .h, .hpp, .hxx, .hh files (guarded by HEADER_EXTENSIONS). Source files .c, .cpp, etc. are NOT registered as providers. ✅

Contract ID format: include::<normalized-relative-path>. Graph strategy uses absolute filePath from the DB — note that extractProvidersGraph does not strip the repo root prefix from f.filePath, which could mean the graph-based contract ID includes the absolute path. This would not match the consumer's relative-path contract ID. However, the filesystem fallback (which runs when dbExecutor is null, which is the case when group sync's pool init fails) uses repo-relative paths correctly. In normal group sync, the DB strategy runs first and would produce mismatched IDs. This is a potential latent bug if the graph strategy is ever used, but it's moot if extractProvidersGraph typically returns 0 results (which seems likely since WHERE f.filePath =~ '.*\\.(h|hpp|hxx|hh)$' uses a Windows regex \\. that may not match on Linux).

System header filtering: Comprehensive list of C and C++ standard headers plus system path prefixes. Adequate for most practical use cases. Some embedded/platform headers (POSIX, OpenBSD sys/queue.h) are filtered via prefix. Project headers with names similar to stdlib (e.g., a project's own string.h) could be incorrectly suppressed, but this edge case is documented.

Duplicate contracts: The dedupe() method handles duplicates per contractId|role|filePath. ✅


Group sync / matching assessment

detect.includes gate: Correctly implemented at sync.ts:173. IncludeExtractor only runs when config.detect.includes === true. ✅

Default includes: true: Set in config-parser.ts:DEFAULT_DETECT. Intentional per PR design. Non-C++ repos pay the overhead of two extra glob calls (see performance finding).

normalizeContractId for include: include::${rest.replace(/\\/g, '/').replace(/^\.\//, '').replace(/\/+/g, '/').toLowerCase()} — identical to normalizeIncludePath. Idempotent: applying twice produces the same result. ✅

Cross-type matching: buildNoisyContractFilter only filters HTTP contracts. findMatchingKeys checks exact normalized key equality — no cross-type collisions possible since the type prefix (include::) differs from all other contract types. ✅

ManifestExtractor include branch: Present at line 276. BUT it has the syntax error. Additionally, the query WHERE f.filePath = $contract uses the raw link.contract value from group.yaml (not normalized), and FilePath in the graph is likely an absolute path. This means the manifest-declared include type would always fail to resolve a symbol and fall back to the synthetic UID. This is acceptable (synthetic UIDs work for cross-impact), but means the resolveSymbol include branch is effectively dead code for graph-based resolution. Consider documenting this.

WriteBridge lifecycle: Called after writeContractRegistry in the same if (opts?.groupDir && !opts.skipWrite) block. Order is correct — registry is written first, bridge is generated from the same in-memory data. No error handling (see finding). ✅ for lifecycle correctness.

Idempotency: Each sync completely replaces the contract list (no incremental merge). Running sync twice produces the same output. ✅


MCP / impact assessment

Cross-links visible to impact: writeBridge() is called with crossLinks that includes include-type CrossLinks produced by runExactMatch. Impact queries traverse the bridge DB. ✅ for wiring.

Target ambiguity: impact({repo: "@group", target: "header.h"}) — if target is a basename, multiple provider headers from multiple repos could match. This is a pre-existing concern in the impact engine, not introduced by this PR. The review of actual MCP tools/resources handling is out of scope per the brief.

Cycle prevention: Not analyzed in the include path specifically. The existing group impact depth/visited controls apply to all CrossLink types including include. ✅ for existing safeguards.

Existing impact regression: No changes to runExactMatch, runWildcardMatch, or buildProviderIndex logic. The include type follows the same matching path as HTTP/gRPC/thrift contracts. ✅


Test assessment

Test area Coverage
Provider header registration (.h, .hpp)
.cpp not registered as provider
Unresolved includes → consumer contracts
Local include suppression
Angle-bracket includes filtered
System headers filtered
System path prefixes filtered
Provider/consumer ID round-trip matching
normalizeContractId for include (backslash, casing, leading ./, double slashes`)
Deduplication (same include in multiple files) ✅ (though semantics allow N consumer contracts)
Block comment false positive (regex fallback) ❌ Missing
Ambiguous suffix resolution (local view.h suppresses cross-repo map/base/view.h) ❌ Missing
Macro includes (#include MACRO) ❌ Missing
Case collision (Foo.h vs foo.h) ❌ Missing
detect.includes: false disables extraction ❌ Missing
Group sync integration (syncGroup → crossLinks) ❌ Missing
regex fallback path (via large file or forced parse failure) ❌ Missing (tree-sitter path always succeeds on small test files)

The test for deduplication (lines 202–214) has a semantic confusion: the test comment says "Both files include ext/api.h — each should produce a separate consumer contract (different symbolRef.filePath)" with toHaveLength(2). This is intentional but contradicts the section title "deduplication". The actual dedup behavior (same file including the same header twice) is not tested.

Also: expect(consumers[0].confidence).toBe(0.85) — DoD §2.7 warns against toBeGreaterThanOrEqual but also cautions that exact assertions should be meaningful. A hardcoded 0.85 confidence is a magic number; its business meaning is not tested.


Performance assessment

Default-on scanning: Non-C++ repos still trigger IncludeExtractor.extract() because canExtract() always returns true. Each repo pays for 2 glob calls. For a group with 20 repos, this is 40 extra filesystem scans per sync even if none are C++ repos. Acceptable at current scale, but detect.includes: false provides an escape hatch.

Parser reuse: new Parser() is created once per extractConsumers() call, shared across all files. Parser.Query is cached per language. ✅

Suffix index: Built once in extract(), reused across all source files. O(N_files) construction. ✅

Regex fallback performance: INCLUDE_REGEX is stateful (g flag), correctly reset with lastIndex = 0. Each fallback call scans the file linearly. For a 500KB file (large C++ source), this is fast. ✅

Memory: All files are read into memory via readSafe() before parsing. A repo with 200 large .cpp files could transiently hold significant memory if tree-sitter fails on all of them. For typical repos this is acceptable.


GHAS / hidden Unicode / hygiene assessment

GHAS findings: The fcc4319 commit addresses all 4 GHAS findings: HEADER_GLOB unused variable removed, and insecure fs.mkdtempSync in tests replaced with crypto-random paths. CodeQL check is now ✅ green.

Hidden Unicode: No non-ASCII characters found in include-extractor.ts or include-extractor.test.ts. Bidi controls absent. ✅

Lint errors introduced by this PR: The ESLint error (not warning) at manifest-extractor.ts:283 was introduced by this PR's include branch addition. The file's other warnings are pre-existing.

Scope: The diff is appropriately scoped. No unrelated reformatting. sync.ts additions are minimal and proportionate. ✅


Elegance / maintainability assessment

The overall design is clean and follows the existing extractor pattern (ContractExtractor interface). The dual strategy (graph-assisted providers → filesystem fallback) mirrors other extractors in the pipeline. The SOURCE_GLOB constant cleanly bounds what gets scanned. The SYSTEM_HEADERS set is comprehensive and readable.

One design concern: Provider graph strategy (extractProvidersGraph) uses a regex '.*\\.(h|hpp|hxx|hh)$' that likely won't work on Linux (backslash in a regex matched against forward-slash paths). On Linux, file paths in the graph would look like /repo/path/include/foo.h — the Cypher regex '.*\\.(h|hpp|hxx|hh)$' with escaped backslash-dot would work, but using \\. in a JavaScript template literal inside a Cypher string needs careful escaping. In practice extractProvidersGraph returns 0 results and falls through to the filesystem fallback, so this is latent rather than active.

Future extensibility: The extractor follows the ContractExtractor interface cleanly. Adding support for -I include directories or conditional compilation blocks would be natural extensions without requiring architectural changes. ✅


Final verdict

❌ Not production-ready

The PR introduces a syntax error in manifest-extractor.ts (missing closing parenthesis in the include branch of resolveSymbol) that causes ESLint to report a parse error, prevents Prettier from formatting the file, and fails TypeScript compilation. This cascades to all test, typecheck, and CI gate jobs failing. A PR that cannot compile is not production-ready by any reasonable bar.

Beyond the syntax error, there are two high-priority design issues that stop short of blocking on their own but together suggest the implementation needs one more iteration: (1) suffixResolve ambiguity can silently suppress true cross-repo includes when a local file shares a suffix with the external header — a systematic false negative risk in large C++ repos; and (2) the regex fallback can produce false positive contracts from #include directives inside block comments. Neither has a test. The missing integration test for the syncGroup → matchingn → CrossLink → writeBridge path means the end-to-end contract matching and bridge generation code is untested in CI.

Required before merge: Fix the syntax error in manifest-extractor.ts:283. Once CI is green, the remaining issues (suffix ambiguity test, regex fallback comment test, source metadata, integration test) should be addressed in a follow-up commit or treated as required follow-ups in this PR depending on the team's bar.

HuangWenjie added 4 commits May 7, 2026 19:06
The 'include' branch in ManifestExtractor.resolveSymbol was missing
the closing ); for the executor() call, causing a syntax error that
broke ESLint, Prettier, and the full test CI on all platforms.

Reported by Claude PR review on abhigyanpatwari#1156.
Upstream removed these in commit 3f0c74f (ladybugdb 0.16.0 upgrade).
Commit 3f5d21c accidentally restored them during a rebase dance.
Adding 'include' pushed the array over prettier's 100-char limit,
so prettier prefers multi-line. Apply the reformat to unbreak
ci-quality/format job.
… findings abhigyanpatwari#3-abhigyanpatwari#7

Claude Deep Review raised 7 findings on the IncludeExtractor. #1/abhigyanpatwari#2
(BLOCKERs) were fixed earlier. This commit closes the remaining five.

abhigyanpatwari#3 HIGH  case-sensitive FS -> provider contract-id collision
  Document the deliberate case-folding trade-off on normalizeIncludePath
  (matches C/C++ convention on Windows/macOS; collapses Foo.h & foo.h on
  Linux). Add a unit test pinning the behavior.

abhigyanpatwari#4 HIGH  suffixResolve short-suffix match silently drops cross-repo include
  When a local file ends with the same basename as an external include
  (e.g. local internal/api.h vs. #include "ext/api.h"), suffixResolve
  returned a bogus local hit and suppressed the cross-repo consumer.
  Replace the suffixResolve lookup inside include-extractor with a
  strict isLocalInclude() that only accepts full-path hits via
  SuffixIndex.get / getInsensitive. Callers of suffixResolve elsewhere
  are unaffected. Add 3 unit tests covering the regression.

abhigyanpatwari#5 MEDIUM regex fallback matched #include inside /* ... */
  Strip block comments before running the fallback regex scan.
  Add a unit test.

abhigyanpatwari#6 MEDIUM meta.source was hard-coded to 'tree_sitter'
  Track the actual extraction path with an extractionSource local and
  write it into meta.source so downstream audits can distinguish
  tree-sitter parses from regex fallbacks. Add 2 unit tests.

abhigyanpatwari#7 MEDIUM missing end-to-end coverage
  Add test/integration/group/include-extractor-sync.test.ts with 3
  cases exercising extractor -> syncGroup -> CrossLink (mocked
  contracts, mixed-case/backslash normalization, real temp repos).

Tests: 21 unit + 3 integration, all green.
@SZU-WenjieHuang

Copy link
Copy Markdown
Contributor Author

Review response + CI status

Thanks @magyargergo @claude for the deep review. All 7 findings are addressed below, and I want to flag one CI signal that is unrelated to this PR.


Claude review findings — status

# Severity Finding Status Commit
1 BLOCKER CodeQL alerts on path handling ✅ Fixed fcc4319a
2 BLOCKER manifest-extractor.ts syntax error from earlier patch ✅ Fixed 3f5d21c5
3 HIGH Case-sensitive FS → provider contract-id collision (Foo.h vs foo.h on Linux) ✅ Documented + test 298c0674
4 HIGH suffixResolve short-suffix match silently drops cross-repo #include ✅ Fixed + 3 tests 298c0674
5 MEDIUM Regex fallback matched #include inside /* ... */ block comments ✅ Fixed + test 298c0674
6 MEDIUM meta.source hard-coded to 'tree_sitter' ✅ Fixed + 2 tests 298c0674
7 MEDIUM No end-to-end test for extractor → syncGroup → CrossLink ✅ Added 298c0674

Verification

  • npx prettier --check . — clean
  • npx eslint . — 0 errors, 0 warnings
  • npx tsc --noEmit — only the pre-existing run.ts(160,9) TS2322 remains (not introduced by this PR)
  • New tests: 21 unit + 3 integration = 24/24 green locally
  • Full group-test regression: 482 passed (remaining flake is a pre-existing bridge-db.test.ts worker-exit, unrelated)

CI status note: tests / windows-latest failure is pre-existing

The Windows job reports 10 failing tests. None of them touch IncludeExtractor, group/sync, or any file modified by this PR. All 10 are Kotlin-parser related and reproduce on main HEAD before this PR:

Failure Root cause
test/integration/resolvers/kotlin.test.ts (3 cases) Kotlin fixtures produce empty results on Windows
test/unit/type-env.test.ts:21, test/unit/call-form.test.ts:13 Cannot find package 'tree-sitter-kotlin'optionalDependencies fails to build on the Windows runner, so the subsequent static import 'tree-sitter-kotlin' throws ERR_MODULE_NOT_FOUND
test/integration/cross-file-binding.test.ts (5 cases) Cascaded: the fixture relies on Kotlin resolution

All other platforms are green:

  • tests / ubuntu-latest
  • tests / macos-latest
  • quality (prettier / eslint / tsc / tsc-web / workflow-convention)
  • e2e
  • scope-parity

Happy to iterate further — let me know if any of the above needs adjustment. 🙏

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
8344 8343 0 1 373s

✅ All 8343 tests passed

1 test(s) skipped — expand for details
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 77.97% 25340/32499 N/A% 🟢 ███████████████░░░░░
Branches 66.44% 15998/24077 N/A% 🟢 █████████████░░░░░░░
Functions 83.19% 2554/3070 N/A% 🟢 ████████████████░░░░
Lines 81% 22897/28265 N/A% 🟢 ████████████████░░░░

📋 View full run · Generated by CI

magyargergo and others added 11 commits May 7, 2026 22:36
LadybugDB's `new Database()` raises `Could not set lock on file` from
local_file_system.cpp synchronously inside the constructor — before any
query is issued, so `withLbugDb`'s query-time retry never sees it. On
Windows CI this surfaces as flaky integration tests due to AV-scanner
holds, libuv handle-release lag, and stale `.wal` sidecars from aborted
prior runs.

This change closes the gap at *open time*:

- `openLbugConnection` now wraps `new lbug.Database()` in a bounded
  busy-retry (5x100ms back-off) inside `lbug-config.ts`. Errors that
  exhaust the budget are tagged via `LBUG_OPEN_RETRY_EXHAUSTED` so
  `withLbugDb`'s outer 3x retry skips re-retrying a freshly-exhausted
  path (eliminates the 3x5=15-attempt / ~6s tail latency).
- For recognized test fixtures only (immediate-parent dir matches a
  known prefix AND resolves under `os.tmpdir()`), one final stale-
  sidecar sweep removes `.wal`/`.lock` and retries once. Production
  paths never enter this branch.
- `safeClose` on Windows runs a bounded `fs.open` probe to absorb
  native handle-release lag; logs a warning if the probe exhausts so
  operators can spot AV interference.
- `isDbBusyError` is now defined in `lbug-config.ts` as the single
  source of truth, re-exported from `lbug-adapter.ts` for compatibility.
- New tests cover open-time retry (happy/retry/exhaust/non-busy/tag),
  stale-sidecar sweep (test-fixture-only, production-rejection,
  preserves-original-error), `isTestFixturePath` direct unit suite
  (accept/reject/traversal/nested/trailing-sep), and
  `waitForWindowsHandleRelease` (openable/ENOENT/no-leak).
- The two new test files are added to vitest's existing serialized
  `lbug-db` project (already `fileParallelism: false`).

Closes the chronic Windows CI flake on lbug-touching integration tests
while preserving the existing single-writable-Database-per-process
LadybugDB contract. No public API surface changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… directly

The re-export from lbug-adapter.ts was a transitional convenience — with
the matcher now living in lbug-config.ts, having two import paths for the
same symbol invites future drift. Updated the two real consumers
(lbug-lock-retry.test.ts, lbug-open-retry.test.ts) to import from
lbug-config directly, removed the re-export equality test (now vacuous),
and refreshed the explanatory comment so it no longer references a
re-export pattern that doesn't exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… on Windows

doInitLbug logs "⚠️ Schema creation warning: ... Could not set lock on
file" on every CREATE NODE TABLE call after the first init on a given
dbPath, on Windows. The lock is internal to LadybugDB v0.16.1 and is
resolved before the table is created — same tolerance pattern as the
existing "already exists" filter. Genuine cross-process lock contention
still surfaces on the next operation through withLbugDb's retry, so
filtering at the schema-init catch only suppresses noise, not signal.

Also extend the safeClose Windows handle-release probe to cover the
.wal sidecar (the previous Database's WAL handle was the slowest to
release, surfacing as the schema-query lock contention) and switch the
probe back to 'r+' so it actually detects exclusive locks.

Test loop in lbug-close-handle-release.test.ts simplified to 10 plain
iterations now that the underlying noise is filtered upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop redundant `could not set lock` term — already subsumed by `lock`.
- Document the intentionally-broad matcher: graph-DB lock-shaped errors
  ("deadlock", "unlock failed", "lock contention", "could not open lock
  file") are all treated as transient. If a non-transient surfaces,
  tighten the matcher rather than raise the retry budget.
- Add positive test cases covering those lock-shaped strings so the
  intent is visible and a future tightening would deliberately break
  these.
- Fix the open-retry back-off comment: max sleep is 100+200+300+400 =
  1000ms (no sleep after the final attempt), not 1.5s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 8m 41s —— View job


PR #1156 Deep Review — IncludeExtractor for C++ cross-repo include tracking

Review tasks:

  • Post initial comment
  • Gather repo context and read all changed files
  • Read related extractor/pipeline/utils files
  • Check CI status and download failure logs
  • Run hidden Unicode checks
  • Verify branch hygiene and git log
  • Analyze Windows failure root cause
  • Full review written

Review bar for this PR

  • include must be consistent across ContractType, DetectConfig, normalizeContractId, ManifestExtractor, runExactMatch, writeContractRegistry, and writeBridge.
  • Provider contracts must use repo-relative paths so normalizeContractId produces matching IDs across repos.
  • Consumer contracts must suppress local includes using exact full-path suffix lookup, not truncated suffix matching.
  • System/stdlib includes, angle-bracket includes, and system path prefixes must be filtered before emitting consumer contracts.
  • detect.includes: false must completely disable extraction.
  • writeBridge() must be called after writeContractRegistry() and must not break existing group sync tests on any CI platform.
  • CI (format, lint, typecheck, tests on all platforms), CodeQL, and Gitleaks must be green on the latest head.

Current PR state

Attribute Value
Head SHA 28a3d99 (latest merge commit)
Feature commits 2cc66c9, fcc4319, 3f5d21c, de6904c, 8df93ee, 298c067
Merge commits 6 (c61ceff, 658f56b, 8170c79, ed4b738, f6a9d5f, 28a3d99)
Status Open, not draft
Changed files (PR-specific diff) 8 TypeScript files
quality/format ✅ PASS
quality/lint ✅ PASS
quality/typecheck ✅ PASS
tests/ubuntu ✅ PASS
tests/macos ✅ PASS
tests/windows FAILsync.test.ts > syncGroup > writes registry to groupDir when skipWrite is false — ENOTEMPTY
CI Gate FAIL
CodeQL ✅ PASS
Gitleaks ✅ PASS
Dependency Review ✅ PASS
Hidden Unicode ⚠️ (U+2026) and (U+2014) in comments only — not in executable code, no bidi controls. Acceptable.
Vercel Irrelevant

Branch hygiene assessment

merge-from-main commits present but harmless and merge-safe — with one caveat that warrants explanation.

The 6 merge commits are standard GitHub merge-branch operations that bring upstream fixes in. The de6904c chore: drop test/global-setup.ts + test/vitest.d.ts commit is a direct consequence of the PR's rebase dance: commit 3f5d21c accidentally restored files that upstream had deleted; de6904c corrects this. The final diff (8 files) is correctly scoped to Group include extraction and tests. No lockfile churn, workflow changes, or unrelated package version bumps were introduced by the PR-specific commits. The merge commits do bring in unrelated upstream changes, but they do not modify any of the PR's changed files.


Understanding of the change

Intra-repo C/C++ include resolution in the single-repo ingestion pipeline creates IMPORTS edges for files within the same repo boundary. Group sync cannot see these cross-repo boundaries because it has no concept of header contracts. A header-only interface repo (map/base/view.h) and a consumer repo (src/controller.cpp containing #include "map/base/view.h") produce 0 cross-links today.

This PR adds IncludeExtractor which:

  1. Provider side: registers every .h/.hpp/.hxx/.hh file in a repo as include::<relative-path>.
  2. Consumer side: parses C/C++ source/header files for #include "..." directives, resolves them against the repo's own file list using exact-full-path suffix lookup (fixing the truncated-suffix ambiguity from the previous review), and emits unresolved paths as consumer contracts.
  3. Matching: normalizeContractId handles the include:: type identically on both sides (lowercase, forward slashes, leading ./ stripped), so exact matching via runExactMatch produces cross-links.
  4. Bridge: writeBridge is called after writeContractRegistry, making include cross-links available to the MCP impact engine.
  5. Fallback: tree-sitter parses AST-correctly; for files that exceed the 32 KB parse limit, a regex fallback strips block comments before scanning to avoid false positives.

Findings


[BLOCKER] Windows CI regression — sync.test.ts ENOTEMPTY from LadybugDB file lock

  • Category: CI regression / Windows correctness
  • Files: gitnexus/src/core/group/sync.ts:285–292, gitnexus/test/unit/group/sync.test.ts:672–694
  • Issue: The test writes registry to groupDir when skipWrite is false creates a temp directory, calls syncGroup({ groupDir: tmpDir, skipWrite: false, extractorOverride: async () => [] }), then cleans up with fs.rmSync(tmpDir, { recursive: true, force: true }). Before this PR, syncGroup only wrote contracts.json (a plain JSON file) — cleanup was trivial. This PR adds await writeBridge(opts.groupDir, ...) to sync.ts. writeBridge creates a LadybugDB file (bridge.lbug) plus WAL/shadow sidecars inside tmpDir. On Windows, LadybugDB's background checkpoint thread can briefly outlive closeBridgeDb(), holding a Win32 lock on the renamed file. When rmSync runs immediately after, Windows refuses to delete the locked file and throws ENOTEMPTY. The CI log confirms: Error: ENOTEMPTY: directory not empty, rmdir 'C:\Users\RUNNER~1\AppData\Local\Temp\gitnexus-sync-write-dom61t'. This failure was NOT present on main before this PR's writeBridge call was added.
  • Why it matters: The CI Gate explicitly fails because of this. A PR that regresses Windows tests is not production-ready under DoD §4.2 regardless of whether it passes on other platforms.
  • Recommended fix: Either (a) add skipWrite: true to the test (it only checks the registry, not the bridge) or (b) add the test to the known-flaky list for Windows with justification, or (c) add a brief delay after writeBridge before cleanup in tests. Option (a) is cleanest: change skipWrite: false to skipWrite: true and check existsSync(registryPath) separately if needed. Or alternatively wrap the writeBridge call in sync.ts in a try/catch to prevent propagation — but that would not fix the cleanup race.
  • Blocks merge: YES

[HIGH] Graph provider strategy emits absolute-path contract IDs

  • Category: Correctness / Latent bug
  • Files: gitnexus/src/core/group/extractors/include-extractor.ts:346–371
  • Issue: extractProvidersGraph queries f.filePath from the LadybugDB graph. GitNexus stores absolute paths during gitnexus analyze (e.g., /home/user/repos/interface/map/base/view.h). normalizeIncludePath converts slashes and lowercases but does NOT strip the repo root prefix. So the resulting contract ID becomes include::/home/user/repos/interface/map/base/view.h. Consumers emit include::map/base/view.h (repo-relative). These IDs never match via runExactMatch. The guard if (graphProviders.length > 0) return graphProviders prevents the correct filesystem fallback from running. Consequence: for any indexed C++ repo where initLbug succeeds and the DB has File nodes with header extensions, the graph strategy returns results with wrong IDs → 0 cross-links → feature silently broken.
  • Why it matters: The primary stated use case is indexed C++ repos running group sync. If the repos have been analyzed (which is the normal pre-condition for group sync to be useful), the graph strategy fires and breaks matching. The author's 141 cross-link validation apparently used the filesystem fallback (perhaps because initLbug failed for their repos, or the Cypher regex didn't match, or File nodes weren't created for headers in their graph). This latent bug would surface in a clean production deployment.
  • Recommended fix: In extractProvidersGraph, after fetching f.filePath, strip the repo root prefix: const rel = path.relative(repoPath, r.filePath as string).replace(/\\/g, '/'). The repoPath needs to be passed into extractProvidersGraph. OR: simplify by removing extractProvidersGraph entirely — the filesystem fallback is correct and fast (O(N) glob), and the graph strategy adds no value when paths are absolute. Fix this →
  • Blocks merge: YES (breaks the feature for the primary use case when repos are indexed)

[MEDIUM] ../ relative includes produce spurious consumer contracts

  • Category: Extraction correctness / Minor noise
  • Files: gitnexus/src/core/group/extractors/include-extractor.ts:223–225, 465–478
  • Issue: normalizeIncludePath strips only leading ./, not ../ segments. So #include "../include/foo.h" normalizes to include::../include/foo.h. The suffix index (built from repo-relative paths) will return undefined for ../include/foo.h (since no suffix of a repo-relative path starts with ../), so isLocalInclude returns false and this becomes a consumer contract. No provider can have ../ in their ID, so the contract never matches — it silently pollutes the registry.
  • Why it matters: ../ includes are almost always intra-repo references. Emitting them as cross-repo consumer contracts adds noise to the registry and could confuse impact queries. No test covers this case.
  • Recommended fix: In extractConsumers, skip includes that start with ../ before calling isLocalInclude: if (cleaned.startsWith('../')) continue;. Or add a .. segment strip to normalizeIncludePath.
  • Blocks merge: no (harmless noise, no false positive cross-links)

[MEDIUM] writeBridge error in sync.ts propagates uncaught

  • Category: Robustness / Observability
  • Files: gitnexus/src/core/group/sync.ts:285–292
  • Issue: If writeBridge fails (disk full, LadybugDB schema error, permission denied), syncGroup rejects. At that point contracts.json has already been written successfully — the registry is valid. Callers (e.g., the CLI group sync command) would surface a fatal error even though the contract data is intact. Impact queries would silently use a stale bridge.
  • Recommended fix: Wrap writeBridge in a try/catch, log a warning, and continue. Bridge staleness degrades gracefully (impact queries return empty results) rather than masking a registry that is actually correct.
  • Blocks merge: no

[LOW] canExtract() always returns truedetect.includes is the only gate

  • Category: Observability / Minor design
  • Files: gitnexus/src/core/group/extractors/include-extractor.ts:303–305
  • Issue: canExtract() returns true for all repos. However, sync.ts does NOT call canExtract() — it only checks config.detect.includes. So the canExtract method is dead code in the current wiring. Non-C++ repos do pay for 2 glob calls (all-files and source-glob), but this is bounded and cheap for typical repos. The dead method could mislead future contributors into thinking it's being used.
  • Recommended fix: Either remove canExtract() if unused, or document that it's not called by sync.ts.
  • Blocks merge: no

Group contract model assessment

include is added to ContractType in types.ts:1, VALID_CONTRACT_TYPES in config-parser.ts:14, DetectConfig.includes in types.ts:31, normalizeContractId in matching.ts:110, and ManifestExtractor.resolveSymbol in manifest-extractor.ts:277. The default DEFAULT_DETECT.includes: true is intentional and documented. Existing group YAML files without includes continue to parse (spread-merge with defaults). detect.includes: false correctly gates the extractor at sync.ts:174.

normalizeContractId for include:: is byte-for-byte identical to normalizeIncludePath (confirmed by reading both). Idempotent: applying twice produces the same result. ✓

The ManifestExtractor include branch at line 277–284 is syntactically correct (the missing ); was fixed by 3f5d21c). Semantically, the query WHERE f.filePath = $contract uses the raw link.contract from group.yaml — in practice this would need an absolute path to match, making the graph lookup effectively dead code. The synthetic UID fallback is used, which works for impact traversal. Not a blocker.

Backward compatibility is preserved: no existing extractor behaviour changes.


Include extraction assessment

Provider side: Correctly registers only .h/.hpp/.hxx/.hh extensions via HEADER_EXTENSIONS (line 29). Paths are relative to repo root. Contract ID format include::normalized-path.h is stable. Path normalization (forward slashes, lowercase, leading ./ stripped) is consistent with normalizeContractId. Except for the graph strategy absolute-path bug described above — the filesystem fallback is correct.

STANDARD_IGNORES excludes node_modules, .git, vendor, dist, build, .gitnexus, third_party, 3rdparty, external. Generated files (.pb.h, .moc.h) are still included, which may add some noise for projects using protobuf/Qt. Acceptable for a first implementation.

Consumer side: Scans SOURCE_GLOB = '**/*.{c,cpp,cc,cxx,h,hpp,hxx,hh}'. Quotes-only includes captured. Angle-bracket includes filtered via isAngleBracketInclude() — this is intentional and explicitly documented. System headers filtered via SYSTEM_HEADERS (comprehensive C and C++ stdlib list). System path prefixes (sys/, linux/, bits/, etc.) filtered. Local includes suppressed via isLocalInclude() using exact-full-path suffix lookup (finding #4 fix verified as correct).

../ includes: produce spurious noise, not matched, see finding above.

Parser failure: Any parser.parse() or parser.setLanguage() throw falls back to regex. stripBlockComments is applied before the regex scan. INCLUDE_REGEX.lastIndex = 0 reset on each call. CRLF handled (regex m flag, ^ matches after \r\n). Linear, ReDoS-safe.

Macro includes (#include HEADER_NAME): tree-sitter captures the macro node via path: (_). isAngleBracketInclude() checks for <...>. The cleaned = rawText.replace(/['"<>]/g, '') would strip nothing from HEADER_NAME, producing include::header_name as a consumer contract. This is a false positive. The regex fallback correctly ignores macro includes (requires literal "..."). No test covers this.


Matching / normalization assessment

normalizeContractId for include:: at matching.ts:110–111 is identical to normalizeIncludePath. Provider and consumer emit normalized IDs. buildProviderIndex normalizes all provider IDs before indexing. findMatchingKeys normalizes consumer IDs before lookup. The round-trip provider → normalize → index, consumer → normalize → lookup is consistent. ✓

buildNoisyContractFilter only filters http:: contracts — include:: passes through unfiltered. ✓ No cross-type collisions possible (type prefix is part of the ID). ✓

Multi-provider ambiguity: If two provider repos both expose include::map/base/view.h, buildProviderIndex accumulates both in the same key. runExactMatch creates CrossLinks to BOTH providers (lines 198–226). This is the correct behavior: both repos are potential providers. Callers (impact engine) would see both cross-links. ✓


Group sync / bridge / impact assessment

IncludeExtractor is registered at sync.ts:105 and gated at sync.ts:174 (if (config.detect.includes)). Order within the extractor loop is correct — runs after HTTP/gRPC/thrift/topic. writeBridge is called after writeContractRegistry at sync.ts:286–293. This order is correct: the registry JSON is written first (fast), then the bridge DB (slower, can fail independently).

The integration tests at include-extractor-sync.test.ts use extractorOverride: async () => mockContracts (zero-argument function, matched by eo.length === 0 in sync.ts:95). This correctly bypasses real extraction and exercises the matching + bridge pipeline. Three tests: direct mock, normalization round-trip, and real-filesystem extractor → syncGroup. The third test proves the actual extraction code + matching pipeline together. ✓

group impact traversal is not directly exercised, but the bridge writer is the existing mechanism used by all extractor types. CrossLinks produced by include matching follow the same path to the bridge DB as HTTP/gRPC CrossLinks. The CrossLink.type: 'include' propagates through writeBridge correctly.

Direction: from = consumer (source file), to = provider (header) — correct for downstream impact queries: "if this header changes, what source files are affected?" ✓

Idempotency: Each sync completely replaces the registry. Running sync twice produces identical output. ✓


Performance / scalability assessment

canExtract() is never called by sync.ts, so the true return doesn't actually cause unnecessary work. Non-C++ repos still pay for 2 glob calls (all-files + source-glob). For groups with 20 non-C++ repos this is 40 extra filesystem traversals. Cheap on SSDs, measurable on slow NFS. detect.includes: false provides an opt-out.

Suffix index: built once in extract(), O(N_files) construction, O(1) lookup per include. ✓

Parser reuse: one Parser() instance per extractConsumers() call, shared across files. Query compiled once per language via queryCache. ✓

No obvious O(providers × consumers) explosion in matching: buildProviderIndex is a hash map, findMatchingKeys is O(1) for exact matches. ✓


Security / path safety assessment

isLocalInclude and contract ID generation use only the cleaned include path string — no filesystem operations on the include path itself. ✓

Provider paths are from glob results relative to repoPath — all are repo-relative. Contract IDs never contain absolute paths from the filesystem fallback path. The graph strategy does produce absolute paths — see blocker finding.

GHAS / CodeQL: all 4 findings resolved in commit fcc4319 (HEADER_GLOB unused variable removed, insecure temp file replaced with fs.mkdtempSync). CodeQL is ✅ green on latest head.

Hidden Unicode: (U+2026) and (U+2014) appear in code comments only. No bidi/directional controls. No impact on executable code, regexes, or contract IDs. ✓

Test helpers use fs.mkdtempSync (cryptographically safe temp directory). ✓


Test assessment

Test area Status
Provider .h/.hpp/.hxx/.hh
Provider ignores .cpp
Consumer emits unresolved includes
Local include resolution suppresses consumer
System headers (quoted form) filtered
System path prefixes filtered
Angle-bracket filtered
Provider/consumer round-trip normalization
normalizeContractId (backslash, case, ./, //)
Suffix ambiguity fix (finding #4) ✅ — 3 tests
Block comment false positive (finding #5)
meta.source reflects path (finding #6) ✅ — 2 tests
Case-folding collision documented (finding #3)
Deduplication semantics
syncGroup → CrossLink (finding #7) ✅ — 3 integration tests
detect.includes: false ❌ Missing
Macro includes #include HEADER ❌ Missing
../ relative include noise ❌ Missing
Graph strategy absolute path ❌ Missing

The integration tests properly exercise the extraction → matching → CrossLink pipeline via extractorOverride. The real-filesystem round-trip test (third test) is the strongest signal. The test for detect.includes: false is notably absent.


GHAS / hidden Unicode assessment

GHAS: All 4 CodeQL findings (1 unused var HEADER_GLOB, 3 insecure temp file) were addressed in fcc4319. CodeQL is green on latest head. The integration test file does not have the same insecure temp file pattern — it uses fs.mkdtempSync correctly.

Hidden Unicode: and in comments only. No bidi controls. No impact on parser behavior, regex semantics, or contract IDs.


Elegance / maintainability assessment

The extractor follows the existing ContractExtractor interface pattern cleanly. The dual-strategy (graph → filesystem fallback) mirrors other extractors. The STANDARD_IGNORES constant is readable. The isLocalInclude function is well-documented with a reference to the review finding. stripBlockComments is a clean, focused helper. The meta.source tracking is a nice observability touch.

The extractProvidersGraph strategy has a latent bug (absolute paths) but is otherwise well-structured. The separation of extractProviders / extractConsumers / dedupe is clean.

The commit history is a bit noisy (multiple rounds of fixes) but the final state of the code is readable. Comments are on-point and reference review findings rather than being noise.

Future extensibility: adding -I include directories or selective vendor exclusion would fit naturally into the existing structure. The ContractExtractor interface is preserved correctly.


Final verdict

❌ Not production-ready

Two blockers require fixes:

  1. Windows CI regression (BLOCKER): Adding writeBridge to sync.ts causes the existing test sync.test.ts > writes registry to groupDir when skipWrite is false to fail on tests/windows-latest with ENOTEMPTY. LadybugDB's background checkpoint thread briefly holds a Win32 lock after closeBridgeDb() returns; when test cleanup runs rmSync immediately after, Windows refuses to delete the locked file. This is a direct regression introduced by this PR — the test passed on main before writeBridge was added. The fix is straightforward: add skipWrite: true to that specific test (it only needs to verify the registry JSON, not the bridge) or wrap the test's cleanup in a retry/delay.

  2. Graph provider absolute-path bug (BLOCKER): extractProvidersGraph returns rows where f.filePath is an absolute path from the DB (e.g., /home/user/repos/interface/map/base/view.h). normalizeIncludePath does not strip the repo root, so provider contract IDs include the absolute path and never match consumer's repo-relative IDs. The guard if (graphProviders.length > 0) return graphProviders prevents the correct filesystem fallback from running. For indexed C++ repos (the primary use case), this silently produces 0 cross-links. The fix is either to pass repoPath into extractProvidersGraph and apply path.relative(), or to remove the graph strategy entirely and always use the filesystem fallback.

Beyond these two blockers, the overall design and implementation quality is high. The finding #4 fix (exact-suffix local resolution) is elegant and correct. The block-comment stripping for the regex fallback is a good defensive improvement. The three integration tests cover the key wiring. With the two fixes applied and CI green on all platforms, this PR should be in good shape for final approval.


@magyargergo

Copy link
Copy Markdown
Collaborator

I'll continue this! Thanks for your PR

@SZU-WenjieHuang

Copy link
Copy Markdown
Contributor Author

@magyargergo Thanks for continuing with this — much appreciated! :)

@magyargergo

Copy link
Copy Markdown
Collaborator

@SZU-WenjieHuang would you be interested in migrating C/C++ to scope resolution?

Addresses two blockers and two mediums from the deep review.

BLOCKER 1: Windows CI ENOTEMPTY in sync.test.ts
  After this PR added writeBridge() to syncGroup, the existing test
  "writes registry to groupDir when skipWrite is false" fails on
  windows-latest. LadybugDB's checkpoint thread briefly outlives
  closeBridgeDb, holding a Win32 lock on bridge.lbug; the test's
  fs.rmSync then fails with ENOTEMPTY. Switched the test cleanup to
  cleanupTempDir from test/helpers/test-db.ts which already tolerates
  EBUSY/EPERM/EACCES/ENOTEMPTY with bounded retries — same pattern
  used elsewhere for LadybugDB-touching tests.

BLOCKER 2: Graph provider absolute-path bug
  extractProvidersGraph queried File.filePath from the LadybugDB graph
  but never stripped the repo root, so provider contract IDs ended up
  as include::/abs/path/foo.h while consumers emitted include::foo.h.
  These never matched through runExactMatch — silently producing 0
  cross-links for any indexed C++ repo (the primary use case).
  Now passes repoPath into extractProvidersGraph and applies
  path.relative(); rows that resolve outside repoPath (stale absolute
  paths from another machine, system headers somehow indexed) are
  dropped instead of polluting the registry.

MEDIUM: `../` relative includes produce spurious noise
  `#include "../foo.h"` is almost always intra-repo, but the suffix
  index can never match a `..`-prefixed path so it became a consumer
  contract no provider could satisfy. Now skipped before matching;
  covers both forward-slash and backslash forms.

MEDIUM: writeBridge error in sync.ts propagates uncaught
  contracts.json is the canonical source of truth and was just written
  successfully when writeBridge runs. A bridge-only failure (disk full,
  schema error, permission denied) shouldn't mask the registry. Wrapped
  writeBridge in try/catch with a logger.warn surfacing the path and
  recovery instructions.

Tests added:
  - extractProvidersGraph repo-relative ID generation (stub Cypher
    executor returns absolute paths)
  - extractProvidersGraph drops rows whose path resolves outside repo
  - `../foo.h` forward-slash skip
  - `..\foo.h` backslash-form skip

Skipped findings:
  - canExtract() removal (abhigyanpatwari#5, low): canExtract is part of the
    ContractExtractor interface; every other extractor implements the
    same `return true` shape. Removing it from IncludeExtractor would
    break the interface contract — keeping for consistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
magyargergo and others added 4 commits May 8, 2026 13:32
Two HIGH findings from the Codex adversarial review on
feat/group-include-extractor:

1. Default-on extraction silently changes existing groups (BLOCKER)
   DEFAULT_DETECT.includes was true, so any pre-existing group.yaml
   that omits the new field would gain a wave of include::* contracts
   on the next sync after upgrade. Flipped to false (opt-in). The
   integration test already declares includes: true explicitly so it
   survives unchanged; the unit extractor tests bypass parseGroupConfig
   entirely; the sync test uses extractorOverride. Only config-parser
   needed regression tests covering omitted/explicit/false variants.

2. IncludeExtractor scans outside the indexed file universe (BLOCKER)
   The extractor was running glob('**/*', { ignore: STANDARD_IGNORES })
   twice with a hand-rolled 9-pattern list, no .gitignore/.gitnexusignore
   honoring, and no max-file-size cap. That meant File:<path> contracts
   could appear for files ingestion would never index, producing
   cross-links group impact cannot fan out to (silent false-negatives).
   Refactored to a single discoverIndexableFiles() helper that mirrors
   walkRepositoryPaths exactly: createIgnoreFilter + getMaxFileSizeBytes,
   one discovery pass shared by provider and consumer paths. Dropped
   STANDARD_IGNORES and SOURCE_GLOB entirely.

   third_party and 3rdparty (the C/C++ vendored-deps conventions) were
   in the local ignore list but not in the canonical DEFAULT_IGNORE_LIST
   used by ingestion. Folded both into the canonical set rather than
   keep a parallel list — the whole point of the Codex finding is that
   two file-discovery implementations drift. Single source of truth.

Tests: 5 new regression tests for the discovery alignment (.gitignore,
.gitnexusignore, max-file-size on both provider and consumer paths)
plus 4 for the opt-in default. All 30 include-extractor tests + the
494-test group suite + ignore-service tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce-code-review surfaced 6 safe_auto findings on commit a9936a9:

- T1 (testing, P2): the sync.ts:174 gate was untested with includes:false.
  Added a sync-level test mirroring the existing thrift-off pattern at
  sync.test.ts:545, asserting zero include contracts when the gate is
  disabled in a real syncGroup call.

- T3 (testing, P3): third_party and 3rdparty entries in DEFAULT_IGNORE_LIST
  had no regression test. Added both to ignore-service.test.ts's
  dependency-directories it.each block.

- M1 (maintainability, P3): discoverIndexableFiles JSDoc lacked a
  fork-warning relative to walkRepositoryPaths. Added a MAINTENANCE
  note explaining why the duplication is tolerated and the contract
  the two implementations must keep.

- M2 (maintainability, P3): thrift-extractor still hand-rolls its
  ignore array with no signal that DEFAULT_IGNORE_LIST additions
  silently do not apply there. Added TODO(abhigyanpatwari#1156-followup) comments
  above both call sites.

- M3 (maintainability, P3): SOURCE_EXTENSIONS duplicated the four
  HEADER_EXTENSIONS entries with no expressed subset relationship.
  Spread HEADER_EXTENSIONS into SOURCE_EXTENSIONS so future header-
  extension additions propagate.

- C1+T4 (correctness+testing, P3, cross-reviewer corroborated):
  discoverIndexableFiles swallowed all fs.stat errors silently,
  including EACCES/EMFILE/EIO. Narrowed the catch to ENOENT (the
  documented benign glob/stat race) and added a logger.warn for
  any other code so operators can spot permission/resource issues.

All 629 tests pass; typecheck + prettier clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s EPERM

`storage.ts:62` used raw `fsp.rename` for the contracts.json atomic swap.
On Windows, AV scanners and concurrent renames briefly hold the
destination handle between rename calls, surfacing as EPERM/EBUSY.
The `insecure-tempfile.test.ts > concurrent writes do not collide`
test was flaking with `EPERM: operation not permitted, rename` on
windows-latest CI.

`bridge-db.ts` already has a battle-tested `retryRename(src, dst, 3)`
helper used at six call sites for exactly this pattern. Reusing it
here keeps the Windows-rename policy single-source-of-truth across
the group package.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…closes Windows lbug lock flakes on cli-e2e

Brings open-time retry for , post-close handle-release
probe (main file + .wal sidecar) on Windows, schema-init lock-warning
filter, and serialized lbug-db vitest project. Without this, the
cli-e2e tests for cypher/query/impact intermittently fail on
windows-latest because the CLI's read-only lbug open hits the
documented Windows lock-acquisition race.
@SZU-WenjieHuang

Copy link
Copy Markdown
Contributor Author

Hi @magyargergo, thanks for thinking of me! I'm definitely interested in tackling both #935 and #936. My plan would be:

Looking forward to contributing!

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 6m 48s —— View job


PR #1156 Deep Review — IncludeExtractor for C++ cross-repo include tracking

Review tasks:

  • Post initial comment
  • Read DoD.md, AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, TESTING.md, ARCHITECTURE.md
  • Read all changed files (include-extractor.ts, sync.ts, types.ts, matching.ts, config-parser.ts, manifest-extractor.ts)
  • Read related extractor/pipeline/utils files (suffixIndex, ingestion, http/grpc/thrift extractors)
  • Check CI status, git log, branch hygiene
  • Run hidden Unicode checks
  • Full review written

Review bar for this PR

  • IncludeExtractor file discovery must use the same rules as the ingestion pipeline so File:<rel> UIDs in cross-links correspond to actual graph File nodes.
  • Provider contract IDs must be repo-relative (not absolute paths from the DB) so they match consumer contract IDs through runExactMatch.
  • Consumer extraction must suppress local includes via exact full-path suffix lookup, not truncated suffix matching.
  • System/stdlib includes, angle-bracket includes, and ../ relative includes must be filtered before emitting consumer contracts.
  • detect.includes must default to false (opt-in) so existing group.yaml files that omit the field are not silently changed on next sync.
  • writeBridge() must be called after writeContractRegistry() with error isolation so a bridge failure cannot mask a valid contract registry.
  • CI (format, lint, typecheck, tests on all platforms), CodeQL, Gitleaks must be green on latest head.

Current PR state

Attribute Value
Head SHA c2fac47 (merge of upstream main into feature branch)
Latest feature commit a462feb fix(group): use retryRename in writeContractRegistry to absorb Windows EPERM
Key fix commits fadbb32 (two previous blockers), a9936a9 (Codex adversarial findings), 72f854f (autofix feedback)
Status Open, not draft
Merge commits 7 (bringing upstream main changes into branch)
Changed files (PR-specific) 14 TypeScript files
Net additions +1505 / -4
quality/format ✅ PASS (last CI report 2026-05-07)
quality/lint ✅ PASS
quality/typecheck ✅ PASS
tests/ubuntu ✅ PASS (8334 tests per May 7 CI)
tests/macos ✅ PASS
tests/windows ⚠️ Unknown on latest head (last passed after 298c067; fadbb32+a9936a9+72f854f+a462feb have not been reported as CI-green yet)
CodeQL ✅ PASS (fcc4319 resolved all 4 findings)
Gitleaks ✅ PASS
Hidden Unicode (U+2026) and (U+2014) in code comments only — not in executable code, regexes, contract IDs, or test strings. No bidi/directional controls. Acceptable.
Vercel Irrelevant to this review

CI coverage note: The last public CI report covers commit 298c067. Since then, three fix commits (fadbb32, a9936a9, 72f854f) and one Windows-resilience commit (a462feb) were applied. CI has not been publicly confirmed green on those commits. The author's commit messages claim "All 629 tests pass; typecheck + prettier clean" and "All 30 include-extractor tests + the 494-test group suite pass" — but CI confirmation on the Windows runner for the Windows-specific blocker fix in fadbb32 is the outstanding signal.


Branch hygiene assessment

Verdict: merge-from-main commits present but harmless and merge-safe.

Seven Merge branch 'main' into feat/group-include-extractor commits are present, bringing in upstream changes (lbug lock-retry improvements, FTS warning, onnxruntime bump, etc.). None of those merged changes touch any file modified by this PR. The de6904c chore: drop test/global-setup.ts + test/vitest.d.ts commit is PR-related cleanup: commit 3f5d21c accidentally restored those files during a rebase dance; de6904c corrects this. The final PR-specific diff is correctly scoped to Group include extraction, config, matching, sync wiring, and tests. No lockfile churn, workflow changes, or unrelated package versions are introduced by the feature commits themselves. Squash-merging or standard merging would produce a clean net diff.


Understanding of the change

Intra-repo C/C++ include resolution in the single-repo ingestion pipeline creates IMPORTS edges for files within the same repo boundary, but Group sync had no concept of header contracts. A header-only interface repo (map/base/dice_map_view.h) and a consumer repo (src/controller.cpp containing #include "map/base/dice_map_view.h") produced 0 cross-links.

This PR adds IncludeExtractor which:

  1. Provider side: registers every .h/.hpp/.hxx/.hh file found via discoverIndexableFiles (aligned with walkRepositoryPaths: same createIgnoreFilter + getMaxFileSizeBytes rules) as include::<repo-relative-path>.
  2. Consumer side: parses C/C++ source/header files for #include "..." directives via tree-sitter, with regex fallback (stripped of block comments) for files exceeding the 32 KB parse limit. Emits unresolved includes as consumer contracts after suppressing: locally resolvable includes (exact full-path suffix matching), system/stdlib headers, ../-prefixed relative includes, and angle-bracket includes.
  3. Matching: normalizeContractId lowercases and forward-slashes both sides identically; runExactMatch produces CrossLinks.
  4. Bridge: writeBridge is called after writeContractRegistry in a try/catch, so a bridge failure cannot mask a valid contract registry.
  5. Default: detect.includes: false (opt-in) so existing group configs are not silently changed.

Key differences from lib/custom/manifest links: file-level granularity (not repo-level), extraction is fully automatic (no manual group.yaml entries needed), and the cross-link direction is consumer source file → provider header file.


Findings


[MEDIUM] Macro include #include HEADER_NAME emits orphaned consumer contracts

  • Category: Extraction correctness / Registry noise
  • Files: include-extractor.ts:496–519
  • Issue: The tree-sitter query (preproc_include path: (_) @import.source) captures macro-style includes like #include PLATFORM_HEADER. The (_) wildcard matches any node type including identifiers. For #include PLATFORM_HEADER, sourceNode.node.text = "PLATFORM_HEADER". isAngleBracketInclude returns false. replace(/['"<>]/g, '') leaves the string unchanged. The cleaned value "PLATFORM_HEADER" is not in SYSTEM_HEADERS, does not start with ../, and would fail isLocalInclude (no file named PLATFORM_HEADER exists). Result: include::platform_header is emitted as a consumer contract. No provider can ever have that contract ID — no file is named PLATFORM_HEADER. The contract is permanently orphaned in the registry.
  • Why it matters: Macro includes are common in embedded C, platform-abstraction layers, and build-system-generated headers. While the orphaned consumer contract never produces a false cross-link (no provider matches it), it pollutes the unmatched contract list and inflates the registry. No test covers this case, leaving the behavior undocumented.
  • Recommended fix: Add a guard: if (!/[./\\]/.test(cleaned)) continue; — skips cleaned strings with no path separator and no extension, which covers virtually all macro identifiers (PLATFORM_HEADER, ASSERT_H_, etc.) while preserving all real header paths (foo.h, dir/foo.h). Add a test: a file containing #include SOME_MACRO should produce no consumer contract. Fix this →
  • Blocks merge: no (orphaned consumers never produce false cross-links; registry noise only)

[LOW] CI not confirmed green on latest feature commits

  • Category: CI / Validation baseline
  • Files: N/A
  • Issue: The last public CI report covers commit 298c067 (May 7). Since then, fadbb32 (Windows ENOTEMPTY fix, graph path fix), a9936a9 (Codex adversarial fixes), 72f854f (autofix feedback), and a462feb (retryRename for storage.ts) were applied. The author's commit messages claim local clean runs, but the CI status on the Windows runner for the fadbb32 cleanup test fix (cleanupTempDir) has not been publicly confirmed in CI comments.
  • Why it matters: The previous Windows blocker was specifically in the sync test. The fix in fadbb32 changes the test cleanup to use cleanupTempDir from test/helpers/test-db.ts. This fix is correct by inspection, but a CI confirmation on windows-latest would close the loop definitively.
  • Recommended fix: Push any no-op commit (or re-push the branch tip) to trigger a fresh CI run. All platforms should be confirmed green.
  • Blocks merge: no (the fix is correct by inspection; CI confirmation is a hygiene requirement)

[LOW] canExtract() is dead code in current sync wiring

  • Category: Maintainability
  • Files: include-extractor.ts:299–301, sync.ts:174
  • Issue: sync.ts gates the extractor via if (config.detect.includes) — it does NOT call canExtract(). Every other extractor also returns true from canExtract(), but the method is at least called consistently elsewhere. The commit message of fadbb32 explicitly documents this was kept for interface consistency. This is a dead method that misleads future contributors into thinking it influences extraction.
  • Why it matters: Low impact, but a dead interface method with always-true semantics adds confusion. A comment noting it is not called by sync.ts would clarify.
  • Recommended fix: Add a brief JSDoc comment above canExtract(): /** Always returns true — unused by sync.ts which gates via config.detect.includes. Kept for interface consistency with other extractors. */
  • Blocks merge: no

Group contract model assessment

include is consistently added to ContractType (types.ts:1), VALID_CONTRACT_TYPES (config-parser.ts:14), DetectConfig.includes (types.ts:31), normalizeContractId (matching.ts:110), and ManifestExtractor.resolveSymbol (manifest-extractor.ts:277). The include branch in resolveSymbol is syntactically correct (the missing ); that caused CI to fail was fixed by 3f5d21c). The Cypher query WHERE f.filePath = $contract uses the raw link.contract from group.yaml, which would need to be an absolute DB path to match — making this branch effectively dead code for graph-based resolution. However, the synthetic UID fallback still generates working CrossLinks for impact traversal.

DEFAULT_DETECT.includes: false is the right call. A comment in config-parser.ts:22–26 explicitly explains the opt-in rationale. Existing group YAML files without includes: inherit false and are not affected. detect.includes: false is tested (sync.test.ts:587–626). detect.includes: true produces include contracts. ✓

Config parser tests cover omitted/explicit-false/explicit-true variants. ✓ VALID_CONTRACT_TYPES includes 'include' so manifest links of type include parse correctly. ✓


Include extraction assessment

Provider side: discoverIndexableFiles uses createIgnoreFilter (respects .gitignore, .gitnexusignore, DEFAULT_IGNORE_LIST) and getMaxFileSizeBytes, exactly mirroring walkRepositoryPaths. third_party and 3rdparty are now in DEFAULT_IGNORE_LIST. Tests confirm .gitignore, .gitnexusignore, and GITNEXUS_MAX_FILE_SIZE are all honored for both providers and consumers. Provider paths are repo-relative. Contract ID include::normalized/path.h is stable. ✓

Consumer side: Source extensions via SOURCE_EXTENSIONS = new Set([...HEADER_EXTENSIONS, '.c', '.cpp', '.cc', '.cxx']) — the spread ensures future additions to HEADER_EXTENSIONS propagate automatically. Tree-sitter correctly handles quoted includes. Angle-bracket includes (<vector>, <stdio.h>) correctly filtered via isAngleBracketInclude(). System headers filtered via SYSTEM_HEADERS set (comprehensive C+C++ stdlib list). System path prefixes (sys/, linux/, bits/, etc.) filtered. ../ includes correctly skipped (startsWith('../') || startsWith('..\\')) before the local resolution check. ✓

Local resolution: isLocalInclude uses the exact full path as the lookup key in the suffix index (suffixIndex.get(cleaned)) rather than the generic suffixResolve() which iterates truncated suffixes. This directly fixes finding #4: #include "ext/api.h" is NOT suppressed when the local repo has internal/api.h — only when it has a file actually ending in ext/api.h. Three tests cover this. ✓

Regex fallback: stripBlockComments applied before scanning — block-commented #include directives correctly suppressed. INCLUDE_REGEX.lastIndex = 0 reset per file. m flag handles CRLF. Regex is linear/ReDoS-safe. meta.source set to 'regex_fallback' (not hardcoded 'tree_sitter'). ✓

Macro includes: #include MACRO_NAME emits an orphaned consumer contract. See finding above. ⚠️


Matching / normalization assessment

normalizeContractId for include:: at matching.ts:110: include::${rest.replace(/\\/g, '/').replace(/^\.\//, '').replace(/\/+/g, '/').toLowerCase()} — byte-for-byte identical to normalizeIncludePath. Idempotent. ✓

Provider and consumer emit normalized IDs. buildProviderIndex normalizes before indexing. findMatchingKeys normalizes before lookup. Round-trip is consistent. ✓

buildNoisyContractFilter only filters http:: contracts — include:: passes unfiltered. ✓ No cross-type collisions possible (the include:: prefix is distinct from all other type prefixes). ✓

Multiple provider ambiguity: If two provider repos both expose include::common/types.h, buildProviderIndex accumulates both. runExactMatch emits CrossLinks to BOTH providers (correct behavior — both repos are potential providers). ✓

Lowercasing caveat: Documented in a comment block at normalizeIncludePath. Two distinct headers Foo.h and foo.h on Linux produce the same contract ID; dedupe() keeps first-seen. Test covers this documented behavior. Acceptable trade-off for cross-platform matching. ✓


Group sync / bridge / impact assessment

IncludeExtractor registered at sync.ts:105, gated at sync.ts:174 (if (config.detect.includes)). Order correct — runs after HTTP/gRPC/thrift/topic. ✓

writeBridge is called after writeContractRegistry in the same if (opts?.groupDir && !opts.skipWrite) block, wrapped in try/catch with logger.warn. A bridge failure no longer rejects syncGroup. ✓

retryRename in storage.ts (commit a462feb) hardens the contracts.json atomic rename against Windows AV scanner EPERM — same pattern used by bridge-db.ts at six call sites. ✓

Integration tests (include-extractor-sync.test.ts): three tests — mock contracts via extractorOverride, normalization round-trip, and real two-repo extraction → matching pipeline. All use skipWrite: true (no bridge file written), focusing on the CrossLink generation path. The bridge write path is tested indirectly via other sync tests; the bridge itself is the pre-existing writeBridge mechanism trusted for all extractor types.

Impact direction: from = consumer (source file), to = provider (header) — correct for downstream queries: "if this header changes, what source files are affected?" ✓ group impact traversal uses the bridge.lbug graph; include CrossLinks follow the same bridge path as HTTP/gRPC CrossLinks. ✓

Idempotency: Each sync completely replaces the registry — running twice is deterministic. ✓


Performance / scalability assessment

canExtract() returns true for all repos but is never called by sync.ts (detect.includes: false is the gate, and the default is off for non-C++ groups). Non-C++ groups with detect.includes: true explicitly set would pay for discoverIndexableFiles() once per repo. The glob is bounded by createIgnoreFilter and the file size cap. Acceptable.

discoverIndexableFiles uses sequential stat (documented in the MAINTENANCE comment). For a 10K-file repo this adds O(N) stat syscalls sequentially. Slower than the batched Promise.allSettled in walkRepositoryPaths, but acceptable for group sync which runs at startup/on-demand. ✓

Parser reuse: one new Parser() per extractConsumers() call, queryCache per language. ✓ Suffix index built once per repo, O(N) construction, O(1) per lookup. ✓


Security / path safety assessment

discoverIndexableFiles is scoped to cwd: repoPath with nodir: true — no path traversal. Provider paths are repo-relative via glob output. Consumer contract IDs are derived from the include string, not from filesystem paths. ../ includes are filtered before any contract emission. ✓

extractProvidersGraph now applies path.relative(normalizedRepoPath, absolute) and explicitly drops rows where rel.startsWith('..') or path.isAbsolute(rel) — guarding against stale absolute paths from another machine. ✓

No absolute local paths appear in contract IDs or registry output. ✓

isLocalInclude operates only on strings (no filesystem operations on include paths). ✓

Hidden Unicode: and appear in comments only — not in executable code, regex patterns, contract ID construction, or test strings. No bidi controls. ✓


Test assessment

Test area Status
Provider .h/.hpp/.hxx/.hh
Provider ignores .cpp
Consumer emits unresolved includes
Local include (exact full-path) suppression
System headers (quoted) filtered
System path prefix filtered
Angle-bracket filtered
Provider/consumer round-trip normalization
normalizeContractId (backslash, case, ./, //)
Suffix ambiguity — cross-repo not suppressed by shorter local suffix ✅ — 3 tests
Block comment false positive (regex fallback)
meta.source reflects path ✅ — 2 tests
Case-folding collision documented
../ forward-slash and backslash skip ✅ — 2 tests
Graph provider absolute-path stripping ✅ — 2 tests
File discovery honors .gitignore
File discovery honors .gitnexusignore
File discovery honors GITNEXUS_MAX_FILE_SIZE (provider + consumer) ✅ — 2 tests
detect.includes: false gate
syncGroup → CrossLink integration ✅ — 3 integration tests
normalizeContractId config parser tests
Macro includes #include HEADER_NAME ❌ Missing
.inc/.ipp/.tpp/.inl explicitly out of scope ✅ by omission (not in SOURCE_EXTENSIONS)

GitHub Advanced Security / hidden Unicode assessment

GHAS/CodeQL: All 4 CodeQL findings were resolved in fcc4319 (HEADER_GLOB unused variable removed, 3 insecure temp file patterns replaced with fs.mkdtempSync). CodeQL is ✅ green on latest head. The integration test file uses fs.mkdtempSync correctly — no new insecure temp file pattern introduced. ✓

Hidden Unicode: No bidi controls (\u202a–\u202e, \u2066–\u2069) in any changed file. (U+2026 HORIZONTAL ELLIPSIS) and (U+2014 EM DASH) appear in code comments only — not in executable code, regex literals, contract IDs, or test fixture strings. These are conventional typographic characters in prose comments and do not affect runtime behavior. Acceptable under the repo's existing non-ASCII-in-comments tolerance. ✓


Elegance / maintainability assessment

The extractor follows the ContractExtractor interface cleanly and consistently with HttpRouteExtractor, GrpcExtractor, ThriftExtractor, and TopicExtractor. The dual strategy (graph-assisted providers → filesystem fallback) matches the existing pattern. discoverIndexableFiles is well-commented with a MAINTENANCE note explaining why duplication with walkRepositoryPaths is tolerated and what the two implementations must agree on.

SOURCE_EXTENSIONS = new Set([...HEADER_EXTENSIONS, ...]) — the spread is elegant: a future contributor adding .hxx to HEADER_EXTENSIONS gets it in SOURCE_EXTENSIONS automatically.

stripBlockComments and isLocalInclude are small, focused helpers with inline references to the review findings that motivated them. Comments explain non-obvious semantics without being noise.

One design note: The extractProvidersGraph method has a complex guard chain (rel.startsWith('..') || path.isAbsolute(rel)) that exists only because the DB stores absolute paths. If extractProvidersGraph ever becomes the primary path (currently it's always overridden by the filesystem fallback for non-indexed repos), the regex '.*\\\\.(h|hpp|hxx|hh)$' embedded in the Cypher query string should be verified on both Linux and Windows DB paths. The current double-backslash escaping in the JS template literal produces the correct Cypher regex .*\\.(h|...)$ (escaped dot), but this is subtle and worth a unit test if the graph path becomes primary.

Future extensibility: adding -I include directory awareness, conditional compilation branch filtering, or .inc/.ipp support would fit naturally without architectural changes. ✓


Final verdict

✅ production-ready with minor follow-ups

All previous blockers from the earlier review rounds are resolved:

  • Windows CI ENOTEMPTY fixed via cleanupTempDir in sync test (fadbb32).
  • Graph provider absolute-path bug fixed via path.relative() stripping (fadbb32).
  • Syntax error in manifest-extractor.ts fixed (3f5d21c).
  • CodeQL findings fixed (fcc4319).
  • suffixResolve ambiguity replaced with exact full-path matching (298c067).
  • Block comment false positive in regex fallback fixed (298c067).
  • meta.source hardcoding fixed (298c067).
  • detect.includes: false default (opt-in) corrected (a9936a9).
  • File discovery aligned with ingestion pipeline (a9936a9).

The remaining issues are all LOW or MEDIUM in severity. The macro include false positive (orphaned consumer contracts) adds registry noise but never produces false cross-links — it is the single actionable follow-up worth addressing before a public release. The detect.includes: false default means the feature is opt-in, so existing users are not affected until they explicitly enable it. With a fresh CI run confirming the Windows runner is clean on the latest commits, this PR meets the DoD for an initial release of the IncludeExtractor feature.

Recommended follow-up (not blocking merge): Add a one-line guard in extractConsumers to skip macro-style includes (if (!/[./\\]/.test(cleaned)) continue;) and a covering unit test.

magyargergo and others added 5 commits May 9, 2026 06:57
Tree-sitter's `(_) @import.source` wildcard matches the identifier node
of `#include PLATFORM_HEADER`, so the cleaned value `PLATFORM_HEADER`
slipped past the system-header / `..` filters and was emitted as a
permanently orphaned consumer contract (no file is named after a macro
identifier, so no provider can ever match). Add a shape guard that
skips cleaned values lacking both a path separator and an extension
dot, plus regression tests for single and multi-macro files.

Also document `IncludeExtractor.canExtract()` as unused by sync.ts
(gated via `config.detect.includes` instead) and kept solely for
ContractExtractor interface uniformity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@magyargergo

Copy link
Copy Markdown
Collaborator

@SZU-WenjieHuang fyi, for backward compatibilty this feature is turned off in the defaul config.

@magyargergo magyargergo merged commit 32b5c0e into abhigyanpatwari:main May 9, 2026
24 of 25 checks passed
@SZU-WenjieHuang

Copy link
Copy Markdown
Contributor Author

@magyargergo Got it. Thanks for the clarification:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[group] Add IncludeExtractor for C/C++ cross-repo #include dependency tracking

3 participants