test#4
Closed
zander-raycraft wants to merge 3 commits into
Closed
Conversation
calm fix 4 adding skills to repo [ISSUE abhigyanpatwari#140]
Owner
Author
|
@claude can yo review this |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 2843 tests passed 20 test(s) skipped — expand for detailsIntegration:
Unit:
Code CoverageCombined (Unit + Integration)
Coverage breakdown by test suiteUnit Tests
Integration Tests
📋 View full run · Generated by CI |
Owner
Author
|
@claude can you review this |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Owner
Author
|
@claude can you look at this |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
zander-raycraft
pushed a commit
that referenced
this pull request
Apr 6, 2026
…gyanpatwari#498) * feat: add COBOL language support with regex extraction pipeline Standalone COBOL processor following the markdown-processor.ts pattern: - No LanguageProvider modification — COBOL uses regex, not tree-sitter - No SupportedLanguages enum change — standalone processor pattern New files: - cobol-processor.ts — orchestrator (processCobol, isCobolFile, isJclFile) - cobol/cobol-preprocessor.ts — regex state machine extraction (~888 LOC) - cobol/cobol-copy-expander.ts — COPY statement expansion with circular detection - cobol/jcl-parser.ts — JCL job/step/DD extraction - cobol/jcl-processor.ts — JCL graph node creation Extraction produces: - Module nodes (PROGRAM-ID) - Function nodes (paragraphs) - Namespace nodes (sections) - Property nodes (data items) - CALLS edges (PERFORM intra-file, CALL cross-program) - IMPORTS edges (COPY statements) - CONTAINS edges (section → paragraph hierarchy) Pipeline integration: single processCobol() call in Phase 2.6 54 new tests (33 COBOL + 21 JCL), all 3889 tests pass. * docs: document custom processor pattern in pipeline.ts Add comment block at the custom processor integration point documenting the pattern for future non-tree-sitter language additions. * feat(cobol): enrich graph with EXEC SQL/CICS, ENTRY points, MOVE data flow, PERFORM THRU Maps the remaining 60% of CobolRegexResults to the graph: - EXEC SQL blocks → CodeElement nodes + ACCESSES edges to DB tables - EXEC CICS LINK/XCTL → CodeElement nodes + cross-program CALLS edges - ENTRY points → Constructor nodes (registered for cross-program resolution) - MOVE statements → ACCESSES edges (read/write data flow tracking) - PERFORM THRU → expanded CALLS edges for range targets - File declarations → Record nodes with assignment metadata - Cross-program CALL 2nd pass: resolves unresolved targets after all programs processed * test(cobol): add 26 integration tests with exact assertions + fix CICS resolution bug Integration tests (test/integration/resolvers/cobol.test.ts): - 26 tests covering full COBOL system extraction - ALL assertions use exact toBe(N) — zero fuzzy assertions - Fixtures: CUSTUPDT.cbl, AUDITLOG.cbl, CUSTDAT.cpy, RPTGEN.cbl, RUNJOBS.jcl Bug fix (cobol-processor.ts): - CICS LINK/XCTL cross-program resolution was broken — edges were created with "resolved" reason but pointing to <unresolved> targets - Fix: use cics-link-unresolved / cics-xctl-unresolved suffix pattern matching the existing cobol-call-unresolved pattern - Second-pass resolver now patches both CALL and CICS unresolved edges All 3915 tests pass, 0 failures. * test(cobol): exhaustive 57-test suite with strict exact assertions Complete rewrite of COBOL integration tests using ground-truth approach: dump the full graph, then assert EVERY node and EVERY edge. 57 tests across 9 sections: - Node completeness: Module(3), Function(13), Namespace(2), Property(21), Record(1), CodeElement(8), Constructor(1) — exact sorted arrays - Edge completeness: 22 tests covering every type+reason combination with exact source→target pairs - Cross-program resolution: 6 tests verifying CALL, CICS LINK/XCTL, JCL - COPY expansion: copybook data items in RPTGEN - Section hierarchy: exact paragraph membership per section - Data item ownership: exact per-module breakdown - MOVE data flow: exact read/write pairs - JCL integration: job/step/dataset containment - Grand totals: CALLS(22), CONTAINS(48), IMPORTS(1), ACCESSES(7) Fixture enhancements: - CUSTUPDT.cbl: added INIT-SECTION + PROCESSING-SECTION, PERFORM THRU - AUDITLOG.cbl: added ENTRY "AUDITLOG-BATCH" - RPTGEN.cbl: added EXEC CICS XCTL Zero fuzzy assertions — every expect uses toBe(N) or toEqual([...sorted]). * fix(cobol): add removeRelationship API + single-quote CALL/COPY/ENTRY, PERFORM keyword skip Phase 0A: Add removeRelationship(id) to KnowledgeGraph interface and implementation (trivial Map.delete wrapper). Required for orphan edge cleanup in next commit. Phase 1A (from PR abhigyanpatwari#500 review, modified): - RE_CALL and RE_COPY_QUOTED now match both "double" and 'single' quotes - parseSingleCopyStatement in copy-expander updated for single quotes - PERFORM_KEYWORD_SKIP set prevents UNTIL/VARYING/WITH/TEST/FOREVER from being stored as false-positive perform targets - Sequence number stripping uses /[^0-9 ]/ (preserves numeric seq numbers unlike PR abhigyanpatwari#500's /\S/ which stripped them) - Normalized || to ?? for regex group extraction in copy-expander 5 new graph unit tests, all 57 COBOL integration tests pass. * fix(cobol): RE_ENTRY single-quote + remove orphan unresolved CALLS edges Phase 1B: RE_ENTRY regex now supports both "double" and 'single' quoted ENTRY targets. Uses named intermediates (entryName, usingClause) with ?? operator. USING capture group shifted from [2] to [3]. Phase 1C: Second-pass resolution now collects resolved orphan edge IDs during iteration and removes them after the loop completes, using the new graph.removeRelationship() API. Graph no longer contains phantom <unresolved>: edges alongside their resolved replacements. CALLS count drops from 22 to 18 (4 orphan edges removed). * fix(cobol): Property ID collisions + O(1) Map lookup for MOVE edges Phase 1D+3C (atomic): Property node IDs now use composite key filePath:section:level:name instead of filePath:name. This prevents duplicate data item names in different sections (e.g., STATUS in both WORKING-STORAGE and LINKAGE) from silently colliding. New generatePropertyId() helper ensures both node creation and MOVE edge lookup use the identical key formula. buildDataItemMap() replaces the O(n) findDataItemNode linear scan with O(1) Map lookup, built once per file before MOVE processing. * feat(cobol): MOVE multi-target extraction with OF/IN qualifier filtering MOVE X TO A B C now produces write edges for all targets, not just the first. extractMoveTargets() helper handles OF/IN qualified names (WS-NAME OF WS-RECORD -> target is WS-NAME), subscript stripping (WS-TABLE(I) -> WS-TABLE), and MOVE_SKIP filtering on targets. Data model: CobolRegexResults.moves.to:string -> targets:string[] MOVE CORRESPONDING stays single-target per COBOL standard. Processor MOVE loop now iterates move.targets. * feat(cobol): COPY IN/OF library, pseudotext REPLACING, dynamic CALL, PERFORM TIMES, CICS MAP unquoted Phase 2B: COPY ... IN/OF library-name now captured as metadata in CopyResolution (IN and OF are synonyms per COBOL-85 standard). Phase 2C: COPY REPLACING ==pseudotext== support. Tokenizer handles ==...== delimiters alongside "quoted" strings. Pseudotext forces EXACT type. Two-pass applyReplacing: first pass handles space-containing/ non-identifier pseudotext via global string replace; second pass handles identifier-level LEADING/TRAILING/EXACT. New test file cobol-copy-expander.test.ts with 10 tests. Phase 2E: PERFORM WS-COUNT TIMES no longer produces a false-positive perform target (checks for TIMES keyword after captured identifier). Phase 2F: Dynamic CALL via data item (CALL WS-PROG-NAME without quotes) now emits a CodeElement annotation node with description 'dynamic-call' instead of silently ignoring. Adds isQuoted:boolean to call results. Phase 3A: CICS MAP(WS-MAP-NAME) unquoted identifiers now captured. Phase 3B: Normalized || to ?? in copy-expander (done in Phase 1A). * feat(cobol): nested program support — capture multiple PROGRAM-IDs per file Phase 2D: The state machine now captures all PROGRAM-IDs, not just the first. The primary program name stays in programName; additional nested programs go into nestedPrograms[]. The processor creates separate Module nodes for each nested program, contained by the outer module, and registers them in moduleNodeIds for cross-program CALL resolution. Paragraphs/data items are not yet scoped per-program (attributed to the outer module) — full per-program scoping is a future enhancement that requires END PROGRAM boundary tracking in the state machine. * test(cobol): expand integration tests for all new language features New fixtures: - NESTED.cbl — two PROGRAM-IDs (OUTER-PROG, INNER-PROG) for nested program support testing - COPYLIB.cpy — copybook for pseudotext REPLACING test target Modified fixtures: - CUSTUPDT.cbl — single-quoted ENTRY 'ALTENTRY', multi-target MOVE (WS-AMT TO FIELD-A FIELD-B), dynamic CALL WS-PROG-NAME, COPY COPYLIB with pseudotext REPLACING, LINKAGE SECTION with LS-PARAM - RPTGEN.cbl — PERFORM WS-COUNT TIMES (false-positive guard), unquoted MAP(WS-MAP-NAME), additional data items WS-COUNT WS-MAP-NAME Integration test rewritten with 62 exact assertions covering: - 5 Module, 17 Function, 33 Property, 9 CodeElement, 2 Constructor nodes - Nested program containment (OUTER-PROG -> INNER-PROG) - Dynamic CALL annotation (CodeElement with cobol-dynamic-call) - Multi-target MOVE (UPDATE-BALANCE: 2 reads, 3 writes) - Single-quoted ENTRY (ALTENTRY under CUSTUPDT) - PERFORM TIMES guard (WS-COUNT not in CALLS) - Orphan unresolved edge removal (zero -unresolved edges) - Grand totals: 21 CALLS, 68 CONTAINS, 2 IMPORTS, 10 ACCESSES * fix(cobol): pseudotext REPLACING now applies correctly via isPseudotext flag Root cause: ==PREFIX-== matched /^[A-Z][A-Z0-9-]*$/i (trailing hyphens allowed), routing it to the second-pass EXACT identifier match where PREFIX-RECORD !== PREFIX- failed silently. Fix: Propagate isPseudotext from parseReplacingClause to CopyReplacing interface, then use it in applyReplacing first-pass condition to force global string replacement for all pseudotext entries regardless of whether the content looks like an identifier. Result: COPY COPYLIB REPLACING ==PREFIX-== BY ==WS-==. now correctly transforms PREFIX-RECORD → WS-RECORD, PREFIX-CODE → WS-CODE, etc. * refactor(cobol): per-program scoping via boundary tracking + line-range grouping State machine changes (minimal, ~30 lines): - Add RE_END_PROGRAM regex for END PROGRAM program-name. detection - Replace nestedPrograms[] with programs[] containing startLine/endLine/ nestingDepth metadata for each PROGRAM-ID in the file - Reset division/section/paragraph state on new PROGRAM-ID boundary - EOF finalization flushes remaining stack entries (single-program files) - Programs sorted by startLine (outer before inner) Processor changes: - Uses programs[] with line-range containment to find enclosing parent Module for nested programs (replaces hardcoded nestedParent logic) - programModuleIds Map tracks Module node IDs per program name Fixture: NESTED.cbl now includes END PROGRAM lines for both programs. Integration test: PREFIX-* Property nodes now correctly appear as WS-* after the pseudotext REPLACING fix from the previous commit. * feat(cobol): free-format COBOL support (>>source free) Auto-detects >>SOURCE FREE directive in the first 500 chars and switches to free-format line processing: - No column-position rules (cols 1-6 are program text, not sequence area) - Comments use *> prefix instead of col 7 indicator - No continuation line indicator - Strip inline *> comments - Skip >>SOURCE directive lines preprocessCobolSource() skips col-1-6 stripping for free-format files. Paragraph/section regexes relaxed from fixed 7-space prefix to flexible whitespace with case-insensitivity (/^\s*([A-Z][A-Z0-9-]+)\.\s*$/i). EXCLUDED_PARA_NAMES expanded with COBOL verbs (GOBACK, END-READ, etc.) to prevent false-positive paragraph detection in free-format. Also fixes: entry-point-scoring.ts crash when language is 'cobol' (MERGED_ENTRY_POINT_PATTERNS[language] was undefined → optional chaining). Benchmark on ACAS 3.01 (268 GnuCOBOL free-format programs, 10MB): - Before: 407 nodes, 393 edges (near-empty, only file nodes) - After: 4,297 nodes, 3,612 edges, 542 clusters, 11 flows * fix(cobol): relax data item regexes for free-format (^\s+ to ^\s*) RE_FD, RE_DATA_ITEM, RE_ANONYMOUS_REDEFINES, and RE_88_LEVEL all used ^\s+ which requires at least 1 leading space. In free-format mode, lines are trimmed before processing, so data items like "01 WS-FIELD PIC X." have no leading whitespace after trimming. Changed to ^\s* (zero or more spaces) which works for both fixed-format (indented lines still have spaces) and free-format (trimmed lines). ACAS benchmark (268 GnuCOBOL programs): - Before: 4,297 nodes, 3,612 edges (paragraphs only) - After: 13,832 nodes, 8,615 edges (+ data items, FDs, 88-levels) * feat(cobol): 100% structural feature coverage — GO TO, SCREEN, SD/RD, SORT, SEARCH, CANCEL, Level 66 New extractions: GO TO (CALLS edges), SCREEN SECTION data items, SD/RD alongside FD (Record nodes), SORT/MERGE USING/GIVING (ACCESSES), SEARCH (ACCESSES), CANCEL (CALLS), Level 66 RENAMES (Property), IS EXTERNAL/IS GLOBAL (Property description enrichment). ACAS: 13,951 nodes | 13,193 edges | 685 clusters | 150 flows (+53% edges from new GO TO/SORT/SEARCH/CANCEL extractions) * feat(cobol): enriched CICS extraction — file I/O, dynamic PROGRAM, queues, HANDLE ABEND EXEC CICS blocks now extract: - FILE/DATASET clause: captures VSAM file name (literal or data item ref) for READ/WRITE/REWRITE/DELETE/STARTBR/READNEXT/READPREV → ACCESSES edges - PROGRAM clause: now handles unquoted variable references (dynamic CICS program transfer) → CodeElement annotation with cics-dynamic-program reason - QUEUE clause: captures TS/TD queue names from WRITEQ/READQ → ACCESSES edges - LABEL clause: captures HANDLE ABEND error handler targets → CALLS edges - TRANSID: now handles unquoted variable references CodeElement descriptions enriched with all captured fields (map, program, transid, file, queue, label). CardDemo benchmark: +49 nodes, +33 edges from enriched CICS extraction. * feat(cobol): complete CICS command extraction — all 7 expert recommendations From COBOL expert agent analysis: 1. ENDBR added to isRead file command list 2. LOAD added to PROGRAM edge commands (alongside LINK/XCTL) 3. Two-word commands expanded: WRITEQ/READQ/DELETEQ TS/TD, HANDLE ABEND/AID/CONDITION, START TRANSID 4. Queue reason differentiated: cics-queue-read/-write/-delete 5. RETURN/START TRANSID → CALLS edges to synthetic <transid> target 6. MAP → ACCESSES edges for screen traceability 7. INTO/FROM data fields extracted → ACCESSES edges to data items Also: dataItemMap built before CICS block processing (was declared after), CodeElement descriptions enriched with all captured CICS fields. * test(cobol): strict exhaustive integration tests with exact edgeSet assertions Every edge reason has exact sorted pair assertions via edgeSet(), not just counts. Any change to extraction that adds, removes, or reorders edges will produce a precise, descriptive failure. Updated RPTGEN.cbl fixture with: - GO TO EXIT-PARAGRAPH, SORT USING/GIVING, SEARCH table - EXEC CICS READ FILE INTO, WRITEQ TS QUEUE FROM, SEND MAP FROM - EXEC CICS HANDLE ABEND LABEL, RETURN TRANSID, XCTL PROGRAM(variable) - ABEND-HANDLER and EXIT-PARAGRAPH paragraphs 46 tests covering 24 CALLS + 79 CONTAINS + 18 ACCESSES + 2 IMPORTS edges across 15 distinct edge reason codes, all with exact sorted pair lists. * fix(cobol): address 5 findings from second Claude review (compiler front-end perspective) Finding #2: Numeric sequence numbers now stripped (changed /[^0-9 ]/ to /\S/ in preprocessCobolSource). Lines like "000100 MAIN-PARAGRAPH." now have cols 1-6 blanked so paragraph regex matches correctly. Finding #11: JCL in-stream PROC ordering fixed — pre-register all PROCs into moduleNames before step processing. Steps that EXEC a PROC defined later in the same file now get CALLS edges. Finding #A: PROCEDURE DIVISION USING no longer captures calling-convention keywords (BY, VALUE, REFERENCE, CONTENT, ADDRESS, OF) as parameter names. Finding #C: SORT/MERGE USING/GIVING now captures ALL file references (multi-file), not just the first. Changed from single-match to section extraction with split. Finding #D: Section headers no longer set currentParagraph, preventing PERFORM caller misattribution to Namespace instead of Function nodes. * fix(cobol): address code review findings — ReDoS fix, perf, cleanup P1 CRITICAL — ReDoS in SORT USING/GIVING: Replaced nested-quantifier regex with safe indexOf+substring+split approach. No backtracking possible on crafted input. P2 — readCopy O(M) linear scan: Added copybookByPath reverse Map for O(1) path-to-content lookup. P3 — Dead code removal: Deleted unused RE_SORT_USING and RE_SORT_GIVING constants. P3 — EXCLUDED_PARA_NAMES simplification: Replaced 20 END-* entries with startsWith('END-') prefix check. Auto-covers future END-* verbs. P3 — Misplaced JSDoc on removeRelationship: Fixed comment that described removeNodesByFile instead. Added missing JSDoc to removeNodesByFile. Review agents: architecture-strategist, performance-oracle, security-sentinel, code-simplicity-reviewer. * refactor: add Cobol to SupportedLanguages with parseStrategy: standalone New languages/cobol.ts — standalone regex processor provider with no-op tree-sitter fields. Declares parseStrategy: 'standalone' to distinguish from tree-sitter-based languages. Added parseStrategy: 'tree-sitter' | 'standalone' to LanguageProviderConfig for languages that use their own processor instead of tree-sitter. Removed all 11 'cobol' as any casts — now uses SupportedLanguages.Cobol. Added empty Cobol entries to entry-point-scoring and framework-detection. * fix(cobol): 5 fixes from third Claude review + 3 regression tests Fixes: - Line numbers now 1-indexed in fixed-format (was 0-indexed, off-by-one in jump-to-definition links) - Copybook content preprocessed before COPY expansion (sequence numbers and patch markers in copybooks no longer survive into expanded source) - ENTRY USING filters calling-convention keywords (BY, VALUE, REFERENCE, CONTENT, ADDRESS, OF) — same fix as PROCEDURE DIVISION USING - SORT/MERGE trailing period stripped from USING/GIVING file tokens - Paragraph exclusion uses exact match for SECTION/DIVISION (was substring match that excluded valid names like CROSS-SECTION-ANALYSIS) USING_KEYWORDS moved to module scope for reuse by both PROCEDURE DIVISION USING and ENTRY USING handlers. New unit tests: - ENTRY USING BY VALUE filtering - Paragraph names containing SECTION not excluded - Numeric sequence numbers stripped enabling paragraph detection * fix(cobol): address 6 findings from fourth Claude review + tests Fourth review findings fixed: - New #IV: PERFORM TIMES guard uses perfMatch.index instead of line.indexOf (prevents wrong match when target appears earlier in line) - New #V: 88-level condition values now handle single-quoted literals ('Y' no longer stored with embedded quotes) - New #I: CANCEL edges use two-pass resolution like CALL (no longer silently dropped when target indexed after source) - New #3: Multi-line SORT/MERGE accumulation — sortAccum state variable accumulates lines until period, then extracts USING/GIVING from full statement (95% of production SORT statements span multiple lines) - New #II: PROCEDURE DIVISION USING on split lines — pendingProcUsing flag defers parameter capture to next line if USING not on same line - New #6 (prior): EXCLUDED_PARA_NAMES exact match for SECTION/DIVISION Updated fixture: RPTGEN.cbl SORT now uses multi-line format with GIVING on separate line (period-terminated). New sort-giving integration test. ACCESSES total: 18 → 19 (new sort-giving edge from multi-line capture). * fix(cobol): address 4 findings from fifth Claude review Finding #B (5 reviews old): Section/paragraph node IDs now include enclosing program name to prevent collision when nested programs share section/paragraph names. New findOwningProgramName() helper uses programs[] line ranges to find the innermost enclosing program. Finding #α: pendingProcUsing now reset in the if(procUsingMatch) branch (was only set in else branch, could leak across nested programs). Finding #β: RE_CALL_DYNAMIC uses negative lookbehind (?<![A-Z0-9-]) to prevent false-positive on compound identifiers like WS-CALL OCCURS. Finding #γ: sortAccum flushed at EOF (parallel to flushSelect and pendingFdName EOF cleanup). Prevents silent loss of SORT USING/GIVING relationships in truncated files. * fix(cobol): address findings from reviews 5+6 with full test coverage Review 5 fixes: - #α: pendingProcUsing reset in if(procUsingMatch) branch - #β: RE_CALL_DYNAMIC negative lookbehind prevents WS-CALL false positive - #γ: sortAccum flushed at EOF for truncated files - #B: Section/paragraph IDs include owning program name Review 6 fixes: - #P: sectionNodeIds/paraNodeIds maps use program-scoped keys (PROGNAME:NAME). New scopedParaLookup/scopedCallerLookup helpers. findContainingSection updated with programs parameter. - #Q: RETURNING added to USING_KEYWORDS for COBOL 2002+ - #R: RE_PERFORM matches both THRU and THROUGH via alternation New unit tests (6): - PERFORM THROUGH captures thruTarget - PROCEDURE DIVISION USING RETURNING filters keyword - RE_CALL_DYNAMIC no false-match on WS-CALL compound identifier - Multi-line SORT captures USING/GIVING from continuation lines - PROCEDURE DIVISION USING on split line via pendingProcUsing - Copybook preprocessing strips sequence numbers * fix(cobol): address findings from seventh Claude review + 3 tests Review 7 fixes: - #i: findContainingSection only updates best when lookup succeeds (prevents undefined overwriting valid parent section) - #ii: RE_PROC_SECTION handles segment numbers (SECTION 30.) - #III: procedureUsing now stored per-program on boundary stack entries, propagated to programs[] output. Inner programs no longer overwrite outer program's parameters. - #δ: Dynamic CANCEL (CANCEL variable) now creates CodeElement annotation node, matching dynamic CALL behavior. RE_CANCEL_DYNAMIC with negative lookbehind. cancels[] gains isQuoted field. - #Q: RETURNING added to USING_KEYWORDS (already in prev commit) - #R: PERFORM THROUGH already fixed (THRU|THROUGH alternation) New unit tests: - Nested programs carry per-program procedureUsing - SECTION with segment number detected - Dynamic CANCEL via data item captured with isQuoted=false * feat(cobol): link PROCEDURE DIVISION USING to LINKAGE data items + close 4 findings Finding #10 FIXED: procedureUsing parameters now create ACCESSES edges with reason 'cobol-procedure-using' from Module to matching LINKAGE SECTION Property nodes. This exposes the program's parameter contract in the graph (e.g., AUDITLOG → LS-CUST-ID, AUDITLOG → LS-AMOUNT). Findings closed by expert agent consensus: - #6 COPY IN library: WONTFIX — captured metadata, no universal library-to-directory mapping exists. Field costs nothing and is useful for library queries. - abhigyanpatwari#14 SQL DELETE: WONTFIX — DB2 requires FROM; existing FROM pattern handles it. Bare DELETE would risk false positives. - #E OCCURS DEPENDING ON: WONTFIX — runtime sizing concern, not structural. The static occurs count is sufficient for indexing. All 39 findings from 7 Claude reviews now resolved or closed. * fix(cobol): resolve 48 review findings across 9 review cycles Ninth deep review resolved all remaining COBOL parser gaps identified by 5 specialist agents (COBOL expert, architecture strategist, TypeScript reviewer, security sentinel, code simplicity reviewer). Fixes (P1 — critical): - SELECT OPTIONAL now correctly skips OPTIONAL keyword (C1) - RETURNING params excluded from PROCEDURE DIVISION USING list (C7) - SORT GIVING no longer captures clause keywords as file names (C5) - Extract flushSort() helper eliminating 40-line duplication (S2) - Flush unclosed EXEC blocks at EOF matching SORT/SELECT pattern (S3) - Guard undefined map key in jcl-processor moduleNames (S1) - Add MAX_TOTAL_EXPANSIONS=500 to prevent exponential COPY breadth (S4) Fixes (P2 — important): - Quote-aware stripInlineComment for | and *> in string literals (C2+C3) - Fixed-format literal continuation now handles quoted strings (C6) - PROGRAM-ID detected regardless of division state for siblings (C9) Fixes (P3 — cleanup): - EXEC SQL INTO restricted to INSERT INTO to avoid FETCH false-pos (C8) - Copy expander line numbers fixed from 0-based to 1-based (C11) - Remove dead code: inInStreamProc, fileIsLiteral, expansionDepth (S7-S10) Also fixes 8th-review findings: nested program CONTAINS attribution, multi-PERFORM on same line, INPUT/OUTPUT PROCEDURE IS in SORT, GO TO DEPENDING ON multi-target, MOVE CORR abbreviation, per-program procedureUsing ACCESSES edges. Tests: 145 COBOL tests passing (59 integration + 86 unit) Benchmarks: CardDemo 12,323 nodes/8,893 edges (7.4s) ACAS 14,016 nodes/15,452 edges (9.3s, -9% faster) * docs(cobol): update documentation for ninth review cycle fixes Update all 4 COBOL documentation files to reflect the 16 fixes from the ninth review cycle: - regex-extraction.md: quote-aware comment stripping, SELECT OPTIONAL, RETURNING exclusion, SORT_CLAUSE_NOISE filter, flushSort() helper, GO TO multi-target, PROGRAM-ID division-independent detection - copy-expansion.md: MAX_TOTAL_EXPANSIONS=500 breadth guard, 1-based line numbers, removed expansionDepth/warnedCircular param - deep-indexing.md: GO TO DEPENDING ON, INPUT/OUTPUT PROCEDURE IS, MOVE CORR edge reasons, INSERT INTO restriction, literal continuation - performance.md: updated benchmarks (CardDemo 12,323n/8,893e/7.4s, ACAS 14,016n/15,452e/9.3s), COPY breadth guard * fix(cobol): resolve 10th review findings — nested program edge attribution Fix 6 findings from the 10th review (PR abhigyanpatwari#498 comment #4132201110): #A+#F: All CALL/CANCEL/CICS/ENTRY/SQL/SEARCH/file-declaration edges now use owningModuleId() for nested program attribution instead of the outer program's parentId. Added helper function owningModuleId() to centralize the pattern. #B: Added USING and GIVING to SORT_CLAUSE_NOISE set to prevent MERGE USING + OUTPUT PROCEDURE from capturing clause keywords as file names. #C: INPUT/OUTPUT PROCEDURE regex now captures optional THRU/THROUGH range end paragraph, mirroring RE_PERFORM's THRU support. #D: scopedCallerLookup fallback now uses programModuleIds.get(pgm) instead of parentId, so PERFORM/MOVE/GOTO in nested programs with unresolvable paragraphs fall back to the correct inner module. #E: pendingProcUsing only set when PROCEDURE DIVISION line is NOT period-terminated, preventing false USING expectation. Tests: 145 passing | TypeScript clean * fix(cobol): resolve 10th review findings — nested program edge attribution Fix 6 findings from the 10th review (PR abhigyanpatwari#498 comment #4132201110): #A+#F: All CALL/CANCEL/CICS/ENTRY/SQL/SEARCH/file-declaration edges now use owningModuleId() for nested program attribution instead of the outer program's parentId. Added helper function owningModuleId() to centralize the pattern. #B: Added USING and GIVING to SORT_CLAUSE_NOISE set to prevent MERGE USING + OUTPUT PROCEDURE from capturing clause keywords as file names. #C: INPUT/OUTPUT PROCEDURE regex now captures optional THRU/THROUGH range end paragraph, mirroring RE_PERFORM's THRU support. #D: scopedCallerLookup fallback now uses programModuleIds.get(pgm) instead of parentId, so PERFORM/MOVE/GOTO in nested programs with unresolvable paragraphs fall back to the correct inner module. #E: pendingProcUsing only set when PROCEDURE DIVISION line is NOT period-terminated, preventing false USING expectation. Tests: 145 passing | TypeScript clean * fix(cobol): resolve 11th review findings — final nested program + multi-CALL gaps #1: scopedCallerLookup(null) now uses owningModuleId(lineNum) instead of parentId, fixing PERFORM/MOVE/GOTO before first paragraph in nested programs. #2+#3: CALL and CANCEL extraction now uses matchAll (global flag) to capture multiple occurrences on the same line. Dynamic CALL/CANCEL checked independently instead of in else branch. #4: SORT/MERGE ACCESSES edge IDs now use owningModuleId(sort.line) instead of parentId for nested program correctness. #5: preprocessCobolSource free-format detection now uses first 10 lines (consistent with extractCobolSymbolsWithRegex threshold). #6: EXCLUDED_PARA_NAMES expanded with DISPLAY, ACCEPT, WRITE, READ, REWRITE, DELETE, OPEN, CLOSE, RETURN, RELEASE, SORT, MERGE to prevent false-positive paragraph detection on isolated verbs. Also removed unused GraphNode import from cobol-processor.ts. Tests: 145 passing | TypeScript clean * docs(cobol): deepened full language coverage plan with research findings 3 research agents analyzed Phase 1-2 features and graph value ranking. Key findings: cobol-call-using is #1 edge type (9.2/10); multi-line accumulation is dominant challenge; DECLARATIVES is lowest-risk Phase 2 item; SET TO TRUE covers 80-90% of SET usage. * feat(cobol): implement Phase 1 — high-value data flow edges 4 new extraction features that create new ACCESSES and IMPORTS edges: 1.1: EXEC SQL INCLUDE -> IMPORTS edges with reason 'sql-include' Handles unquoted (SQLCA), quoted ('DBRMLIB.MEMBER'), and underscored (CUST_TBL_DCL) member names. 1.2: CALL USING parameter extraction -> ACCESSES edges Extracts parameters from CALL USING clause, filtering BY/REFERENCE/ CONTENT/VALUE/ADDRESS/OF/LENGTH/OMITTED keywords. Creates 'cobol-call-using' ACCESSES edges (graph value: 9.2/10). 1.4: OCCURS DEPENDING ON -> ACCESSES edges with reason 'cobol-depends-on' Extended OCCURS regex captures DEPENDING ON field with subscript stripping. Creates dependency edge from table to controlling field. 1.5: VALUE clause for standard data items Extracts VALUE from data item clauses: quoted strings with type prefix (X/N/G/B), ALL literals, numerics (incl negative/decimal), and figurative constants. Populates Property node values. Tests: 145 passing (+2 ACCESSES from CALL USING) | TypeScript clean * feat(cobol): implement Phase 2 — DECLARATIVES, SET, INSPECT, EXEC DLI 4 new extraction features for error handling, data flow, and IMS/DB: 2.1: EXEC DLI (IMS/DB) -> CodeElement + ACCESSES edges Accumulates EXEC DLI blocks like EXEC SQL. Parses DLI verbs (GU, GN, ISRT, REPL, DLET, CHKP, SCHD, TERM). Extracts SEGMENT, PCB, INTO/FROM, PSB. Creates dli-{verb} ACCESSES edges to <ims>:segment Record nodes. 2.2: DECLARATIVES / USE AFTER EXCEPTION -> ACCESSES edges Tracks inDeclaratives state. Detects USE AFTER STANDARD EXCEPTION ON file-name. Creates cobol-error-handler ACCESSES edge from handler section to file Record. 2.3: SET statement -> ACCESSES edges Detects SET TO TRUE (80-90% of SET usage) and SET index TO/UP BY/DOWN BY. Creates cobol-set-condition / cobol-set-index write edges + cobol-set-read for identifier values. 2.4: INSPECT -> ACCESSES edges with multi-line accumulator Accumulates INSPECT until period (like SORT). Extracts inspected field + tally counters. Creates cobol-inspect-read/write/tally edges. Form detection: tallying/replacing/converting/combined. Preprocessor: 1398 -> 1597 LOC (+199). Tests: 145 passing. * feat(cobol): implement Phase 3 — completeness fixes 6 partial features fixed to first-class support: 3.1: CALL RETURNING -> ACCESSES write edge (cobol-call-returning) 3.2: SELECT OPTIONAL flag preserved in FileDeclaration + Record node 3.3: ALTERNATE RECORD KEY extraction (matchAll for multiple keys) 3.4: COMMON attribute on nested programs (RE_PROGRAM_ID extended) 3.5: IS EXTERNAL / IS GLOBAL as first-class boolean properties (removed usage string hack) 3.6: AUTHOR / DATE-WRITTEN mapped to Module node description Tests: 145 passing | TypeScript clean * feat(cobol): implement Phase 4 — INITIALIZE + metadata completeness 4.1: INITIALIZE statement -> ACCESSES write edge (cobol-initialize) 4.2: DATE-COMPILED and INSTALLATION paragraphs extracted and mapped to Module node description alongside existing AUTHOR/DATE-WRITTEN All 4 plan phases complete. Coverage: ~95% (up from 71.9%). Tests: 145 passing | TypeScript clean * test(cobol): add 24 unit tests for Phase 1-4 features Coverage for all new extraction features: Phase 1 (8 tests): - EXEC SQL INCLUDE (unquoted, quoted, underscored) - CALL USING (simple, mixed modes, ADDRESS OF, OMITTED) - CALL RETURNING - OCCURS DEPENDING ON - VALUE clause (string, numeric, figurative constant) Phase 2 (10 tests): - EXEC DLI GU/ISRT/SCHD (verb, segment, PCB, INTO, FROM, PSB) - DECLARATIVES USE AFTER EXCEPTION (single + multiple sections) - SET TO TRUE, SET index UP BY - INSPECT TALLYING, INSPECT REPLACING Phase 3-4 (6 tests): - SELECT OPTIONAL flag - ALTERNATE RECORD KEY - PROGRAM-ID IS COMMON - IS EXTERNAL / IS GLOBAL booleans - INITIALIZE extraction - Full programMetadata (AUTHOR, DATE-WRITTEN, DATE-COMPILED, INSTALLATION) Total: 168 tests passing (145 + 24 - 1 removed duplicate) * fix(cobol): use /\r?\n/ split for Windows CRLF compatibility All 4 COBOL source files now split on /\r?\n/ instead of '\n' to handle CRLF line endings on Windows. Previously, trailing \r in lines caused RE_GOTO's $ anchor to fail on multi-line GO TO DEPENDING ON statements, producing only 1 goto edge instead of 4. Files fixed: cobol-preprocessor.ts (2 sites), cobol-processor.ts, jcl-parser.ts, cobol-copy-expander.ts Tests: 168 passing | TypeScript clean * fix(cobol): resolve 12th review — dynamic CALL/CANCEL dedup + trailing anchors #1+#2: Removed incorrect hasQuotedCall/hasQuotedCancel deduplication guards. RE_CALL_DYNAMIC and RE_CANCEL_DYNAMIC require [A-Z] after CALL/CANCEL, so they CANNOT match quoted targets — the guards were both unnecessary and actively harmful, suppressing dynamic CALL/CANCEL in ON EXCEPTION patterns. #3+#5: Changed RE_CALL_DYNAMIC and RE_CANCEL_DYNAMIC trailing anchor from (?:\s|\.) to (?=\s|\.|$) (lookahead). The consuming anchor failed when the identifier was the last token on a physical line. Tests: 168 passing | TypeScript clean * feat(cobol): add CALL accumulator + fix SORT double-statement (#4, #6) Finding #4: Multi-line CALL USING accumulator Added callAccum state variable that accumulates CALL statements spanning multiple physical lines until period or END-CALL is found. Uses flushCallAccum() to re-extract CALL target + USING parameters from the full accumulated statement. This fixes the silent loss of ACCESSES parameter edges when USING appears on lines after CALL. Finding #6: SORT double-statement on same line After flushSort(), the code now falls through to re-check the current line for a new SORT/MERGE start (was previously blocked by the sortAccum === null check evaluating before flushSort ran). Also fixed: used non-global regex for CALL detection test to avoid the classic global regex .test() lastIndex bug. Tests: 168 passing (+1 ACCESSES from multi-line CALL USING) * fix(cobol): resolve 13th review — CICS LOAD, USING extraction, file scoping #1: CICS LOAD unresolved edge no longer silently deleted in second pass. Changed narrow cics-link/cics-xctl check to catch-all pattern: rel.reason?.startsWith('cics-') && rel.reason.endsWith('-unresolved') #2: flushCallAccum USING extraction now stops before COBOL statement verbs (INSPECT, SEARCH, SORT, MERGE, DISPLAY, ACCEPT, MOVE, PERFORM, GO TO, CALL, IF, EVALUATE). Prevents absorbing adjacent statements as false USING parameters in legacy pre-COBOL-85 code without END-CALL. #3: CICS FILE Record nodes now globally-scoped (<cics-file>:FILENAME) instead of per-file-scoped. Enables cross-program CICS file access analysis, consistent with SQL table scoping (<db>:TABLE). #4: callAccum pre-check regex now has (?<![A-Z0-9-]) lookbehind to prevent false activation on compound identifiers like WS-CALL-FLAG. Tests: 168 passing | TypeScript clean * fix(cobol): resolve 14th review — callAccum false paragraph + Area A guard #1: callAccum continuation lines now check for COBOL statement verb starts (GO TO, PERFORM, MOVE, etc.) and paragraph/section headers. If detected, the CALL is flushed as-is and the line processed normally — prevents false paragraph detection and currentParagraph corruption from lines like "WS-ADDR." being treated as paragraphs. #4: callAccum pre-check now guarded by currentDivision === 'procedure' to prevent unnecessary activations in DATA DIVISION. #5: Fixed-format paragraph detection now rejects lines with >7 leading spaces (Area B indentation) as paragraph candidates. Paragraph names in fixed-format must start in Area A (col 8-11, max 7 spaces). Free-format mode is unaffected. Tests: 168 passing | TypeScript clean * fix(cobol): resolve 15th review — callAccum Area A + verb boundary fixes #A: Column-position-aware paragraph detection in callAccum flush. #B: inspectAccum early-flush on paragraph/section/verb headers. #C: Verb boundary \b → (?:\s|$) prevents MOVE-COUNT false flush. * test(cobol): add 17 edge-case regression tests + fix USING verb boundary 17 new tests covering all recurring review patterns: Multi-line CALL USING (7 tests): - Parameters on separate continuation lines (IBM mainframe style) - No absorption of INSPECT/GO TO/paragraphs following CALL - END-CALL scope terminator - Hyphenated identifiers (MOVE-COUNT) not triggering false flush - Dual quoted+dynamic CALL on same line (ON EXCEPTION) Nested program attribution (2 tests): - CALL in inner program within inner line range - PERFORM before first paragraph has null caller CRLF compatibility (1 test): - GO TO DEPENDING ON with \r\n line endings Area A paragraph detection (2 tests): - Area B (>7 spaces) rejected; Area A (7 spaces) accepted SORT/MERGE (1 test): COLLATING SEQUENCE keywords not captured PROCEDURE USING (2 tests): RETURNING excluded, period-terminated Comment stripping (1 test): pipe in quoted string preserved SELECT OPTIONAL (1 test): correct file name, not OPTIONAL keyword Bug fix: USING extraction regex verb terminators changed from \bVERB\b to \bVERB(?=\s|$) in flushCallAccum — prevents truncation on hyphenated identifiers like MOVE-COUNT, PERFORM-LIMIT. Total: 185 tests passing * test(cobol): add 32 comprehensive edge-case regression tests 13 new describe blocks covering all extraction features: - EXEC DLI: no-SEGMENT, multi-line accumulation (2 tests) - SET: multiple targets, DOWN BY, TO numeric (3 tests) - INSPECT: CONVERTING, multiple counters, tallying-replacing, paragraph flush during accumulation (4 tests) - DECLARATIVES: no-STANDARD keyword, I-O mode, post-END paragraphs (3) - COPY REPLACING: pseudotext deletion ==OLD== BY ==== (1 test) - VALUE: hex literal, negative numeric, ALL literal (3 tests) - OCCURS: TO range, fixed-size without DEPENDING ON (2 tests) - Dynamic CALL/CANCEL: end-of-line, multiple CANCELs (3 tests) - EXEC SQL: INCLUDE skips tables, SELECT INTO host vars, host variable extraction (3 tests) - INITIALIZE: target and caller context (1 test) - Nested programs: sibling scoping, PROGRAM-ID without ID DIV (2) - EXEC EOF flush: unclosed EXEC SQL flushed (1 test) - Multi-PERFORM: IF/ELSE dual PERFORM on single line (1 test) - IS EXTERNAL: USAGE not polluted by external flag (1 test) Total: 215 tests passing * fix(cobol): resolve 16th review — CANCEL in CALL block + USING boundary #1: flushCallAccum now extracts CANCEL statements from within CALL ON EXCEPTION blocks. Adds RE_CANCEL + RE_CANCEL_DYNAMIC matchAll passes alongside existing CALL extraction. #2: Added \bCANCEL(?=\s|$) to USING lookahead regex to prevent CANCEL keyword being captured as false USING parameter. #3: Multi-line CALL start now returns immediately to prevent the CALL start line from simultaneously feeding sortAccum/inspectAccum. #6: Division transitions now flush all active accumulators (callAccum, sortAccum, inspectAccum) to prevent state leakage across programs. Also added CANCEL to callAccum flush trigger verb list. Tests: 215 passing | TypeScript clean * refactor(cobol): extract shared verb constants + resolve 17th review Extract COBOL_STATEMENT_VERBS, RE_STATEMENT_VERB_START, and RE_USING_PARAMS as shared constants — eliminates 4 duplicated 25-verb regex patterns. 17th review: #1 flushCallAccum before EXEC entry, #2 inspectAccum verb parity via shared constant. Tests: 215 passing | TypeScript clean * test(cobol): replace all fuzzy assertions with exact toBe checks Replaced 7 toBeGreaterThan/toBeLessThan/toBeGreaterThanOrEqual assertions with exact toBe values: - dataItems.length: >= 3 → toBe(3) - calls.length: >= 1 → toBe(1) - calls[0].line: range check → toBe(10) - programs[].startLine/endLine: comparison → exact values - innerA.endLine/innerB.startLine: comparison → exact values Also added 11 new edge-case tests (accumulator flush on EXEC/division transitions, free-format, CANCEL in CALL block, SORT THRU, verb flush, integration). 226 tests passing — zero fuzzy assertions remain. * fix(cobol): resolve 19th review + 15 accumulator flush tests Fixes: #1: END PROGRAM flushes callAccum/sortAccum/inspectAccum #2: PROGRAM-ID sibling path flushes all accumulators #3: Added COMPUTE/ADD/SUBTRACT/MULTIPLY/DIVIDE/STRING/UNSTRING to COBOL_STATEMENT_VERBS (now 32 verbs) Tests (15 new): - END PROGRAM flush: single + nested programs (2) - PROGRAM-ID sibling flush (1) - Arithmetic verb flush: COMPUTE/ADD/SUBTRACT/MULTIPLY/DIVIDE (5) - String verb flush: STRING/UNSTRING (2) - Arithmetic not captured as false USING params (1) - SORT flushed at END PROGRAM (1) - INSPECT flushed at END PROGRAM (1) - All with exact toBe assertions (2) Total: 239 tests passing | Zero fuzzy assertions * fix(cobol): resolve 20th review — INITIALIZE multi-target + 2 tests Finding 1: INITIALIZE now captures multiple targets with REPLACING clause keyword filtering. Regex changed to lazy match stopping at REPLACING/WITH/period boundary. Targets split on whitespace and filtered against INITIALIZE_CLAUSE_KEYWORDS set. Tests (2 new): - INITIALIZE multi-target: WS-CUSTOMER WS-ORDER WS-LINE-ITEM → 3 - INITIALIZE with REPLACING: only WS-RECORD captured, not keywords Total: 241 tests passing | TypeScript clean
zander-raycraft
pushed a commit
that referenced
this pull request
May 7, 2026
…atwari#756) * Initial plan * feat(SM-13): extract resolveFreeCall from resolveCallTarget Extract the free-function call resolution path into a dedicated `resolveFreeCall(calledName, filePath, ctx)` function that uses `lookupExact` + import-scoped resolution via `ctx.resolve()`. - Free function calls (foo()) now route through `resolveFreeCall` - Swift/Kotlin implicit constructors (User()) delegate to `resolveStaticCall` within `resolveFreeCall` - `resolveCallTarget` dispatches `callForm === 'free'` early, removing the inline freeFormHasClassTarget logic - S0 block simplified to only handle `callForm === 'constructor'` - Global (Tier 3) fallthrough preserved via ctx.resolve() until Phase 5 - 9 new unit tests for resolveFreeCall - All 163 unit tests pass, all 1199 integration resolver tests pass Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/c5f2e73a-259a-438c-b5c8-286b82e3c215 Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com> * chore: revert unrelated package-lock.json change Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/c5f2e73a-259a-438c-b5c8-286b82e3c215 Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com> * fix(SM-13): address PR abhigyanpatwari#756 review findings on resolveFreeCall Addresses all 7 findings from the PR abhigyanpatwari#756 review comment. Code (R1, finding #1) - Replace the literal `'Class' | 'Struct' | 'Record'` check in `hasClassTarget` with `INSTANTIABLE_CLASS_TYPES.has(c.type)`. Converts an invariant that was previously comment-enforced ("keep this list aligned with INSTANTIABLE_CLASS_TYPES") into one enforced structurally. Any future extension of the set propagates here automatically. The narrower Swift extension dedup block below still uses literal `'Class' | 'Struct'` by design — Swift extensions only produce Class duplicates in practice, Record is deliberately excluded there, and the inline comment now documents that asymmetry. Tests (+12 regression scenarios) Finding #2 — language coverage - Go free function (doStuff()) - Python free function (def helper(): ... helper()) - Rust free function outside any impl block - Java statically-imported function - JavaScript module-level function Each exercises `_resolveCallTargetForTesting` with `callForm='free'` and the language-specific file extension. `resolveFreeCall` has no file-extension branching, so these guard the dispatch chain per language without assuming extractor-specific symbol shapes. Finding #3 — argCount threading - 2-arg overload selected when argCount=2 - 0-arg overload selected when argCount=0 Finding #5 — Tier 3 (global) resolution - Function globally visible but not imported. Asserts exact `TIER_CONFIDENCE.global === 0.5` and `reason === 'global'` to catch silent drift if the tier table is ever refactored. Finding #6 — preComputedArgTypes worker path - String overload matched via preComputedArgTypes=['String'] - Int overload matched via preComputedArgTypes=['int'] (lowercase, mirroring the parse-worker's inferred-literal shape; stored 'Int' is normalized via normalizeJvmTypeName at comparison time) Finding #7 — Enum null-route documentation - Enum-only free call asserts `toBeNull()` with an explanatory comment linking to the INSTANTIABLE_CLASS_TYPES rationale. NOT marked skipped — current behavior is intentional, not broken. Finding #4 — Swift extension dedup guard - Two same-name Class entries at different path lengths; exercises the full dispatch chain: 1. filterCallableCandidates with 'free' strips Class → length 0 2. hasClassTarget triggers resolveStaticCall 3. Homonym ambiguity null-routes per SM-12 round-1 contract 4. Constructor-form retry repopulates with both Classes 5. Dedup block sorts by filePath.length → shortest path wins Verification - `tsc --noEmit` clean - 3064 unit tests pass (+12) - 1766 integration tests pass - Zero regressions Plan: docs/plans/2026-04-09-003-fix-sm13-resolve-free-call-review-findings-plan.md Review: abhigyanpatwari#756 (comment) * refactor(SM-13): extract dedupSwiftExtensionCandidates shared helper Follow-up to the PR abhigyanpatwari#756 review fix. SM-13 duplicated the Swift extension same-name collision dedup block between `resolveCallTarget` and `resolveFreeCall` — two copies of identical 15-line logic with the same heuristic (`filePath.length` sort, Class/Struct-only, `length > 1` guard). Extract a single shared helper so the two sites cannot drift. Changes - New `dedupSwiftExtensionCandidates(candidates, tier)` helper defined alongside `tryOverloadDisambiguation`, with JSDoc documenting: - The Swift extension scenario it addresses - Why it is intentionally narrower than INSTANTIABLE_CLASS_TYPES (Class/Struct only, not Record — C#/Kotlin records don't exhibit the multi-file definition pattern, widening risks accidental dedup of legitimately distinct record types) - The return-null-on-no-match contract so callers can fall through - `resolveCallTarget` tail dedup (was lines 1593-1610): replaced with a single `dedupSwiftExtensionCandidates` call - `resolveFreeCall` tail dedup (was lines 1994-2012): same replacement - Net line count: -32 insertions, -9 deletions in the consumer sites, +36 for the shared helper + JSDoc Verification - `tsc --noEmit` clean - 3064 unit tests pass (including the R7 Swift dedup guard test added in the previous commit that exercises the full free-form retry chain through this helper) - 1766 integration tests pass - Zero regressions Follows-up on: abhigyanpatwari#756 * docs(SM-13): address PR abhigyanpatwari#756 final review — comment cleanup only Three documentation-only findings from the approval review. No behavior change, no new tests, no code path modifications. Finding #1 — stale line-number comment - The comment inside `resolveFreeCall` at the `hasClassTarget` site referenced "lines ~1994-2008" for the Swift extension dedup block. Those lines were the inlined pre-SM-13 version; the block has since been extracted to `dedupSwiftExtensionCandidates`. Replaced the line reference with the helper name so future readers don't chase dead line numbers. Finding #2 — fuzzy-widening asymmetry undocumented - `resolveFreeCall` intentionally has no `widenCache` parameter and no D2 fuzzy-widening pass (unlike `resolveCallTarget`'s member-call path). Added an explicit "Asymmetry vs `resolveCallTarget`" paragraph to the JSDoc so a caller comparing the two signatures knows the skipped pass is deliberate and tied to Phase 5. Finding #3 — constructor-form retry reasons undocumented - `resolveStaticCall` can return null for three distinct reasons (empty instantiable pool, homonym ambiguity, ownerless Constructor nodes). The retry below it unconditionally re-filters with `'constructor'` form, which is correct for all three but not obvious. Added a structured three-case comment enumerating each reason and linking (a) to the SM-12 null-route contract, (b) to the R7 dedup test, and (c) to the currently-uncovered ownerless- Constructor path (noted as a future test candidate). Verification - `tsc --noEmit` clean - 175 `resolveFreeCall` + `resolveStaticCall` + sibling tests pass (sanity check — no behavior change expected) - No regressions Follows-up on: abhigyanpatwari#756 (comment) * test(SM-13): cover ownerless-Constructor retry + PHP free function Two low-severity test gaps from PR abhigyanpatwari#756 review comment 4215739052 — previously addressed doc-only, now have concrete test coverage. Finding #3 low — ownerless-Constructor retry path (previously comment-only) - The retry after resolveStaticCall returns null handles three distinct null-return reasons. Cases (a) and (b) were already tested (Interface/ Trait null-route from SM-12, Swift shadowing dedup from R7). Case (c) — resolveStaticCall step-4 bailout when the tiered pool contains ownerless Constructor nodes — was only covered by a comment. - New test: Class + ownerless Constructor in tiered pool, callForm='free'. Exercises the full chain: 1. resolveStaticCall step 3 walks classCandidates via lookupMethodByOwner — ownerless Constructor not in methodByOwner, nothing found. 2. Step 4 detects Constructor in tiered pool, bails with null. 3. resolveFreeCall retry re-runs filterCallableCandidates with 'constructor' form, which prefers Constructor over Class per CONSTRUCTOR_TARGET_TYPES ordering. 4. Single survivor returned. - Asserts the Constructor node (not the Class) is the resolved target. Low — PHP free function coverage gap - The language coverage table in the same review flagged PHP free functions (top-level `function helper()` outside any class) as uncovered. Added a test mirroring the existing Go/Python/Rust/Java/ JS language tests — exercises the `.php` dispatch path for free calls. Ruby and C/C++ remain uncovered; deferred to a future round since those languages also have other gaps in the broader test file. Verification - `tsc --noEmit` clean - 3066 unit tests pass (+2 new regression tests) - 1766 integration tests pass - Zero regressions Follows-up on: abhigyanpatwari#756 (comment) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com> Co-authored-by: Gergo Magyar <gergomagyar@icloud.com>
zander-raycraft
pushed a commit
that referenced
this pull request
May 7, 2026
…yanpatwari#770) * Initial plan * SM-19: Replace resolveCallTarget with thin dispatcher Delete the monolithic resolveCallTarget function (~200 lines) and replace it with a 15-line thin dispatcher that routes to resolveMemberCall, resolveStaticCall, or resolveFreeCall. Extract module-alias resolution and file-based member-call fallback into dedicated helper functions. - resolveCallTarget body reduced from ~200 lines to ~15 lines - Extract resolveModuleAliasedCall helper (Python/Ruby module imports) - Extract resolveMemberCallByFile helper (trait dispatch, overload disambiguation) - Extract singleCandidate helper (constructor alias fallback, name-based fallback) - Update unit tests for new dispatcher semantics - Update doc comments referencing deleted D0-D4 paths Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/469eac38-b0c0-4a26-a2ff-3eb06299730b Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com> * SM-19: Add singleCandidate tail fallback for member calls with unresolvable receiver type Agent-Logs-Url: https://github.com/abhigyanpatwari/GitNexus/sessions/469eac38-b0c0-4a26-a2ff-3eb06299730b Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com> * fix(SM-19): address all PR abhigyanpatwari#770 review findings + fix CI Fixes all 5 test failures (2 unit + 3 integration) and addresses 10 review findings from comment 4225312416. Critical fix — singleCandidate null-route guard The SM-19 dispatcher chained singleCandidate as an unconditional tail fallback for member calls with receiverTypeName. This bypassed the SM-10 R3 null-route contract: when the receiver type IS in the index but file/owner filtering produced zero matches, the old code returned null (genuine miss), but the new code fell through to singleCandidate (false-positive CALLS edge). Root cause: resolveMemberCallByFile returns null for two semantically different reasons — (1) type not found in the index at all, and (2) type found but no candidate matched after narrowing. The dispatcher treated both as "try the next fallback." The old resolveCallTarget exited the entire function on case 2. Fix: after the scoped resolvers both return null, check whether the receiver type resolves in the index. If it does (case 2), null-route — the scoped resolvers made the right decision. If it doesn't (case 1, e.g. PHP 'mixed', dynamic types), singleCandidate is the correct last resort. ctx.resolve is cached so the check is free. This fixes: - Unit: no heritageMap null-route test (was getting 1 edge, expects 0) - Integration: Rust c.trait_only() negative test - Integration: 3 PHP heritage + alias tests (singleCandidate correctly fires when the receiver type is not in the index) Performance (findings #1, #2, #3) - Thread pre-computed tiered result into resolveModuleAliasedCall via new tieredOverride parameter — eliminates the duplicate ctx.resolve call on every module-alias path. - Add countCallableCandidates helper that short-circuits at threshold without allocating an intermediate array — replaces the filterCallableCandidates(...).length > 1 allocation in skipMember. - resolveMemberCallByFile lookupCallableByName caching deferred to a follow-up (finding #2) — the fix requires threading widenCache through the file-scoped resolver which is a larger change. Code quality (findings #4, #5) - Remove dead code: redundant conditional in resolveMemberCallByFile where both branches returned null. - Move WidenCache type declaration from mid-file (between JSDoc blocks) to adjacent to CONSTRUCTOR_TARGET_TYPES with other type declarations. Formatting - Applied prettier to call-processor.ts (CI format check was failing). Verification - tsc --noEmit clean - 3188 unit tests pass (0 skipped real tests) - 1766 resolver integration tests pass - Zero regressions — all PHP, Rust, and no-heritageMap tests green Review: abhigyanpatwari#770 (comment) * fix(SM-19): restore module-alias narrowing and constructor disambiguation Codex adversarial review on PR abhigyanpatwari#770 surfaced two silent regressions in the SM-19 thin dispatcher: Finding 1 [high] — Typed member calls bypassed module-alias narrowing. When two homonym receiver types are both imported by the caller, the import-scoped tier no longer narrows and the owner/file resolvers see genuine ambiguity. The dispatcher null-routed silently, dropping valid CALLS edges. Fix: consult `resolveModuleAliasedCall` at the top of the typed-member branch so an active alias on `call.receiverName` picks the aliased file before the generic resolvers run. Finding 2 [medium] — Constructor dispatch lost overload disambiguation. When `resolveStaticCall` bails (ambiguous or ownerless Constructor pool) and the caller supplied `overloadHints` / `preComputedArgTypes`, the branch fell straight through to `singleCandidate` — which also bails on multiple same-arity survivors. Fix: between `resolveStaticCall` and `singleCandidate`, run constructor-filtered overload disambiguation on the tiered pool. Only engages when a narrowing signal is present; preserves SM-10 R3 null-route for genuinely ambiguous cases. Tests: - call-processor.test.ts: 3 new dispatcher-level regression tests covering real-homonym alias narrowing, constructor overload disambiguation with `argTypes`, and null-route control - symbol-table.test.ts: update `module alias homonyms` test which previously codified the Finding 1 regression; now asserts resolution to the aliased file's method Verification: 3191 unit + 2398 integration tests pass; tsc --noEmit clean; prettier clean. * refactor(SM-19): address code review findings with clean-code pass Code review on commit f424685 surfaced one P1 correctness regression and two P2 maintainability concerns. This commit closes all ten findings: P1 — Alias helper placement regression - resolveModuleAliasedCall now runs as a FALLBACK in the typed-member branch, after resolveMemberCall/resolveMemberCallByFile return null. Previously it short-circuited BEFORE scoped resolvers, leaking unrelated homonyms from the aliased file when a local var coincidentally matched a module alias. - Added type-file verification guard: alias narrowing only fires when the alias target file is among the receiver type's defining files. Prevents cross-type false positives and hardens SM-10 R3. P2 — Thin-dispatcher drift (roadmap Phase 3) - Extracted disambiguateByOverloadOrArgTypes shared helper. Centralizes the overloadHints → preComputedArgTypes precedence rule used by both member and constructor resolvers. - Folded constructor overload disambiguation into resolveStaticCall as step 4.5 (between the ambiguous-pool bail and the instantiable-class fallback). resolveStaticCall now accepts optional overloadHints / preComputedArgTypes symmetric with resolveMemberCallByFile. - Dispatcher's constructor branch returns to a 2-line delegation. - resolveMemberCallByFile now calls the shared helper instead of inlining the ternary. P2 — Missing test coverage - owner-scoped wins over alias narrowing (alias with unrelated target class must not override unique owner-scoped answer) - alias narrowing rejects unrelated target type (type-file guard) - alias fallthrough: receiverName not in alias map - alias fallthrough: alias target file has no matching method (overloadHints-for-constructor variant transitively covered via the extracted helper's member-path tests; direct dispatcher test deferred as it requires real OverloadHints fixture parsing) P3 — Clarity and durability - Stripped "Codex SM-19 Finding N" prefixes from comments. Replaced with durable explanations of WHY each guarded branch exists. - Added cross-reference comment at the tail-branch resolveModuleAliasedCall call site pointing to the typed-member branch usage. Verification: 3195 unit + 1766 resolver integration + 2398 full integration tests pass. tsc --noEmit clean. prettier clean. Plan: docs/plans/2026-04-11-002-fix-sm19-code-review-findings-plan.md --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: magyargergo <11230420+magyargergo@users.noreply.github.com> Co-authored-by: Gergo Magyar <gergomagyar@icloud.com>
zander-raycraft
pushed a commit
that referenced
this pull request
May 7, 2026
…npatwari#606 split) (abhigyanpatwari#796) * feat(group): extractor expansion + manifest extractor Part 2 of 4 in the split of abhigyanpatwari#606 (ticket: abhigyanpatwari#792). Follows abhigyanpatwari#795 (bridge.lbug storage foundation, already merged), but this PR has no code-level dependency on abhigyanpatwari#795 — it only imports types and the ContractExtractor interface that existed on upstream main before either PR. It could have been reviewed in parallel with abhigyanpatwari#795. ## What changed Expands the 3 existing contract extractors with substantially more language/framework coverage, and adds a new `manifest-extractor` that resolves `group.yaml`-declared cross-links against the per-repo graph via exact-name lookups. ### New file (228 LOC) - `gitnexus/src/core/group/extractors/manifest-extractor.ts` — exact graph lookup for `group.yaml`-declared cross-links. HTTP paths are canonicalized before Route.name matching; gRPC is resolved by service/method name (NO `.proto`-filename fallback); topic and lib use exact-name match. Falls back to a synthetic `manifest::<repo>::<contractId>` uid when the graph has no matching symbol, so cross-impact traversal still has a stable anchor for the contract. ### Modified extractors (+958 LOC prod) - `extractors/grpc-extractor.ts` (+522) — `.proto` parser with comment and string-literal sanitization (braces inside strings no longer truncate service bodies); package/service/method canonical IDs; server/client detection across Go (`grpc.NewServer`, `RegisterXxxServer`, `XxxGrpc.XxxImplBase`), Java (`@GrpcService`, `BlockingStub`), Python (`servicer_to_server`, `XxxStub`), and TypeScript/Node (`@GrpcMethod`, `ClientGrpc`, `loadPackageDefinition`). - `extractors/http-route-extractor.ts` (+174) — Go gin/echo/stdlib `HandleFunc`, NestJS `@Controller`+`@Get`/etc, Python FastAPI decorators, Java Spring `@RequestMapping`/`@GetMapping`, restTemplate / WebClient / OkHttp consumers. - `extractors/topic-extractor.ts` (+98) — sarama `ProducerMessage{}` struct literal detection (replaces a constructor-anchored regex that missed topics inside producer loops), kafka-go Writer/Reader, Python NATS (`await nc.subscribe`/`await nc.publish`), JetStream helpers. ### Modified and new tests (+1264 LOC) - `grpc-extractor.test.ts` (+539) — full coverage of the new proto parser (strings-with-braces regression, comments-with-braces regression), per-language server/client detection - `http-route-extractor.test.ts` (+240) — per-framework route extraction + normalization edge cases - `topic-extractor.test.ts` (+177) — the sarama in-loop regression, JetStream, Python NATS, kafka-go Writer/Reader - `manifest-extractor.test.ts` (+308 NEW) — HTTP path normalization, gRPC exact lookup with proto-fallback regression, lib and topic exact matching, synthetic-uid fallback behavior ### Self-review fixes folded in Carried forward from the abhigyanpatwari#606 self-review (commit `d15b8cb`): - **HIGH #1** — `manifest-extractor.resolveSymbol` was too fuzzy. Previously used `CONTAINS` on route/name fields plus an unconditional `filePath ENDS WITH '.proto'` fallback for gRPC. Consequences: `/orders` matched `/suborders`, and any repo with any `.proto` file returned a random proto symbol for a gRPC manifest entry. Replaced with exact equality + deterministic `ORDER BY` + synthetic-uid fallback for unresolved manifests. Regression tests included. - **MED #3** — gRPC proto parser brace-depth counting now sanitizes strings and comments first (`stripProtoCommentsAndStrings`). A valid proto with `option deprecated_reason = "use NewService { instead"` used to have its service body closed early by the `"{"` inside the literal, silently dropping methods after the offending string. Regression tests for both string-with-brace and comment-with-brace cases. - **MED #4** — sarama Kafka regex changed from `sarama.NewSyncProducer[\s\S]{0,300}?Topic:` (anchored on constructor, caught only first topic in a loop) to `sarama.ProducerMessage{...Topic:}` (matches every struct literal directly). Regression test with a for-loop that constructs multiple `ProducerMessage`s. - **MED #7** — `manifest-extractor.resolveSymbol` no longer has a silent `catch { /* fall through */ }`. Errors from the graph executor are logged via `console.warn` with link type, contract name, repo key, and error message before falling through to the synthetic-uid path. ## Why Reviewer focus here is pure regex / parser correctness — no storage, no Cypher queries, no algorithmic changes to the cross-link algorithm. Separating this from the bridge foundation PR (abhigyanpatwari#795) meant reviewers could stay in a single mental mode (parsing logic) instead of context-switching between DDL, Cypher, and regex. ## How to verify - `cd gitnexus && npx tsc --noEmit` - `cd gitnexus && npx vitest run test/unit/group/grpc-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/http-route-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/topic-extractor.test.ts --pool=forks` - `cd gitnexus && npx vitest run test/unit/group/manifest-extractor.test.ts --pool=forks` Local pre-push: typecheck clean, all 99 extractor unit tests pass (grpc 43, http 18, topic 30, manifest 8). ## Risk / rollback **Low.** Extractors have no user-facing surface in this PR — they produce `ExtractedContract[]` that is consumed by `sync.ts` in the next split (abhigyanpatwari#793). No existing behavior changes for users who don't run a `group sync`. Rollback = `git revert` of the merge commit; the modifications to `grpc-extractor.ts` / `http-route-extractor.ts` / `topic-extractor.ts` revert to the pre-PR versions that still work (they're subsets of the new functionality). ## Scope discipline (per GUARDRAILS.md) - Only the 8 files above are touched; no drive-by refactors - No CI/release/security config changes - No secrets or machine-specific paths - Content lifted from abhigyanpatwari#606 (CI 11/11 green on `d15b8cb`) ## Dependencies - **Base:** `main` (upstream already includes abhigyanpatwari#795 as `1ff324c`) - **Blocks:** sync pipeline (abhigyanpatwari#793) and the cross-impact feature (abhigyanpatwari#794) - **Tracker issue:** abhigyanpatwari#792 - **Parent PR:** abhigyanpatwari#606 Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate topic-extractor from regex to tree-sitter queries Addresses @magyargergo's feedback on abhigyanpatwari#796 that regex-based lookups should use tree-sitter nodes instead, and that the top-level extractors must NOT carry language dependencies. This is phase 1 of a multi-step migration — topic-extractor first because its patterns are the most uniform (16 "call/annotation with first-arg string literal" variants), which makes it a clean proof of the approach before grpc-extractor and http-route-extractor get the same treatment. ## Architecture: language-agnostic orchestrator + per-language plugins The top-level extractor is a thin orchestrator that never imports a tree-sitter grammar or a query string. Per-language knowledge lives in a new `topic-patterns/` folder with one file per language plus a registry that maps file extensions to compiled plugins: ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic scanning utilities ├── topic-extractor.ts # thin orchestrator (no grammar imports) └── topic-patterns/ ├── types.ts # TopicMeta, Broker ├── index.ts # registry: extension → compiled provider ├── java.ts # tree-sitter-java + JAVA_TOPIC_PROVIDER ├── go.ts # tree-sitter-go + GO_TOPIC_PROVIDER ├── python.ts # tree-sitter-python + PYTHON_TOPIC_PROVIDER └── node.ts # tree-sitter-javascript + tree-sitter-typescript # → JAVASCRIPT_/TYPESCRIPT_/TSX_TOPIC_PROVIDER ``` **Shared scanner (`tree-sitter-scanner.ts`)** — defines `PatternSpec<TMeta>`, `LanguagePatterns<TMeta>`, `CompiledPatterns<TMeta>` and the `scanFile(parser, plugin, content)` helper. Plugins compile their queries eagerly at module load via `compilePatterns()`, so a broken pattern fails loudly at import time instead of silently at scan time. `unquoteLiteral()` handles single/double/template quotes, Python triple-quoted strings, and Go raw backtick strings. **Per-language plugins** own: - the tree-sitter grammar import (this is the ONLY place in `src/core/group/` where tree-sitter grammars are imported), - the query S-expressions, - the `TopicMeta` payload (role, broker, confidence, symbolName) that the orchestrator receives back on every match. Each plugin uses a `@value` capture name to bind the topic literal node. The JavaScript and TypeScript grammars share AST node names for every construct we query, so `node.ts` defines the pattern sources once and compiles them against `JavaScript`, `TypeScript.typescript`, and `TypeScript.tsx` — exporting three providers because `Parser.Query` objects are NOT portable across grammar instances. **Registry (`topic-patterns/index.ts`)** — maps `.java` → Java provider, `.go` → Go, `.py` → Python, `.js`/`.jsx` → JS, `.ts` → TS, `.tsx` → TSX. Also exports `TOPIC_SCAN_GLOB` so adding a new language is a single file-level edit (drop `topic-patterns/<lang>.ts`, import + register it here — zero edits required in `topic-extractor.ts`). **Orchestrator (`topic-extractor.ts`)** — ~110 lines, no grammar or query imports. Per file: `getProviderForFile(rel)` → `scanFile(parser, provider, content)` → `unquoteLiteral(valueText)` → `makeContract(...)`. Reuses one `Parser` instance across files; the scanner calls `setLanguage` per plugin. ## Why this is better than regex 1. **Comments and strings are respected for free.** The old regex would match `// kafkaTemplate.send("fake.topic")` as a real producer; tree-sitter never visits comments or string literals as code nodes, so false positives from commented-out code are eliminated. 2. **Struct/object literal patterns are structural, not textual.** `sarama.ProducerMessage{Topic: "..."}` no longer needs a 300-char lookahead (which was a known cross-match bug partly mitigated by a loop regression test in the self-review). The new query matches a specific `composite_literal` with a specific `qualified_type` and `keyed_element` — exactly one struct literal per match. 3. **No order-of-operations fragility.** Regex for `channel.publish` vs `channel.consume` was independent and file-wide; the AST scopes matches to the specific `call_expression`. 4. **Language-agnostic extension.** Adding Ruby, Rust, or C# topic detection later means dropping one file in `topic-patterns/` — no changes to shared scanner or orchestrator, and no tree-sitter imports leak into top-level code. ## Per-file fault tolerance - Malformed files that tree-sitter can't parse are silently skipped (`parser.parse` is wrapped by `scanFile`). The ingestion pipeline already logs unparseable files at index time. - A syntactically invalid query is caught at `compilePatterns` time, not scan time — broken plugins fail loudly at import. - Per-pattern `matches()` failures are swallowed so one broken query in a plugin doesn't block the rest. ## Tests All 30 existing `topic-extractor.test.ts` tests pass **without any changes to the test file** — they were written as input/output contract tests (given this source file, expect these `ExtractedContract` objects) and that contract is unchanged. Regression coverage includes: - Kafka: Java `@KafkaListener` + `kafkaTemplate.send`; Node `producer.send` + `consumer.subscribe`; Go sarama producer/consumer (sync and async); kafka-go Writer/Reader; Python `KafkaConsumer` + `producer.send/produce` - RabbitMQ: Java `@RabbitListener` + `rabbitTemplate.convertAndSend`; Node `channel.consume/publish/sendToQueue`; Python `basic_consume/ basic_publish` with keyword args - NATS: Go and Node `nc.Subscribe/Publish`; Go and Node JetStream `js.Subscribe/Publish`; Python `await nc.subscribe/publish` Including the regression test for the sarama `ProducerMessage` in-loop case — the AST-based query captures every literal in the file independently, not just the first one after `NewSyncProducer`. ## Neighbor regression check - `topic-extractor.test.ts` — 30/30 pass (rewritten extractor) - `http-route-extractor.test.ts` — 18/18 pass (untouched) - `grpc-extractor.test.ts` — 43/43 pass (untouched) - `manifest-extractor.test.ts` — 8/8 pass (untouched) - Full `npx tsc --noEmit` clean ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched; no changes to other extractors, tests, MCP surface, or pipeline.ts. - No CI/release/security config changes, no secrets. - New tree-sitter imports all reference grammars that are already installed as dependencies (`tree-sitter`, `tree-sitter-javascript`, `tree-sitter-typescript`, `tree-sitter-python`, `tree-sitter-java`, `tree-sitter-go` — all in `package.json` for the existing pipeline). ## Phase 2 / phase 3 plan - **Phase 2 (next commit):** rewrite `http-route-extractor.ts` Strategy B (regex fallback) on the same plugin pattern. Graph-assisted Strategy A stays as-is (already uses pipeline-built tree-sitter data via `HANDLES_ROUTE` Cypher queries). - **Phase 3 (commit after):** rewrite `grpc-extractor.ts` for Java / Go / Python / TypeScript detection. `.proto` files are the one outstanding question — there is no `tree-sitter-proto` grammar installed; the in-tree string-sanitizing parser stays as a pragmatic exception with a comment, alternative being to add `tree-sitter-proto` as a dep (open for the maintainer). Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate http-route-extractor Strategy B to tree-sitter plugins Phase 2 of the extractor refactor requested by @magyargergo on abhigyanpatwari#796. Same architecture as the phase 1 topic-extractor rewrite: a thin, language-agnostic orchestrator plus per-language plugins that own tree-sitter grammars and query sources. The top-level extractor file no longer imports any tree-sitter grammar or query string. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared, language-agnostic primitives ├── http-route-extractor.ts # thin orchestrator (no grammar imports) └── http-patterns/ ├── types.ts # HttpDetection, HttpLanguagePlugin, HttpRole ├── index.ts # registry: ext → plugin + HTTP_SCAN_GLOB ├── java.ts # tree-sitter-java: Spring + RestTemplate/WebClient/OkHttp ├── go.ts # tree-sitter-go: gin/echo/HandleFunc + http/resty consumers ├── python.ts # tree-sitter-python: FastAPI + requests ├── php.ts # tree-sitter-php: Laravel Route::get/... └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # NestJS controllers, Express, fetch, axios ``` **Shared scanner (`tree-sitter-scanner.ts`)** — generalised from phase 1: - `ScanMatch<TMeta>.captures` is now a full `CaptureMap` (every named capture the query binds, not just a single `@value`). Topic extractor updated to read `match.captures.value` accordingly. - New `runCompiledPatterns(plugin, tree)` helper lets plugins run multiple query bundles against the same pre-parsed tree. This is needed for HTTP plugins that combine a class-prefix query with a method-route query (Spring, NestJS). - `scanFile` becomes a thin wrapper over `parser.parse + runCompiledPatterns`. **HTTP plugin shape** — unlike topic plugins, HTTP plugins expose a `scan(tree)` function rather than a flat pattern list. This reflects HTTP's more complex extraction: each detection needs method + path + handler name, and framework patterns like Spring `@RequestMapping` / NestJS `@Controller` require cross-referencing a class-level prefix with method-level annotations. Plugins internally use `compilePatterns` + `runCompiledPatterns` and walk the AST to resolve the class/method relationships. **Per-framework coverage:** - **Java (`java.ts`)** - Spring: `@RequestMapping("/api/v2")` class prefix + `@(Get|Post|Put| Delete|Patch)Mapping("/sub")` method routes, joined via the enclosing `class_declaration` node id. - `RestTemplate.getForObject/postForEntity/put/delete/patchForObject` → method derived from API name. - `WebClient.method(HttpMethod.X, "/path")` → method from `HttpMethod.X` capture. - `new Request.Builder().url("/path")` → OkHttp consumer. - **Go (`go.ts`)** - gin / echo / chi frameworks: `\w+.GET("/path", handler)` captures upper-case verb + handler identifier. - `net/http.HandleFunc("/path", handler)` → provider (default GET). - `http.Get/Post/Head` consumer, `http.NewRequest("METHOD", ...)`, resty `client.R().Get/Post/...`. - **Python (`python.ts`)** - `@app.get("/path")` FastAPI decorators. - `requests.get/post/...` and `requests.request("METHOD", "url")`. - **PHP (`php.ts`)** - Laravel `Route::get/post/.../patch('/path', ...)` via `scoped_call_expression`. Uses `PHP.php_only` to match the existing ingestion pipeline's grammar selection. - **Node (`node.ts`) — JS + TS + TSX** - Pattern sources defined once, compiled against three grammar variants (`JavaScript`, `TypeScript.typescript`, `TypeScript.tsx`) because `Parser.Query` objects are not portable across grammars. Exports three plugins sharing the same `scan` logic. - NestJS: `@Controller('prefix')` decorators are siblings of the class in `export_statement` / `program`; `@Get(':id')` decorators are siblings of the method in `class_body`. The plugin walks decorator → next named sibling to find the decorated class / method, then combines the class prefix with the method path. Only emits NestJS detections when the enclosing class has a real `@Controller` decorator — prevents false positives from generic classes that happen to use `@Get` from another library. - Express: `(router|app).<verb>('/path', ...)`. - `fetch(url)` (default GET) + `fetch(url, { method: 'X' })` (uses two queries + a SyntaxNode-id dedupe set so URL literals aren't double-emitted by the options variant). - `axios.get/post/...`. ## Orchestrator changes `http-route-extractor.ts` drops every `scanXxxProviders` / `scanXxxConsumers` regex method and replaces them with a single source-scan loop that delegates to `getPluginForFile(rel).scan(tree)`. The orchestrator still owns: - **Path normalization** (`normalizeHttpPath`, `normalizeConsumerPath`) — language-agnostic string processing shared by both strategies. - **Graph-assisted Strategy A** (`HANDLES_ROUTE` / `FETCHES` / `CONTAINS` Cypher queries) — unchanged in spirit. The only regex helpers it used (`inferMethodFromFileScan`, `pickJavaHandlerName`) are now replaced by a lookup against the plugin's detections for the same file: for each route row, find the detection whose normalized path matches, and pull the HTTP method + handler name from it. - **Per-file parse cache** — the orchestrator parses each relevant file at most once per `extract()` call. Both the graph-assisted enrichment loop and the source-scan fallback share the same `cachedDetections` map, so we never run the plugin twice for the same file. ## Why this is better than the regex version 1. **Comments and strings for free.** The old regex would match `// router.get('/fake')` as a real Express route; tree-sitter never visits string/comment nodes. 2. **Structural controller-prefix.** Spring and NestJS class-prefix joining is now scoped to the enclosing class via `class_declaration` node ids, eliminating file-wide state that broke when a file had multiple controllers. 3. **Precise NestJS disambiguation.** The plugin only emits a NestJS detection when the enclosing class has a real `@Controller` decorator — the old regex would fire on any `@Get(...)` in the file regardless of surrounding context. 4. **Language-agnostic extension.** Adding Ruby / Rust / Kotlin HTTP detection later means dropping one file in `http-patterns/` — no changes to the shared scanner, the orchestrator, or the Strategy A Cypher queries. ## Tests - `http-route-extractor.test.ts` — **18/18 pass** (tests unchanged; they're contract-style input/output tests and the contract shape is unchanged). Covers Spring class prefix, Express, gin/echo, stdlib HandleFunc, NestJS, Laravel, FastAPI for providers and fetch/axios/python-requests/rest-template/webClient/okhttp/go-stdlib/ resty for consumers, plus graph-first Strategy A for both. - `topic-extractor.test.ts` — **30/30 pass** after the `captures.value` API migration. - `grpc-extractor.test.ts` — 43/43 pass (untouched; phase 3). - `manifest-extractor.test.ts` — 8/8 pass (untouched). - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No changes to pipeline.ts, MCP surface, ingestion, or tests. - No CI / release / security / secrets changes. - Tree-sitter grammars imported by plugins (`tree-sitter-java`, `tree-sitter-go`, `tree-sitter-python`, `tree-sitter-php`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already in `package.json` for the existing ingestion pipeline. ## Phase 3 plan - **grpc-extractor** gets the same treatment: plugin-per-language under `grpc-patterns/` for Java / Go / Python / TS detection. `.proto` files remain an open question — no `tree-sitter-proto` grammar is installed, so the in-tree string-sanitizing parser from PR abhigyanpatwari#796's self-review stays as a pragmatic exception unless the maintainer wants us to add `tree-sitter-proto` as a new dep. Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): migrate grpc-extractor source scans to tree-sitter plugins Phase 3 (final) of the extractor refactor requested by @magyargergo on abhigyanpatwari#796. Same architecture as phase 1 (topic) and phase 2 (http): thin language-agnostic orchestrator + per-language plugins that own tree-sitter grammars and query sources. With this commit the top-level extractors under `src/core/group/extractors/` import ZERO tree-sitter grammars and ZERO query strings — every grammar import lives in a `*-patterns/<lang>.ts` plugin file, and the orchestrators go through the registry indirection. ## Architecture ``` src/core/group/extractors/ ├── tree-sitter-scanner.ts # shared primitives (unchanged) ├── grpc-extractor.ts # orchestrator (only `.proto` parser left) └── grpc-patterns/ ├── types.ts # GrpcDetection, GrpcLanguagePlugin, GrpcRole ├── index.ts # registry: ext → plugin + GRPC_SCAN_GLOB ├── go.ts # tree-sitter-go: RegisterXxxServer, Unimplemented, NewXxxClient ├── java.ts # tree-sitter-java: @GrpcService + XxxImplBase + newBlockingStub ├── python.ts # tree-sitter-python: add_XxxServicer_to_server + XxxStub └── node.ts # tree-sitter-javascript + tree-sitter-typescript: # @GrpcMethod, @GrpcClient field type, # .getService<X>('Svc'), new XxxServiceClient, # loadPackageDefinition dynamic constructors ``` ## Per-language coverage **Go (`go.ts`)** - Provider: `\w+.RegisterXxxServer(...)` via `call_expression → selector_expression → field_identifier` + JS regex filter `^Register(\w+)Server$`. - Provider: `pb.UnimplementedXxxServer` embedded in a struct via `struct_type → field_declaration_list → field_declaration → qualified_type → type_identifier` + JS filter. - Consumer: `\w+.NewXxxClient(...)` via the same call_expression query + JS filter `^New(\w+)Client$`. **Java (`java.ts`)** - Provider: `class X extends YyyGrpc.YyyImplBase` — two queries handle the scoped and plain forms. `scoped_type_identifier`'s children are positional (no `scope:`/`name:` fields), so the query matches the two `type_identifier` children by position. - `#match? @inner "ImplBase$"` restricts matches at query time. - Whether the class has `@GrpcService` or not controls only the `source` metadata label — the plugin walks the class_declaration's `modifiers` child in JS to detect the marker_annotation. - Consumer: `YyyGrpc.newStub(ch)` / `newBlockingStub(ch)` via a `method_invocation` query with `#match? @method "^new(Blocking)?Stub$"`, service name extracted via `^(\w+)Grpc$` on the object identifier. **Python (`python.ts`)** - Single call-expression query covers both bare identifier and `obj.method` attribute forms: `(call function: [(identifier) @fn (attribute attribute: (identifier) @fn)])`. - Plugin filters `@fn.text` against two JS regexes: `^add_(\w+)Servicer_to_server$` (provider) and `^(\w+)Stub$` (consumer), with a reserved-names ignore list for the Stub case (Mock / Test / Fake / Stub). **Node — JavaScript + TypeScript + TSX (`node.ts`)** - Pattern sources defined once, compiled three times (one per grammar) because `Parser.Query` objects are not portable across grammars. Exports three `GrpcLanguagePlugin`s sharing the same `scan`. - `@GrpcMethod('Service', 'Method')`: decorator query captures the two string literals. Confidence is hard-coded 0.8 regardless of proto map resolution (matches the original regex version's behaviour). - `@GrpcClient(...) field: XxxServiceClient`: decorator query captures the decorator node, plugin walks up to find the enclosing `public_field_definition` (decorators on fields are CHILDREN of the field definition in tree-sitter-typescript, not siblings) and reads its first `type_annotation → type_identifier`, then runs the `^(\w+Service)Client$` JS filter. - `client.getService<X>('AuthService')`: call-expression query on `member_expression.property = "getService"` + string literal arg. - `new XxxServiceClient(...)`: `new_expression` with a bare identifier constructor, filtered by `^(\w+Service)Client$` so generic `new AuthClient(...)` (missing the `Service` infix) does NOT falsely register as a consumer. Preserves the regression test `test_extract_ts_non_service_client_constructor_is_ignored`. - `loadPackageDefinition` dynamic loader: gated on `tree.rootNode.text.includes('loadPackageDefinition')`. When set, `new foo.bar.Xxx(...)` qualified constructors with a capitalised property name register as consumers. ## Orchestrator changes `grpc-extractor.ts` loses every `scanGoProviders` / `scanJavaProviders` / ... helper and replaces them with a single source-scan loop that: 1. Parses each file with the plugin's grammar (one shared `Parser` instance across all files, `setLanguage` called per plugin). 2. Calls `plugin.scan(tree)` to get `GrpcDetection[]`. 3. Converts each detection to an `ExtractedContract` via the private `detectionToContract` helper, which: - Looks the short service name up in the proto map (filled by the `.proto` parser). - Picks confidence = `confidenceWithProto` if resolved, else `confidenceWithoutProto`. - Builds a method-level contract id (`grpc::pkg.Svc/Method`) when the detection carries a `methodName` (TS `@GrpcMethod` only), otherwise a service-level id (`grpc::pkg.Svc/*`). Everything else — the `.proto` parser, `buildProtoContext`, `buildProtoMap`, `resolveProtoConflict`, `serviceContractId`, `stripProtoCommentsAndStrings`, `extractServiceBlocks`, the dedupe function — stays exactly as before. The `.proto` parser is kept as a pragmatic exception to the "no regex in extractors" rule because no `tree-sitter-proto` grammar is installed in the repo; a comment at the top of the file explains this and flags the maintainer option of adding `tree-sitter-proto` as a dependency. ## Why this is better than the regex version 1. **Comments and strings are respected for free.** Matched node types are only code constructs, never text inside comments or string literals. 2. **No false positives on partial names.** The old `(\w+?)Grpc`-style regexes would cross-match unrelated identifiers; structural queries restrict matches to the exact AST shape (`scoped_type_identifier → type_identifier` pairs, `method_invocation → identifier` etc.). 3. **NestJS `@GrpcClient` is structural, not regex-based.** The old regex required a specific textual layout (`@GrpcClient(...) private readonly foo!: XxxServiceClient`); the plugin now walks the AST, so modifier order / optional modifiers / multi-line formatting don't break it. 4. **Language-agnostic extension.** Adding Kotlin / Rust / C# gRPC detection later is a one-file edit in `grpc-patterns/index.ts` — no touches to the shared scanner, the orchestrator, or the proto parser. ## Tests - `grpc-extractor.test.ts` — **43/43 pass** (tests unchanged; the contract shape is identical). Covers .proto parsing (including the brace-inside-string regression), Go provider/consumer, Java @GrpcService / plain ImplBase provider + newBlockingStub consumer, Python servicer + stub, TS @GrpcMethod + @GrpcClient + .getService + new XxxServiceClient + loadPackageDefinition + the `AuthClient` vs `AuthServiceClient` discrimination, dedupe across multiple patterns in one file, proto-aware confidence, and the inherited-package resolution for split proto definitions. - `topic-extractor.test.ts` — 30/30 pass. - `http-route-extractor.test.ts` — 18/18 pass. - `manifest-extractor.test.ts` — 8/8 pass. - `service.test.ts`, `sync.test.ts`, `storage.test.ts` — 41/41 pass. - `npx tsc -p tsconfig.json --noEmit` clean. ## Scope discipline (per GUARDRAILS.md) - Only files under `src/core/group/extractors/` are touched. - No pipeline.ts, MCP surface, ingestion, CI / release / security, or test changes. - New tree-sitter grammar imports (`tree-sitter-go`, `tree-sitter-java`, `tree-sitter-python`, `tree-sitter-javascript`, `tree-sitter-typescript`) are all already installed for the ingestion pipeline. ## End of phase series This commit completes the three-phase extractor refactor: - **Phase 1** (`ea06d11`): topic-extractor → `topic-patterns/` - **Phase 2** (`b6015f6`): http-route-extractor → `http-patterns/` - **Phase 3** (this commit): grpc-extractor → `grpc-patterns/` Every remaining regex-based extractor helper under the `src/core/group/ extractors/` directory is either (a) language-agnostic string processing (path normalization, dedupe keys) or (b) the `.proto` parser, which is documented as an explicit exception. Co-authored-by: Claude <noreply@anthropic.com> * feat(group): add tree-sitter-proto for .proto file parsing Addresses @magyargergo's suggestion on abhigyanpatwari#796 to replace the manual string-sanitizing .proto parser with a tree-sitter grammar. - **Vendored `tree-sitter-proto`** in `vendor/tree-sitter-proto/`. Grammar source from [coder3101/tree-sitter-proto](https://github.com/coder3101/tree-sitter-proto) (latest `grammar.js`), parser.c regenerated with `tree-sitter-cli 0.24` to produce ABI version 14 — compatible with the project's `tree-sitter 0.25` runtime (which supports ABI ≤ 14). Added as `optionalDependency` with `file:./vendor/tree-sitter-proto`. - **New `grpc-patterns/proto.ts` plugin** — uses the same `compilePatterns` + `runCompiledPatterns` infrastructure as every other plugin. Two queries: - `(package (full_ident) @pkg)` — package declaration - `(service (service_name) @service_name (rpc (rpc_name) @rpc_name))` — one match per (service, rpc) pair - **Graceful fallback** — `tree-sitter-proto` is an optional dependency. If it fails to install (platform incompatibility) or fails the runtime smoke-test (`setLanguage` + `parse` on a trivial proto), `PROTO_GRPC_PLUGIN` stays `null` and the orchestrator uses the existing manual parser. The smoke-test catches the `SyntaxNode` TDZ error that occurs in vitest's fork-based test runner. - **Orchestrator updated** — when `hasProtoPlugin` is true, `.proto` files are handled by the plugin loop (they're included in `GRPC_SCAN_GLOB`), and the manual `parseProtoFile` loop is skipped. `buildProtoContext` still runs to build the proto map for cross-referencing source-file detections. 1. **No manual comment/string stripping.** The old parser needed `stripProtoCommentsAndStrings` (110 lines) to avoid counting braces inside comments and string literals. tree-sitter handles this natively. 2. **No brace-depth tracking.** `extractServiceBlocks` used a manual depth counter to find service boundaries. tree-sitter's AST gives us `service` → `service_name` + `rpc` → `rpc_name` directly. 3. **Performance.** tree-sitter's C-based parser is faster than character-by-character JS scanning + regex on large proto files. - `grpc-extractor.test.ts` — **43/43 pass** (unchanged) - All other extractor tests — 99/99 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * chore: add .gitignore for vendored tree-sitter-proto build artifacts https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * fix: correct .gitignore paths for vendored tree-sitter-proto Patterns should be relative to the .gitignore file's directory. https://claude.ai/code/session_01SFUCxgKMMQ8EgRHYw91xPU * refactor(group): address Copilot review feedback on abhigyanpatwari#796 Six fixes suggested by the Copilot AI review: 1. **`normalizeHttpPath` root-path edge case** — stripping trailing slashes on the input `/` produced an empty string, yielding malformed contract ids like `http::GET::`. Now preserves `/` for the root handler/fetch case. 2. **Dedupe `scanFiles` call** — `extract()` was globbing the source-scan file list twice (once for the provider fallback, once for the consumer fallback). Moved to a single lazy call that memoizes the result for the rest of the method. 3. **HTTP `scanFiles` now ignores `**/vendor/**`** — every other extractor's glob already ignored vendored sources; the HTTP one didn't. Fixed for consistency. 4. **`loadPackageDefinition` check is now structural** — was calling `tree.rootNode.text.includes('loadPackageDefinition')` which forces materialization of the entire file text from the parse tree (expensive on large files). Replaced with a dedicated compiled query on `(call_expression function: [(identifier) | (member_expression)])` so the check stays in the AST domain. 5. **`grpc-extractor.ts` header docstring updated** — still claimed ".proto parsing is not tree-sitter-based because no grammar is installed". Now describes the actual behaviour: tree-sitter when `tree-sitter-proto` is available (optionalDependency), manual fallback otherwise. 6. **Eliminated the double proto file parse on the fallback path** — `buildProtoContext` already globs + parses every `.proto` file to build `servicesByName`. On the `!hasProtoPlugin` branch the extractor was globbing + parsing again via the now-removed `parseProtoFile` helper. The fallback branch now iterates the map that `buildProtoContext` already produced to emit provider contracts directly — single pass per proto file. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * refactor(group): address Claude review feedback (bugs + dedup + hygiene) on abhigyanpatwari#796 Follows up `2f28bfc` with the remaining items from the Claude AI review: ## Bugs **Bug 2 — Label-unaware Cypher queries in `resolveSymbol`.** The manifest-extractor's lookup queries were `MATCH (n) WHERE n.name = $x` with no label filter, so a topic/service/package name could silently match any node type (File, Variable, Import, Folder, …). Added label filters: - `topic` → `(n:Function|Method|Class|Interface)` (topics are best-effort symbol-name matches against listener/publisher symbols) - `grpc` method → `(n:Function|Method)` - `grpc` service → `(n:Class|Interface)` - `lib` → `(n:Package|Module)` All 8 manifest-extractor tests still pass (mock executor is label-agnostic, but the production LadybugDB graph now gets correctly scoped queries). **Bug 8 — Tautological `!handlerName` condition.** `http-route-extractor.ts:extractProvidersGraph` had `let handlerName = null; if (!method || !handlerName) { ... }` — the `!handlerName` clause was always true since there was no intervening assignment. Simplified to always run the plugin-scan lookup (we need the handler name even when `methodFromRouteReason` already resolved the method). ## Clean code / dedup **Design 7 — `readSafe` was copy-pasted in all three orchestrators.** Extracted to `extractors/fs-utils.ts` as the single source of truth for the path-traversal guard. Dropped the three local copies and the now-unused `fs`/`path` imports from topic-extractor. **Style 10 — Language-specific `_test.go` skip in the topic orchestrator.** Was `if (rel.endsWith('_test.go')) continue;` inside the language- agnostic extraction loop. Pushed into the glob's ignore list (`'**/*_test.go'`) alongside the existing `node_modules`, `vendor`, `dist`, `build` entries, with a comment explaining that other languages' test file conventions either live in separate directories (Python `tests/`, Java `src/test/`) or are already covered by the existing ignores. ## Already addressed in `2f28bfc` (mentioned again in Claude review) - Bug 3: `normalizeHttpPath('/')` returns `''` — fixed - Bug 4: double glob + double parse of `.proto` — fixed - Bug 5: `scanFiles` called twice in HTTP — fixed - Bug 6: missing `**/vendor/**` in HTTP glob — fixed - Design 9 partially: `tree.rootNode.text.includes('loadPackageDefinition')` replaced with a dedicated structural query ## Deferred - Bug 1 (`http::*::path` vs `http::GET::path` matching) — out of scope; sync.ts matching logic lands in abhigyanpatwari#793, manifest extractor already emits correct synthetic uids for unresolved HTTP contracts. - Design 9 full (change plugin `scan(tree)` → `scan(tree, source)`) — the only real use case (`loadPackageDefinition` gate) is already fixed via a structural query, so the interface change would be cosmetic churn without a concrete consumer. ## Tests - `topic-extractor.test.ts` — 30/30 pass - `http-route-extractor.test.ts` — 18/18 pass - `grpc-extractor.test.ts` — 43/43 pass - `manifest-extractor.test.ts` — 8/8 pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> * docs+fix(group): address remaining Claude review items + add pipeline flow chart ## Fixes **Remaining 🔴 — HTTP contract id wildcard format.** Documented the `http::*::<path>` format as an intentional wildcard for manifest links that omit the HTTP method, alongside the explicit-method form (`GET::/path` → `http::GET::/path`). The docblock on `buildContractId` now states both forms, notes that wildcard-aware matching is the responsibility of the sync / cross-impact layer (abhigyanpatwari#793), and recommends the explicit-method form whenever the author knows the method (it round-trips through exact equality without needing wildcard logic downstream). Tests unchanged — the wildcard format is what they've always asserted. **Minor 1 — stale comment at `manifest-extractor.ts:124-126`.** The comment claimed "creates a contract with an empty symbolUid/ref" but the code switched to `manifestSymbolUid(repo, contractId)` a few commits back. Updated to describe the actual synthetic-uid fallback semantics and the cross-impact path that relies on both sides of the join deriving the same uid. **Minor 2 — exhaustiveness guard on `buildContractId`.** The `switch(type)` covered all five current `ContractType` variants but silently returned `undefined` if a new variant was added. Added a `default: const _exhaustive: never = type; throw new Error(...)` clause so the build fails loudly on an unhandled variant. **Minor 3 — `tree.rootNode.text` in `grpc-patterns/node.ts`.** Already fixed in `2f28bfc` via a dedicated structural query (`LOAD_PACKAGE_DEFINITION_SPEC`). No action needed. ## New: pipeline flow chart (per @magyargergo's request) Added `src/core/group/PIPELINE.md` with four Mermaid diagrams: 1. **High-level overview** — `group.yaml` → extractors + manifest → contract matching → `bridge.lbug` → `runGroupImpact`. 2. **Per-repo extractor two-strategy shape** — graph-assisted Strategy A vs. source-scan Strategy B. 3. **Plugin architecture** — orchestrator → registry → per-language `*-patterns/<lang>.ts` → `tree-sitter-scanner.ts` → `ExtractedContract`. 4. **Manifest extraction** — label-scoped `resolveSymbol` with the synthetic-uid fallback. 5. **Cross-impact query (abhigyanpatwari#606)** — local impact → bridge join → cross-repo fan-out. Each diagram is annotated with which PRs own which stage (this PR: extractors + manifest; abhigyanpatwari#795: bridge storage; abhigyanpatwari#606: cross-impact runtime) and points at the concrete files/functions involved. ## Tests - 99/99 extractor tests pass - `npx tsc -p tsconfig.json --noEmit` clean Co-authored-by: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
zander-raycraft
pushed a commit
that referenced
this pull request
May 7, 2026
abhigyanpatwari#1050) * feat(ingestion): TypeScript registry-primary scope resolution (Ring 3) - Add TypeScript ScopeResolver stack (query/captures/interpret, import decomposition, hooks, arity, merge, receiver binding) and register in SCOPE_RESOLVERS. - Harden shared compound receiver and receiver-bound CALLS pass for map for-of tuple bindings, dotted typeRef shapes, and callable-alias fallbacks. - Flip TypeScript into MIGRATED_LANGUAGES; refresh AGENTS.md and type-resolution-system.md. - Shared finalize-algorithm updates for cross-file scope parity. - Tests: TS scope-resolution unit suite; legacy call-processor suite forces REGISTRY_PRIMARY_TYPESCRIPT=0; registry-primary flag test opts out TS in override scenario. Made-with: Cursor * fix(ingestion): SCC-ordered cross-file return-type propagation + multi-hop re-export resolution Fix CI failures on PR abhigyanpatwari#1050 (TypeScript registry-primary migration) by making `propagateImportedReturnTypes` deterministic via reverse- topological SCC ordering and updating the multi-hop re-export contract to match `followReexportChain` behavior. Why: the legacy pass mirrored an intermediate ref instead of the terminal type when an importer was processed before its source module had its own typeBindings chain-followed (4-file alias chain regression in `ts-simple` fixture: `models.User -> service.user -> app.user` collapsed to `getUser` instead of `User`). Reverse-topological walk of `indexes.sccs` (leaves first) lets every importer see the source's already-followed terminal type in a single pass. Changes: - `imported-return-types.ts`: rewrite to walk SCCs leaves-first, chain- follow the source module's typeBindings BEFORE mirroring, and chain- follow the importer's typeBindings AFTER mirroring. Cyclic SCCs reach a partial fixpoint (no convergence guarantee, ts-circular only asserts no-throw). - `finalize-algorithm.ts`: docstring update on `FinalizeFile.localDefs` to reflect that `followReexportChain` resolves multi-hop re-exports through barrels even when intermediates do not surface the name - surfacing is now a static optimization, not a correctness requirement. - `contract/scope-resolver.ts` Invariant I3: explicitly document the SCC ordering requirement. - `pipeline/run.ts`: split PROF timer into `finalize` and `propagate` so the pass's cost is observable independently. - `ARCHITECTURE.md` Performance notes: describe SCC-ordered propagation. - `imported-return-types.ts`: expand chain-depth comment (2x effective depth from pre/post follow), add multi-ref break rationale, add `ts-simple` motivating-fixture pointer. Tests: - `finalize-algorithm.test.ts`: add 4 cases (3-hop chain, cyclic re-export visited-set guard, wildcard re-export fall-through, multi-source first-match-wins); fix misleading shared nodeId in the thick variant; rename and update the multi-hop test for the new contract (transitiveVia assertion on the thin variant). - `imported-return-types.test.ts` (NEW): unit tests for the SCC pass pinning topological collapse, local-annotation guard, missing-source skip, and cyclic-SCC no-throw. - `cross-file-binding.test.ts` + `ts-deep-alias-chain` fixture (NEW): 5-file integration regression guard for SCC-ordered propagation through 4 module boundaries. Validation: 865 scope-resolution + cross-file tests pass on Windows; typecheck clean across both packages; only pre-existing Swift overload failures remain (verified on PR base commit, environmental). Made-with: Cursor * fix(ingestion): address PR abhigyanpatwari#1050 review findings — side-effect imports, resolve-cache perf, adapter signature Three independent fixes surfaced by the production-readiness review of the TypeScript registry-primary scope-resolution migration (RFC abhigyanpatwari#909 Ring 3). All three pass under both REGISTRY_PRIMARY_TYPESCRIPT=0 and =1. 1. Side-effect imports were silently dropped (correctness regression). The legacy DAG emitted IMPORTS edges for `import './polyfill'` because its tree-sitter query matches `(import_statement source: (string))` regardless of clause. The new registry-primary path returned `[]` from `splitImportStatement()` for clause-less imports, so no ParsedImport / ImportEdge was ever produced — silent file-level edge loss. Add a generic 'side-effect' variant to `ParsedImport` and `ImportEdge['kind']` in `gitnexus-shared`; finalize resolves the target file and pre-finalizes the edge (no `targetDefId`, no `BindingRef`) so the SCC fixpoint loop skips it. The TypeScript provider now emits + interprets the new kind end-to-end. The variant is intentionally generic so other languages (Rust `use foo as _`, Python module-init) can adopt it. 2. Per-import re-derivation in `resolveImportTarget` (perf regression). The TS adapter built `new Set(allFilePaths)` on every call and let `resolveTsImportTarget` re-derive `allFileList` / `normalizedFileList` and discard the `resolveCache`. For a workspace with N files and M imports that's O(N × M) work per pass. Wrap the adapter in a closure that memoizes all five derived values keyed on the orchestrator's `ReadonlySet` identity; reset only when the set reference changes (start of new pass). New cost: O(N + M). 3. Misleading fake `ParsedImport` in the adapter (architecture). The adapter constructed `{ kind: 'named', localName: '_', importedName: '_', targetRaw }` to call `resolveTsImportTarget`, even though only `targetRaw` and the structural-typed context are read. Extract `resolveTsTarget(targetRaw, ctx)` so the adapter has an honest signature; `resolveTsImportTarget` still works for other callers. Also extract `narrowTsContext` for the type narrowing. Tests: - New 4-file fixture `typescript-side-effect-imports` with two side-effect imports + one named import. - New "TypeScript side-effect imports" describe in `test/integration/resolvers/typescript.test.ts` (parity-gated by `ci-scope-parity.yml` — runs under both flag states). - Updated 2 unit tests to expect 1 side-effect ParsedImport and 4 `@import.statement` matches (was 0 / 3). - 785 / 785 TS scope-resolution tests pass under both REGISTRY_PRIMARY_TYPESCRIPT=0 and =1. Made-with: Cursor * fix(scope): address Codex adversarial review findings on PR abhigyanpatwari#1050 Four findings from the Codex adversarial review broke registry-primary TypeScript resolution for common patterns. All four now have unit and integration regression coverage that pass under both `REGISTRY_PRIMARY_TYPESCRIPT=0` (legacy DAG) and the default registry-primary path. [high] tsconfig path aliases dropped: Threaded `tsconfigPaths` through ScopeResolver via a new opaque `resolutionConfig` parameter and a `loadResolutionConfig(repoPath)` hook. The orchestrator (`scopeResolutionPhase` + `runScopeResolution`) loads it once per workspace pass and forwards into every `resolveImportTarget` call. TypeScript resolver now resolves `@/services/user` style imports through the standard resolver's alias branch. [high] TSX parsed with the wrong grammar: `emitTsScopeCaptures` now picks the parser/query by `filePath` (`.tsx` -> TSX grammar) and validates cached trees against the expected grammar via the new exported `tsCachedTreeMatchesGrammar` helper. Stale TS-grammar trees for `.tsx` files no longer leak through the scope query. [medium] Literal dynamic imports never linked: Added `kind: 'dynamic-resolved'` to `ParsedImport` and `ImportEdge`. The decomposer emits a synthetic `@import.literal` capture for string-literal dynamic imports; the interpreter maps that to `dynamic-resolved`; finalize pre-finalizes it as a file-level terminal (same shape as `side-effect`). `import('./feature')` now produces a real IMPORTS edge under the registry-primary path. Legacy DAG keeps its existing behavior — the new integration assertion is gated behind the flag. [medium] Namespace re-exports invisible from barrels: The decomposer now emits TWO captures for `export * as ns from './m'` — the existing `reexport-namespace` import draft AND a synthetic `@declaration.namespace` capture (via `buildNamespaceDeclarationMatch`). The latter creates a Namespace `SymbolDefinition` in the barrel's `localDefs`, so downstream `import { ns } from './barrel'` resolves through `findExportByName`. Regression fixtures under `gitnexus/test/fixtures/lang-resolution/`: - typescript-tsconfig-aliases (`@/` alias) - typescript-tsx-jsx (Button.tsx + App.tsx with JSX) - typescript-dynamic-import (`await import('./feature')`) - typescript-reexport-namespace (`export * as Models from './base'`) Validation: - gitnexus-shared builds clean - gitnexus typecheck clean - 385/385 TS scope-resolution tests pass under both `REGISTRY_PRIMARY_TYPESCRIPT=0` and default Made-with: Cursor * perf(scope): O(1) defById lookup + bounded re-export depth (PR abhigyanpatwari#1050 round 3) Addresses the round-3 PR abhigyanpatwari#1050 reviews (Claude adversarial + xkonjin): both flagged the existing O(N²) `findDefById` linear scan in `materializeBindings` and the unbounded recursion in `followReexportChain` as production-readiness blockers for TypeScript monorepos. Both fixes land alongside their regression tests under both `REGISTRY_PRIMARY_TYPESCRIPT=0` and the default registry-primary path. [high] materializeBindings O(N_files × N_defs × N_edges) → O(N_defs + N_edges): Build a `nodeId → SymbolDefinition` index map once at the top of `materializeBindings` (one O(N_defs) pass), then replace the per-edge `findDefById(files, edge.targetDefId)` linear scan with an O(1) `defById.get(edge.targetDefId)` lookup. Also drop the now-unused `findDefById` helper. At realistic TypeScript monorepo scale (~5k files × ~50 defs/file × ~100k linked import edges) this is the difference between ~25 s and a few ms inside finalize. Regression test in `finalize-algorithm.test.ts` builds 200 leaf files + 1 consumer importing one symbol from each, asserts every binding materializes correctly. [medium] followReexportChain unbounded recursion: The existing `visited` set caps depth at `O(N_files)` but allows recursion proportional to barrel-chain depth, mismatching the explicit "Iterative DFS to avoid stack overflow" policy in `tarjanSccs`. Added a `MAX_REEXPORT_DEPTH = 100` constant and a `depth` parameter to `followReexportChain` (defaults to 0); each recursive call passes `depth + 1` and the function returns `null` when the cap is exceeded. 100 is comfortably above any realistic hand-authored barrel chain (typical depth 1-5; auto-generated barrels rarely exceed 20) while staying well below JS engine call stack limits. Regression test wires a 200-link reexport chain and verifies the crawl terminates cleanly with `linkStatus: 'unresolved'` (no terminal def reachable within the budget). [low] synthesizeInstanceofNarrowings bare-identifier-only limitation: xkonjin's review #4 noted that the LHS narrowing only handles bare identifiers (`if (x instanceof Foo)`), not member expressions (`if (user.address instanceof Address)`). Added a JSDoc note explaining the constraint and pointing readers at field-type resolution as the workaround for member-chain receivers. Validation: - gitnexus-shared builds clean - gitnexus typecheck clean - 413/413 tests pass under both flag states for finalize-algorithm + TS unit + TS integration suites - 972/972 tests pass across full scope-resolution + Python + C# integration smoke (no cross-language regression) Made-with: Cursor * refactor(finalize): replace recursive followReexportChain with SCC-condensed iterative closure The legacy `followReexportChain` walked re-export drafts via mutual recursion guarded by a per-call visited set + a `MAX_REEXPORT_DEPTH` ceiling. Recursion is fragile (call-stack ceiling, no bound on depth that's actually meaningful), so this replaces it with a structurally better algorithm: a precomputed per-file re-export closure built by running Tarjan SCC over the re-export sub-graph and propagating names in reverse-topological order with a bounded intra-SCC fixpoint. Algorithm (`buildReexportClosures` in finalize-algorithm.ts): 1. Sub-graph: build the directed graph of `reexport` + `wildcard` drafts only (regular/namespace/dynamic imports do not contribute). 2. SCC condensation: run the same iterative `tarjanSccs` already used for the file-level import graph; output is in reverse-topo order so out-of-SCC neighbors are always already-finalized. 3. Per-SCC propagation: - Acyclic singleton: one pass populates from neighbors' closures. - Cyclic SCC: bounded fixpoint capped at |SCC|+1 iterations. With first-wins precedence the closure map is monotone, so each name needs at most |SCC| hops to traverse the cycle. Precedence (preserved from the recursive crawl): - Named re-exports take precedence over wildcards. - Within each kind, declaration order wins. Lookup at finalize time becomes O(1) (`lookupReexportedName`), down from O(chain_depth × drafts) per consult and recursive at that. Properties vs the legacy implementation: - Stack-safe by construction; no `MAX_REEXPORT_DEPTH` guard needed. - 1000-hop barrel chains now resolve in full (legacy capped at 100 and surfaced anything deeper as `unresolved`). - Cycles handled structurally via SCC, not via per-call visited set. - Same observable semantics: every existing test passes unchanged. Tests: - Replace the obsolete `MAX_REEXPORT_DEPTH (200-hop chain stops cleanly without stack overflow)` test (which asserted the OLD bug — that deep chains failed to resolve) with a positive 1000-hop test that asserts full resolution + accurate `transitiveVia`. Proves both the recursion is gone AND the closure correctly inherits the leaf def across all hops. - Update commentary on adjacent re-export tests to reference the closure mechanism. - Update `FinalizeFile.localDefs` JSDoc + import-decomposer.ts inline doc to point at `buildReexportClosures` instead of the removed function name. Validation: - gitnexus-shared builds cleanly. - gitnexus typechecks cleanly. - 28/28 finalize-algorithm.test.ts tests pass (incl. new 1000-hop). - 801/801 TypeScript scope-resolution tests pass under default (registry-primary) AND `REGISTRY_PRIMARY_TYPESCRIPT=0` (legacy DAG). - 404/404 Python + C# integration tests pass — no regression in cross-language consumers of the shared `finalize`. Made-with: Cursor * fix(scope): remove non-null assertions from scope resolution Made-with: Cursor * fix(scope): address TypeScript review follow-ups Made-with: Cursor * fix(scope): address TypeScript import review follow-ups Add regression coverage for non-binding import edges and circular TypeScript bindings so PR abhigyanpatwari#1050 review concerns stay visible without changing runtime semantics. Made-with: Cursor
zander-raycraft
pushed a commit
that referenced
this pull request
May 7, 2026
…mo / useCallback / useMemo / observer) (abhigyanpatwari#1261) * fix(typescript): name HOC-wrapped const declarations (forwardRef / memo / useCallback / useMemo / observer / debounce) Follow-up to issue abhigyanpatwari#1166 / PR abhigyanpatwari#1175. After fixing HOF callbacks (Promise fan-out, queryFn pair-arrows, multi-action Zustand stores) and JSX-as-call, the dominant residual 0%-capture pattern in real React UI codebases was the HOC-wrapped variable declaration: const Button = React.forwardRef((props, ref) => { ... }) const Card = memo((props) => { ... }) const handleClick = useCallback(() => { ... }, []) const computed = useMemo(() => { ... }, []) const debouncedSearch = debounce((q) => { ... }, 250) All share the AST shape `lexical_declaration > variable_declarator > call_expression > arguments > arrow_function`. Pre-fix, neither the registry-primary `query.ts` nor the legacy `tree-sitter-queries.ts` had a `@declaration.function` pattern matching this shape, and the legacy DAG's `tsExtractFunctionName` only walked `variable_declarator` and `pair` parents — `arguments` parents fell through with `funcName = null`. Result: every shadcn/Radix component, every memoised React component, and every `useCallback` / `useMemo` callback bound to a const registered as anonymous; calls inside attributed to the file. Sourcerer-fe audit: ~296 declarations affected (~57 forwardRef + ~21 memo + ~161 useCallback + ~57 useMemo). Fix: - 4 new tree-sitter patterns in `languages/typescript/query.ts` (registry-primary), anchored on the inner arrow_function / function_expression — same anchor discipline as the existing `lexical_declaration` and `pair` patterns from PR abhigyanpatwari#1175. - 8 mirrored patterns in `tree-sitter-queries.ts` (4 in TYPESCRIPT_QUERIES, 4 in JAVASCRIPT_QUERIES) for the legacy DAG and the CI parity gate. - New `arguments`-parent branch in `tsExtractFunctionName` that walks `arguments → call_expression → variable_declarator` and returns the const's name. Three guards keep it strictly scoped to HOC-wrapped declarations; bare statement-level HOC calls fall through anonymous. Tests: - 11 integration tests + 9 minimal TS/TSX fixtures exercising forwardRef / memo / useCallback / useMemo / observer / debounce, with positive (named-Function + correct CALLS edge), negative (no phantom Functions for unbound HOCs, no phantom self-loops, no first-sibling-wins leakage), and cross-pollination assertions. - 8 new unit tests in `call-attribution-issue-1166.test.ts` pinning the legacy-DAG path: 6 attribution tests + 2 @definition.function capture tests. Trade-off documented inline: chained array-method declarations (`const x = arr.find((y) => p(y))`) match the same shape and produce a mostly-harmless phantom `Function:x` with one outgoing edge. The false-positive cost is negligible vs. the React UI coverage gain. Verification: - 11/11 typescript-hoc-wrapped (registry-primary) - 26/26 call-attribution-issue-1166 (8 new + 18 pre-existing) - 266/266 across all 4 typescript resolver test files (registry) - 236/236 typescript.test.ts on legacy DAG (CI parity gate) - 1693/1693 across all non-Kotlin/Swift resolver test files - tsc --noEmit clean; prettier clean; eslint clean (no new warnings) Co-authored-by: Cursor <cursoragent@cursor.com> * test(typescript): pin documented HOC trade-offs and close var-form parity gap Addresses the four findings on PR abhigyanpatwari#1261 (Claude bot review for abhigyanpatwari#1261). All findings flagged missing assertion tests for behaviour already documented in code comments — none reported a real bug. The verdict was "production-ready with minor follow-ups"; these tests strengthen the documentation-to-test contract. [medium #1] Array-method false-positive Pin `const found = items.find((item) => predicate(item))` → `predicate.attributedTo === 'found'` as an accepted FP. The const is a value, never invoked, so no incoming CALLS edge ever points at it; the outgoing edge is a minor mis-attribution we accept rather than maintain a HOC allowlist. [medium #2] Nested HOCs (`memo(forwardRef(...))`) — no phantom Function:Wrapped Two integration tests in `typescript-hoc-wrapped.test.ts`: 1. `Wrapped` is NOT a Function node (the outer call's first arg is a call_expression, not an arrow — no @declaration.function pattern matches the outer shape). 2. The deepest arrow's `helper()` call is NOT attributed to Function:Wrapped (the deepest arrow is anonymous because call_expression.parent is `arguments`, not `variable_declarator`), and no Function-sourced CALLS originate from `nested.tsx`. [medium #3] Multi-arrow argument dedup Pin `const x = call(() => first(), () => second())` — both arrows share the same `arguments → call_expression → variable_declarator` ancestor chain on the legacy DAG, so both attribute to "x". Documents the registry-primary dedup story alongside. [low #4] `var X = HOC(...)` parity gap Registry-primary `query.ts` had `(variable_declaration ...)` HOC patterns but legacy `tree-sitter-queries.ts` (TS + JS) did not. Closes the gap by mirroring two `(variable_declaration ...)` HOC patterns into both legacy sections so the parity gate stays tight even if a codebase mixes `var X = HOC(...)` with `const X = HOC(...)`. Validation - Targeted: 41/41 (28 unit + 13 integration) on registry-primary. - Broader TS suite: 60/60 across 4 resolver test files. - CI parity gate (`typescript.test.ts`): 236/236 on legacy DAG and 236/236 on registry-primary. - Prettier clean. ESLint clean (5 pre-existing non-null-assertion warnings in the test file, unrelated). tsc --noEmit clean. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.