diff --git a/docs/code-indexing/cobol/README.md b/docs/code-indexing/cobol/README.md new file mode 100644 index 0000000000..c96eb4626f --- /dev/null +++ b/docs/code-indexing/cobol/README.md @@ -0,0 +1,100 @@ +# COBOL Code Indexing + +GitNexus indexes COBOL codebases using a **regex-only extraction** strategy, bypassing tree-sitter entirely. This document explains why, how the pipeline works, and links to detailed sub-documents. + +## Why Regex-Only? + +The tree-sitter-cobol grammar (v0.0.1) has three critical limitations that make it unusable for production indexing: + +| Issue | Impact | Severity | +|-------|--------|----------| +| External scanner hangs on ~5% of files | No timeout mechanism exists for the C scanner; the process blocks indefinitely | **Blocking** | +| Only ~15% of paragraph headers detected | Most procedure-division paragraphs are invisible to the grammar | High | +| Patch markers in cols 1-6 cause parse errors | Enterprise COBOL uses non-standard sequence area content (e.g., `mzADD`, `estero`, `#FIX`) | High | + +Because the external scanner hang cannot be interrupted (there is no `setTimeoutMicros` equivalent for tree-sitter), using tree-sitter-cobol would hang the indexing pipeline on a non-trivial fraction of real-world files. + +The regex-only approach provides: + +- **Speed**: ~1ms per file average extraction time +- **Reliability**: zero hangs, zero crashes across 13,000+ files +- **Coverage**: captures all critical symbols -- program name, paragraphs, sections, CALL, PERFORM, COPY, data items (01-77, 88-level), file declarations, FD entries, EXEC SQL/CICS blocks, ENTRY points, and MOVE statements + +## Architecture + +```mermaid +flowchart TD + A[Repository Scan] --> B{File Detection} + B -->|Extension match| C[COBOL file] + B -->|GITNEXUS_COBOL_DIRS match| C + B -->|No match| Z[Skip] + + C --> D{Copybook?} + D -->|Yes| E[Add to Copybook Map] + D -->|No| F[Source Program] + + E --> G[COPY Expansion Engine] + F --> G + + G -->|Inline copybook content| H[Expanded Source] + H --> I[Patch Marker Cleanup] + I --> J[Regex State Machine] + + J --> K[Extracted Symbols] + K --> L[Graph Model Builder] + L --> M[Knowledge Graph] + + subgraph "Per-Chunk Processing" + G + H + I + J + K + L + end + + subgraph "Post-Processing" + M --> N[Community Detection] + M --> O[Process Detection] + M --> P[Contract Detection] + end + + style J fill:#e8f5e9,stroke:#2e7d32 + style G fill:#e3f2fd,stroke:#1565c0 +``` + +## COBOL vs Tree-Sitter Languages + +| Feature | COBOL (Regex) | Tree-Sitter Languages | +|---------|--------------|----------------------| +| Parser | Single-pass regex state machine | tree-sitter grammar + queries | +| Speed | ~1ms/file | ~5ms/file | +| AST available | No | Yes | +| COPY expansion | Yes (pre-processing step) | N/A | +| Deep indexing | Data items, SQL, CICS, FD, ENTRY | Type annotations, generics, etc. | +| Call extraction | PERFORM (intra-file) + CALL (cross-program) | AST-based call site detection | +| Import extraction | COPY statements | `import`/`require`/`use`/`#include` | +| Coverage | All critical symbols | Language-dependent query coverage | +| Failure mode | Never hangs | External scanner can hang (COBOL only) | + +## Sub-Documents + +| Document | Description | +|----------|-------------| +| [File Detection](./file-detection.md) | Extension mapping, `GITNEXUS_COBOL_DIRS`, copybook classification | +| [COPY Expansion](./copy-expansion.md) | Copybook inlining, REPLACING transformations, cycle detection | +| [Regex Extraction](./regex-extraction.md) | State machine, regex patterns, line processing | +| [Deep Indexing](./deep-indexing.md) | Data items, EXEC SQL/CICS, file declarations, FD, ENTRY, MOVE | +| [Graph Model](./graph-model.md) | COBOL-specific node types, edge types, full annotated example | +| [Performance](./performance.md) | Benchmarks, worker pool tuning, caps, troubleshooting | + +## Key Source Files + +| File | Purpose | +|------|---------| +| `gitnexus/src/core/ingestion/cobol-preprocessor.ts` | Patch marker cleanup + regex extraction engine | +| `gitnexus/src/core/ingestion/cobol-copy-expander.ts` | COPY statement expansion with REPLACING | +| `gitnexus/src/core/ingestion/utils.ts` | `getLanguageFromPath`, `getLanguageFromFilename` | +| `gitnexus/src/core/ingestion/pipeline.ts` | `isCobolCopybook`, `expandCobolCopies`, `detectCrossProgamContracts` | +| `gitnexus/src/core/ingestion/workers/parse-worker.ts` | `processCobolRegexOnly` -- graph model builder | +| `gitnexus/src/core/ingestion/workers/worker-pool.ts` | Configurable sub-batch size for COBOL | diff --git a/docs/code-indexing/cobol/copy-expansion.md b/docs/code-indexing/cobol/copy-expansion.md new file mode 100644 index 0000000000..7c6aaa2a3e --- /dev/null +++ b/docs/code-indexing/cobol/copy-expansion.md @@ -0,0 +1,157 @@ +# COBOL COPY Expansion + +The COPY statement is COBOL's include mechanism -- analogous to `#include` in C or `import` in modern languages. GitNexus expands COPY statements **before** regex extraction so that symbols defined inside copybooks (data items, paragraphs, etc.) are visible in the program's extracted graph. + +## Supported Syntax + +### Basic COPY + +```cobol +COPY CPSESP. +COPY "WORKGRID.CPY". +``` + +Inlines the content of the named copybook, replacing the COPY line(s). + +### COPY with REPLACING + +```cobol +COPY CPSESP REPLACING "ANAZI-KEY" BY "LK-KEY". +COPY CPSESP REPLACING LEADING "ESP-" BY "LK-ESP-" + LEADING "KPSESPL" BY "LK-KPSESPL". +COPY LINKAGE REPLACING TRAILING "-IN" BY "-OUT". +``` + +Three REPLACING types are supported: + +| Type | Syntax | Behavior | Example | +| ------------ | ------------------------------------ | --------------------------------------- | -------------------------------- | +| **EXACT** | `REPLACING "OLD" BY "NEW"` | Replace exact identifier matches | `ANAZI-KEY` becomes `LK-KEY` | +| **LEADING** | `REPLACING LEADING "PFX-" BY "NEW-"` | Replace prefix on all COBOL identifiers | `ESP-NAME` becomes `LK-ESP-NAME` | +| **TRAILING** | `REPLACING TRAILING "-IN" BY "-OUT"` | Replace suffix on all COBOL identifiers | `DATA-IN` becomes `DATA-OUT` | + +Multiple REPLACING clauses can appear in a single COPY statement. They are applied in order to each COBOL identifier in the copybook content. + +### Multi-Line COPY + +COPY statements can span multiple lines (standard COBOL continuation rules apply): + +```cobol + COPY CPSESP REPLACING + - LEADING "ESP-" BY "LK-ESP-" + - LEADING "KPSESPL" BY "LK-KPSESPL". +``` + +Continuation lines (indicator `-` in column 7) are merged before COPY statement scanning. + +## Expansion Flow + +```mermaid +sequenceDiagram + participant Pipeline + participant Expander as COPY Expander + participant Resolver + participant Reader + + Pipeline->>Pipeline: Identify all COBOL files + Pipeline->>Pipeline: Classify copybooks vs programs + Pipeline->>Reader: Read all copybook content upfront + Reader-->>Pipeline: Copybook content map (name -> content) + + loop For each source file in chunk + Pipeline->>Expander: expandCopies(content, filePath, resolveFile, readFile) + Expander->>Expander: Merge continuation lines + Expander->>Expander: Detect COPY statements via regex + + loop For each COPY statement (reverse order) + Expander->>Resolver: resolveFile(copyTarget) + Resolver-->>Expander: Copybook key or null + + alt Resolved successfully + Expander->>Reader: readFile(resolvedKey) + Reader-->>Expander: Copybook content + + Expander->>Expander: Apply REPLACING transformations + Expander->>Expander: Recurse for nested COPYs (depth + 1) + Expander->>Expander: Splice expanded content into output + else Not resolved + Expander->>Expander: Keep original COPY line + end + end + + Expander-->>Pipeline: Expanded content + resolution metadata + Pipeline->>Pipeline: Replace file content with expanded content + end +``` + +The return type `CopyExpansionResult` contains `expandedContent` and `copyResolutions`. The `expansionDepth` field has been removed from the return type (it was unused by callers). + +COPY statement line numbers in `CopyResolution` are 1-based (consistent with the preprocessor's line numbering). The splice operation that replaces COPY lines with expanded content adjusts for 0-based array indexing internally. + +## Cycle Detection + +Circular COPY references (e.g., copybook A includes copybook B which includes copybook A) are detected and handled: + +1. Each expansion chain maintains a `visited` set of resolved copybook paths +2. If a copybook path is already in the visited set, the expansion is skipped +3. A `warnedCircular` set (internal to `expandCopies()`, not a parameter) deduplicates warning messages within a single file expansion + +Known circular copybooks in PROJECT-NAME: `ANAZI`, `ANDIP`, `QDIPE` (self-referential includes). + +## Max Depth + +Nested COPY expansion is limited to **10 levels** (`DEFAULT_MAX_DEPTH`). If a COPY chain exceeds this depth, a warning is logged and the remaining COPY statements are left unexpanded. + +## Max Total Expansions + +A breadth amplification guard caps the total number of COPY expansions across all branches within a single file to **500** (`MAX_TOTAL_EXPANSIONS`). This prevents exponential blowup from diamond-shaped COPY graphs where N copybooks each include N other copybooks. Once the limit is reached, further COPY statements in that file are left unexpanded and a single warning is logged. + +## REPLACING Application Detail + +The REPLACING engine works by scanning all COBOL identifiers (matching `\b[A-Z][A-Z0-9-]*\b`) in the copybook content and applying each replacement rule: + +``` +Original copybook content: + 05 ESP-NAME PIC X(30). + 05 ESP-CODE PIC X(10). + 05 KPSESPL-FLAG PIC X(01). + +After REPLACING LEADING "ESP-" BY "LK-ESP-" LEADING "KPSESPL" BY "LK-KPSESPL": + 05 LK-ESP-NAME PIC X(30). + 05 LK-ESP-CODE PIC X(10). + 05 LK-KPSESPL-FLAG PIC X(01). +``` + +For LEADING replacements, the engine checks if each identifier starts with the `from` prefix (case-insensitive) and replaces only the prefix portion, preserving the rest of the identifier. + +For TRAILING replacements, the same logic applies to suffixes. + +For EXACT replacements, only identifiers that match the `from` value exactly (case-insensitive) are replaced. + +## Copybook Resolution + +The resolver tries multiple strategies to match a COPY target name to a copybook file: + +1. **Exact match**: `COPY CPSESP` resolves to copybook named `CPSESP` +2. **Strip extension**: `COPY WORKGRID.CPY` strips `.CPY` and resolves to `WORKGRID` +3. **Add extension**: `COPY CPSESP` tries `CPSESP.CPY` and `CPSESP.COPY` + +If no match is found, the COPY statement is left in place (unexpanded) and a resolution record with `resolvedPath: null` is created. + +## Pipeline Integration + +The expansion runs **per chunk**, after file content is read but before dispatch to worker threads: + +1. All copybook files are read upfront (they are typically small, collectively under 100MB) +2. Per chunk, the copybook map is merged with chunk content (in case a chunk contains copybooks) +3. Only programs (not copybooks themselves) undergo expansion +4. The expanded content replaces the original content in-place before worker dispatch + +## Inline Comment Handling + +The copy expander's `stripInlineComment()` helper is quote-aware: pipe characters (`|`) inside single- or double-quoted strings are preserved. This matches the same quote-aware logic used by the preprocessor. + +## Source Files + +- `gitnexus/src/core/ingestion/cobol-copy-expander.ts` -- `expandCopies()`, `parseReplacingClause()`, `applyReplacing()` +- `gitnexus/src/core/ingestion/pipeline.ts` -- `expandCobolCopies()`, copybook map construction, chunk integration diff --git a/docs/code-indexing/cobol/deep-indexing.md b/docs/code-indexing/cobol/deep-indexing.md new file mode 100644 index 0000000000..f283767820 --- /dev/null +++ b/docs/code-indexing/cobol/deep-indexing.md @@ -0,0 +1,312 @@ +# COBOL Deep Indexing + +Beyond basic symbol extraction (program name, paragraphs, CALL, PERFORM, COPY), GitNexus performs deep indexing of COBOL-specific constructs: data items, EXEC SQL/CICS blocks, file declarations, FD entries, ENTRY points, and MOVE statements. + +## Data Items + +### Level Numbers + +| Level Range | Meaning | Graph Node Type | +|-------------|---------|-----------------| +| 01 | Record (group item) | `Record` | +| 02-49 | Elementary/group items | `Property` | +| 66 | RENAMES | `Property` | +| 77 | Independent item | `Property` | +| 88 | Condition name | `Const` | + +FILLER items are skipped (no useful name for the graph). + +### Clauses Parsed + +The `parseDataItemClauses()` function extracts these clauses from the trailing text of a data item declaration: + +| Clause | Pattern | Example | +|--------|---------|---------| +| `PIC` / `PICTURE` | `\bPIC(?:TURE)?\s+(?:IS\s+)?(\S+)` | `PIC X(30)`, `PICTURE IS 9(5)V99` | +| `USAGE` | `\bUSAGE\s+(?:IS\s+)?(COMP\|BINARY\|...)` | `USAGE IS COMP-3`, `BINARY` | +| `REDEFINES` | `\bREDEFINES\s+([A-Z][A-Z0-9-]+)` | `REDEFINES WK-DATE-NUM` | +| `OCCURS` | `\bOCCURS\s+(\d+)` | `OCCURS 12 TIMES` | + +Standalone COMP variants (without the `USAGE` keyword) are also detected: `COMP`, `COMP-1` through `COMP-6`, `COMP-X`, `BINARY`, `PACKED-DECIMAL`. + +### Data Hierarchy + +Data items form a hierarchical structure based on level numbers. The extractor uses a **stack algorithm**: + +``` +Processing order: + 01 WK-RECORD -> push {01, WK-RECORD} -> parent: Module + 05 WK-NAME -> push {05, WK-NAME} -> parent: WK-RECORD (01 < 05) + 10 WK-FIRST -> push {10, WK-FIRST} -> parent: WK-NAME (05 < 10) + 10 WK-LAST -> pop WK-FIRST, push -> parent: WK-NAME (05 < 10) + 05 WK-CODE -> pop WK-LAST, WK-NAME -> parent: WK-RECORD (01 < 05) + 88 WK-ACTIVE -> (88 handled separately) -> parent: WK-CODE +``` + +The stack maintains items where each entry's level is strictly less than the next. When a new item arrives with a level <= the top of stack, items are popped until the stack top has a smaller level. A `CONTAINS` edge is created from the stack top to the new item. + +For 88-level condition names, the parent is the immediately preceding non-88 data item (found by scanning backwards). + +### Annotated Example + +```cobol + 01 WK-EMPLOYEE. + 05 WK-EMP-ID PIC 9(6). + 05 WK-EMP-NAME PIC X(30). + 05 WK-EMP-STATUS PIC X(01). + 88 WK-ACTIVE VALUE "A". + 88 WK-INACTIVE VALUE "I". + 05 WK-SALARY PIC 9(7)V99 COMP-3. + 05 WK-DEPT PIC X(04) OCCURS 3 TIMES. +``` + +Produces: +- `Record` node: `WK-EMPLOYEE` (level 01, section: working-storage) +- `Property` nodes: `WK-EMP-ID`, `WK-EMP-NAME`, `WK-EMP-STATUS`, `WK-SALARY`, `WK-DEPT` +- `Const` nodes: `WK-ACTIVE` (values: `A`), `WK-INACTIVE` (values: `I`) +- `CONTAINS` edges: `WK-EMPLOYEE -> WK-EMP-ID`, `WK-EMPLOYEE -> WK-EMP-NAME`, etc. +- `CONTAINS` edges: `WK-EMP-STATUS -> WK-ACTIVE`, `WK-EMP-STATUS -> WK-INACTIVE` + +### Data Item Cap + +A maximum of **500 data items per file** (`MAX_DATA_ITEMS_PER_FILE`) are processed. Some COBOL programs (especially after COPY expansion) can have 10,000+ data items, which would cause graph bloat and push the V8 relationship Map past its 16.7M entry limit across thousands of files. + +The cap applies after extraction: the first 500 items in source order are kept. Since 01-level records appear first, critical top-level structure is preserved. + +## EXEC SQL + +EXEC SQL blocks are accumulated across lines between `EXEC SQL` and `END-EXEC`, then parsed as a unit. + +### Operation Classification + +The first SQL keyword determines the operation: + +| First Keyword | Operation | +|---------------|-----------| +| `SELECT` | SELECT | +| `INSERT` | INSERT | +| `UPDATE` | UPDATE | +| `DELETE` | DELETE | +| `DECLARE` | DECLARE | +| `OPEN` | OPEN | +| `CLOSE` | CLOSE | +| `FETCH` | FETCH | +| *(anything else)* | OTHER | + +### Table Extraction + +Tables are extracted from SQL clauses: + +| Clause Pattern | Example | +|----------------|---------| +| `FROM ` | `SELECT * FROM EMPLOYEES` | +| `INSERT INTO
` | `INSERT INTO EMPLOYEES` | +| `UPDATE
` | `UPDATE EMPLOYEES SET ...` | +| `JOIN
` | `LEFT JOIN DEPARTMENTS ON ...` | + +Note: The `INTO` pattern is restricted to `INSERT INTO` to avoid false positives from `FETCH ... INTO :host-var` and `SELECT ... INTO :host-var` statements, where `INTO` introduces host variables rather than table names. + +### Cursor Detection + +```cobol + EXEC SQL + DECLARE C-EMPLOYEES CURSOR FOR + SELECT EMP-ID, EMP-NAME FROM EMPLOYEES + WHERE DEPT = :WK-DEPT + END-EXEC +``` + +Extracts: cursor `C-EMPLOYEES`, table `EMPLOYEES`, host variable `WK-DEPT`. + +### Host Variables + +Host variables are COBOL variables referenced in SQL with a `:` prefix. The colon is stripped: + +```sql +WHERE EMP-ID = :WK-EMP-ID AND DEPT = :WK-DEPT +``` + +Extracts: `WK-EMP-ID`, `WK-DEPT`. + +### Graph Output + +- `CodeElement` node per table, with description `sql-table op:{OP}` +- `CodeElement` node per cursor, with description `sql-cursor` +- `ACCESSES` edge from Module to each CodeElement +- Deduplication: if the same table appears in multiple SQL blocks, only one node is created + +## EXEC CICS + +EXEC CICS blocks are accumulated and parsed similarly to SQL blocks. + +### Command Detection + +Two-word commands are detected first (matched against the block start): + +``` +SEND MAP, RECEIVE MAP, SEND TEXT, SEND CONTROL, READ NEXT, READ PREV +``` + +If no two-word command matches, the first word is used (e.g., `LINK`, `XCTL`, `RETURN`, `READ`, `WRITE`). + +### Extraction + +| Element | Pattern | Example | +|---------|---------|---------| +| MAP name | `MAP('name')` or `MAP("name")` | `EXEC CICS SEND MAP('EMPMENU')` | +| PROGRAM name | `PROGRAM('name')` or `PROGRAM("name")` | `EXEC CICS LINK PROGRAM('BGTABUP')` | +| TRANSID | `TRANSID('name')` or `TRANSID("name")` | `EXEC CICS START TRANSID('EMP1')` | + +### Graph Output + +- MAP: `CodeElement` node with description `cics-map cmd:{CMD}` + `ACCESSES` edge from Module +- PROGRAM: `CALLS` edge (cross-program call via CICS LINK/XCTL) +- TRANSID: `CodeElement` node with description `cics-transid cmd:{CMD}` + `ACCESSES` edge from Module + +### Annotated Example + +```cobol + EXEC CICS + SEND MAP('EMPMENU') + MAPSET('EMPSET') + FROM(WK-MAP-DATA) + ERASE + END-EXEC +``` + +Produces: +- `CodeElement` node: `EMPMENU` (description: `cics-map cmd:SEND MAP`) +- `ACCESSES` edge: Module -> `EMPMENU` + +## File Declarations + +SELECT statements in the INPUT-OUTPUT SECTION are accumulated across multiple lines (until a period terminator) and parsed for: + +| Clause | Pattern | Example | +|--------|---------|---------| +| SELECT | `SELECT ` | `SELECT MASTER-FILE` | +| ASSIGN | `ASSIGN TO ` | `ASSIGN TO "MASTER.DAT"` | +| ORGANIZATION | `ORGANIZATION IS ` | `ORGANIZATION IS INDEXED` | +| ACCESS | `ACCESS MODE IS ` | `ACCESS MODE IS DYNAMIC` | +| RECORD KEY | `RECORD KEY IS ` | `RECORD KEY IS WK-EMP-ID` | +| FILE STATUS | `FILE STATUS IS ` | `FILE STATUS IS WK-FILE-STATUS` | + +### Graph Output + +- `CodeElement` node with description containing all parsed clauses (e.g., `select org:INDEXED access:DYNAMIC key:WK-EMP-ID status:WK-FILE-STATUS assign:MASTER.DAT`) +- `RECORD_KEY_OF` edge: from Property node to CodeElement (confidence 0.8) +- `FILE_STATUS_OF` edge: from Property node to CodeElement (confidence 0.8) + +## FD Entries + +FD (File Description) entries associate a file name with its record layout: + +```cobol + FD MASTER-FILE. + 01 MASTER-RECORD. + 05 MR-EMP-ID PIC 9(6). + 05 MR-EMP-NAME PIC X(30). +``` + +The extractor tracks `pendingFdName` state: when an `FD` line is seen, the next 01-level data item becomes its record. + +### Graph Output + +- `CodeElement` node with description `fd record:{recordName}` +- `CONTAINS` edge: FD CodeElement -> Record node +- `CONTAINS` edge: SELECT CodeElement -> FD CodeElement (linking file declaration to file description) + +## ENTRY Points + +The `ENTRY` statement defines additional entry points into a COBOL program (in addition to the main program entry): + +```cobol + ENTRY "SUBPROG" USING WK-PARAM-1 WK-PARAM-2. +``` + +### Graph Output + +- `Constructor` node with description `entry params:{param1},{param2}` (or just `entry` if no parameters) +- `CONTAINS` edge: Module -> Constructor +- Symbol table entry (so the entry point is discoverable by name) + +## PROCEDURE DIVISION USING + +```cobol + PROCEDURE DIVISION USING WK-INPUT-REC WK-OUTPUT-REC. +``` + +The USING clause identifies parameters received by the program from its caller. + +### Graph Output + +- `RECEIVES` edge: Module -> Property (for each parameter name, confidence 0.8) + +## MOVE Statements + +MOVE statements produce `ACCESSES` edges in the graph: + +```cobol + MOVE WK-NAME TO OUT-NAME. + MOVE CORRESPONDING WK-INPUT TO WK-OUTPUT. + MOVE CORR WK-IN TO WK-OUT. +``` + +### Extraction Details + +- Source and target identifiers are captured +- `CORRESPONDING` and its abbreviation `CORR` are both recognized (bulk field-by-field move) +- Figurative constants (SPACES, ZEROS, LOW-VALUES, HIGH-VALUES, QUOTES, ALL) are skipped +- The enclosing paragraph (`caller`) is tracked for context + +### MOVE CORRESPONDING / CORR Edge Reasons + +MOVE CORRESPONDING (and CORR) produces distinct edge reasons to differentiate from simple MOVE: + +| Edge | Reason (simple MOVE) | Reason (CORRESPONDING/CORR) | +|------|---------------------|-----------------------------| +| Read (source) | `cobol-move-read` | `cobol-move-corresponding-read` | +| Write (target) | `cobol-move-write` | `cobol-move-corresponding-write` | + +This distinction allows queries to find bulk field-by-field moves separately from simple variable assignments. + +## GO TO DEPENDING ON + +The `GO TO` statement with multiple targets and a `DEPENDING ON` clause is a computed branch: + +```cobol + GO TO PARA-1 PARA-2 PARA-3 + DEPENDING ON WK-SELECTOR. +``` + +All target paragraph names are extracted and emitted as separate `gotos` entries. Each target produces a `CALLS` edge in the graph (same semantics as PERFORM). The `DEPENDING ON` variable is not currently tracked as a data-flow dependency. + +## SORT INPUT/OUTPUT PROCEDURE + +SORT and MERGE statements can specify procedural entry points instead of file-based I/O: + +```cobol + SORT SORT-FILE ON ASCENDING KEY SORT-KEY + INPUT PROCEDURE IS PREPARE-INPUT + OUTPUT PROCEDURE IS FORMAT-OUTPUT. +``` + +`INPUT PROCEDURE IS` and `OUTPUT PROCEDURE IS` targets are extracted as control-flow targets (same as PERFORM). They produce `performs` entries and corresponding `CALLS` edges in the graph. + +## Fixed-Format Literal Continuation + +In fixed-format COBOL, string literals can span multiple lines using the continuation indicator (`-` in column 7). When a continuation line starts with a quote character, the extractor joins it with the predecessor by removing the trailing quote from the previous line and the opening quote from the continuation: + +``` +Line N: MOVE "THIS IS A LONG STRI +Line N+1 (cont): - "NG VALUE" TO WK-FIELD. +Merged: MOVE "THIS IS A LONG STRING VALUE" TO WK-FIELD. +``` + +The trailing `"` on line N and the opening `"` on line N+1 are both removed, producing a seamless literal. If no matching quote is found on the predecessor line, the continuation is appended as-is. + +## Source Files + +- `gitnexus/src/core/ingestion/cobol-preprocessor.ts` -- All extraction logic, clause parsers, EXEC block parsers +- `gitnexus/src/core/ingestion/workers/parse-worker.ts` -- `processCobolRegexOnly()`, graph node/edge emission +- `gitnexus/src/core/ingestion/parsing-processor.ts` -- Sequential fallback with same `MAX_DATA_ITEMS_PER_FILE` cap diff --git a/docs/code-indexing/cobol/file-detection.md b/docs/code-indexing/cobol/file-detection.md new file mode 100644 index 0000000000..60b60918a7 --- /dev/null +++ b/docs/code-indexing/cobol/file-detection.md @@ -0,0 +1,126 @@ +# COBOL File Detection + +GitNexus detects COBOL files through two mechanisms: extension-based mapping and directory-based override for extensionless files. This document covers both, plus the copybook/program classification logic. + +## Extension Mapping + +### Program Extensions + +| Extension | Type | +|-----------|------| +| `.cbl` | COBOL program | +| `.cob` | COBOL program | +| `.cobol` | COBOL program | + +### Copybook Extensions + +| Extension | Type | Notes | +|-----------|------|-------| +| `.cpy` | Copybook | Standard | +| `.copy` | Copybook | Standard | +| `.gnm` / `.GNM` | Copybook | Enterprise (GnuCOBOL naming) | +| `.fd` / `.FD` | Copybook | File Description fragment | +| `.wrk` / `.WRK` | Copybook | Working-Storage fragment | +| `.sel` / `.SEL` | Copybook | SELECT clause fragment | +| `.open` / `.OPEN` | Copybook | File OPEN fragment | +| `.close` / `.CLOSE` | Copybook | File CLOSE fragment | +| `.ini` / `.INI` | Copybook | Initialization fragment | +| `.def` / `.DEF` | Copybook | Definition fragment | + +All extension matching is case-sensitive in `getLanguageFromFilename` (the extensions above are matched as written, including uppercase variants like `.GNM`). + +## Extensionless File Detection: `GITNEXUS_COBOL_DIRS` + +Many enterprise COBOL repositories use extensionless files -- the filename alone identifies the program (e.g., `s/BGTABFL` is the source for program `BGTABFL`). GitNexus handles this via the `GITNEXUS_COBOL_DIRS` environment variable. + +### Configuration + +Set `GITNEXUS_COBOL_DIRS` to a comma-separated list of directory names: + +```bash +# Files in s/, c/, and wfproc/ directories (at any depth) are treated as COBOL +export GITNEXUS_COBOL_DIRS=s,c,wfproc +``` + +The matching is **case-insensitive** and checks all path segments: + +- `/repo/s/BGTABFL` -- matches segment `s` -- COBOL +- `/repo/src/c/CPSESP` -- matches segment `c` -- COBOL +- `/repo/wfproc/WF001` -- matches segment `wfproc` -- COBOL +- `/repo/docs/README` -- no matching segment -- skipped + +### Decision Tree + +```mermaid +flowchart TD + A[getLanguageFromPath] --> B[getLanguageFromFilename] + B --> C{Known extension?} + C -->|Yes .cbl/.cob/.cobol/.cpy/...| D[Return COBOL] + C -->|Yes .ts/.py/.java/...| E[Return other language] + C -->|No match| F{Has extension?} + + F -->|"Has dot in basename"| G[Return null] + F -->|"No dot = extensionless"| H{GITNEXUS_COBOL_DIRS set?} + + H -->|No| G + H -->|Yes| I{Any path segment
matches a configured dir?} + + I -->|Yes| D + I -->|No| G + + style D fill:#e8f5e9,stroke:#2e7d32 + style G fill:#ffebee,stroke:#c62828 +``` + +### Implementation Detail + +The `GITNEXUS_COBOL_DIRS` value is parsed once (on first call) and cached in a `Set`: + +```typescript +// From gitnexus/src/core/ingestion/utils.ts +const getCobolDirs = (): Set => { + if (_cobolDirs) return _cobolDirs; + const raw = process.env.GITNEXUS_COBOL_DIRS; + _cobolDirs = raw + ? new Set(raw.split(',').map(d => d.trim().toLowerCase())) + : new Set(); + return _cobolDirs; +}; +``` + +The path segment check splits the full path on `/` and tests each segment against the cached set. + +## Copybook vs Program Classification + +After a file is identified as COBOL, it must be classified as either a **program** (to be parsed for symbols) or a **copybook** (to be loaded into the copybook map for COPY expansion). + +### Classification Rules + +A COBOL file is classified as a **copybook** if ANY of these conditions is true: + +1. It has a recognized copybook extension (`.cpy`, `.copy`, `.gnm`, `.fd`, `.wrk`, `.sel`, `.open`, `.close`, `.ini`, `.def`) +2. It is an extensionless file whose path contains a directory segment matching one of: `c`, `copy`, `copybooks`, `copylib`, `cpy` + +A file is classified as a **program** if: + +1. It has a program extension (`.cbl`, `.cob`, `.cobol`), OR +2. It is extensionless and does NOT match any copybook directory pattern + +### Copybook Name Resolution + +Copybook names are derived from the filename: + +- Strip the extension (if any) +- Convert to uppercase + +Examples: +- `c/CPSESP` -- name: `CPSESP` +- `copy/workgrid.cpy` -- name: `WORKGRID` +- `c/ANAZI.GNM` -- name: `ANAZI` + +This name is used to resolve `COPY CPSESP.` statements during expansion. + +## Source Files + +- `gitnexus/src/core/ingestion/utils.ts` -- `getLanguageFromPath()`, `getLanguageFromFilename()`, `getCobolDirs()` +- `gitnexus/src/core/ingestion/pipeline.ts` -- `isCobolCopybook()`, `getCopybookName()`, `COPYBOOK_EXTENSIONS`, `COBOL_PROGRAM_EXTENSIONS` diff --git a/docs/code-indexing/cobol/graph-model.md b/docs/code-indexing/cobol/graph-model.md new file mode 100644 index 0000000000..de82c0723d --- /dev/null +++ b/docs/code-indexing/cobol/graph-model.md @@ -0,0 +1,193 @@ +# COBOL Graph Model + +This document describes the graph nodes and edges that GitNexus creates for COBOL codebases. The COBOL graph model is richer than most tree-sitter languages because it captures domain-specific constructs: file declarations, FD entries, data hierarchies, SQL tables, CICS maps, and cross-program contracts. + +## Entity-Relationship Diagram + +```mermaid +erDiagram + File ||--o{ Module : DEFINES + File ||--o{ Function : DEFINES + File ||--o{ Namespace : DEFINES + File ||--o{ Record : DEFINES + File ||--o{ Property : DEFINES + File ||--o{ Const : DEFINES + File ||--o{ CodeElement : DEFINES + File ||--o{ Constructor : DEFINES + File }o--o{ File : IMPORTS + + Module ||--o{ Record : CONTAINS + Module ||--o{ Constructor : CONTAINS + Module }o--o{ CodeElement : ACCESSES + Module }o--o{ Module : CALLS + Module }o--o{ Module : CONTRACTS + Module }o--o{ Property : RECEIVES + + Record ||--o{ Property : CONTAINS + Record ||--o{ Const : CONTAINS + Record }o--o{ Record : REDEFINES + + Property ||--o{ Property : CONTAINS + Property ||--o{ Const : CONTAINS + Property }o--o{ Property : REDEFINES + Property }o--o{ CodeElement : RECORD_KEY_OF + Property }o--o{ CodeElement : FILE_STATUS_OF + + CodeElement ||--o{ CodeElement : CONTAINS + CodeElement ||--o{ Record : CONTAINS + + Function }o--o{ Function : CALLS +``` + +## Node Types + +| Node Type | COBOL Concept | Created From | Example | +|-----------|--------------|--------------|---------| +| `Module` | PROGRAM-ID | `PROGRAM-ID. BGTABFL` | Name: `BGTABFL`, description may include author and date | +| `Function` | Paragraph | `PROCESS-RECORD.` at column 8 | Name: `PROCESS-RECORD` | +| `Namespace` | Procedure section | `MAIN-LOGIC SECTION.` at column 8 | Name: `MAIN-LOGIC` | +| `Record` | 01-level data item | `01 WK-EMPLOYEE.` | Description: `level:01 section:working-storage` | +| `Property` | 02-49/66/77 data item | `05 WK-NAME PIC X(30).` | Description: `level:05 pic:X(30) section:working-storage` | +| `Const` | 88-level condition | `88 WK-ACTIVE VALUE "A".` | Description: `level:88 values:A` | +| `CodeElement` | SELECT, FD, SQL table, CICS map, cursor, transid | Various | Description varies by subtype | +| `Constructor` | ENTRY point | `ENTRY "SUBPROG" USING WK-DATA` | Description: `entry params:WK-DATA` | + +### CodeElement Subtypes + +CodeElement is used for multiple COBOL constructs, distinguished by their description prefix: + +| Subtype | ID Pattern | Description Format | Example | +|---------|-----------|-------------------|---------| +| File SELECT | `CodeElement:{path}:SELECT:{name}` | `select org:INDEXED access:DYNAMIC ...` | `SELECT MASTER-FILE` | +| FD entry | `CodeElement:{path}:FD:{name}` | `fd record:{recordName}` | `FD MASTER-FILE` | +| SQL table | `CodeElement:{path}:sql-table:{name}` | `sql-table op:SELECT` | Table `EMPLOYEES` | +| SQL cursor | `CodeElement:{path}:sql-cursor:{name}` | `sql-cursor` | Cursor `C-EMPLOYEES` | +| CICS map | `CodeElement:{path}:cics-map:{name}` | `cics-map cmd:SEND MAP` | Map `EMPMENU` | +| CICS transid | `CodeElement:{path}:cics-transid:{name}` | `cics-transid cmd:START` | Transid `EMP1` | + +## Edge Types + +| Edge Type | Source | Target | Created By | Confidence | Example | +|-----------|--------|--------|-----------|------------|---------| +| `DEFINES` | File | any node | File defines its symbols | 1.0 | File -> Module `BGTABFL` | +| `CALLS` | Function | Function | `PERFORM X [THRU Y]` | (via call-processor) | `PROCESS-RECORD` -> `CALC-TAX` | +| `CALLS` | Module | Module | `CALL "BGTABUP"` | (via call-processor) | `BGTABFL` -> `BGTABUP` | +| `CALLS` | Module | Module | `EXEC CICS LINK PROGRAM('X')` | (via call-processor) | `BGTABFL` -> `BGTABUP` | +| `IMPORTS` | File | File | `COPY copybook` | (via import-processor) | Source file -> Copybook file | +| `CONTAINS` | Module | Record | Data hierarchy root | 1.0 | `BGTABFL` -> `WK-EMPLOYEE` | +| `CONTAINS` | Record | Property | Data hierarchy | 1.0 | `WK-EMPLOYEE` -> `WK-NAME` | +| `CONTAINS` | Property | Property | Nested data items | 1.0 | `WK-ADDRESS` -> `WK-CITY` | +| `CONTAINS` | Record/Property | Const | 88-level parent | 1.0 | `WK-STATUS` -> `WK-ACTIVE` | +| `CONTAINS` | CodeElement (FD) | Record | FD record link | 1.0 | `FD:MASTER-FILE` -> `MASTER-RECORD` | +| `CONTAINS` | CodeElement (SELECT) | CodeElement (FD) | SELECT-FD link | 0.9 | `SELECT:MASTER-FILE` -> `FD:MASTER-FILE` | +| `CONTAINS` | Module | Constructor | ENTRY in module | 1.0 | `BGTABFL` -> `SUBPROG` | +| `REDEFINES` | Record | Record | `01 X REDEFINES Y` | 1.0 | `WK-DATE-NUM` -> `WK-DATE-ALPHA` | +| `REDEFINES` | Property | Property | `05 X REDEFINES Y` | 1.0 | `WK-CODE-NUM` -> `WK-CODE-ALPHA` | +| `RECORD_KEY_OF` | Property | CodeElement (SELECT) | `RECORD KEY IS field` | 0.8 | `WK-EMP-ID` -> `SELECT:MASTER-FILE` | +| `FILE_STATUS_OF` | Property | CodeElement (SELECT) | `FILE STATUS IS field` | 0.8 | `WK-FS` -> `SELECT:MASTER-FILE` | +| `ACCESSES` | Module | CodeElement | EXEC SQL/CICS | 0.9 | `BGTABFL` -> `sql-table:EMPLOYEES` | +| `RECEIVES` | Module | Property | `PROCEDURE USING` | 0.8 | `BGTABFL` -> `WK-INPUT-REC` | +| `CONTRACTS` | Module | Module | Shared copybook detection | 0.9 | `BGTABFL` -> `BGTABUP` (via `CPSESP`) | + +## Full Annotated Example + +Given this COBOL program: + +```cobol + IDENTIFICATION DIVISION. + PROGRAM-ID. EMPMAINT. + AUTHOR. Development Team. + + ENVIRONMENT DIVISION. + INPUT-OUTPUT SECTION. + FILE-CONTROL. + SELECT EMP-FILE + ASSIGN TO "EMPLOYEE.DAT" + ORGANIZATION IS INDEXED + ACCESS MODE IS DYNAMIC + RECORD KEY IS EMP-ID + FILE STATUS IS WS-FILE-STATUS. + + DATA DIVISION. + FILE SECTION. + FD EMP-FILE. + 01 EMP-RECORD. + 05 EMP-ID PIC 9(6). + 05 EMP-NAME PIC X(30). + + WORKING-STORAGE SECTION. + 01 WS-FLAGS. + 05 WS-FILE-STATUS PIC X(02). + 05 WS-EOF-FLAG PIC X(01). + 88 WS-EOF VALUE "Y". + + LINKAGE SECTION. + 01 LK-SEARCH-KEY PIC 9(6). + + PROCEDURE DIVISION USING LK-SEARCH-KEY. + MAIN-LOGIC SECTION. + MAIN-START. + PERFORM OPEN-FILE + PERFORM PROCESS-RECORDS + PERFORM CLOSE-FILE + STOP RUN. + + OPEN-FILE. + OPEN I-O EMP-FILE. + + PROCESS-RECORDS. + MOVE LK-SEARCH-KEY TO EMP-ID + EXEC SQL + SELECT EMP_SALARY INTO :WS-SALARY + FROM EMPLOYEES + WHERE EMP_ID = :EMP-ID + END-EXEC + CALL "EMPREPORT". + + CLOSE-FILE. + CLOSE EMP-FILE. +``` + +The graph produced contains: + +**Nodes:** +- `Module`: EMPMAINT (description: `author:Development Team`) +- `Namespace`: MAIN-LOGIC +- `Function`: MAIN-START, OPEN-FILE, PROCESS-RECORDS, CLOSE-FILE +- `Record`: EMP-RECORD, WS-FLAGS, LK-SEARCH-KEY +- `Property`: EMP-ID, EMP-NAME, WS-FILE-STATUS, WS-EOF-FLAG +- `Const`: WS-EOF (values: Y) +- `CodeElement`: SELECT:EMP-FILE, FD:EMP-FILE, sql-table:EMPLOYEES +- (COPY imports, if any, would produce File IMPORTS edges) + +**Edges:** +- `DEFINES`: File -> all nodes +- `CONTAINS`: EMPMAINT -> EMP-RECORD, EMPMAINT -> WS-FLAGS, EMPMAINT -> LK-SEARCH-KEY +- `CONTAINS`: EMP-RECORD -> EMP-ID, EMP-RECORD -> EMP-NAME +- `CONTAINS`: WS-FLAGS -> WS-FILE-STATUS, WS-FLAGS -> WS-EOF-FLAG +- `CONTAINS`: WS-EOF-FLAG -> WS-EOF +- `CONTAINS`: FD:EMP-FILE -> EMP-RECORD +- `CONTAINS`: SELECT:EMP-FILE -> FD:EMP-FILE +- `CALLS`: MAIN-START -> OPEN-FILE, MAIN-START -> PROCESS-RECORDS, MAIN-START -> CLOSE-FILE +- `CALLS`: EMPMAINT -> EMPREPORT (external CALL) +- `ACCESSES`: EMPMAINT -> sql-table:EMPLOYEES +- `RECEIVES`: EMPMAINT -> LK-SEARCH-KEY (PROCEDURE USING) +- `RECORD_KEY_OF`: EMP-ID -> SELECT:EMP-FILE +- `FILE_STATUS_OF`: WS-FILE-STATUS -> SELECT:EMP-FILE + +## How COBOL Differs from Tree-Sitter Languages + +| Aspect | COBOL | Tree-Sitter Languages | +|--------|-------|----------------------| +| Node variety | 8 types (Module, Function, Namespace, Record, Property, Const, CodeElement, Constructor) | Typically 4-6 (Function, Class, Method, Interface, Module, Const) | +| Domain edges | RECORD_KEY_OF, FILE_STATUS_OF, ACCESSES, RECEIVES, CONTRACTS, REDEFINES | Primarily CALLS, IMPORTS, EXTENDS, IMPLEMENTS | +| Data hierarchy | Deep CONTAINS chains (01 -> 05 -> 10 -> 88) | Flat class members | +| Cross-program calls | CALL "name" + CICS LINK PROGRAM | Import-based resolution | +| Contract detection | Shared COPY copybook between caller/callee | Not applicable | +| Metadata | AUTHOR, DATE-WRITTEN on Module | JSDoc/docstring (not indexed) | + +## Source Files + +- `gitnexus/src/core/ingestion/workers/parse-worker.ts` -- `processCobolRegexOnly()`, node/edge emission logic +- `gitnexus/src/core/ingestion/pipeline.ts` -- `detectCrossProgamContracts()` for CONTRACTS edges +- `gitnexus/src/core/ingestion/cobol-preprocessor.ts` -- `CobolRegexResults` interface (all extracted data) diff --git a/docs/code-indexing/cobol/performance.md b/docs/code-indexing/cobol/performance.md new file mode 100644 index 0000000000..b0f69e7019 --- /dev/null +++ b/docs/code-indexing/cobol/performance.md @@ -0,0 +1,261 @@ +# COBOL Performance and Tuning + +This document covers real-world benchmarks, worker pool configuration, memory management, known limitations, and troubleshooting for COBOL indexing. + +## PROJECT-NAME Benchmark + +The PROJECT-NAME project is a large Italian payroll system written in COBOL. It serves as the primary benchmark for COBOL indexing performance. + +### Input + +| Metric | Value | +| --------------------------- | ---------------------------------------------------------------------------- | +| Paths scanned | 14,217 | +| Parseable files | 13,129 | +| Total source size | 224 MB | +| Chunks | 12 (at 20 MB budget) | +| Copybooks loaded | 2,976 | +| Copybooks used in expansion | 2,955 | +| Key directories | `s/` (7773 programs), `c/` (3036 copybooks), `wfproc/` (1973 workflow files) | + +### Output + +| Metric | Value | +| ---------------------- | ------ | +| Graph nodes | 2.79M | +| Graph edges | 5.67M | +| Clusters (communities) | 16,679 | +| Execution flows | 300 | + +### Timing + +| Phase | Duration | +| ------------------------------- | ----------------- | +| Total | ~251s | +| KuzuDB write | 132s | +| Full-text search indexing | 6.7s | +| Regex extraction (avg per file) | ~1ms | +| COPY expansion + deep indexing | Remainder (~112s) | + +### Indexing Command + +```bash +cd /path/to/PROJECT-NAME +GITNEXUS_COBOL_DIRS=s,c,wfproc GITNEXUS_VERBOSE=1 node --max-old-space-size=8192 \ + /path/to/gitnexus/dist/cli/index.js analyze --force +``` + +## Open-Source Benchmarks + +### CardDemo (AWS) + +| Metric | Value | +| ------ | ----- | +| Graph nodes | 12,323 | +| Graph edges | 8,893 | +| Total time | 7.4s | + +### ACAS + +| Metric | Value | +| ------ | ----- | +| Graph nodes | 14,016 | +| Graph edges | 15,452 | +| Total time | 9.3s | + +### Micro-Benchmark (Single-File Extraction) + +| Metric | Value | +| ------ | ----- | +| Per-iteration | 0.65ms | +| Throughput | ~382K lines/sec | + +## Worker Pool Tuning + +### Sub-Batch Size + +The worker pool splits each worker's chunk into sub-batches to bound peak memory per `postMessage` serialization. COBOL repos use a smaller sub-batch size than the default: + +| Parameter | Default | COBOL Mode | +| --------------------- | ----------- | ------------------- | +| Sub-batch size | 1,500 files | 200 files | +| Per sub-batch timeout | 120s | 120s (configurable) | + +**Why 200?** COBOL regex extraction + preprocessing takes ~1ms per file on average, but with COPY expansion and deep indexing the effective time is ~150ms per file. At sub-batch size 1500, that would be ~225s per sub-batch, exceeding the 120s timeout. + +COBOL mode is activated automatically when `GITNEXUS_COBOL_DIRS` is set: + +```typescript +// From pipeline.ts +const cobolSubBatch = process.env.GITNEXUS_COBOL_DIRS ? 200 : undefined; +workerPool = createWorkerPool(workerUrl, undefined, cobolSubBatch); +``` + +### Worker Count + +Workers default to `min(8, cpus - 1)`. For COBOL repos, this is usually sufficient since regex extraction is CPU-bound but fast. The bottleneck is typically KuzuDB write, not extraction. + +### Timeout Configuration + +| Environment Variable | Default | Purpose | +| ------------------------------------ | --------------- | --------------------------------------------------- | +| `GITNEXUS_WORKER_TIMEOUT_MS` | 120,000 (2 min) | Per sub-batch processing timeout | +| `GITNEXUS_WORKER_STARTUP_TIMEOUT_MS` | 60,000 (1 min) | Worker initialization timeout (tree-sitter loading) | + +For COBOL-only repos, worker startup is faster because tree-sitter native modules are loaded lazily (skipped entirely if only COBOL files are present). + +## Data Item Cap + +### Configuration + +```typescript +const MAX_DATA_ITEMS_PER_FILE = 500; +``` + +This constant appears in both `parse-worker.ts` (worker path) and `parsing-processor.ts` (sequential fallback). + +### Rationale + +Some COBOL programs, especially after COPY expansion, can have 10,000+ data items. At that scale: + +- The in-memory relationship Map (for CONTAINS, REDEFINES, etc.) approaches the V8 16.7M entry limit across thousands of files +- KuzuDB write time increases linearly with edge count +- Most deep-nested items (level 20+) are rarely queried individually + +### Impact + +The cap truncates data items beyond the 500th in source order. Since 01-level Records appear first in COBOL source, the cap preserves: + +- All 01-level record definitions +- The most important 02-49 level items (those closest to the record root) +- 88-level conditions associated with early items + +To increase the cap for specific needs, modify the `MAX_DATA_ITEMS_PER_FILE` constant in both files. + +## Memory Management + +### COPY Expansion Breadth Guard + +A per-file `MAX_TOTAL_EXPANSIONS = 500` limit prevents exponential blowup from diamond-shaped COPY graphs (e.g., N copybooks each containing N COPY statements). Once the limit is reached, further COPY statements in that file are left unexpanded. See [copy-expansion.md](copy-expansion.md) for details. + +### COPY Expansion Memory + +All copybook content is loaded upfront into a Map before chunk processing begins. For PROJECT-NAME: + +- 2,976 copybooks, typically under 100MB total +- The Map is shared (read-only) across chunk iterations +- Per-chunk, the copybook map is merged with chunk file content (in case a chunk contains copybooks not in the pre-loaded set) +- After all chunks are processed, the copybook map is freed (`cobolCopybookContents = undefined`) + +### Chunk Budget + +Source files are grouped into chunks of max 20MB (`CHUNK_BYTE_BUDGET`). Each chunk's lifecycle: + +1. Read file content into memory +2. Expand COPY statements (mutates content in-place) +3. Dispatch to workers for extraction +4. Workers return serialized results +5. Merge results into graph +6. Chunk content goes out of scope (GC reclaims) + +This ensures only ~20MB of source + ~200-400MB of working memory (ASTs, extracted records, serialization) is active at any time. + +### Shared Warning Deduplication + +The `warnedCircular` set (used by the COPY expansion engine) is shared across all files in a chunk. This prevents the same circular copybook warning (e.g., `ANAZI includes itself`) from being logged thousands of times. + +## Known Limitations + +| Limitation | Impact | Workaround | +| ---------------------------------------- | --------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- | +| tree-sitter-cobol hangs on ~5% of files | Cannot use tree-sitter for COBOL | Regex-only extraction (current approach) | +| Data item cap (500/file) | May miss deeply nested items in large programs | Increase `MAX_DATA_ITEMS_PER_FILE` in source | +| Circular copybooks (ANAZI, ANDIP, QDIPE) | Self-referential includes cannot be expanded | Detected and skipped with warning | +| wfproc/ files may not be pure COBOL | Workflow files may produce extraction noise | Exclude `wfproc` from `GITNEXUS_COBOL_DIRS` if problematic | +| No MOVE DATA_FLOW edges yet | Data flow between variables not in graph | Reserved for future release | +| Continuation line handling | Some complex multi-line continuations (especially in string literals spanning 3+ lines) may not merge correctly | Known edge case; affects <0.1% of lines | +| Single-line EXEC blocks | `EXEC SQL SELECT ... END-EXEC` on one line is handled, but pathological nesting is not | Extremely rare in practice | +| Extension case sensitivity | `.GNM` and `.gnm` are matched differently | Use the exact case from the codebase | + +## Troubleshooting + +### "COPY expansion failed" + +``` +[pipeline] COPY expansion failed for s/BGTABFL: Cannot read properties of null +``` + +**Cause:** A copybook referenced by a COPY statement cannot be found. + +**Fix:** + +1. Verify `GITNEXUS_COBOL_DIRS` includes the directory containing copybooks (typically `c`) +2. Check that copybook filenames match the COPY target (case-insensitive, after stripping extensions) +3. Ensure copybook files are not in `.gitignore` + +### Worker sub-batch timeout + +``` +Worker 3 sub-batch timed out after 120s (chunk: 200 items) +``` + +**Cause:** A sub-batch took longer than the timeout. Typically happens when one file is extremely large (50,000+ lines after COPY expansion). + +**Fix:** Increase the timeout: + +```bash +GITNEXUS_WORKER_TIMEOUT_MS=300000 gitnexus analyze +``` + +### Memory errors (heap out of memory) + +``` +FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory +``` + +**Fix:** Increase Node.js heap size: + +```bash +node --max-old-space-size=16384 /path/to/gitnexus/dist/cli/index.js analyze +``` + +For very large repos (>500MB source), consider `--max-old-space-size=32768`. + +### Concurrent analyze corruption + +**Rule:** Only ONE `gitnexus analyze` process should run at a time per repository. Concurrent writes to KuzuDB corrupt the database. + +If corruption occurs: + +```bash +# Remove the KuzuDB directory and re-index +rm -rf .gitnexus/kuzu +gitnexus analyze --force +``` + +### Slow KuzuDB write phase + +The KuzuDB write phase (132s for PROJECT-NAME) is the bottleneck for large COBOL repos. This is proportional to the number of nodes and edges being written. Reducing `MAX_DATA_ITEMS_PER_FILE` or excluding non-essential directories from `GITNEXUS_COBOL_DIRS` can help. + +### Verbose output + +Enable verbose logging to see per-phase timing and statistics: + +```bash +GITNEXUS_VERBOSE=1 gitnexus analyze +``` + +This outputs: + +- Scan statistics (paths, parseable files, chunk count) +- Worker pool configuration (worker count, sub-batch size) +- COPY expansion statistics (copybooks loaded, files expanded) +- Community and process detection results +- Contract detection results + +## Source Files + +- `gitnexus/src/core/ingestion/workers/worker-pool.ts` -- `DEFAULT_SUB_BATCH_SIZE`, `SUB_BATCH_TIMEOUT_MS`, `WORKER_STARTUP_TIMEOUT_MS` +- `gitnexus/src/core/ingestion/pipeline.ts` -- `CHUNK_BYTE_BUDGET`, COBOL sub-batch configuration, chunk lifecycle +- `gitnexus/src/core/ingestion/workers/parse-worker.ts` -- `MAX_DATA_ITEMS_PER_FILE`, `processCobolRegexOnly()` +- `gitnexus/src/core/ingestion/parsing-processor.ts` -- Sequential fallback `MAX_DATA_ITEMS_PER_FILE` diff --git a/docs/code-indexing/cobol/regex-extraction.md b/docs/code-indexing/cobol/regex-extraction.md new file mode 100644 index 0000000000..9f37c10c93 --- /dev/null +++ b/docs/code-indexing/cobol/regex-extraction.md @@ -0,0 +1,206 @@ +# COBOL Regex Extraction + +The `extractCobolSymbolsWithRegex()` function in `cobol-preprocessor.ts` performs single-pass, state-machine-driven extraction of all COBOL symbols. This document describes the state machine, line processing flow, and every regex pattern used. + +## State Machine: Division Tracking + +The extractor tracks which COBOL division is currently being processed. Division transitions are detected by the `RE_DIVISION` pattern. + +```mermaid +stateDiagram-v2 + [*] --> null : Start of file + null --> identification : IDENTIFICATION DIVISION + identification --> environment : ENVIRONMENT DIVISION + environment --> data : DATA DIVISION + data --> procedure : PROCEDURE DIVISION + + note right of identification + Extracts: PROGRAM-ID, AUTHOR, DATE-WRITTEN + end note + note right of environment + Extracts: SELECT ... ASSIGN ... (file declarations) + end note + note right of data + Extracts: FD entries, data items (01-77, 88), COPY + end note + note right of procedure + Extracts: paragraphs, sections, PERFORM, CALL, + ENTRY, MOVE, EXEC SQL/CICS + end note +``` + +## State Machine: Data Section Tracking + +Within the DATA DIVISION, a secondary state machine tracks the current section to tag data items with their origin. + +```mermaid +stateDiagram-v2 + [*] --> unknown : DATA DIVISION entered + unknown --> working_storage : WORKING-STORAGE SECTION + unknown --> linkage : LINKAGE SECTION + unknown --> file : FILE SECTION + unknown --> local_storage : LOCAL-STORAGE SECTION + working_storage --> linkage : LINKAGE SECTION + working_storage --> file : FILE SECTION + linkage --> working_storage : WORKING-STORAGE SECTION + file --> working_storage : WORKING-STORAGE SECTION + file --> linkage : LINKAGE SECTION + local_storage --> working_storage : WORKING-STORAGE SECTION +``` + +Within the ENVIRONMENT DIVISION, the `currentEnvSection` tracks whether we are in `INPUT-OUTPUT` or `CONFIGURATION` section. SELECT statement accumulation only occurs in `INPUT-OUTPUT`. + +## Line Processing Flow + +Each raw source line goes through this pipeline: + +``` +Raw line + | + v +Length < 7? ---------> Skip (flush pending if any) + | + v +Indicator col 7 + | + +-- '*' or '/' -----> Comment: skip entirely + | + +-- '-' ------------> Continuation: append to pending line + | + +-- other ----------> Normal: flush pending, strip inline comments (|), + buffer as new pending logical line +``` + +After all lines are processed, the final pending line is flushed, along with any accumulated SELECT statement, SORT/MERGE accumulator, and any open EXEC block (truncated file without `END-EXEC`). + +### Inline Comment Stripping + +Enterprise COBOL (particularly Italian dialect) uses the pipe character `|` as an inline comment marker. The `stripInlineComment()` helper is **quote-aware**: it tracks whether the scan position is inside a single- or double-quoted string and only treats `|` as a comment marker when outside quotes. Pipe characters inside string literals are preserved. + +Free-format `*>` inline comment stripping uses the same quote-aware approach: the scanner walks character by character, toggling quote state, and only recognizes `*>` as a comment marker when not inside a quoted string. + +### Patch Marker Handling + +The `preprocessCobolSource()` function (run before extraction in the worker) replaces non-standard content in columns 1-6. Standard COBOL expects spaces or digit sequence numbers in this area. If any letter or `#` character is found, the entire sequence area is replaced with 6 spaces: + +``` +Before: mzADD MOVE WK-AMT TO WK-TOTAL +After: MOVE WK-AMT TO WK-TOTAL +``` + +This preserves exact line count for position mapping. + +## Regex Pattern Reference + +All patterns are compiled once as module-level constants and reused across calls. + +### Division and Section Detection + +| Constant | Pattern | Purpose | Example Match | +|----------|---------|---------|---------------| +| `RE_DIVISION` | `\b(IDENTIFICATION\|ENVIRONMENT\|DATA\|PROCEDURE)\s+DIVISION\b` | Division boundary | `PROCEDURE DIVISION` | +| `RE_SECTION` | `\b(WORKING-STORAGE\|LINKAGE\|FILE\|LOCAL-STORAGE\|INPUT-OUTPUT\|CONFIGURATION)\s+SECTION\b` | Section boundary | `WORKING-STORAGE SECTION` | + +### IDENTIFICATION DIVISION + +| Constant | Pattern | Purpose | Example Match | +|----------|---------|---------|---------------| +| `RE_PROGRAM_ID` | `\bPROGRAM-ID\.\s*([A-Z][A-Z0-9-]*)` | Program name | `PROGRAM-ID. BGTABFL` | +| `RE_AUTHOR` | `^\s+AUTHOR\.\s*(.+)` | Author metadata | `AUTHOR. D. Smith` | +| `RE_DATE_WRITTEN` | `^\s+DATE-WRITTEN\.\s*(.+)` | Date metadata | `DATE-WRITTEN. 2024-01-15` | + +### ENVIRONMENT DIVISION + +| Constant | Pattern | Purpose | Example Match | +|----------|---------|---------|---------------| +| `RE_SELECT_START` | `\bSELECT\s+(?:OPTIONAL\s+)?([A-Z][A-Z0-9-]+)` | File SELECT start (with optional `SELECT OPTIONAL` support) | `SELECT MASTER-FILE`, `SELECT OPTIONAL TRANS-FILE` | + +SELECT statements are accumulated across multiple lines until a period terminator is found, then parsed for ASSIGN, ORGANIZATION, ACCESS, RECORD KEY, and FILE STATUS clauses. + +### DATA DIVISION + +| Constant | Pattern | Purpose | Example Match | +|----------|---------|---------|---------------| +| `RE_FD` | `^\s+FD\s+([A-Z][A-Z0-9-]+)` | File description | `FD MASTER-FILE` | +| `RE_DATA_ITEM` | `^\s+(\d{1,2})\s+([A-Z][A-Z0-9-]+)\s*(.*)` | Data item (01-77) | `05 WK-NAME PIC X(30)` | +| `RE_ANONYMOUS_REDEFINES` | `^\s+(\d{1,2})\s+REDEFINES\s+([A-Z][A-Z0-9-]+)` | Anonymous REDEFINES | `01 REDEFINES WK-REC` | +| `RE_88_LEVEL` | `^\s+88\s+([A-Z][A-Z0-9-]+)\s+VALUES?\s+(?:ARE\s+)?(.+)` | Condition name | `88 WK-ACTIVE VALUE "Y"` | + +The trailing clauses of `RE_DATA_ITEM` are parsed by `parseDataItemClauses()` for PIC, USAGE, OCCURS, and REDEFINES. + +### PROCEDURE DIVISION + +| Constant | Pattern | Purpose | Example Match | +|----------|---------|---------|---------------| +| `RE_PROC_SECTION` | `^ ([A-Z][A-Z0-9-]+)\s+SECTION\.\s*$` | Procedure section header | ` MAIN-LOGIC SECTION.` | +| `RE_PROC_PARAGRAPH` | `^ ([A-Z][A-Z0-9-]+)\.\s*$` | Paragraph header | ` PROCESS-RECORD.` | +| `RE_PERFORM` | `\bPERFORM\s+([A-Z][A-Z0-9-]+)(?:\s+THRU\s+([A-Z][A-Z0-9-]+))?` | PERFORM call | `PERFORM CALC-TAX THRU CALC-TAX-EXIT` | +| `RE_PROC_USING` | `\bPROCEDURE\s+DIVISION\s+USING\s+([\s\S]*?)(?:\.\|$)` | USING parameters | `PROCEDURE DIVISION USING WK-PARAM` | +| `RE_ENTRY` | `\bENTRY\s+"([^"]+)"(?:\s+USING\s+([\s\S]*?))?(?:\.\|$)` | ENTRY point | `ENTRY "SUBPROG" USING WK-DATA` | +| `RE_MOVE` | `\bMOVE\s+((?:CORRESPONDING\|CORR)\s+)?([A-Z][A-Z0-9-]+)\s+TO\s+(.+)` | MOVE statement (supports CORR abbreviation and multi-target) | `MOVE WK-NAME TO OUT-NAME`, `MOVE CORR WK-IN TO WK-OUT` | + +The USING parameter list (`RE_PROC_USING`) is split on `\bRETURNING\b` before tokenization -- any RETURNING clause and everything after it is excluded from the parameter list (`.split(/\bRETURNING\b/i)[0]`). + +Note: `RE_PROC_SECTION` and `RE_PROC_PARAGRAPH` require exactly 7 spaces of leading indentation (COBOL area A starting at column 8). This is the standard COBOL paragraph indentation. + +### All-Division Patterns + +These patterns are checked regardless of current division: + +| Constant | Pattern | Purpose | Example Match | +|----------|---------|---------|---------------| +| `RE_CALL` | `\bCALL\s+"([^"]+)"` | External program call | `CALL "BGTABUP"` | +| `RE_COPY_UNQUOTED` | `\bCOPY\s+([A-Z][A-Z0-9-]+)(?:\s\|\.)` | COPY (unquoted) | `COPY CPSESP.` | +| `RE_COPY_QUOTED` | `\bCOPY\s+"([^"]+)"(?:\s\|\.)` | COPY (quoted) | `COPY "WORKGRID.CPY".` | + +### SORT/MERGE Support + +| Constant | Purpose | +|----------|---------| +| `SORT_CLAUSE_NOISE` | Set of SORT/MERGE clause keywords filtered from USING/GIVING file lists: `ON`, `ASCENDING`, `DESCENDING`, `KEY`, `WITH`, `DUPLICATES`, `IN`, `ORDER`, `COLLATING`, `SEQUENCE`, `IS`, `THROUGH`, `THRU`, `INPUT`, `OUTPUT`, `PROCEDURE` | + +SORT and MERGE statements are accumulated across multiple lines (like SELECT) until a period terminator is found, then parsed for USING/GIVING file lists and INPUT/OUTPUT PROCEDURE targets. The `flushSort()` helper encapsulates the flush-and-parse logic, mirroring the existing `flushSelect()` pattern. Both helpers are called at EOF to handle truncated files. + +### GO TO Multi-Target + +`RE_GOTO` captures all paragraph names in a `GO TO` statement, including the multi-target form `GO TO p1 p2 p3 DEPENDING ON x`. The captured group contains all target names (space-separated), which are split into individual targets. Each target produces a separate `gotos` entry. + +### PROGRAM-ID Detection + +PROGRAM-ID is detected regardless of the current division state. This handles sibling programs that appear after `END PROGRAM` and omit the `IDENTIFICATION DIVISION` header -- the extractor will still capture the PROGRAM-ID and push a new program boundary. + +### EXEC Block Patterns + +| Constant | Pattern | Purpose | Example Match | +|----------|---------|---------|---------------| +| `RE_EXEC_SQL_START` | `\bEXEC\s+SQL\b` | Start of EXEC SQL block | `EXEC SQL` | +| `RE_EXEC_CICS_START` | `\bEXEC\s+CICS\b` | Start of EXEC CICS block | `EXEC CICS` | +| `RE_END_EXEC` | `\bEND-EXEC\b` | End of EXEC block | `END-EXEC` | + +EXEC blocks accumulate all lines between `EXEC SQL/CICS` and `END-EXEC`, then delegate to `parseExecSqlBlock()` or `parseExecCicsBlock()` for detailed extraction. + +## Excluded Paragraph Names + +The following names are excluded from paragraph detection to avoid false positives from division/section headers: + +``` +DECLARATIVES, END, PROCEDURE, IDENTIFICATION, +ENVIRONMENT, DATA, WORKING-STORAGE, LINKAGE, +FILE, LOCAL-STORAGE, COMMUNICATION, REPORT, +SCREEN, INPUT-OUTPUT, CONFIGURATION +``` + +Additionally, paragraph candidates containing `DIVISION` or `SECTION` as substrings are excluded. + +## MOVE Skip List (Figurative Constants) + +MOVE statements where the source is a figurative constant are skipped: + +``` +SPACES, ZEROS, ZEROES, LOW-VALUES, LOW-VALUE, +HIGH-VALUES, HIGH-VALUE, QUOTES, QUOTE, ALL +``` + +## Source Files + +- `gitnexus/src/core/ingestion/cobol-preprocessor.ts` -- `preprocessCobolSource()`, `extractCobolSymbolsWithRegex()`, all regex constants diff --git a/docs/plans/2026-03-26-feat-cobol-full-language-coverage-plan.md b/docs/plans/2026-03-26-feat-cobol-full-language-coverage-plan.md new file mode 100644 index 0000000000..b1a2e880ca --- /dev/null +++ b/docs/plans/2026-03-26-feat-cobol-full-language-coverage-plan.md @@ -0,0 +1,326 @@ +--- +title: "feat: Complete COBOL language feature coverage for maximum knowledge graph value" +type: feat +status: active +date: 2026-03-26 +origin: Feature audit from v3-integration-architect agent (session 8642401e) +--- + +## Enhancement Summary + +**Deepened on:** 2026-03-26 +**Research agents used:** COBOL expert (Phase 1+2), graph value analyst, codebase explorer +**Sections enhanced:** Phase 1 (5 features), Phase 2 (4 features), graph value ranking + +### Key Improvements from Research +1. **CALL USING** is the #1 highest-value edge type (9.2/10) — fixes ~40% of missing caller references +2. **EXEC DLI** requires dual-interface support (EXEC DLI + CBLTDLI CALL) for full IMS coverage +3. **DECLARATIVES** is lowest-risk Phase 2 item — existing section/paragraph detection already captures structure +4. **SET TO TRUE** accounts for 80-90% of all SET statements — prioritize this form +5. **INSPECT** needs multi-line accumulator (like SORT) — can span 5+ continuation lines +6. **Graph value ranking**: cobol-call-using (9.2) > cobol-error-handler (9.0) > dli-gu (8.2) > cobol-string (6.2) + +### New Edge Cases Discovered +- CALL USING supports mixed modes: `USING BY REFERENCE WS-A BY CONTENT WS-B BY VALUE WS-C` +- CALL USING `ADDRESS OF` and `OMITTED` must be filtered from parameter lists +- EXEC DLI can have multiple SEGMENT levels in hierarchical retrieval (use matchAll) +- DECLARATIVES can have multiple USE sections (one per file + catch-all for INPUT/OUTPUT/I-O/EXTEND) +- INSPECT TALLYING can have multiple counters in a single statement +- STRING/UNSTRING can span multiple lines (need accumulator pattern) + +--- + +# Complete COBOL Language Feature Coverage + +## Overview + +Implement the remaining 25 unhandled COBOL language features and fix 10 partial features to achieve ~95% coverage (up from 71.9%). The goal is to build the richest possible knowledge graph from COBOL codebases, enabling a future `modernize` MCP command (out of scope for this plan) that would use the graph to assist with COBOL-to-modern-language migration. + +## Problem Statement + +The COBOL processor currently handles 54 of 89 applicable language features (71.9%). The 25 unhandled features represent real data loss in the knowledge graph: +- **Cross-program data flow** is invisible (CALL ... USING parameters not extracted) +- **IMS/DB programs** produce empty graphs (EXEC DLI not recognized) +- **String transformation logic** is invisible (STRING/UNSTRING/INSPECT not tracked) +- **SQL copybook dependencies** are missing (EXEC SQL INCLUDE not mapped) +- **Error handling flows** are lost (DECLARATIVES/USE AFTER not captured) + +## Proposed Solution + +Implement features in 4 phases, ordered by graph value density (edges created per LOC of implementation). Each phase is independently shippable and testable. + +## Technical Approach + +### Phase 1: High-Value Data Flow Edges (~150 LOC, ~8 new edge types) + +The highest-ROI features: they create new ACCESSES and IMPORTS edges that directly improve impact analysis. + +**Critical research finding**: Multi-line statement accumulation is the dominant challenge. CALL USING, STRING/UNSTRING, and multi-line data item clauses all span multiple lines in production COBOL. The free-format path processes each line independently — these features need statement accumulators (like SORT/SELECT) or the free-format path needs multi-line awareness. Estimated LOC increased from 110 to 150 to account for accumulator infrastructure. + +#### 1.1 EXEC SQL INCLUDE -> IMPORTS edges +- **File:** `cobol-preprocessor.ts` (parseExecSqlBlock) +- **What:** Detect `INCLUDE` as the operation, extract member name, emit as a `copies[]` entry +- **Graph:** IMPORTS edge from File to included copybook/SQLCA with reason `sql-include` +- **Tests:** Unit test for `EXEC SQL INCLUDE SQLCA END-EXEC` and `EXEC SQL INCLUDE CUSTCOPY END-EXEC` + +**Research insights (EXEC SQL INCLUDE):** +- DB2 member names can contain underscores: `EXEC SQL INCLUDE CUST_TBL_DCL END-EXEC` — regex must use `[A-Z][A-Z0-9_-]+` +- Quoted literal form: `EXEC SQL INCLUDE 'DBRMLIB.MEMBER' END-EXEC` (z/OS PDS qualified name) +- SQLCA/SQLDA are DB2 builtins — won't resolve to repo files. Emit unresolved IMPORTS edge (still valuable) +- No REPLACING support on EXEC SQL INCLUDE (unlike COPY) +- Add `INCLUDE` to `OP_MAP` in `parseExecSqlBlock`; extract member via `RE_SQL_INCLUDE = /^INCLUDE\s+(?:'([^']+)'|"([^"]+)"|([A-Z][A-Z0-9_-]+))/i` + +#### 1.2 CALL ... USING parameter extraction -> ACCESSES edges (Graph value: 9.2/10) +- **File:** `cobol-preprocessor.ts` (processLogicalLine CALL section) +- **What:** After capturing CALL target, scan for USING clause. Extract parameter names (reuse USING_KEYWORDS filter). Store as `calls[].parameters: string[]` +- **Interface:** Add `parameters?: string[]` to calls array type in CobolRegexResults +- **File:** `cobol-processor.ts` (CALL edge block) +- **Graph:** For each USING parameter, create ACCESSES edge from caller to data item Property node with reason `cobol-call-using` +- **Tests:** `CALL 'AUDITLOG' USING CUST-ID WS-AMOUNT` -> 2 ACCESSES edges + +**Research insights (CALL USING forms):** +- Mixed modes: `CALL 'PGM' USING BY REFERENCE WS-A BY CONTENT WS-B BY VALUE WS-C` +- Pointer passing: `CALL 'PGM' USING ADDRESS OF WS-A` +- Placeholder: `CALL 'PGM' USING OMITTED WS-B` +- Filter keywords: add `ADDRESS`, `OMITTED`, `LENGTH` to USING_KEYWORDS (already has BY/VALUE/REFERENCE/CONTENT) +- **Impact tool enhancement:** CALL-USING edges enable BFS traversal through parameter data flow — single most impactful edge type for COBOL impact analysis + +#### 1.3 STRING/UNSTRING data flow -> ACCESSES edges +- **File:** `cobol-preprocessor.ts` (new section in extractProcedure) +- **What:** Accumulate multi-line STRING/UNSTRING until period or END-STRING/END-UNSTRING. Extract sources and INTO targets. +- **Interface:** Add `strings: Array<{ sources: string[]; target: string; type: 'string' | 'unstring'; line: number; caller: string | null }>` to CobolRegexResults +- **Graph:** read-ACCESSES on sources, write-ACCESSES on INTO target with reason `cobol-string-read` / `cobol-string-write` +- **Tests:** 2 unit tests + integration test assertions + +**Research insights (STRING/UNSTRING):** +- **Needs statement accumulator** — STRING/UNSTRING always span multiple lines in production +- Terminate accumulation at: period, END-STRING/END-UNSTRING, or start of next COBOL verb +- STRING sources: identifiers before each `DELIMITED BY`. Filter: STRING, DELIMITED, BY, SIZE, ALL, INTO, WITH, POINTER, ON, OVERFLOW, NOT, END-STRING +- UNSTRING: source is first identifier after UNSTRING; INTO targets are identifiers after INTO. Filter: DELIMITER, IN, COUNT, TALLYING, OR +- WITH POINTER field is both read AND written (starting position updated) +- TALLYING IN / COUNT IN fields are write targets +- Literal sources (`'text'`) must be filtered — quote-aware tokenization needed +- **Edge case**: STRING terminated by next verb, not period — existing fixture has `STRING ... DISPLAY` without period between them + +#### 1.4 OCCURS DEPENDING ON -> ACCESSES edge +- **File:** `cobol-preprocessor.ts` (parseDataItemClauses) +- **What:** Extend OCCURS regex to capture DEPENDING ON field, KEY fields, and INDEXED BY names +- **Interface:** Add `dependingOn?: string`, `occursMax?: number`, `occursKeys?: Array<{direction: string; fields: string[]}>`, `indexedBy?: string[]` to data items +- **Graph:** ACCESSES edge from table item to controlling field with reason `cobol-depends-on` +- **Tests:** `05 WS-TABLE OCCURS 100 DEPENDING ON WS-COUNT` -> edge + +**Research insights (OCCURS):** +- IBM allows `OCCURS 0 TO n DEPENDING ON` (zero minimum) and `OCCURS UNBOUNDED DEPENDING ON` (V6.4) +- Subscripted controlling fields: `DEPENDING ON WS-COUNT(WS-IDX)` — strip subscripts before storing +- **Pre-existing gap**: Multi-line data item clauses without continuation indicator are NOT captured. `05 WS-TABLE\n OCCURS 100\n DEPENDING ON WS-COUNT.` — the current RE_DATA_ITEM only gets the first line, `rest` is empty. Fixing properly requires a data item accumulator (like SELECT). **Defer full fix to Phase 3; implement same-line capture now.** +- KEY IS fields: `ASCENDING KEY IS WS-KEY-1 WS-KEY-2` — capture for SEARCH ALL resolution +- INDEXED BY: `INDEXED BY IDX-1 IDX-2` — capture for SET/SEARCH context + +#### 1.5 VALUE clause for standard data items +- **File:** `cobol-preprocessor.ts` (parseDataItemClauses) +- **What:** Extract VALUE using a pragmatic function that handles quoted strings, numerics, figurative constants, hex/national literals +- **Interface:** Already exists as `values?: string[]` on data items (currently only populated for 88-level) +- **Graph:** Stored in Property node description (no new edges) +- **Tests:** `01 WS-STATUS PIC X VALUE 'A'` -> values: ['A'] + +**Research insights (VALUE forms):** +- Hex literals: `VALUE X'F1F2F3F4'`, National: `VALUE N'text'`, DBCS: `VALUE G'text'` +- Figurative constants: SPACES, ZEROS, ZEROES, LOW-VALUES, HIGH-VALUES, QUOTES, NULL, NULLS +- ALL literal: `VALUE ALL '*'` +- Numeric with sign/decimal: `VALUE -123.45`, `VALUE +1` +- `VALUE IS` optional — both `VALUE 'A'` and `VALUE IS 'A'` valid +- **Decimal vs period ambiguity**: `VALUE 100.` — is `.` decimal or terminator? `parseDataItemClauses` already strips trailing period, so this is handled +- IBM V6.4: floating-point `VALUE 1.0E5` — extend numeric regex if needed +- Implementation: use a pragmatic `extractValue(rest)` function, not a single complex regex + +### Phase 2: EXEC DLI + DECLARATIVES (~90 LOC, ~4 new edge types) + +IMS/DB support and error handling flows. + +#### 2.1 EXEC DLI (IMS/DB) -> ACCESSES edges (Graph value: 8.2/10) +- **File:** `cobol-preprocessor.ts` (processLogicalLine — add RE_EXEC_DLI_START check alongside SQL/CICS) +- **What:** Accumulate EXEC DLI blocks like EXEC SQL. Parse DLI verbs (GU, GN, GNP, GHU, GHN, GHNP, ISRT, DLET, REPL, CHKP, SCHD, TERM). Extract segment name, PCB number, INTO/FROM areas, WHERE fields, PSB name. +- **Interface:** Add `execDliBlocks: Array<{ line: number; verb: string; pcbNumber?: number; segmentName?: string; intoField?: string; fromField?: string; whereField?: string; psbName?: string }>` to CobolRegexResults +- **Graph:** CodeElement node + ACCESSES edge to `:` Record node with reason `dli-{verb}`; ACCESSES edges to INTO/FROM data areas; PSB ACCESSES for SCHD +- **Tests:** `EXEC DLI GU USING PCB(1) SEGMENT(CUSTOMER) INTO(WS-CUST) END-EXEC` + +**Research insights (dual IMS interface):** +- **EXEC DLI**: Embedded command interface for CICS-DL/I programs only +- **CBLTDLI CALL**: Batch interface via `CALL 'CBLTDLI' USING function-code PCB io-area SSA1..SSA15` +- CBLTDLI is already captured as a CALL to 'CBLTDLI' — enrich with USING parameter semantics later +- Multiple SEGMENT levels in hierarchical retrieval — use `matchAll` on segment regex +- DLI verbs: GU (most common), GN, GNP, GHU, GHN, GHNP, ISRT, REPL, DLET, CHKP, SCHD, TERM, ROLL, ROLB +- **Edge case**: DLET/REPL have no SEGMENT clause (operate on current position) +- **Recommended order**: Implement AFTER DECLARATIVES and SET (lower risk, higher frequency) + +#### 2.2 DECLARATIVES / USE AFTER STANDARD EXCEPTION (Graph value: 9.0/10) +- **File:** `cobol-preprocessor.ts` (processLogicalLine — detect DECLARATIVES keyword, track USE AFTER blocks) +- **What:** When `DECLARATIVES.` is encountered, switch to declaratives mode. Extract USE statements binding sections to files/modes. +- **Interface:** Add `declaratives: Array<{ sectionName: string; useType: 'error' | 'debug' | 'label' | 'reporting'; target: string; line: number }>` to CobolRegexResults +- **Graph:** ACCESSES edge from declarative Namespace to file Record with reason `cobol-declarative-error-handler` +- **Tests:** Unit test with DECLARATIVES section, integration test for error flow + +**Research insights (DECLARATIVES syntax):** +- `USE AFTER STANDARD {EXCEPTION|ERROR} ON {file-name|INPUT|OUTPUT|I-O|EXTEND}` +- EXCEPTION and ERROR are synonymous; STANDARD is optional in IBM dialects +- Multiple USE sections allowed (one per file + catch-all for I/O modes) +- `END DECLARATIVES.` must NOT reset PROCEDURE DIVISION state +- `DECLARATIVES` is already in EXCLUDED_PARA_NAMES — no false paragraph risk +- Existing section/paragraph detection already captures structural elements — just need USE binding +- **Lowest risk Phase 2 item** — implement first + +#### 2.3 SET statement -> ACCESSES edges +- **File:** `cobol-preprocessor.ts` (extractProcedure — new RE_SET regex) +- **Interface:** Add `sets: Array<{ targets: string[]; form: 'to-true'|'to-value'|'up-by'|'down-by'|'address-of'|'to-null'|'to-entry'; value?: string; entryTarget?: string; entryIsLiteral?: boolean; line: number; caller: string | null }>` to CobolRegexResults +- **Graph:** ACCESSES write edge with reason `cobol-set-condition` (TO TRUE), `cobol-set-index` (TO/UP/DOWN), `cobol-set-address` (ADDRESS OF). SET ENTRY with literal -> CALLS edge. +- **Tests:** `SET WS-EOF TO TRUE`, `SET IDX-1 TO 5`, `SET IDX-1 UP BY 1` + +**Research insights (SET forms by frequency):** +- `SET condition TO TRUE` — 80-90% of all SET usage. Multiple targets: `SET COND-A COND-B TO TRUE` +- `SET index TO/UP BY/DOWN BY` — ~8%. Multiple indices: `SET IDX-1 IDX-2 UP BY 1` +- `SET pointer TO ADDRESS OF data-item` / `SET ADDRESS OF data-item TO pointer` — ~2% +- `SET proc-ptr TO ENTRY "PROGNAME"` — rare but creates CALLS edge (like dynamic CALL) +- Filter OF/IN qualifiers: `SET COND-A OF WS-RECORD TO TRUE` (strip OF WS-RECORD) +- **Prioritize**: SET TO TRUE alone covers 80-90% — implement this form first + +#### 2.4 INSPECT -> ACCESSES edges +- **File:** `cobol-preprocessor.ts` (extractProcedure — new `inspectAccum` accumulator like SORT) +- **What:** Accumulate multi-line INSPECT until period. Extract inspected field + tally counters. +- **Interface:** Add `inspects: Array<{ inspectedField: string; counters: string[]; form: 'tallying'|'replacing'|'converting'|'tallying-replacing'; line: number; caller: string | null }>` to CobolRegexResults +- **Graph:** ACCESSES read on inspected field always; write if REPLACING/CONVERTING. Write edges for tally counters. Reason: `cobol-inspect-read`/`cobol-inspect-write`/`cobol-inspect-tally` +- **Tests:** `INSPECT WS-FIELD TALLYING WS-COUNT FOR ALL 'A'` -> read on WS-FIELD, write on WS-COUNT + +**Research insights (INSPECT forms by frequency):** +- REPLACING (~60%): `INSPECT WS-STR REPLACING ALL 'A' BY 'B'` +- TALLYING (~25%): `INSPECT WS-STR TALLYING WS-CNT FOR ALL 'A'` — multiple counters possible +- CONVERTING (~10%): `INSPECT WS-STR CONVERTING 'abc' TO 'ABC'` +- Combined (~5%): TALLYING + REPLACING in single statement +- **Needs multi-line accumulator** — INSPECT frequently spans 3-5 lines in production +- Extract tally counters with `([A-Z][A-Z0-9-]+)\s+FOR\b` matchAll pattern +- Filter figurative constants (SPACES, ZEROS) using existing MOVE_SKIP set + +### Phase 3: Completeness Fixes (~60 LOC) + +Fix the 10 partial features and small gaps. + +#### 3.1 CALL ... RETURNING extraction +- Extend RE_CALL processing to capture RETURNING target after the USING clause +- Store as `calls[].returning?: string` +- Graph: ACCESSES write edge with reason `cobol-call-returning` + +#### 3.2 SELECT OPTIONAL flag preservation +- Store `isOptional: boolean` in FileDeclaration interface +- Include in Record node description + +#### 3.3 ALTERNATE RECORD KEY extraction +- Add regex in parseSelectStatement: `/\bALTERNATE\s+RECORD\s+KEY\s+(?:IS\s+)?([A-Z][A-Z0-9-]+)/i` +- Store as `alternateKeys?: string[]` + +#### 3.4 COMMON attribute on nested programs +- Extend RE_PROGRAM_ID: `/\bPROGRAM-ID\.\s*([A-Z][A-Z0-9-]+)(?:\s+IS\s+COMMON)?/i` +- Store `isCommon: boolean` on Module node +- Affects cross-program CALL resolution scope + +#### 3.5 IS EXTERNAL / IS GLOBAL as first-class properties +- Change from usage string hack to proper boolean fields on data items +- Add `isExternal?: boolean`, `isGlobal?: boolean` to data item interface + +#### 3.6 AUTHOR / DATE-WRITTEN mapped to Module node +- Already extracted as programMetadata — map to Module node properties +- `graph.addNode({ ..., properties: { ..., author, dateWritten } })` + +#### 3.7 REPLACE statement +- Track REPLACE / REPLACE OFF state in preprocessor +- Apply text substitutions during preprocessing (before regex extraction) +- Complex: requires careful scoping rules + +### Phase 4: Niche Features (~30 LOC) + +Low-priority but nice for completeness. + +#### 4.1 INITIALIZE statement -> write ACCESSES +- `/\bINITIALIZE\s+([A-Z][A-Z0-9-]+)/i` +- ACCESSES write edge with reason `cobol-initialize` + +#### 4.2 Remaining IDENTIFICATION DIVISION paragraphs +- DATE-COMPILED, INSTALLATION, SECURITY, REMARKS +- Map to Module node description properties + +#### 4.3 EXEC SQL INCLUDE -> IMPORTS edge (expansion) +- For EXEC SQL INCLUDE inside EXEC blocks that reference copybooks containing SQL +- Create IMPORTS edge similar to COPY + +## Acceptance Criteria + +### Functional Requirements + +- [ ] Phase 1: All 5 features implemented with unit + integration tests +- [ ] Phase 2: All 4 features implemented with unit + integration tests +- [ ] Phase 3: All 7 partial features fixed +- [ ] Phase 4: At least 2 of 3 niche features implemented +- [ ] All existing 145 tests continue to pass +- [ ] TypeScript compiles cleanly + +### Non-Functional Requirements + +- [ ] No performance regression: CardDemo benchmark stays under 8s +- [ ] No file exceeds 1500 LOC (preprocessor currently 1326) +- [ ] ACAS benchmark shows increased node/edge counts (more data extracted) +- [ ] CardDemo benchmark shows increased edge counts (CALL USING, STRING, etc.) + +### Quality Gates + +- [ ] Each phase has its own commit +- [ ] Integration test assertions updated with exact counts per phase +- [ ] Benchmark run after each phase to track graph growth + +## Dependencies & Risks + +### Dependencies +- None. All changes are additive to existing COBOL processor code. +- No LanguageProvider changes needed. +- No graph schema changes needed (all new constructs map to existing node labels + edge types). + +### Risks +- **preprocessor.ts size**: Currently 1326 LOC. Phase 1+2 adds ~200 LOC -> 1526 LOC. May need to extract helpers into a separate `cobol-data-flow.ts` module if it exceeds 1500. +- **REPLACE statement** (Phase 3.7) is the most complex feature — requires tracking text substitution state across logical lines. Consider deferring to a separate PR if it takes >100 LOC. +- **EXEC DLI** (Phase 2.1) is only testable against IMS codebases. Need fixture data or synthetic test cases. + +## Graph Value Ranking by MCP Tool Impact + +Research agent analyzed all 5 MCP tools (query, context, impact, detect_changes, rename) against planned edge types: + +| Edge Type | QUERY | CONTEXT | IMPACT | DETECT | RENAME | **Overall** | +|-----------|-------|---------|--------|--------|--------|-------------| +| `cobol-call-using` | 4/5 | 5/5 | 5/5 | 4/5 | 4/5 | **9.2/10** | +| `cobol-error-handler` | 5/5 | 4/5 | 5/5 | 5/5 | 2/5 | **9.0/10** | +| `dli-*` (IMS verbs) | 4/5 | 4/5 | 5/5 | 4/5 | 2/5 | **8.2/10** | +| `cobol-string-*` | 4/5 | 3/5 | 3/5 | 3/5 | 2/5 | **6.2/10** | + +**Key finding**: `cobol-call-using` alone would fix ~40% of missing caller references in COBOL graphs. + +## Future Considerations + +This plan provides the graph data foundation for a future `modernize` MCP command (out of scope) that would: +- Use CALL USING edges to map data contracts between programs +- Use STRING/UNSTRING edges to identify data transformation logic +- Use EXEC SQL/DLI edges to map database access patterns +- Use DECLARATIVES to understand error handling architecture +- Use the complete knowledge graph to generate migration plans + +**MCP tool enhancements needed** (after this plan ships): +- Add `cobol-call-using`, `cobol-error-handler`, `dli-*` to IMPACT tool's default `relationTypes` for COBOL repos +- Add confidence floors for new edge types in `IMPACT_RELATION_CONFIDENCE` +- Register new edge types in `VALID_RELATION_TYPES` set (`local-backend.ts:52`) + +## Sources & References + +### Internal References +- Feature audit: session 8642401e (COBOL expert agent, 123 features audited) +- Prior plans: `docs/plans/2026-03-25-feat-cobol-100-percent-feature-coverage-plan.md` +- Architecture: `docs/code-indexing/cobol/` (7 documentation files) + +### External References +- COBOL features reference: mainframestechhelp.com/tutorials/cobol/features.htm +- COBOL-85 standard: ISO/IEC 1989:1985 +- IBM Enterprise COBOL reference diff --git a/gitnexus/src/config/supported-languages.ts b/gitnexus/src/config/supported-languages.ts index a35c3d2b11..91a654e805 100644 --- a/gitnexus/src/config/supported-languages.ts +++ b/gitnexus/src/config/supported-languages.ts @@ -42,4 +42,6 @@ export enum SupportedLanguages { Kotlin = 'kotlin', Swift = 'swift', Dart = 'dart', + /** Standalone regex processor — no tree-sitter, no LanguageProvider. */ + Cobol = 'cobol', } diff --git a/gitnexus/src/core/graph/graph.ts b/gitnexus/src/core/graph/graph.ts index 4658131ccb..b0a641ec69 100644 --- a/gitnexus/src/core/graph/graph.ts +++ b/gitnexus/src/core/graph/graph.ts @@ -34,7 +34,15 @@ export const createKnowledgeGraph = (): KnowledgeGraph => { }; /** - * Remove all nodes (and their relationships) belonging to a file + * Remove a single relationship by id. + * Returns true if the relationship existed and was removed, false otherwise. + */ + const removeRelationship = (relationshipId: string): boolean => { + return relationshipMap.delete(relationshipId); + }; + + /** + * Remove all nodes (and their relationships) belonging to a file. */ const removeNodesByFile = (filePath: string): number => { let removed = 0; @@ -75,6 +83,7 @@ export const createKnowledgeGraph = (): KnowledgeGraph => { addRelationship, removeNode, removeNodesByFile, + removeRelationship, }; }; diff --git a/gitnexus/src/core/graph/types.ts b/gitnexus/src/core/graph/types.ts index 594a94c7e4..69cfc81fed 100644 --- a/gitnexus/src/core/graph/types.ts +++ b/gitnexus/src/core/graph/types.ts @@ -141,4 +141,5 @@ export interface KnowledgeGraph { addRelationship: (relationship: GraphRelationship) => void, removeNode: (nodeId: string) => boolean, removeNodesByFile: (filePath: string) => number, + removeRelationship: (relationshipId: string) => boolean, } diff --git a/gitnexus/src/core/ingestion/cobol-processor.ts b/gitnexus/src/core/ingestion/cobol-processor.ts new file mode 100644 index 0000000000..8e7983d813 --- /dev/null +++ b/gitnexus/src/core/ingestion/cobol-processor.ts @@ -0,0 +1,1308 @@ +/** + * COBOL Processor + * + * Standalone regex-based processor for COBOL and JCL files. + * Follows the markdown-processor.ts pattern: takes (graph, files, allPathSet), + * does its own extraction, and writes directly to the graph. + * + * Pipeline: + * 1. Separate programs from copybooks + * 2. Build copybook map (name -> content) + * 3. For each program: expand COPY statements, then run regex extraction + * 4. Map CobolRegexResults to graph nodes and relationships + * 5. Optionally process JCL files for job-step cross-references + */ + +import path from 'node:path'; +import { generateId } from '../../lib/utils.js'; +import { SupportedLanguages } from '../../config/supported-languages.js'; +import type { KnowledgeGraph } from '../graph/types.js'; +import { + preprocessCobolSource, + extractCobolSymbolsWithRegex, + type CobolRegexResults, +} from './cobol/cobol-preprocessor.js'; +import { expandCopies } from './cobol/cobol-copy-expander.js'; +import { processJclFiles } from './cobol/jcl-processor.js'; + +// --------------------------------------------------------------------------- +// File detection +// --------------------------------------------------------------------------- + +const COBOL_EXTENSIONS = new Set([ + '.cob', '.cbl', '.cobol', '.cpy', '.copybook', +]); + +const JCL_EXTENSIONS = new Set(['.jcl', '.job', '.proc']); + +const COPYBOOK_EXTENSIONS = new Set(['.cpy', '.copybook']); + +interface CobolFile { + path: string; + content: string; +} + +export interface CobolProcessResult { + programs: number; + paragraphs: number; + sections: number; + dataItems: number; + calls: number; + copies: number; + execSqlBlocks: number; + execCicsBlocks: number; + entryPoints: number; + moves: number; + fileDeclarations: number; + jclJobs: number; + jclSteps: number; + sqlIncludes: number; + execDliBlocks: number; + declaratives: number; + sets: number; + inspects: number; + initializes: number; +} + +/** Returns true if the file is a COBOL or copybook file. */ +export function isCobolFile(filePath: string): boolean { + return COBOL_EXTENSIONS.has(path.extname(filePath).toLowerCase()); +} + +/** Returns true if the file is a JCL file. */ +export function isJclFile(filePath: string): boolean { + return JCL_EXTENSIONS.has(path.extname(filePath).toLowerCase()); +} + +/** Returns true if the file is a COBOL copybook. */ +function isCopybook(filePath: string): boolean { + return COPYBOOK_EXTENSIONS.has(path.extname(filePath).toLowerCase()); +} + +// --------------------------------------------------------------------------- +// Main processor +// --------------------------------------------------------------------------- + +/** + * Process COBOL and JCL files into the knowledge graph. + * + * @param graph - The in-memory knowledge graph + * @param files - Array of { path, content } for COBOL/JCL files + * @param allPathSet - Set of all file paths in the repository + * @returns Summary of what was extracted + */ +export const processCobol = ( + graph: KnowledgeGraph, + files: CobolFile[], + allPathSet: Set, +): CobolProcessResult => { + const result: CobolProcessResult = { + programs: 0, + paragraphs: 0, + sections: 0, + dataItems: 0, + calls: 0, + copies: 0, + execSqlBlocks: 0, + execCicsBlocks: 0, + entryPoints: 0, + moves: 0, + fileDeclarations: 0, + jclJobs: 0, + jclSteps: 0, + sqlIncludes: 0, + execDliBlocks: 0, + declaratives: 0, + sets: 0, + inspects: 0, + initializes: 0, + }; + + // ── 1. Separate programs, copybooks, and JCL ─────────────────────── + const programs: CobolFile[] = []; + const copybooks: CobolFile[] = []; + const jclFiles: CobolFile[] = []; + + for (const file of files) { + const ext = path.extname(file.path).toLowerCase(); + if (JCL_EXTENSIONS.has(ext)) { + jclFiles.push(file); + } else if (isCopybook(file.path)) { + copybooks.push(file); + } else if (COBOL_EXTENSIONS.has(ext)) { + programs.push(file); + } + } + + // ── 2. Build copybook map (uppercase name -> content) ────────────── + const copybookMap = new Map(); + for (const cb of copybooks) { + const name = path.basename(cb.path, path.extname(cb.path)).toUpperCase(); + copybookMap.set(name, { content: cb.content, path: cb.path }); + } + + // Build reverse lookup: path -> content for O(1) readCopy + const copybookByPath = new Map(); + for (const [, entry] of copybookMap) { + copybookByPath.set(entry.path, entry.content); + } + + // Resolve and read callbacks for expandCopies + const resolveCopy = (name: string): string | null => { + const entry = copybookMap.get(name.toUpperCase()); + return entry ? entry.path : null; + }; + const readCopy = (copyPath: string): string | null => { + const content = copybookByPath.get(copyPath); + return content ? preprocessCobolSource(content) : null; + }; + + // Track module names for cross-program CALL resolution + const moduleNodeIds = new Map(); // uppercase program name -> node id + + // ── 3. Process each COBOL program ────────────────────────────────── + for (const file of programs) { + const fileNodeId = generateId('File', file.path); + // Skip if file node doesn't exist (structure-processor creates it) + if (!graph.getNode(fileNodeId)) continue; + + // Preprocess: clean patch markers + const cleaned = preprocessCobolSource(file.content); + + // Expand COPY statements + const { expandedContent, copyResolutions } = expandCopies( + cleaned, file.path, resolveCopy, readCopy, + ); + + // Extract symbols from expanded source + const extracted = extractCobolSymbolsWithRegex(expandedContent, file.path); + + // Map to graph + mapToGraph(graph, extracted, file, copyResolutions, moduleNodeIds); + + // Accumulate stats + result.programs += extracted.programs.length || (extracted.programName ? 1 : 0); + result.paragraphs += extracted.paragraphs.length; + result.sections += extracted.sections.length; + result.dataItems += extracted.dataItems.length; + result.calls += extracted.calls.length; + result.copies += extracted.copies.length; + result.execSqlBlocks += extracted.execSqlBlocks.length; + result.sqlIncludes += extracted.execSqlBlocks.filter(s => s.includeMember).length; + result.execCicsBlocks += extracted.execCicsBlocks.length; + result.entryPoints += extracted.entryPoints.length; + result.moves += extracted.moves.length; + result.fileDeclarations += extracted.fileDeclarations.length; + result.execDliBlocks += extracted.execDliBlocks.length; + result.declaratives += extracted.declaratives.length; + result.sets += extracted.sets.length; + result.inspects += extracted.inspects.length; + result.initializes += extracted.initializes.length; + } + + // ── 4. Second pass: resolve cross-program CALL targets ───────────── + // During mapToGraph, early programs create unresolved CALL edges + // (target = :PROGNAME) because later programs haven't + // been registered in moduleNodeIds yet. Now that ALL programs are + // processed, re-scan unresolved CALLS edges and patch them. + // This covers both `cobol-call-unresolved` and CICS LINK/XCTL edges + // whose targets contain `:`. + const unresolvedToRemove: string[] = []; + + graph.forEachRelationship(rel => { + if (rel.type !== 'CALLS') return; + const match = rel.targetId.match(/:(.+)/); + if (!match) return; + const resolvedId = moduleNodeIds.get(match[1]); + if (!resolvedId) return; + + if (rel.reason?.startsWith('cobol-call-unresolved') || rel.reason === 'cobol-cancel-unresolved') { + // Replace unresolved CALL/CANCEL with resolved edge + const resolvedReason = rel.reason === 'cobol-cancel-unresolved' ? 'cobol-cancel' : 'cobol-call'; + graph.addRelationship({ + id: rel.id + ':resolved', + type: 'CALLS', + sourceId: rel.sourceId, + targetId: resolvedId, + confidence: rel.reason === 'cobol-cancel-unresolved' ? 0.9 : 0.95, + reason: resolvedReason, + }); + } else if (rel.reason?.startsWith('cics-') && rel.reason.endsWith('-unresolved')) { + // Replace unresolved CICS LINK/XCTL/LOAD with resolved edge + graph.addRelationship({ + id: rel.id + ':resolved', + type: 'CALLS', + sourceId: rel.sourceId, + targetId: resolvedId, + confidence: 0.95, + reason: rel.reason.replace('-unresolved', ''), + }); + } + + // Mark original unresolved edge for removal after iteration + unresolvedToRemove.push(rel.id); + }); + + // Remove orphan unresolved edges (cannot delete during Map.forEach iteration) + for (const id of unresolvedToRemove) { + graph.removeRelationship(id); + } + + // ── 5. Process JCL files ─────────────────────────────────────────── + if (jclFiles.length > 0) { + const jclPaths = jclFiles.map(f => f.path); + const jclContents = new Map(); + for (const f of jclFiles) { + jclContents.set(f.path, f.content); + } + const jclResult = processJclFiles(graph, jclPaths, jclContents); + result.jclJobs += jclResult.jobCount; + result.jclSteps += jclResult.stepCount; + } + + return result; +}; + +// --------------------------------------------------------------------------- +// Graph mapping +// --------------------------------------------------------------------------- + +/** Generate a deterministic Property node ID using composite key (section:level:name). */ +function generatePropertyId( + filePath: string, + item: { section: string; level: number; name: string }, +): string { + return generateId('Property', `${filePath}:${item.section}:${item.level}:${item.name}`); +} + +/** + * Build a lookup Map from data item name (uppercase) to its Property node ID. + * First-wins semantics: if the same name appears in multiple sections, + * the first occurrence in extraction order is used for MOVE edge resolution. + */ +function buildDataItemMap( + dataItems: CobolRegexResults['dataItems'], + filePath: string, +): Map { + const map = new Map(); + for (const item of dataItems) { + if (item.name === 'FILLER') continue; + const key = item.name.toUpperCase(); + if (!map.has(key)) { + map.set(key, generatePropertyId(filePath, item)); + } + } + return map; +} + +function mapToGraph( + graph: KnowledgeGraph, + extracted: CobolRegexResults, + file: CobolFile, + copyResolutions: Array<{ copyTarget: string; resolvedPath: string | null; line: number }>, + moduleNodeIds: Map, +): void { + const { path: filePath, content } = file; + const lines = content.split(/\r?\n/); + const fileNodeId = generateId('File', filePath); + + // ── PROGRAM-ID -> Module node ──────────────────────────────────── + let moduleId: string | undefined; + if (extracted.programName) { + moduleId = generateId('Module', `${filePath}:${extracted.programName}`); + const metaDesc = [ + extracted.programMetadata.author && `author:${extracted.programMetadata.author}`, + extracted.programMetadata.dateWritten && `date:${extracted.programMetadata.dateWritten}`, + extracted.programMetadata.dateCompiled && `compiled:${extracted.programMetadata.dateCompiled}`, + extracted.programMetadata.installation && `install:${extracted.programMetadata.installation}`, + ].filter(Boolean).join(' '); + graph.addNode({ + id: moduleId, + label: 'Module', + properties: { + name: extracted.programName, + filePath, + startLine: 1, + endLine: lines.length, + language: SupportedLanguages.Cobol, + isExported: true, + description: metaDesc || undefined, + }, + }); + graph.addRelationship({ + id: generateId('CONTAINS', `${fileNodeId}->${moduleId}`), + type: 'CONTAINS', + sourceId: fileNodeId, + targetId: moduleId, + confidence: 1.0, + reason: 'cobol-program-id', + }); + moduleNodeIds.set(extracted.programName.toUpperCase(), moduleId); + } + + // ── Nested programs -> additional Module nodes ─────────────────── + // programs[] contains all PROGRAM-IDs with line ranges. The first entry + // is the primary (outer) program (already created above). Additional + // entries are nested programs that get their own Module nodes. + const programModuleIds = new Map(); + if (moduleId) { + programModuleIds.set(extracted.programName!.toUpperCase(), moduleId); + } + for (const prog of extracted.programs) { + if (prog.name.toUpperCase() === extracted.programName?.toUpperCase()) continue; // skip primary + const nestedModuleId = generateId('Module', `${filePath}:${prog.name}`); + graph.addNode({ + id: nestedModuleId, + label: 'Module', + properties: { + name: prog.name, + filePath, + startLine: prog.startLine, + endLine: prog.endLine, + language: SupportedLanguages.Cobol, + isExported: true, + description: `nested-program${prog.isCommon ? ' common' : ''}`, + }, + }); + // Find enclosing program by line-range containment + const enclosing = extracted.programs.find(p => + p.startLine < prog.startLine && p.endLine > prog.endLine && p.nestingDepth < prog.nestingDepth, + ); + const nestedParent = enclosing + ? (programModuleIds.get(enclosing.name.toUpperCase()) ?? moduleId ?? fileNodeId) + : (moduleId ?? fileNodeId); + graph.addRelationship({ + id: generateId('CONTAINS', `${nestedParent}->${nestedModuleId}`), + type: 'CONTAINS', + sourceId: nestedParent, + targetId: nestedModuleId, + confidence: 1.0, + reason: 'cobol-nested-program', + }); + moduleNodeIds.set(prog.name.toUpperCase(), nestedModuleId); + programModuleIds.set(prog.name.toUpperCase(), nestedModuleId); + } + + const parentId = moduleId ?? fileNodeId; + + // ── SECTIONs -> Namespace nodes ────────────────────────────────── + const sectionNodeIds = new Map(); + for (let i = 0; i < extracted.sections.length; i++) { + const sec = extracted.sections[i]; + const nextLine = i + 1 < extracted.sections.length + ? extracted.sections[i + 1].line - 1 + : lines.length; + const owningPgm = findOwningProgramName(sec.line, extracted.programs); + const secId = generateId('Namespace', `${filePath}:${owningPgm ? owningPgm + ':' : ''}${sec.name}`); + graph.addNode({ + id: secId, + label: 'Namespace', + properties: { + name: sec.name, + filePath, + startLine: sec.line, + endLine: nextLine, + language: SupportedLanguages.Cobol, + isExported: true, + }, + }); + const secParent = programModuleIds.get(owningPgm ?? '') ?? parentId; + graph.addRelationship({ + id: generateId('CONTAINS', `${secParent}->${secId}`), + type: 'CONTAINS', + sourceId: secParent, + targetId: secId, + confidence: 1.0, + reason: 'cobol-section', + }); + sectionNodeIds.set(`${owningPgm ?? ''}:${sec.name.toUpperCase()}`, secId); + } + + // ── PARAGRAPHs -> Function nodes ───────────────────────────────── + const paraNodeIds = new Map(); + for (let i = 0; i < extracted.paragraphs.length; i++) { + const para = extracted.paragraphs[i]; + const nextLine = i + 1 < extracted.paragraphs.length + ? extracted.paragraphs[i + 1].line - 1 + : lines.length; + const owningPgmPara = findOwningProgramName(para.line, extracted.programs); + const paraId = generateId('Function', `${filePath}:${owningPgmPara ? owningPgmPara + ':' : ''}${para.name}`); + graph.addNode({ + id: paraId, + label: 'Function', + properties: { + name: para.name, + filePath, + startLine: para.line, + endLine: nextLine, + language: SupportedLanguages.Cobol, + isExported: true, + }, + }); + // Parent: find the containing section, or fall back to module/file + const containerId = findContainingSection(para.line, extracted.sections, sectionNodeIds, extracted.programs) + ?? (programModuleIds.get(owningPgmPara ?? '') ?? parentId); + graph.addRelationship({ + id: generateId('CONTAINS', `${containerId}->${paraId}`), + type: 'CONTAINS', + sourceId: containerId, + targetId: paraId, + confidence: 1.0, + reason: 'cobol-paragraph', + }); + paraNodeIds.set(`${owningPgmPara ?? ''}:${para.name.toUpperCase()}`, paraId); + } + + // ── Data items -> Property nodes ───────────────────────────────── + for (const item of extracted.dataItems) { + if (item.name === 'FILLER') continue; // Skip anonymous fillers + const propId = generatePropertyId(filePath, item); + const itemOwner = findOwningProgramName(item.line, extracted.programs); + const itemParent = programModuleIds.get(itemOwner ?? '') ?? parentId; + graph.addNode({ + id: propId, + label: 'Property', + properties: { + name: item.name, + filePath, + startLine: item.line, + endLine: item.line, + language: SupportedLanguages.Cobol, + description: `level:${item.level} section:${item.section}${item.pic ? ` pic:${item.pic}` : ''}`, + }, + }); + graph.addRelationship({ + id: generateId('CONTAINS', `${itemParent}->${propId}`), + type: 'CONTAINS', + sourceId: itemParent, + targetId: propId, + confidence: 1.0, + reason: 'cobol-data-item', + }); + } + + // ── Build data item Map early (needed by CALL USING, CICS INTO/FROM, MOVE, and USING) ── + const dataItemMap = buildDataItemMap(extracted.dataItems, filePath); + + // ── OCCURS DEPENDING ON -> ACCESSES edges (variable-length table deps) ── + for (const item of extracted.dataItems) { + if (item.name === 'FILLER' || !item.dependingOn) continue; + const propId = generatePropertyId(filePath, item); + const depFieldId = dataItemMap.get(item.dependingOn.toUpperCase()); + if (depFieldId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${propId}->depends-on->${item.dependingOn}`), + type: 'ACCESSES', + sourceId: propId, + targetId: depFieldId, + confidence: 1.0, + reason: 'cobol-depends-on', + }); + } + } + + // Helper: look up paragraph/section by name scoped to the owning program + const scopedParaLookup = (name: string, lineNum: number): string | undefined => { + const pgm = findOwningProgramName(lineNum, extracted.programs); + return paraNodeIds.get(`${pgm ?? ''}:${name.toUpperCase()}`) + ?? sectionNodeIds.get(`${pgm ?? ''}:${name.toUpperCase()}`); + }; + const scopedCallerLookup = (name: string | null, lineNum: number): string => { + if (!name) return owningModuleId(lineNum); + const pgm = findOwningProgramName(lineNum, extracted.programs); + return paraNodeIds.get(`${pgm ?? ''}:${name.toUpperCase()}`) + ?? (programModuleIds.get(pgm ?? '') ?? parentId); + }; + /** Resolve the owning program's module ID for a given line (for nested program edge attribution). */ + const owningModuleId = (lineNum: number): string => { + const pgm = findOwningProgramName(lineNum, extracted.programs); + return programModuleIds.get(pgm ?? '') ?? parentId; + }; + + // ── PERFORM -> CALLS relationship (intra-file) ────────────────── + for (const perf of extracted.performs) { + const targetId = scopedParaLookup(perf.target, perf.line); + if (!targetId) continue; + + // Source: the paragraph containing the PERFORM, or the module + const sourceId = scopedCallerLookup(perf.caller, perf.line); + + graph.addRelationship({ + id: generateId('CALLS', `${sourceId}->perform->${targetId}:L${perf.line}`), + type: 'CALLS', + sourceId, + targetId, + confidence: 1.0, + reason: 'cobol-perform', + }); + + // PERFORM THRU -> expanded CALLS edge to thru target + if (perf.thruTarget) { + const thruTargetId = scopedParaLookup(perf.thruTarget, perf.line); + if (thruTargetId && thruTargetId !== targetId) { + graph.addRelationship({ + id: generateId('CALLS', `${sourceId}->perform-thru->${thruTargetId}:L${perf.line}`), + type: 'CALLS', + sourceId, + targetId: thruTargetId, + confidence: 1.0, + reason: 'cobol-perform-thru', + }); + } + } + } + + // ── CALL -> CALLS relationship (cross-program) ────────────────── + for (const call of extracted.calls) { + if (!call.isQuoted) { + // Dynamic CALL via data item — not statically resolvable. + // Emit a CodeElement annotation for visibility in impact analysis. + graph.addNode({ + id: generateId('CodeElement', `${filePath}:dynamic-call:${call.target}:L${call.line}`), + label: 'CodeElement', + properties: { + name: `CALL ${call.target}`, + filePath, + startLine: call.line, + endLine: call.line, + language: SupportedLanguages.Cobol, + description: 'dynamic-call (target is a data item, not resolvable statically)', + }, + }); + const dynCallOwner = owningModuleId(call.line); + graph.addRelationship({ + id: generateId('CONTAINS', `${dynCallOwner}->dynamic-call:${call.target}:L${call.line}`), + type: 'CONTAINS', + sourceId: dynCallOwner, + targetId: generateId('CodeElement', `${filePath}:dynamic-call:${call.target}:L${call.line}`), + confidence: 1.0, + reason: 'cobol-dynamic-call', + }); + + // CALL USING parameters for dynamic call too + if (call.parameters && call.parameters.length > 0) { + for (const param of call.parameters) { + const paramPropId = dataItemMap.get(param.toUpperCase()); + if (paramPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${dynCallOwner}->call-using->${param}:L${call.line}`), + type: 'ACCESSES', + sourceId: dynCallOwner, + targetId: paramPropId, + confidence: 0.9, + reason: 'cobol-call-using', + }); + } + } + } + // CALL RETURNING target for dynamic call too + if (call.returning) { + const retPropId = dataItemMap.get(call.returning.toUpperCase()); + if (retPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${dynCallOwner}->call-returning->${call.returning}:L${call.line}`), + type: 'ACCESSES', + sourceId: dynCallOwner, + targetId: retPropId, + confidence: 0.9, + reason: 'cobol-call-returning', + }); + } + } + continue; + } + + const targetModuleId = moduleNodeIds.get(call.target.toUpperCase()); + // Create edge even if target not yet known — use a synthetic target id + const targetId = targetModuleId + ?? generateId('Module', `:${call.target.toUpperCase()}`); + + const callOwner = owningModuleId(call.line); + graph.addRelationship({ + id: generateId('CALLS', `${callOwner}->call->${call.target}:L${call.line}`), + type: 'CALLS', + sourceId: callOwner, + targetId, + confidence: targetModuleId ? 0.95 : 0.5, + reason: targetModuleId ? 'cobol-call' : 'cobol-call-unresolved', + }); + + // CALL USING parameters -> ACCESSES edges (data flow across programs) + if (call.parameters && call.parameters.length > 0) { + for (const param of call.parameters) { + const paramPropId = dataItemMap.get(param.toUpperCase()); + if (paramPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${callOwner}->call-using->${param}:L${call.line}`), + type: 'ACCESSES', + sourceId: callOwner, + targetId: paramPropId, + confidence: 0.9, + reason: 'cobol-call-using', + }); + } + } + } + // CALL RETURNING target -> ACCESSES edge (return value data flow) + if (call.returning) { + const retPropId = dataItemMap.get(call.returning.toUpperCase()); + if (retPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${callOwner}->call-returning->${call.returning}:L${call.line}`), + type: 'ACCESSES', + sourceId: callOwner, + targetId: retPropId, + confidence: 0.9, + reason: 'cobol-call-returning', + }); + } + } + } + + // ── COPY -> IMPORTS relationship ───────────────────────────────── + for (const res of copyResolutions) { + if (!res.resolvedPath) continue; + const targetFileId = generateId('File', res.resolvedPath); + graph.addRelationship({ + id: generateId('IMPORTS', `${fileNodeId}->${targetFileId}:${res.copyTarget}`), + type: 'IMPORTS', + sourceId: fileNodeId, + targetId: targetFileId, + confidence: 1.0, + reason: 'cobol-copy', + }); + } + + // ── EXEC SQL blocks -> CodeElement nodes + ACCESSES edges ────── + for (const sql of extracted.execSqlBlocks) { + const sqlId = generateId('CodeElement', `${filePath}:exec-sql:L${sql.line}`); + graph.addNode({ + id: sqlId, + label: 'CodeElement', + properties: { + name: `EXEC SQL ${sql.operation}`, + filePath, + startLine: sql.line, + endLine: sql.line, + language: SupportedLanguages.Cobol, + description: `tables:[${sql.tables.join(',')}] cursors:[${sql.cursors.join(',')}]`, + }, + }); + const sqlOwner = owningModuleId(sql.line); + graph.addRelationship({ + id: generateId('CONTAINS', `${sqlOwner}->${sqlId}`), + type: 'CONTAINS', + sourceId: sqlOwner, + targetId: sqlId, + confidence: 1.0, + reason: 'cobol-exec-sql', + }); + // ACCESSES edges to tables + for (const table of sql.tables) { + const tableId = generateId('Record', `:${table}`); + graph.addRelationship({ + id: generateId('ACCESSES', `${sqlId}->${tableId}:${sql.operation}`), + type: 'ACCESSES', + sourceId: sqlId, + targetId: tableId, + confidence: 0.9, + reason: `sql-${sql.operation.toLowerCase()}`, + }); + } + + // EXEC SQL INCLUDE -> IMPORTS edge + if (sql.includeMember) { + // Try to resolve as a copybook + const includeTarget = sql.includeMember.toUpperCase(); + // We don't have copybookMap here, so emit directly as IMPORTS + // The edge uses reason 'sql-include' to distinguish from COPY + graph.addRelationship({ + id: generateId('IMPORTS', `${fileNodeId}->sql-include->${includeTarget}:L${sql.line}`), + type: 'IMPORTS', + sourceId: fileNodeId, + targetId: generateId('File', `:${includeTarget}`), + confidence: 0.8, + reason: 'sql-include', + }); + } + } + + // ── PROCEDURE DIVISION USING -> ACCESSES edges (parameter contract) ── + // Iterate per-program to handle nested programs with their own USING clauses + for (const prog of extracted.programs) { + const progModId = programModuleIds.get(prog.name.toUpperCase()) ?? moduleId; + if (progModId && prog.procedureUsing && prog.procedureUsing.length > 0) { + for (const param of prog.procedureUsing) { + const paramPropId = dataItemMap.get(param.toUpperCase()); + if (paramPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${progModId}->using->${param}`), + type: 'ACCESSES', + sourceId: progModId, + targetId: paramPropId, + confidence: 1.0, + reason: 'cobol-procedure-using', + }); + } + } + } + } + + // ── EXEC CICS blocks -> CodeElement nodes + CALLS edges ──────── + for (const cics of extracted.execCicsBlocks) { + const cicsId = generateId('CodeElement', `${filePath}:exec-cics:L${cics.line}`); + graph.addNode({ + id: cicsId, + label: 'CodeElement', + properties: { + name: `EXEC CICS ${cics.command}`, + filePath, + startLine: cics.line, + endLine: cics.line, + language: SupportedLanguages.Cobol, + description: [ + cics.mapName && `map:${cics.mapName}`, + cics.programName && `program:${cics.programName}${cics.programIsLiteral === false ? ' (dynamic)' : ''}`, + cics.transId && `transid:${cics.transId}`, + cics.fileName && `file:${cics.fileName}`, + cics.queueName && `queue:${cics.queueName}`, + cics.labelName && `label:${cics.labelName}`, + ].filter(Boolean).join(' ') || undefined, + }, + }); + const cicsOwner = owningModuleId(cics.line); + graph.addRelationship({ + id: generateId('CONTAINS', `${cicsOwner}->${cicsId}`), + type: 'CONTAINS', + sourceId: cicsOwner, + targetId: cicsId, + confidence: 1.0, + reason: 'cobol-exec-cics', + }); + // LINK/XCTL -> cross-program CALLS (handles both literal and variable PROGRAM) + if (cics.programName && ['LINK', 'XCTL', 'LOAD'].includes(cics.command)) { + if (cics.programIsLiteral === false) { + // Dynamic PROGRAM reference via variable — annotate, don't resolve + graph.addNode({ + id: generateId('CodeElement', `${filePath}:cics-dynamic-pgm:${cics.programName}:L${cics.line}`), + label: 'CodeElement', + properties: { + name: `CICS ${cics.command} ${cics.programName}`, + filePath, startLine: cics.line, endLine: cics.line, + language: SupportedLanguages.Cobol, + description: `cics-dynamic-program (target is data item ${cics.programName})`, + }, + }); + graph.addRelationship({ + id: generateId('CONTAINS', `${cicsOwner}->cics-dynamic-pgm:${cics.programName}:L${cics.line}`), + type: 'CONTAINS', sourceId: cicsOwner, + targetId: generateId('CodeElement', `${filePath}:cics-dynamic-pgm:${cics.programName}:L${cics.line}`), + confidence: 1.0, reason: 'cics-dynamic-program', + }); + } else { + const cicsTargetModuleId = moduleNodeIds.get(cics.programName.toUpperCase()); + const targetId = cicsTargetModuleId + ?? generateId('Module', `:${cics.programName.toUpperCase()}`); + const cicsReason = `cics-${cics.command.toLowerCase()}`; + graph.addRelationship({ + id: generateId('CALLS', `${cicsOwner}->cics-${cics.command.toLowerCase()}->${cics.programName}:L${cics.line}`), + type: 'CALLS', sourceId: cicsOwner, targetId, + confidence: cicsTargetModuleId ? 0.95 : 0.5, + reason: cicsTargetModuleId ? cicsReason : `${cicsReason}-unresolved`, + }); + } + } + + // CICS FILE I/O -> ACCESSES edges (READ/WRITE/REWRITE/DELETE/STARTBR/ENDBR FILE) + if (cics.fileName) { + const fileRecordId = generateId('Record', `:${cics.fileName.toUpperCase()}`); + const ioCommand = cics.command.toUpperCase(); + const isRead = ['READ', 'STARTBR', 'READNEXT', 'READPREV', 'READ NEXT', 'READ PREV', 'ENDBR'].includes(ioCommand); + const isWrite = ['WRITE', 'REWRITE', 'DELETE'].includes(ioCommand); + const reason = isRead ? 'cics-file-read' : isWrite ? 'cics-file-write' : 'cics-file-access'; + graph.addRelationship({ + id: generateId('ACCESSES', `${cicsId}->file->${cics.fileName}:L${cics.line}`), + type: 'ACCESSES', sourceId: cicsId, targetId: fileRecordId, + confidence: 0.9, reason, + }); + } + + // CICS QUEUE -> ACCESSES edge with differentiated reason (WRITEQ/READQ/DELETEQ TS/TD) + if (cics.queueName) { + const queueId = generateId('Record', `:${cics.queueName}`); + const qCmd = cics.command.toUpperCase(); + const qReason = qCmd.startsWith('READQ') ? 'cics-queue-read' + : qCmd.startsWith('WRITEQ') ? 'cics-queue-write' + : qCmd.startsWith('DELETEQ') ? 'cics-queue-delete' + : 'cics-queue'; + graph.addRelationship({ + id: generateId('ACCESSES', `${cicsId}->queue->${cics.queueName}:L${cics.line}`), + type: 'ACCESSES', sourceId: cicsId, targetId: queueId, + confidence: 0.85, reason: qReason, + }); + } + + // CICS RETURN/START TRANSID -> CALLS edge (transaction flow) + if (cics.transId) { + const cmd = cics.command.toUpperCase(); + if (cmd === 'RETURN' || cmd.startsWith('START')) { + const transNodeId = generateId('CodeElement', `:${cics.transId}`); + graph.addRelationship({ + id: generateId('CALLS', `${cicsOwner}->${cmd === 'RETURN' ? 'return' : 'start'}-transid->${cics.transId}:L${cics.line}`), + type: 'CALLS', sourceId: cicsOwner, targetId: transNodeId, + confidence: 0.8, + reason: cmd === 'RETURN' ? 'cics-return-transid' : 'cics-start-transid', + }); + } + } + + // CICS MAP -> ACCESSES edge (screen/mapset traceability) + if (cics.mapName) { + const mapId = generateId('Record', `:${cics.mapName}`); + graph.addRelationship({ + id: generateId('ACCESSES', `${cicsId}->map->${cics.mapName}:L${cics.line}`), + type: 'ACCESSES', sourceId: cicsId, targetId: mapId, + confidence: 0.85, reason: 'cics-map', + }); + } + + // CICS INTO(data-area) -> ACCESSES edge (data write target) + if (cics.intoField) { + const intoPropId = dataItemMap.get(cics.intoField.toUpperCase()); + if (intoPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${cicsId}->into->${cics.intoField}:L${cics.line}`), + type: 'ACCESSES', sourceId: cicsId, targetId: intoPropId, + confidence: 0.9, reason: 'cics-receive-into', + }); + } + } + + // CICS FROM(data-area) -> ACCESSES edge (data read source) + if (cics.fromField) { + const fromPropId = dataItemMap.get(cics.fromField.toUpperCase()); + if (fromPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${cicsId}->from->${cics.fromField}:L${cics.line}`), + type: 'ACCESSES', sourceId: cicsId, targetId: fromPropId, + confidence: 0.9, reason: 'cics-send-from', + }); + } + } + + // CICS HANDLE ABEND LABEL -> CALLS edge to error handler paragraph + if (cics.labelName) { + const labelTargetId = scopedParaLookup(cics.labelName, cics.line); + if (labelTargetId) { + graph.addRelationship({ + id: generateId('CALLS', `${cicsOwner}->abend-label->${cics.labelName}:L${cics.line}`), + type: 'CALLS', sourceId: cicsOwner, targetId: labelTargetId, + confidence: 0.9, reason: 'cics-handle-abend', + }); + } + } + } + + // ── ENTRY points -> Constructor nodes ────────────────────────── + for (const entry of extracted.entryPoints) { + const entryId = generateId('Constructor', `${filePath}:${entry.name}`); + graph.addNode({ + id: entryId, + label: 'Constructor', + properties: { + name: entry.name, + filePath, + startLine: entry.line, + endLine: entry.line, + language: SupportedLanguages.Cobol, + isExported: true, + description: entry.parameters.length > 0 ? `using:${entry.parameters.join(',')}` : undefined, + }, + }); + const entryOwner = owningModuleId(entry.line); + graph.addRelationship({ + id: generateId('CONTAINS', `${entryOwner}->${entryId}`), + type: 'CONTAINS', + sourceId: entryOwner, + targetId: entryId, + confidence: 1.0, + reason: 'cobol-entry-point', + }); + // Register in moduleNodeIds for cross-program resolution + moduleNodeIds.set(entry.name.toUpperCase(), entryId); + } + + // ── DECLARATIVES error handlers -> ACCESSES edges ────────────────── + for (const decl of extracted.declaratives) { + // Find the section's Namespace node + const pgm = findOwningProgramName(decl.line, extracted.programs); + const sectionId = sectionNodeIds.get(`${pgm ?? ''}:${decl.sectionName.toUpperCase()}`); + if (!sectionId) continue; + + // Create ACCESSES edge from handler section to file/mode + const targetId = generateId('Record', `${filePath}:${decl.target}`); + graph.addRelationship({ + id: generateId('ACCESSES', `${sectionId}->error-handler->${decl.target}:L${decl.line}`), + type: 'ACCESSES', + sourceId: sectionId, + targetId, + confidence: 0.9, + reason: 'cobol-error-handler', + }); + } + + // ── SET statement -> ACCESSES edges ────────────────── + for (const set of extracted.sets) { + const callerId = scopedCallerLookup(set.caller, set.line); + const reason = set.form === 'to-true' ? 'cobol-set-condition' : 'cobol-set-index'; + for (const target of set.targets) { + const targetPropId = dataItemMap.get(target.toUpperCase()); + if (targetPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${callerId}->set->${target}:L${set.line}`), + type: 'ACCESSES', + sourceId: callerId, + targetId: targetPropId, + confidence: 0.9, + reason, + }); + } + } + // If SET index has a value that is an identifier (not a number), add read edge + if (set.value && /^[A-Z][A-Z0-9-]+$/i.test(set.value)) { + const valuePropId = dataItemMap.get(set.value.toUpperCase()); + if (valuePropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${callerId}->set-read->${set.value}:L${set.line}`), + type: 'ACCESSES', + sourceId: callerId, + targetId: valuePropId, + confidence: 0.9, + reason: 'cobol-set-read', + }); + } + } + } + + // ── INSPECT -> ACCESSES edges ────────────────── + for (const insp of extracted.inspects) { + const callerId = scopedCallerLookup(insp.caller, insp.line); + const inspFieldId = dataItemMap.get(insp.inspectedField.toUpperCase()); + if (inspFieldId) { + // Read edge (always — INSPECT reads the field) + graph.addRelationship({ + id: generateId('ACCESSES', `${callerId}->inspect-read->${insp.inspectedField}:L${insp.line}`), + type: 'ACCESSES', + sourceId: callerId, + targetId: inspFieldId, + confidence: 0.9, + reason: 'cobol-inspect-read', + }); + // Write edge (if REPLACING or CONVERTING — modifies the field in-place) + if (insp.form !== 'tallying') { + graph.addRelationship({ + id: generateId('ACCESSES', `${callerId}->inspect-write->${insp.inspectedField}:L${insp.line}`), + type: 'ACCESSES', + sourceId: callerId, + targetId: inspFieldId, + confidence: 0.9, + reason: 'cobol-inspect-write', + }); + } + } + // Tally counter write edges + for (const counter of insp.counters) { + const counterPropId = dataItemMap.get(counter.toUpperCase()); + if (counterPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${callerId}->inspect-tally->${counter}:L${insp.line}`), + type: 'ACCESSES', + sourceId: callerId, + targetId: counterPropId, + confidence: 0.9, + reason: 'cobol-inspect-tally', + }); + } + } + } + + // ── INITIALIZE -> ACCESSES write edges ────────────────── + for (const init of extracted.initializes) { + const callerId = scopedCallerLookup(init.caller, init.line); + const targetPropId = dataItemMap.get(init.target.toUpperCase()); + if (targetPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${callerId}->initialize->${init.target}:L${init.line}`), + type: 'ACCESSES', + sourceId: callerId, + targetId: targetPropId, + confidence: 0.9, + reason: 'cobol-initialize', + }); + } + } + + // ── EXEC DLI (IMS/DB) -> CodeElement + ACCESSES edges ────────────── + for (const dli of extracted.execDliBlocks) { + const dliId = generateId('CodeElement', `${filePath}:exec-dli:L${dli.line}`); + const dliOwner = owningModuleId(dli.line); + graph.addNode({ + id: dliId, + label: 'CodeElement', + properties: { + name: `EXEC DLI ${dli.verb}`, + filePath, + startLine: dli.line, + endLine: dli.line, + language: SupportedLanguages.Cobol, + description: [ + dli.segmentName && `segment:${dli.segmentName}`, + dli.pcbNumber !== undefined && `pcb:${dli.pcbNumber}`, + dli.psbName && `psb:${dli.psbName}`, + ].filter(Boolean).join(' ') || undefined, + }, + }); + graph.addRelationship({ + id: generateId('CONTAINS', `${dliOwner}->${dliId}`), + type: 'CONTAINS', + sourceId: dliOwner, + targetId: dliId, + confidence: 1.0, + reason: 'cobol-exec-dli', + }); + // ACCESSES edge to IMS segment (like SQL table) + if (dli.segmentName) { + const segId = generateId('Record', `:${dli.segmentName}`); + graph.addRelationship({ + id: generateId('ACCESSES', `${dliId}->${dli.segmentName}:${dli.verb}`), + type: 'ACCESSES', + sourceId: dliId, + targetId: segId, + confidence: 0.9, + reason: `dli-${dli.verb.toLowerCase()}`, + }); + } + // ACCESSES to INTO/FROM data areas + if (dli.intoField) { + const intoPropId = dataItemMap.get(dli.intoField.toUpperCase()); + if (intoPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${dliId}->into->${dli.intoField}:L${dli.line}`), + type: 'ACCESSES', + sourceId: dliId, + targetId: intoPropId, + confidence: 0.9, + reason: 'dli-into', + }); + } + } + if (dli.fromField) { + const fromPropId = dataItemMap.get(dli.fromField.toUpperCase()); + if (fromPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${dliId}->from->${dli.fromField}:L${dli.line}`), + type: 'ACCESSES', + sourceId: dliId, + targetId: fromPropId, + confidence: 0.9, + reason: 'dli-from', + }); + } + } + } + + // ── MOVE data flow -> ACCESSES edges (read/write) ────────────── + for (const move of extracted.moves) { + const fromPropId = dataItemMap.get(move.from.toUpperCase()); + const callerId = scopedCallerLookup(move.caller, move.line); + + // One read edge per MOVE (regardless of number of targets) + if (fromPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${callerId}->read->${move.from}:L${move.line}`), + type: 'ACCESSES', + sourceId: callerId, + targetId: fromPropId, + confidence: 0.9, + reason: move.corresponding ? 'cobol-move-corresponding-read' : 'cobol-move-read', + }); + } + + // One write edge per target + for (const target of move.targets) { + const toPropId = dataItemMap.get(target.toUpperCase()); + if (toPropId) { + graph.addRelationship({ + id: generateId('ACCESSES', `${callerId}->write->${target}:L${move.line}`), + type: 'ACCESSES', + sourceId: callerId, + targetId: toPropId, + confidence: 0.9, + reason: move.corresponding ? 'cobol-move-corresponding-write' : 'cobol-move-write', + }); + } + } + } + + // ── File declarations -> Record nodes ────────────────────────── + for (const fd of extracted.fileDeclarations) { + const fdId = generateId('Record', `${filePath}:${fd.selectName}`); + graph.addNode({ + id: fdId, + label: 'Record', + properties: { + name: fd.selectName, + filePath, + startLine: fd.line, + endLine: fd.line, + language: SupportedLanguages.Cobol, + description: `assign:${fd.assignTo}${fd.isOptional ? ' optional' : ''}${fd.organization ? ` org:${fd.organization}` : ''}${fd.access ? ` access:${fd.access}` : ''}`, + }, + }); + const fdOwner = owningModuleId(fd.line); + graph.addRelationship({ + id: generateId('CONTAINS', `${fdOwner}->${fdId}`), + type: 'CONTAINS', + sourceId: fdOwner, + targetId: fdId, + confidence: 1.0, + reason: 'cobol-file-declaration', + }); + } + + // ── GO TO -> CALLS edges ────────────────────────────────────── + for (const gt of extracted.gotos) { + const callerId = scopedCallerLookup(gt.caller, gt.line); + const targetId = scopedParaLookup(gt.target, gt.line); + if (targetId) { + graph.addRelationship({ + id: generateId('CALLS', `${callerId}->goto->${gt.target}:L${gt.line}`), + type: 'CALLS', + sourceId: callerId, + targetId, + confidence: 1.0, + reason: 'cobol-goto', + }); + } + } + + // ── SORT/MERGE -> ACCESSES edges ────────────────────────────── + for (const sort of extracted.sorts) { + const sortFileId = generateId('Record', `${filePath}:${sort.sortFile}`); + const sortOwner = owningModuleId(sort.line); + for (const usingFile of sort.usingFiles) { + const usingId = generateId('Record', `${filePath}:${usingFile}`); + graph.addRelationship({ + id: generateId('ACCESSES', `${sortOwner}->sort-using->${usingFile}:L${sort.line}`), + type: 'ACCESSES', + sourceId: sortFileId, + targetId: usingId, + confidence: 0.85, + reason: 'sort-using', + }); + } + for (const givingFile of sort.givingFiles) { + const givingId = generateId('Record', `${filePath}:${givingFile}`); + graph.addRelationship({ + id: generateId('ACCESSES', `${sortOwner}->sort-giving->${givingFile}:L${sort.line}`), + type: 'ACCESSES', + sourceId: sortFileId, + targetId: givingId, + confidence: 0.85, + reason: 'sort-giving', + }); + } + } + + // ── SEARCH -> ACCESSES edges ────────────────────────────────── + for (const search of extracted.searches) { + const targetPropId = dataItemMap.get(search.target.toUpperCase()); + if (targetPropId) { + const searchOwner = owningModuleId(search.line); + graph.addRelationship({ + id: generateId('ACCESSES', `${searchOwner}->search->${search.target}:L${search.line}`), + type: 'ACCESSES', + sourceId: searchOwner, + targetId: targetPropId, + confidence: 0.9, + reason: 'cobol-search', + }); + } + } + + // ── CANCEL -> CALLS edges (with two-pass resolution like CALL) ── + for (const cancel of extracted.cancels) { + if (!cancel.isQuoted) { + // Dynamic CANCEL via data item — annotate, don't resolve + graph.addNode({ + id: generateId('CodeElement', `${filePath}:dynamic-cancel:${cancel.target}:L${cancel.line}`), + label: 'CodeElement', + properties: { + name: `CANCEL ${cancel.target}`, + filePath, startLine: cancel.line, endLine: cancel.line, + language: SupportedLanguages.Cobol, + description: 'dynamic-cancel (target is a data item, not resolvable statically)', + }, + }); + const cancelOwner = owningModuleId(cancel.line); + graph.addRelationship({ + id: generateId('CONTAINS', `${cancelOwner}->dynamic-cancel:${cancel.target}:L${cancel.line}`), + type: 'CONTAINS', sourceId: cancelOwner, + targetId: generateId('CodeElement', `${filePath}:dynamic-cancel:${cancel.target}:L${cancel.line}`), + confidence: 1.0, reason: 'cobol-dynamic-cancel', + }); + continue; + } + const targetModuleId = moduleNodeIds.get(cancel.target.toUpperCase()); + const targetId = targetModuleId + ?? generateId('Module', `:${cancel.target.toUpperCase()}`); + const cancelCallOwner = owningModuleId(cancel.line); + graph.addRelationship({ + id: generateId('CALLS', `${cancelCallOwner}->cancel->${cancel.target}:L${cancel.line}`), + type: 'CALLS', + sourceId: cancelCallOwner, + targetId, + confidence: targetModuleId ? 0.9 : 0.5, + reason: targetModuleId ? 'cobol-cancel' : 'cobol-cancel-unresolved', + }); + } +} + +// --------------------------------------------------------------------------- +// Helpers +// --------------------------------------------------------------------------- + +/** Find the enclosing program name for a given line number (innermost wins). */ +function findOwningProgramName( + lineNum: number, + programs: Array<{ name: string; startLine: number; endLine: number; nestingDepth: number }>, +): string | undefined { + let best: typeof programs[0] | undefined; + for (const p of programs) { + if (p.startLine <= lineNum && p.endLine >= lineNum) { + if (!best || p.nestingDepth > best.nestingDepth) best = p; + } + } + return best?.name; +} + +/** Find the section that contains a given line number. */ +function findContainingSection( + line: number, + sections: Array<{ name: string; line: number }>, + sectionNodeIds: Map, + programs: Array<{ name: string; startLine: number; endLine: number; nestingDepth: number }>, +): string | undefined { + const pgm = findOwningProgramName(line, programs); + // Sections are in order; find the last section whose start line <= the target line + let best: string | undefined; + for (const sec of sections) { + if (sec.line <= line) { + const resolved = sectionNodeIds.get(`${pgm ?? ''}:${sec.name.toUpperCase()}`); + if (resolved) best = resolved; // only update if lookup succeeds + } else { + break; + } + } + return best; +} diff --git a/gitnexus/src/core/ingestion/cobol/cobol-copy-expander.ts b/gitnexus/src/core/ingestion/cobol/cobol-copy-expander.ts new file mode 100644 index 0000000000..4fd9f1155f --- /dev/null +++ b/gitnexus/src/core/ingestion/cobol/cobol-copy-expander.ts @@ -0,0 +1,501 @@ +/** + * COBOL COPY statement expansion engine. + * + * Expands COPY statements by inlining copybook content, applying REPLACING + * transformations (LEADING, TRAILING, EXACT), and handling nested copies + * with cycle detection. + * + * This is a preprocessing step that runs BEFORE extractCobolSymbolsWithRegex. + * The caller should run preprocessCobolSource first to clean patch markers. + * + * Supported syntax: + * COPY CPSESP. + * COPY "WORKGRID.CPY". + * COPY CPSESP REPLACING LEADING "ESP-" BY "LK-ESP-" + * LEADING "KPSESPL" BY "LK-KPSESPL". + * COPY ANAZI REPLACING "ANAZI-KEY" BY "LK-KEY". + */ + +// --------------------------------------------------------------------------- +// Public interfaces +// --------------------------------------------------------------------------- + +export interface CopyReplacing { + type: 'LEADING' | 'TRAILING' | 'EXACT'; + from: string; + to: string; + isPseudotext?: boolean; +} + +export interface CopyResolution { + copyTarget: string; + resolvedPath: string | null; + line: number; + replacing: CopyReplacing[]; + library?: string; +} + +export interface CopyExpansionResult { + expandedContent: string; + copyResolutions: CopyResolution[]; +} + +// --------------------------------------------------------------------------- +// Constants +// --------------------------------------------------------------------------- + +export const DEFAULT_MAX_DEPTH = 10; + +/** COBOL identifier pattern: starts with letter, contains letters, digits, hyphens. */ +const RE_COBOL_IDENTIFIER = /\b([A-Z][A-Z0-9-]*)\b/gi; + +// --------------------------------------------------------------------------- +// Private helpers +// --------------------------------------------------------------------------- + +/** + * Strip inline comments (Italian-style `|` comments). + * Only strips if `|` appears in the code area (col 7+). + */ +function stripInlineComment(line: string): string { + let inQuote: string | null = null; + for (let i = 0; i < line.length; i++) { + const ch = line[i]; + if (inQuote) { + if (ch === inQuote) inQuote = null; + } else if (ch === '"' || ch === "'") { + inQuote = ch; + } else if (ch === '|') { + return line.substring(0, i); + } + } + return line; +} + +/** + * Check if a line is a COBOL comment (indicator in col 7 is `*` or `/`). + */ +function isCommentLine(line: string): boolean { + return line.length >= 7 && (line[6] === '*' || line[6] === '/'); +} + +/** + * Check if a line is a continuation line (indicator in col 7 is `-`). + */ +function isContinuationLine(line: string): boolean { + return line.length >= 7 && line[6] === '-'; +} + +/** + * Merge continuation lines into their predecessors. + * Returns an array of logical lines with their original starting line numbers. + */ +function mergeLogicalLines( + rawLines: string[], +): Array<{ text: string; lineNum: number }> { + const logical: Array<{ text: string; lineNum: number }> = []; + + for (let i = 0; i < rawLines.length; i++) { + const raw = rawLines[i]; + + // Skip comment lines + if (isCommentLine(raw)) { + logical.push({ text: '', lineNum: i + 1 }); + continue; + } + + // Continuation: merge into previous logical line + if (isContinuationLine(raw)) { + if (logical.length > 0) { + const prev = logical[logical.length - 1]; + const continuation = raw.length > 7 ? raw.substring(7).trimStart() : ''; + prev.text += continuation; + } + // Push empty placeholder to preserve line count + logical.push({ text: '', lineNum: i + 1 }); + continue; + } + + // Normal line: strip inline comments + const cleaned = stripInlineComment(raw); + logical.push({ text: cleaned, lineNum: i + 1 }); + } + + return logical; +} + +// --------------------------------------------------------------------------- +// COPY statement parsing +// --------------------------------------------------------------------------- + +interface ParsedCopyStatement { + startLine: number; + endLine: number; + target: string; + replacing: CopyReplacing[]; + library?: string; +} + +/** + * Parse REPLACING clause text into structured replacements. + * + * Input examples: + * LEADING "ESP-" BY "LK-ESP-" LEADING "KPSESPL" BY "LK-KPSESPL" + * "ANAZI-KEY" BY "LK-KEY" + * TRAILING "-IN" BY "-OUT" + * ==CUST-== BY ==WS-CUST-== + * ==OLD-TEXT== BY ==== + */ +export function parseReplacingClause(text: string): CopyReplacing[] { + const replacings: CopyReplacing[] = []; + if (!text || text.trim().length === 0) return replacings; + + // Tokenize: ==pseudotext==, "quoted strings", or bare words. + // Pseudotext can contain spaces and single = chars but not ==. + interface TokenInfo { value: string; isPseudotext: boolean; } + const tokens: TokenInfo[] = []; + const tokenRe = /==((?:[^=]|=[^=])*)==|"([^"]*)"|(\S+)/g; + let tm: RegExpExecArray | null; + while ((tm = tokenRe.exec(text)) !== null) { + if (tm[1] !== undefined) { + // Pseudotext: trim leading/trailing whitespace + tokens.push({ value: tm[1].trim(), isPseudotext: true }); + } else if (tm[2] !== undefined) { + tokens.push({ value: tm[2], isPseudotext: false }); + } else { + tokens.push({ value: tm[3], isPseudotext: false }); + } + } + + // Parse token stream: [LEADING|TRAILING]? BY + let i = 0; + while (i < tokens.length) { + let type: CopyReplacing['type'] = 'EXACT'; + + // Check for type modifier (only on non-pseudotext tokens) + if (!tokens[i].isPseudotext) { + const upper = tokens[i].value.toUpperCase(); + if (upper === 'LEADING') { + type = 'LEADING'; + i++; + } else if (upper === 'TRAILING') { + type = 'TRAILING'; + i++; + } + } + + if (i >= tokens.length) break; + const fromToken = tokens[i]; + i++; + + // Pseudotext always forces EXACT type + if (fromToken.isPseudotext) type = 'EXACT'; + + // Expect BY keyword + if (i >= tokens.length) break; + if (tokens[i].value.toUpperCase() !== 'BY') { + // Malformed — skip this token and try to resync + continue; + } + i++; // skip BY + + if (i >= tokens.length) break; + const toToken = tokens[i]; + i++; + + replacings.push({ type, from: fromToken.value, to: toToken.value, isPseudotext: fromToken.isPseudotext || undefined }); + } + + return replacings; +} + +/** + * Scan logical lines for COPY statements. + * COPY statements can span multiple lines and terminate with a period. + */ +function parseCopyStatements( + logicalLines: Array<{ text: string; lineNum: number }>, +): ParsedCopyStatement[] { + const results: ParsedCopyStatement[] = []; + + let accumulator: string | null = null; + let startLine = 0; + let endLine = 0; + + for (let i = 0; i < logicalLines.length; i++) { + const { text, lineNum } = logicalLines[i]; + if (text.length === 0) continue; + + // Check for COPY keyword start (not inside a string context) + const copyStart = text.match(/\bCOPY\b/i); + + if (accumulator === null) { + if (!copyStart) continue; + + // Start accumulating from the COPY keyword onwards + const copyIdx = copyStart.index!; + accumulator = text.substring(copyIdx); + startLine = lineNum; + endLine = lineNum; + } else { + // Continue accumulating + accumulator += ' ' + text.trim(); + endLine = lineNum; + } + + // Check if statement terminates (period at end of accumulated text) + if (accumulator !== null && /\.\s*$/.test(accumulator)) { + const parsed = parseSingleCopyStatement(accumulator, startLine, endLine); + if (parsed) { + results.push(parsed); + } + accumulator = null; + } + } + + // If there's an unterminated COPY (missing period), try to parse what we have + if (accumulator !== null) { + const parsed = parseSingleCopyStatement(accumulator, startLine, endLine); + if (parsed) { + results.push(parsed); + } + } + + return results; +} + +/** + * Parse a single complete COPY statement string. + * + * Formats: + * COPY target. + * COPY "target". + * COPY target REPLACING ... . + */ +function parseSingleCopyStatement( + stmt: string, + startLine: number, + endLine: number, +): ParsedCopyStatement | null { + // Strip terminating period + const text = stmt.replace(/\.\s*$/, '').trim(); + + // Extract target: COPY or COPY "" or COPY '' + // Optionally followed by IN/OF (COBOL-85 standard: IN and OF are synonyms) + const targetMatch = text.match( + /^COPY\s+(?:"([^"]+)"|'([^']+)'|([A-Z][A-Z0-9-]*))(?:\s+(?:IN|OF)\s+([A-Z][A-Z0-9-]*))?/i, + ); + if (!targetMatch) return null; + + const target = targetMatch[1] ?? targetMatch[2] ?? targetMatch[3]; + const library = targetMatch[4] || undefined; + + // Extract REPLACING clause if present + let replacing: CopyReplacing[] = []; + const replacingIdx = text.search(/\bREPLACING\b/i); + if (replacingIdx >= 0) { + const replacingText = text.substring(replacingIdx + 'REPLACING'.length); + replacing = parseReplacingClause(replacingText); + } + + return { startLine, endLine, target, replacing, library }; +} + +// --------------------------------------------------------------------------- +// REPLACING application +// --------------------------------------------------------------------------- + +/** + * Apply REPLACING transformations to copybook content. + * + * LEADING: replace prefix in COBOL identifiers. + * TRAILING: replace suffix in COBOL identifiers. + * EXACT: replace exact token matches. + */ +function applyReplacing(content: string, replacings: CopyReplacing[]): string { + if (replacings.length === 0) return content; + + // First pass: handle EXACT replacements that contain spaces or non-identifier + // characters (pseudotext). These cannot be handled by identifier-level matching. + let result = content; + for (const r of replacings) { + if (r.type === 'EXACT' && (r.isPseudotext || r.from.includes(' ') || !/^[A-Z][A-Z0-9-]*$/i.test(r.from))) { + const escaped = r.from.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); + const re = new RegExp(escaped, 'gi'); + result = result.replace(re, r.to); + } + } + + // Second pass: identifier-level replacements (LEADING, TRAILING, single-word EXACT) + const identifierReplacings = replacings.filter( + r => !(r.type === 'EXACT' && (r.isPseudotext || r.from.includes(' ') || !/^[A-Z][A-Z0-9-]*$/i.test(r.from))), + ); + if (identifierReplacings.length === 0) return result; + + return result.replace(RE_COBOL_IDENTIFIER, (match) => { + for (const r of identifierReplacings) { + const upper = match.toUpperCase(); + const from = r.from.toUpperCase(); + const to = r.to.toUpperCase(); + switch (r.type) { + case 'LEADING': + if (upper.startsWith(from)) { + return to + match.substring(from.length); + } + break; + case 'TRAILING': + if (upper.endsWith(from)) { + return match.substring(0, match.length - from.length) + to; + } + break; + case 'EXACT': + if (upper === from) { + return to; + } + break; + } + } + return match; + }); +} + +// --------------------------------------------------------------------------- +// Main expansion engine +// --------------------------------------------------------------------------- + +/** + * Expand COBOL COPY statements by inlining copybook content. + * + * @param content - Source COBOL content (after preprocessCobolSource) + * @param filePath - Path of the source file (for diagnostics) + * @param resolveFile - Maps a COPY target name to a filesystem path, or null if not found + * @param readFile - Reads file content by path, or null if unreadable + * @param maxDepth - Maximum nesting depth for recursive expansion (default: 10) + * @returns Expanded content and resolution metadata + */ +export function expandCopies( + content: string, + filePath: string, + resolveFile: (name: string) => string | null, + readFile: (path: string) => string | null, + maxDepth: number = DEFAULT_MAX_DEPTH, +): CopyExpansionResult { + const allResolutions: CopyResolution[] = []; + const warnedCircular = new Set(); + let totalExpansions = 0; + const MAX_TOTAL_EXPANSIONS = 500; + + const expanded = expandRecursive(content, filePath, 0, new Set()); + + return { + expandedContent: expanded, + copyResolutions: allResolutions, + }; + + /** + * Recursively expand COPY statements in content. + * + * @param src - Source content to expand + * @param srcPath - Path of the file being expanded (for cycle detection logging) + * @param depth - Current recursion depth + * @param visited - Set of already-visited copybook paths (cycle detection) + */ + function expandRecursive( + src: string, + srcPath: string, + depth: number, + visited: Set, + ): string { + const rawLines = src.split(/\r?\n/); + const logicalLines = mergeLogicalLines(rawLines); + const copyStatements = parseCopyStatements(logicalLines); + + // No COPY statements — return as-is + if (copyStatements.length === 0) return src; + + // Process COPY statements in reverse order so line numbers stay valid + // as we splice content + const outputLines = [...rawLines]; + + for (let ci = copyStatements.length - 1; ci >= 0; ci--) { + const cs = copyStatements[ci]; + + // Resolve the copybook path + const resolvedPath = resolveFile(cs.target); + + // Record resolution metadata + allResolutions.push({ + copyTarget: cs.target, + resolvedPath, + line: cs.startLine, + replacing: cs.replacing, + library: cs.library, + }); + + // Cannot resolve — keep original lines + if (resolvedPath === null) { + continue; + } + + // Cycle detection + if (visited.has(resolvedPath)) { + if (!warnedCircular.has(resolvedPath)) { + warnedCircular.add(resolvedPath); + console.warn( + `[cobol-copy-expander] Circular COPY detected: ${cs.target} (${resolvedPath}) ` + + `includes itself. Skipping expansion.`, + ); + } + continue; + } + + // Max depth exceeded — keep unexpanded + if (depth >= maxDepth) { + console.warn( + `[cobol-copy-expander] Max expansion depth (${maxDepth}) reached for ` + + `COPY ${cs.target} in ${srcPath}. Skipping expansion.`, + ); + continue; + } + + // Guard against exponential breadth amplification (N copybooks each with N COPYs) + if (++totalExpansions > MAX_TOTAL_EXPANSIONS) { + if (!warnedCircular.has('__max_total__')) { + warnedCircular.add('__max_total__'); + console.warn( + `[cobol-copy-expander] Max total expansions (${MAX_TOTAL_EXPANSIONS}) reached ` + + `in ${srcPath}. Skipping further expansions.`, + ); + } + continue; + } + + // Read the copybook content + const copybookContent = readFile(resolvedPath); + if (copybookContent === null) { + continue; + } + + // Apply REPLACING transformations + const replaced = applyReplacing(copybookContent, cs.replacing); + + // Recurse into the copybook for nested COPYs + const nestedVisited = new Set(visited); + nestedVisited.add(resolvedPath); + const expandedCopybook = expandRecursive( + replaced, + resolvedPath, + depth + 1, + nestedVisited, + ); + + // Splice: replace the COPY statement lines with expanded content + // startLine/endLine are 1-based; convert to 0-based array index + const expansionLines = expandedCopybook.split('\n'); + const removeCount = cs.endLine - cs.startLine + 1; + outputLines.splice(cs.startLine - 1, removeCount, ...expansionLines); + } + + return outputLines.join('\n'); + } +} diff --git a/gitnexus/src/core/ingestion/cobol/cobol-preprocessor.ts b/gitnexus/src/core/ingestion/cobol/cobol-preprocessor.ts new file mode 100644 index 0000000000..600b84dda0 --- /dev/null +++ b/gitnexus/src/core/ingestion/cobol/cobol-preprocessor.ts @@ -0,0 +1,1771 @@ +/** + * COBOL source pre-processing and regex-based symbol extraction. + * + * DESIGN DECISION — Why regex instead of a full parser (ANTLR4, tree-sitter): + * + * 1. Performance: Regex processes ~1ms/file vs 50-200ms/file for ANTLR4/tree-sitter. + * On EPAGHE (14k COBOL files), this is ~14 seconds vs 12-47 minutes. + * + * 2. Reliability: tree-sitter-cobol@0.0.1's external scanner hangs indefinitely + * on ~5% of production files (no timeout possible). ANTLR4's proleap-cobol-parser + * is a Java project — using it from Node.js requires Java subprocesses or + * extracting .g4 grammars and generating JS/TS targets (significant effort). + * + * 3. Dialect compatibility: GnuCOBOL with Italian comments, patch markers in + * cols 1-6 (mzADD, estero, etc.), and vendor extensions. Formal grammars + * target COBOL-85 and would need dialect modifications. + * + * 4. Industry precedent: ctags, GitHub code navigation, and Sourcegraph all use + * regex-based extraction for code indexing. Full parsing is only needed for + * compilation or semantic analysis, not symbol extraction. + * + * 5. Determinism: Every regex pattern is tested with canonical COBOL input + * (see test/unit/cobol-preprocessor.test.ts). Same input always produces + * same output — no grammar ambiguity or parser state issues. + * + * This module provides: + * 1. preprocessCobolSource() — cleans patch markers (kept for potential future use) + * 2. extractCobolSymbolsWithRegex() — single-pass state machine COBOL extraction + */ + +// --------------------------------------------------------------------------- +// Public interfaces +// --------------------------------------------------------------------------- + +export interface CobolRegexResults { + programName: string | null; + /** All programs in this file with line-range boundaries for per-program scoping. */ + programs: Array<{ name: string; startLine: number; endLine: number; nestingDepth: number; procedureUsing?: string[]; isCommon?: boolean }>; + paragraphs: Array<{ name: string; line: number }>; + sections: Array<{ name: string; line: number }>; + performs: Array<{ caller: string | null; target: string; thruTarget?: string; line: number }>; + calls: Array<{ target: string; line: number; isQuoted: boolean; parameters?: string[]; returning?: string }>; + copies: Array<{ target: string; line: number }>; + dataItems: Array<{ + name: string; + level: number; + line: number; + pic?: string; + usage?: string; + occurs?: number; + dependingOn?: string; + redefines?: string; + values?: string[]; + isExternal?: boolean; + isGlobal?: boolean; + section: 'working-storage' | 'linkage' | 'file' | 'local-storage' | 'screen' | 'unknown'; + }>; + fileDeclarations: Array<{ + selectName: string; + assignTo: string; + organization?: string; + access?: string; + recordKey?: string; + alternateKeys?: string[]; + fileStatus?: string; + isOptional?: boolean; + line: number; + }>; + fdEntries: Array<{ + fdName: string; + recordName?: string; + line: number; + }>; + programMetadata: { + author?: string; + dateWritten?: string; + dateCompiled?: string; + installation?: string; + }; + + // Phase 2: EXEC blocks + execSqlBlocks: Array<{ + line: number; + tables: string[]; + cursors: string[]; + hostVariables: string[]; + operation: 'SELECT' | 'INSERT' | 'UPDATE' | 'DELETE' | 'DECLARE' | 'OPEN' | 'CLOSE' | 'FETCH' | 'OTHER'; + includeMember?: string; + }>; + execCicsBlocks: Array<{ + line: number; + command: string; + mapName?: string; + programName?: string; + programIsLiteral?: boolean; + transId?: string; + fileName?: string; + fileIsLiteral?: boolean; + queueName?: string; + labelName?: string; + intoField?: string; + fromField?: string; + }>; + + // Phase 3: Linkage + Data Flow + procedureUsing: string[]; + entryPoints: Array<{ + name: string; + parameters: string[]; + line: number; + }>; + moves: Array<{ + from: string; + targets: string[]; + line: number; + caller: string | null; + corresponding: boolean; + }>; + + // Phase 4: Additional structural features + gotos: Array<{ caller: string | null; target: string; line: number }>; + sorts: Array<{ sortFile: string; usingFiles: string[]; givingFiles: string[]; line: number }>; + searches: Array<{ target: string; line: number }>; + cancels: Array<{ target: string; line: number; isQuoted: boolean }>; + + // Phase 2.1: EXEC DLI (IMS/DB) + execDliBlocks: Array<{ + line: number; + verb: string; + pcbNumber?: number; + segmentName?: string; + intoField?: string; + fromField?: string; + psbName?: string; + }>; + + // Phase 2.2: DECLARATIVES + declaratives: Array<{ + sectionName: string; + target: string; // file-name or INPUT/OUTPUT/I-O/EXTEND + line: number; + }>; + + // Phase 2.3: SET statement + sets: Array<{ + targets: string[]; + form: 'to-true' | 'to-value' | 'up-by' | 'down-by'; + value?: string; + line: number; + caller: string | null; + }>; + + // Phase 2.4: INSPECT + inspects: Array<{ + inspectedField: string; + counters: string[]; + form: 'tallying' | 'replacing' | 'converting' | 'tallying-replacing'; + line: number; + caller: string | null; + }>; + + // Phase 4.1: INITIALIZE + initializes: Array<{ target: string; line: number; caller: string | null }>; +} + +// --------------------------------------------------------------------------- +// Preserved exactly: preprocessCobolSource +// --------------------------------------------------------------------------- + +/** + * Normalize COBOL source for regex-based extraction. + * + * The COBOL fixed-format sequence number area (columns 1-6) is semantically + * irrelevant to parsing — compilers and tools always ignore it. This + * function replaces ANY non-space content in columns 1-6 with spaces + * so that position-sensitive regexes (paragraph/section detection, data-item + * anchors, etc.) work identically whether the file carries numeric sequence + * numbers (000100), alphabetic patch markers (mzADD, estero, #patch), or + * the COBOL default of all spaces. + * + * Preserves exact line count for position mapping. + */ +export function preprocessCobolSource(content: string): string { + // Skip preprocessing for free-format COBOL — cols 1-6 are program text, not sequence area + // Check first 10 lines (consistent with extractCobolSymbolsWithRegex detection threshold) + const firstLines = content.split('\n', 10).join('\n'); + if (/>>SOURCE\s+(?:FORMAT\s+(?:IS\s+)?)?FREE/i.test(firstLines)) { + return content; + } + + const lines = content.split(/\r?\n/); + for (let i = 0; i < lines.length; i++) { + const line = lines[i]; + if (line.length < 7) continue; + const seq = line.substring(0, 6); + // Replace any non-space content in the sequence area with spaces. + // This covers numeric sequence numbers (000100), alphabetic patch markers + // (mzADD, estero), '#'-prefixed markers, and all other col 1-6 content. + if (/\S/.test(seq)) { + lines[i] = ' ' + line.substring(6); + } + } + return lines.join('\n'); +} + +// --------------------------------------------------------------------------- +// Preserved exactly: EXCLUDED_PARA_NAMES +// --------------------------------------------------------------------------- + +// COBOL calling-convention keywords to filter from USING parameter lists +const USING_KEYWORDS = new Set(['BY', 'VALUE', 'REFERENCE', 'CONTENT', 'ADDRESS', 'OF', 'RETURNING']); + +// CALL ... USING keyword filter (extends USING_KEYWORDS for CALL-specific forms) +const CALL_USING_FILTER = new Set([ + 'BY', 'REFERENCE', 'CONTENT', 'VALUE', + 'ADDRESS', 'OF', 'LENGTH', 'OMITTED', +]); + +const EXCLUDED_PARA_NAMES = new Set([ + 'DECLARATIVES', 'END', 'PROCEDURE', 'IDENTIFICATION', + 'ENVIRONMENT', 'DATA', 'WORKING-STORAGE', 'LINKAGE', + 'FILE', 'LOCAL-STORAGE', 'COMMUNICATION', 'REPORT', + 'SCREEN', 'INPUT-OUTPUT', 'CONFIGURATION', + // COBOL verbs that appear alone on a line with period (false-positive in free-format) + 'GOBACK', 'STOP', 'EXIT', 'CONTINUE', + 'DISPLAY', 'ACCEPT', 'WRITE', 'READ', 'REWRITE', 'DELETE', + 'OPEN', 'CLOSE', 'RETURN', 'RELEASE', 'SORT', 'MERGE', +]); + +// --------------------------------------------------------------------------- +// State machine types +// --------------------------------------------------------------------------- + +type Division = 'identification' | 'environment' | 'data' | 'procedure' | null; + +type DataSection = 'working-storage' | 'linkage' | 'file' | 'local-storage' | 'screen' | 'unknown'; + +type EnvironmentSection = 'input-output' | 'configuration' | null; + +// --------------------------------------------------------------------------- +// Regex constants (compiled once, reused across calls) +// --------------------------------------------------------------------------- + +const RE_DIVISION = /\b(IDENTIFICATION|ENVIRONMENT|DATA|PROCEDURE)\s+DIVISION\b/i; +const RE_SECTION = /\b(WORKING-STORAGE|LINKAGE|FILE|LOCAL-STORAGE|SCREEN|INPUT-OUTPUT|CONFIGURATION)\s+SECTION\b/i; + +// IDENTIFICATION DIVISION +const RE_PROGRAM_ID = /\bPROGRAM-ID\.\s*([A-Z][A-Z0-9-]*)(?:\s+IS\s+COMMON)?/i; +const RE_END_PROGRAM = /\bEND\s+PROGRAM\s+([A-Z][A-Z0-9-]*)\s*\./i; +const RE_AUTHOR = /^\s+AUTHOR\.\s*(.+)/i; +const RE_DATE_WRITTEN = /^\s+DATE-WRITTEN\.\s*(.+)/i; +const RE_DATE_COMPILED = /^\s+DATE-COMPILED\.\s*(.+)/i; +const RE_INSTALLATION = /^\s+INSTALLATION\.\s*(.+)/i; + +// ENVIRONMENT DIVISION — SELECT +const RE_SELECT_START = /\bSELECT\s+(?:OPTIONAL\s+)?([A-Z][A-Z0-9-]+)/i; + +// DATA DIVISION +// ^\s* (not ^\s+) to support both fixed-format (indented) and free-format (trimmed) +const RE_FD = /^\s*(?:FD|SD|RD)\s+([A-Z][A-Z0-9-]+)/i; +const RE_DATA_ITEM = /^\s*(\d{1,2})\s+([A-Z][A-Z0-9-]+)\s*(.*)/i; +const RE_ANONYMOUS_REDEFINES = /^\s*(\d{1,2})\s+REDEFINES\s+([A-Z][A-Z0-9-]+)/i; +const RE_88_LEVEL = /^\s*88\s+([A-Z][A-Z0-9-]+)\s+VALUES?\s+(?:ARE\s+)?(.+)/i; + +// PROCEDURE DIVISION +// These patterns support both fixed-format (7 leading spaces) and free-format (any indentation) +const RE_PROC_SECTION = /^\s*([A-Z][A-Z0-9-]+)\s+SECTION(?:\s+\d+)?\.\s*$/i; +const RE_PROC_PARAGRAPH = /^\s*([A-Z][A-Z0-9-]+)\.\s*$/i; +const RE_PERFORM = /\bPERFORM\s+([A-Z][A-Z0-9-]+)(?:\s+(?:THRU|THROUGH)\s+([A-Z][A-Z0-9-]+))?/gi; + +// ALL DIVISIONS +// Both double-quoted ("PROG") and single-quoted ('PROG') targets are valid COBOL. +// Use separate alternation groups so quotes must match (prevents "PROG' false-matches). +const RE_CALL = /\bCALL\s+(?:"([^"]+)"|'([^']+)')/gi; +// Dynamic CALL via data item (no quotes): CALL WS-PROGRAM-NAME +const RE_CALL_DYNAMIC = /(? t.length > 0); + + const targets: string[] = []; + const QUAL_KEYWORDS = new Set(['OF', 'IN']); + let skipNext = false; + for (const token of tokens) { + if (skipNext) { skipNext = false; continue; } + if (QUAL_KEYWORDS.has(token.toUpperCase())) { skipNext = true; continue; } + if (/^[A-Z][A-Z0-9-]+$/i.test(token) && !MOVE_SKIP.has(token.toUpperCase())) { + targets.push(token); + } + } + return targets; +} + +// PERFORM: keywords that may follow PERFORM but are NOT paragraph/section names. +// Inline PERFORM loops (UNTIL, VARYING) and inline test clauses (WITH TEST, +// FOREVER) must not be stored as perform-target false positives. +const PERFORM_KEYWORD_SKIP = new Set([ + 'UNTIL', 'VARYING', 'WITH', 'TEST', 'FOREVER', +]); + +// SORT/MERGE clause keywords that should not be captured as file names +const SORT_CLAUSE_NOISE = new Set([ + 'ON', 'ASCENDING', 'DESCENDING', 'KEY', 'WITH', 'DUPLICATES', + 'IN', 'ORDER', 'COLLATING', 'SEQUENCE', 'IS', 'THROUGH', 'THRU', + 'INPUT', 'OUTPUT', 'PROCEDURE', 'USING', 'GIVING', +]); + +// COBOL statement verbs used as boundary detectors across accumulators. +// Shared by: callAccum flush trigger, inspectAccum flush trigger, and USING lookahead. +// Note: CALL is intentionally excluded — it's handled by the callAccum state machine. +// Including CALL here would cause the flush trigger to consume the new CALL line +// without re-detecting it as a CALL start. +const COBOL_STATEMENT_VERBS = [ + 'GO\\s+TO', 'PERFORM', 'MOVE', 'DISPLAY', 'ACCEPT', + 'INSPECT', 'SEARCH', 'SORT', 'MERGE', 'IF', 'EVALUATE', + 'SET', 'INITIALIZE', 'STOP', 'EXIT', 'GOBACK', 'CONTINUE', + 'READ', 'WRITE', 'REWRITE', 'DELETE', 'OPEN', 'CLOSE', 'START', + 'CANCEL', 'COMPUTE', 'ADD', 'SUBTRACT', 'MULTIPLY', 'DIVIDE', + 'STRING', 'UNSTRING', +]; + +/** Regex matching start of any COBOL statement verb (for accumulator flush triggers). */ +const RE_STATEMENT_VERB_START = new RegExp( + `^(?:${COBOL_STATEMENT_VERBS.join('|')})(?:\\s|$)`, 'i', +); + +/** Lookahead alternation for USING parameter extraction (stops before statement verbs). + * Includes CALL (excluded from COBOL_STATEMENT_VERBS to avoid callAccum conflicts). */ +const USING_VERB_LOOKAHEAD = [...COBOL_STATEMENT_VERBS, 'CALL'] + .filter(v => v !== 'GO\\s+TO') // GO TO handled separately with \bGO\s+TO\b + .map(v => `\\b${v}(?=\\s|$)`) + .join('|'); +const RE_USING_PARAMS = new RegExp( + `\\bUSING\\s+([\\s\\S]*?)(?=\\bRETURNING\\b|\\bON\\s+(?:EXCEPTION|OVERFLOW)\\b|\\bNOT\\s+ON\\b|\\bEND-CALL\\b|\\bGO\\s+TO\\b|${USING_VERB_LOOKAHEAD}|\\.\\s*$|$)`, 'i', +); + +// --------------------------------------------------------------------------- +// Private helper: strip Italian inline comments (| and everything after) +// --------------------------------------------------------------------------- + +function stripInlineComment(line: string): string { + let inQuote: string | null = null; + for (let i = 0; i < line.length; i++) { + const ch = line[i]; + if (inQuote) { + if (ch === inQuote) inQuote = null; + } else if (ch === '"' || ch === "'") { + inQuote = ch; + } else if (ch === '|') { + return line.substring(0, i); + } + } + return line; +} + +// --------------------------------------------------------------------------- +// Private helper: parse data item trailing clauses (PIC, USAGE, etc.) +// --------------------------------------------------------------------------- + +function parseDataItemClauses(rest: string): { + pic?: string; + usage?: string; + redefines?: string; + occurs?: number; + dependingOn?: string; + value?: string; + isExternal?: boolean; + isGlobal?: boolean; +} { + const result: { pic?: string; usage?: string; redefines?: string; occurs?: number; dependingOn?: string; value?: string; isExternal?: boolean; isGlobal?: boolean } = {}; + + // Strip trailing period for easier parsing + const text = rest.replace(/\.\s*$/, ''); + + // PIC / PICTURE [IS] + const picMatch = text.match(/\bPIC(?:TURE)?\s+(?:IS\s+)?(\S+)/i); + if (picMatch) { + result.pic = picMatch[1]; + } + + // USAGE [IS] — including non-standard COMP-6, COMP-X etc. + const usageMatch = text.match(/\bUSAGE\s+(?:IS\s+)?(COMP(?:UTATIONAL)?(?:-[0-9X])?|BINARY|PACKED-DECIMAL|DISPLAY|INDEX|POINTER|NATIONAL)\b/i); + if (usageMatch) { + result.usage = usageMatch[1].toUpperCase(); + } else { + // Standalone COMP variants without USAGE keyword + const compMatch = text.match(/\b(COMP(?:UTATIONAL)?(?:-[0-9X])?|BINARY|PACKED-DECIMAL)\b/i); + if (compMatch) { + result.usage = compMatch[1].toUpperCase(); + } + } + + // REDEFINES + const redefMatch = text.match(/\bREDEFINES\s+([A-Z][A-Z0-9-]+)/i); + if (redefMatch) { + result.redefines = redefMatch[1]; + } + + // OCCURS [TO ] [TIMES] [DEPENDING ON ] + const occursMatch = text.match(/\bOCCURS\s+(\d+)(?:\s+TO\s+(\d+))?\s*(?:TIMES\s*)?(?:DEPENDING\s+ON\s+([A-Z][A-Z0-9-]+(?:\s*\([^)]*\))?))?/i); + if (occursMatch) { + result.occurs = parseInt(occursMatch[1], 10); + if (occursMatch[3]) { + // Strip any subscript from DEPENDING ON field + result.dependingOn = occursMatch[3].replace(/\s*\([^)]*\)/, '').trim(); + } + } + + // IS EXTERNAL / IS GLOBAL + result.isExternal = /\bIS\s+EXTERNAL\b/i.test(text) || undefined; + result.isGlobal = /\bIS\s+GLOBAL\b/i.test(text) || undefined; + + // VALUE [IS] literal/constant + if (!result.value) { + const valueIdx = text.search(/\bVALUE\b/i); + if (valueIdx >= 0) { + const afterValue = text.substring(valueIdx + 5).replace(/^\s+IS\s+/i, '').trimStart(); + // Try quoted: "..." or '...' (with optional type prefix X, N, G, B) + const quotedMatch = afterValue.match(/^([XNGB])?(?:"([^"]*)"|'([^']*)')/i); + if (quotedMatch) { + const prefix = quotedMatch[1] ? quotedMatch[1].toUpperCase() : ''; + result.value = prefix ? `${prefix}'${quotedMatch[2] ?? quotedMatch[3]}'` : (quotedMatch[2] ?? quotedMatch[3]); + } else { + // Try ALL "..." or ALL '...' + const allMatch = afterValue.match(/^ALL\s+(?:"([^"]*)"|'([^']*)')/i); + if (allMatch) { + result.value = `ALL '${allMatch[1] ?? allMatch[2]}'`; + } else { + // Try numeric (including negative, decimal) + const numMatch = afterValue.match(/^(-?\d+\.?\d*)/); + if (numMatch) { + result.value = numMatch[1]; + } else { + // Try figurative constant or identifier + const identMatch = afterValue.match(/^([A-Z][A-Z0-9-]*)/i); + if (identMatch) result.value = identMatch[1].toUpperCase(); + } + } + } + } + } + + return result; +} + +// --------------------------------------------------------------------------- +// Private helper: parse 88-level condition values +// --------------------------------------------------------------------------- + +function parseConditionValues(valuesStr: string): string[] { + // Strip trailing period + const text = valuesStr.replace(/\.\s*$/, '').trim(); + const values: string[] = []; + + // Match quoted strings: "O" "Y" "I" + const quotedRe = /(?:"([^"]*)"|'([^']*)')/g; + let qm: RegExpExecArray | null; + let hasQuoted = false; + while ((qm = quotedRe.exec(text)) !== null) { + values.push(qm[1] ?? qm[2]); + hasQuoted = true; + } + if (hasQuoted) return values; + + // No quotes — split on whitespace, filtering out THRU/THROUGH keywords + // Handle: 11 12 16 17 21 or 1 THRU 5 + const tokens = text.split(/\s+/); + for (const token of tokens) { + const upper = token.toUpperCase(); + if (upper === 'THRU' || upper === 'THROUGH') { + // Keep THRU ranges as combined value: prev THRU next is already captured + // by having both sides in the array + continue; + } + if (token.length > 0) { + values.push(token); + } + } + + return values; +} + +// --------------------------------------------------------------------------- +// Private helper: parse accumulated multi-line SELECT statement +// --------------------------------------------------------------------------- + +interface FileDeclaration { + selectName: string; + assignTo: string; + organization?: string; + access?: string; + recordKey?: string; + alternateKeys?: string[]; + fileStatus?: string; + isOptional?: boolean; + line: number; +} + +function parseSelectStatement(stmt: string, startLine: number): FileDeclaration | null { + // Normalize whitespace + const text = stmt.replace(/\s+/g, ' ').trim(); + + const nameMatch = text.match(/^SELECT\s+(?:OPTIONAL\s+)?([A-Z][A-Z0-9-]+)/i); + if (!nameMatch) return null; + + const result: FileDeclaration = { + selectName: nameMatch[1], + assignTo: '', + line: startLine, + }; + + const assignMatch = text.match(/\bASSIGN\s+(?:TO\s+)?("([^"]+)"|([A-Z][A-Z0-9-]*))/i); + if (assignMatch) { + result.assignTo = assignMatch[2] || assignMatch[3] || ''; + } + + const orgMatch = text.match(/\bORGANIZATION\s+(?:IS\s+)?(SEQUENTIAL|INDEXED|RELATIVE|LINE\s+SEQUENTIAL)/i); + if (orgMatch) { + result.organization = orgMatch[1].toUpperCase(); + } + + const accessMatch = text.match(/\bACCESS\s+(?:MODE\s+)?(?:IS\s+)?(SEQUENTIAL|RANDOM|DYNAMIC)/i); + if (accessMatch) { + result.access = accessMatch[1].toUpperCase(); + } + + const keyMatch = text.match(/\bRECORD\s+KEY\s+(?:IS\s+)?([A-Z][A-Z0-9-]+)/i); + if (keyMatch) { + result.recordKey = keyMatch[1]; + } + + // ALTERNATE RECORD KEY + const altKeyMatches = text.matchAll(/\bALTERNATE\s+RECORD\s+KEY\s+(?:IS\s+)?([A-Z][A-Z0-9-]+)/gi); + const alternateKeys: string[] = []; + for (const m of altKeyMatches) alternateKeys.push(m[1]); + if (alternateKeys.length > 0) result.alternateKeys = alternateKeys; + + // FILE STATUS IS / STATUS IS + const statusMatch = text.match(/\b(?:FILE\s+)?STATUS\s+(?:IS\s+)?([A-Z][A-Z0-9-]+)/i); + if (statusMatch) { + result.fileStatus = statusMatch[1]; + } + + // SELECT OPTIONAL flag + result.isOptional = /^SELECT\s+OPTIONAL\b/i.test(text) || undefined; + + return result; +} + +// --------------------------------------------------------------------------- +// Private helper: parse EXEC SQL block +// --------------------------------------------------------------------------- + +type SqlOperation = 'SELECT' | 'INSERT' | 'UPDATE' | 'DELETE' | 'DECLARE' | 'OPEN' | 'CLOSE' | 'FETCH' | 'OTHER'; + +function parseExecSqlBlock(block: string, line: number): CobolRegexResults['execSqlBlocks'][number] { + // Strip EXEC SQL ... END-EXEC wrapper + const body = block + .replace(/\bEXEC\s+SQL\b/i, '') + .replace(/\bEND-EXEC\b/i, '') + .replace(/\s+/g, ' ') + .trim(); + + // Determine operation from first SQL keyword + const firstWord = body.split(/\s+/)[0]?.toUpperCase() || ''; + const OP_MAP: Record = { + SELECT: 'SELECT', INSERT: 'INSERT', UPDATE: 'UPDATE', DELETE: 'DELETE', + DECLARE: 'DECLARE', OPEN: 'OPEN', CLOSE: 'CLOSE', FETCH: 'FETCH', + INCLUDE: 'OTHER', // we handle INCLUDE specially below + }; + const operation: SqlOperation = OP_MAP[firstWord] || 'OTHER'; + + // EXEC SQL INCLUDE — extract member name for IMPORTS edge + let includeMember: string | undefined; + if (firstWord === 'INCLUDE') { + const includeMatch = body.match(/^INCLUDE\s+(?:'([^']+)'|"([^"]+)"|([A-Z][A-Z0-9_-]+))/i); + if (includeMatch) { + includeMember = includeMatch[1] ?? includeMatch[2] ?? includeMatch[3]; + } + } + + // Extract table names from FROM, INTO (INSERT), UPDATE, DELETE FROM, JOIN + const tables: string[] = []; + const tablePatterns = [ + /\bFROM\s+([A-Z][A-Z0-9_]+)/gi, + /\bINSERT\s+INTO\s+([A-Z][A-Z0-9_]+)/gi, + /\bUPDATE\s+([A-Z][A-Z0-9_]+)/gi, + /\bJOIN\s+([A-Z][A-Z0-9_]+)/gi, + ]; + for (const re of tablePatterns) { + let m: RegExpExecArray | null; + while ((m = re.exec(body)) !== null) { + const name = m[1].toUpperCase(); + // Skip host variables and SQL keywords + if (!name.startsWith(':') && !tables.includes(name)) { + tables.push(name); + } + } + } + + // Extract cursor names from DECLARE ... CURSOR + const cursors: string[] = []; + const cursorRe = /\bDECLARE\s+([A-Z][A-Z0-9_-]+)\s+CURSOR\b/gi; + let cm: RegExpExecArray | null; + while ((cm = cursorRe.exec(body)) !== null) { + cursors.push(cm[1]); + } + + // Extract host variables: :VARIABLE-NAME (strip the colon) + const hostVariables: string[] = []; + const hostRe = /:([A-Z][A-Z0-9-]+)/gi; + let hm: RegExpExecArray | null; + while ((hm = hostRe.exec(body)) !== null) { + const name = hm[1]; + if (!hostVariables.includes(name)) { + hostVariables.push(name); + } + } + + return { line, tables, cursors, hostVariables, operation, includeMember }; +} + +// --------------------------------------------------------------------------- +// Private helper: parse EXEC CICS block +// --------------------------------------------------------------------------- + +function parseExecCicsBlock(block: string, line: number): CobolRegexResults['execCicsBlocks'][number] { + // Strip EXEC CICS ... END-EXEC wrapper + const body = block + .replace(/\bEXEC\s+CICS\b/i, '') + .replace(/\bEND-EXEC\b/i, '') + .replace(/\s+/g, ' ') + .trim(); + + // Command: first keyword(s) — handle two-word commands like SEND MAP, RECEIVE MAP + const twoWordCommands = [ + 'SEND MAP', 'RECEIVE MAP', 'SEND TEXT', 'SEND CONTROL', + 'READ NEXT', 'READ PREV', + 'WRITEQ TS', 'WRITEQ TD', 'READQ TS', 'READQ TD', + 'DELETEQ TS', 'DELETEQ TD', + 'HANDLE ABEND', 'HANDLE AID', 'HANDLE CONDITION', + 'START TRANSID', + ]; + let command = ''; + const upperBody = body.toUpperCase(); + for (const twoWord of twoWordCommands) { + if (upperBody.startsWith(twoWord)) { + command = twoWord; + break; + } + } + if (!command) { + command = body.split(/\s+/)[0]?.toUpperCase() || ''; + } + + const result: CobolRegexResults['execCicsBlocks'][number] = { line, command }; + + // MAP name: MAP('name') or MAP("name") or MAP(IDENTIFIER) + const mapMatch = body.match(/\bMAP\s*\(\s*(?:['"]([^'"]+)['"]|([A-Z][A-Z0-9-]+))\s*\)/i); + if (mapMatch) result.mapName = mapMatch[1] ?? mapMatch[2]; + + // PROGRAM name: PROGRAM('name') or PROGRAM("name") or PROGRAM(VARIABLE) + const progMatch = body.match(/\bPROGRAM\s*\(\s*(?:['"]([^'"]+)['"]|([A-Z][A-Z0-9-]+))\s*\)/i); + if (progMatch) { + result.programName = progMatch[1] ?? progMatch[2]; + result.programIsLiteral = !!progMatch[1]; + } + + // TRANSID: TRANSID('name') or TRANSID("name") or TRANSID(VARIABLE) + const transMatch = body.match(/\bTRANSID\s*\(\s*(?:['"]([^'"]+)['"]|([A-Z][A-Z0-9-]+))\s*\)/i); + if (transMatch) result.transId = transMatch[1] ?? transMatch[2]; + + // FILE/DATASET: FILE('name') or DATASET('name') or FILE(VARIABLE) + // Used in CICS READ, WRITE, REWRITE, DELETE, STARTBR, READNEXT, READPREV, ENDBR + const fileMatch = body.match(/\b(?:FILE|DATASET)\s*\(\s*(?:['"]([^'"]+)['"]|([A-Z][A-Z0-9-]+))\s*\)/i); + if (fileMatch) { + result.fileName = fileMatch[1] ?? fileMatch[2]; + result.fileIsLiteral = !!fileMatch[1]; + } + + // QUEUE: QUEUE('name') — used in WRITEQ/READQ TS/TD + const queueMatch = body.match(/\bQUEUE\s*\(\s*(?:['"]([^'"]+)['"]|([A-Z][A-Z0-9-]+))\s*\)/i); + if (queueMatch) result.queueName = queueMatch[1] ?? queueMatch[2]; + + // HANDLE ABEND LABEL(paragraph-name) — error handler target + const labelMatch = body.match(/\bLABEL\s*\(\s*([A-Z][A-Z0-9-]+)\s*\)/i); + if (labelMatch) result.labelName = labelMatch[1]; + + // INTO(data-area) — data target (READ INTO, RECEIVE INTO, RETRIEVE INTO, READQ INTO) + const intoMatch = body.match(/\bINTO\s*\(\s*([A-Z][A-Z0-9-]+)\s*\)/i); + if (intoMatch) result.intoField = intoMatch[1]; + + // FROM(data-area) — data source (WRITE FROM, SEND FROM, WRITEQ FROM, START FROM) + const fromMatch = body.match(/\bFROM\s*\(\s*([A-Z][A-Z0-9-]+)\s*\)/i); + if (fromMatch) result.fromField = fromMatch[1]; + + return result; +} + +// --------------------------------------------------------------------------- +// Private helper: parse EXEC DLI block (IMS/DB) +// --------------------------------------------------------------------------- + +function parseExecDliBlock(block: string, line: number): CobolRegexResults['execDliBlocks'][number] { + const body = block.replace(/\bEXEC\s+DLI\b/i, '').replace(/\bEND-EXEC\b/i, '').replace(/\s+/g, ' ').trim(); + const verb = body.split(/\s+/)[0]?.toUpperCase() || ''; + const result: CobolRegexResults['execDliBlocks'][number] = { line, verb }; + + const pcbMatch = body.match(/\bUSING\s+PCB\s*\(\s*(\d+)\s*\)/i); + if (pcbMatch) result.pcbNumber = parseInt(pcbMatch[1], 10); + + const segMatch = body.match(/\bSEGMENT\s*\(\s*([A-Z][A-Z0-9-]*)\s*\)/i); + if (segMatch) result.segmentName = segMatch[1]; + + const intoMatch = body.match(/\bINTO\s*\(\s*([A-Z][A-Z0-9-]+)\s*\)/i); + if (intoMatch) result.intoField = intoMatch[1]; + + const fromMatch = body.match(/\bFROM\s*\(\s*([A-Z][A-Z0-9-]+)\s*\)/i); + if (fromMatch) result.fromField = fromMatch[1]; + + const psbMatch = body.match(/\bPSB\s*\(\s*([A-Z][A-Z0-9-]+)\s*\)/i); + if (psbMatch) result.psbName = psbMatch[1]; + + return result; +} + +// --------------------------------------------------------------------------- +// Main extraction: single-pass state machine +// --------------------------------------------------------------------------- + +/** + * Extract COBOL symbols using a single-pass state machine. + * Extracts program name, paragraphs, sections, CALL, PERFORM, COPY, + * data items, file declarations, FD entries, and program metadata. + */ +export function extractCobolSymbolsWithRegex( + content: string, + _filePath: string, +): CobolRegexResults { + const rawLines = content.split(/\r?\n/); + + const result: CobolRegexResults = { + programName: null, + programs: [], + paragraphs: [], + sections: [], + performs: [], + calls: [], + copies: [], + dataItems: [], + fileDeclarations: [], + fdEntries: [], + programMetadata: {}, + execSqlBlocks: [], + execCicsBlocks: [], + procedureUsing: [], + entryPoints: [], + moves: [], + gotos: [], + sorts: [], + searches: [], + cancels: [], + execDliBlocks: [], + declaratives: [], + sets: [], + inspects: [], + initializes: [], + }; + + // --- State --- + let currentDivision: Division = null; + let currentDataSection: DataSection = 'unknown'; + let currentEnvSection: EnvironmentSection = null; + let currentParagraph: string | null = null; + + // Program boundary stack for nested PROGRAM-ID / END PROGRAM tracking + const programBoundaryStack: Array<{ name: string; startLine: number; procedureUsing?: string[]; isCommon?: boolean }> = []; + + // SELECT accumulator (multi-line) + let selectAccum: string | null = null; + let selectStartLine = 0; + + // PROCEDURE DIVISION USING on next line + let pendingProcUsing = false; + + // SORT/MERGE accumulator (multi-line SORT ... USING ... GIVING ...) + let sortAccum: string | null = null; + let sortStartLine = 0; + + // EXEC block accumulator (multi-line EXEC SQL / EXEC CICS / EXEC DLI) + let execAccum: { type: 'sql' | 'cics' | 'dli'; lines: string; startLine: number } | null = null; + + // DECLARATIVES state + let inDeclaratives = false; + + // INSPECT accumulator (multi-line) + let inspectAccum: string | null = null; + let inspectStartLine = 0; + + // CALL accumulator (multi-line CALL ... USING on separate lines) + let callAccum: string | null = null; + let callAccumLine = 0; + + // FD tracking: after seeing FD, the next 01-level data item is its record + let pendingFdName: string | null = null; + let pendingFdLine = 0; + + // Continuation line buffer + let pendingLine: string | null = null; + let pendingLineNumber = 0; + + // --- Detect source format: free vs fixed --- + // GnuCOBOL uses >>SOURCE FREE directive, typically in first 5 lines + let isFreeFormat = false; + for (let i = 0; i < Math.min(rawLines.length, 10); i++) { + if (/>>SOURCE\s+(?:FORMAT\s+(?:IS\s+)?)?FREE/i.test(rawLines[i])) { + isFreeFormat = true; + break; + } + } + + // --- Process each raw line --- + for (let i = 0; i < rawLines.length; i++) { + const raw = rawLines[i]; + + if (isFreeFormat) { + // FREE FORMAT: no column-position rules + // Skip >>SOURCE directive lines + if (/^[ \t]*>>/.test(raw)) continue; + // Skip free-format comment lines (*> at start of content) + const trimmed = raw.trimStart(); + if (trimmed.startsWith('*>') || trimmed.length === 0) continue; + // Strip inline *> comments (quote-aware) + let commentIdx = -1; + let ffInQuote: string | null = null; + for (let ci = 0; ci < raw.length - 1; ci++) { + const c = raw[ci]; + if (ffInQuote) { + if (c === ffInQuote) ffInQuote = null; + } else if (c === '"' || c === "'") { + ffInQuote = c; + } else if (c === '*' && raw[ci + 1] === '>') { + commentIdx = ci; + break; + } + } + const line = commentIdx >= 0 ? raw.substring(0, commentIdx) : raw; + // Free-format lines are logical lines (no continuation indicator) + const lineNum = i + 1; + processLogicalLine(line.trim(), lineNum); + continue; + } + + // FIXED FORMAT: column-position-based processing + + // Skip lines too short to have indicator area + if (raw.length < 7) { + // If there's a pending continuation, flush it + if (pendingLine !== null) { + processLogicalLine(pendingLine, pendingLineNumber); + pendingLine = null; + } + continue; + } + + const indicator = raw[6]; + + // Comment line: indicator is '*' or '/' + if (indicator === '*' || indicator === '/') { + continue; + } + + // Continuation line: indicator is '-' + if (indicator === '-') { + if (pendingLine !== null) { + const continuation = raw.substring(7).trimStart(); + // Handle literal continuation: if continuation starts with a quote, + // remove the trailing quote from the predecessor and skip the opening quote + if (continuation.length > 0 && (continuation[0] === '"' || continuation[0] === "'")) { + const quoteChar = continuation[0]; + const lastQuoteIdx = pendingLine.lastIndexOf(quoteChar); + if (lastQuoteIdx >= 0) { + pendingLine = pendingLine.substring(0, lastQuoteIdx) + continuation.substring(1); + } else { + pendingLine += continuation; + } + } else { + pendingLine += continuation; + } + } + continue; + } + + // Normal line — flush any pending continuation first + if (pendingLine !== null) { + processLogicalLine(pendingLine, pendingLineNumber); + pendingLine = null; + } + + // Strip inline Italian comments, then use area A+B (from col 7 onwards, + // but keep full line for indentation-sensitive paragraph/section detection) + const cleaned = stripInlineComment(raw); + + // Buffer as new pending logical line + pendingLine = cleaned; + pendingLineNumber = i + 1; // 1-indexed (consistent with free-format) + } + + // Flush final pending line + if (pendingLine !== null) { + processLogicalLine(pendingLine, pendingLineNumber); + } + + // Flush any pending SELECT + flushSelect(); + + // Flush any pending SORT/MERGE accumulator (truncated file without trailing period) + flushSort(); + + // Flush any pending INSPECT accumulator (truncated file without trailing period) + flushInspect(); + + // Flush any pending CALL accumulator (truncated file without trailing period) + flushCallAccum(); + + // Flush any pending EXEC block (truncated file without END-EXEC) + if (execAccum !== null) { + if (execAccum.type === 'sql') { + result.execSqlBlocks.push(parseExecSqlBlock(execAccum.lines, execAccum.startLine)); + } else if (execAccum.type === 'cics') { + result.execCicsBlocks.push(parseExecCicsBlock(execAccum.lines, execAccum.startLine)); + } else if (execAccum.type === 'dli') { + result.execDliBlocks.push(parseExecDliBlock(execAccum.lines, execAccum.startLine)); + } + execAccum = null; + } + + // If we saw an FD but never found its record, emit it without a record name + if (pendingFdName !== null) { + result.fdEntries.push({ fdName: pendingFdName, line: pendingFdLine }); + pendingFdName = null; + } + + // Finalize any remaining programs on the boundary stack (e.g., single-program + // files without END PROGRAM, or outermost programs in nested files) + while (programBoundaryStack.length > 0) { + const topProgram = programBoundaryStack.pop()!; + result.programs.push({ + name: topProgram.name, + startLine: topProgram.startLine, + endLine: rawLines.length, + nestingDepth: programBoundaryStack.length, + procedureUsing: topProgram.procedureUsing, + isCommon: topProgram.isCommon, + }); + } + // Sort by startLine so outer programs come first + if (result.programs.length > 1) { + result.programs.sort((a, b) => a.startLine - b.startLine); + } + + return result; + + // ========================================================================= + // Inner function: process one logical line (after continuation merging) + // ========================================================================= + function processLogicalLine(line: string, lineNum: number): void { + // --- EXEC block accumulation (spans any division) --- + if (execAccum !== null) { + execAccum.lines += ' ' + line; + if (RE_END_EXEC.test(line)) { + if (execAccum.type === 'sql') { + result.execSqlBlocks.push(parseExecSqlBlock(execAccum.lines, execAccum.startLine)); + } else if (execAccum.type === 'cics') { + result.execCicsBlocks.push(parseExecCicsBlock(execAccum.lines, execAccum.startLine)); + } else if (execAccum.type === 'dli') { + result.execDliBlocks.push(parseExecDliBlock(execAccum.lines, execAccum.startLine)); + } + execAccum = null; + } + return; // While accumulating, skip normal processing + } + + // Check for EXEC SQL / EXEC CICS start + // Flush any pending CALL accumulator before entering EXEC block + if (RE_EXEC_SQL_START.test(line)) { + flushCallAccum(); + execAccum = { type: 'sql', lines: line, startLine: lineNum }; + // If END-EXEC is on the same line, finalize immediately + if (RE_END_EXEC.test(line)) { + result.execSqlBlocks.push(parseExecSqlBlock(execAccum.lines, execAccum.startLine)); + execAccum = null; + } + return; + } + if (RE_EXEC_CICS_START.test(line)) { + flushCallAccum(); + execAccum = { type: 'cics', lines: line, startLine: lineNum }; + if (RE_END_EXEC.test(line)) { + result.execCicsBlocks.push(parseExecCicsBlock(execAccum.lines, execAccum.startLine)); + execAccum = null; + } + return; + } + if (RE_EXEC_DLI_START.test(line)) { + flushCallAccum(); + execAccum = { type: 'dli', lines: line, startLine: lineNum }; + if (RE_END_EXEC.test(line)) { + result.execDliBlocks.push(parseExecDliBlock(execAccum.lines, execAccum.startLine)); + execAccum = null; + } + return; + } + + // --- END PROGRAM boundary detection --- + const endProgramMatch = line.match(RE_END_PROGRAM); + if (endProgramMatch) { + // Flush any pending accumulators at program boundary + flushCallAccum(); + flushSort(); + flushInspect(); + const topProgram = programBoundaryStack.pop(); + if (topProgram) { + result.programs.push({ + name: topProgram.name, + startLine: topProgram.startLine, + endLine: lineNum, + nestingDepth: programBoundaryStack.length, + procedureUsing: topProgram.procedureUsing, + isCommon: topProgram.isCommon, + }); + } + return; + } + + // DECLARATIVES boundary detection + if (RE_DECLARATIVES_START.test(line)) { + inDeclaratives = true; + return; + } + if (RE_DECLARATIVES_END.test(line)) { + inDeclaratives = false; + return; + } + + // Detect PROGRAM-ID regardless of current division state (handles sibling + // programs after END PROGRAM where IDENTIFICATION DIVISION header is omitted) + if (currentDivision !== 'identification') { + const pgmIdMatch = line.match(RE_PROGRAM_ID); + if (pgmIdMatch) { + flushCallAccum(); + flushSort(); + flushInspect(); + extractIdentification(line, lineNum); + return; + } + } + + // --- Division transitions --- + const divMatch = line.match(RE_DIVISION); + if (divMatch) { + // Flush any pending accumulators on division boundary + flushSelect(); + flushCallAccum(); + flushSort(); + flushInspect(); + + const divName = divMatch[1].toUpperCase(); + switch (divName) { + case 'IDENTIFICATION': currentDivision = 'identification'; break; + case 'ENVIRONMENT': currentDivision = 'environment'; currentEnvSection = null; break; + case 'DATA': currentDivision = 'data'; currentDataSection = 'unknown'; break; + case 'PROCEDURE': { + currentDivision = 'procedure'; + currentParagraph = null; + const procUsingMatch = line.match(RE_PROC_USING); + if (procUsingMatch) { + const params = procUsingMatch[1].split(/\bRETURNING\b/i)[0].trim().split(/\s+/) + .filter(s => s.length > 0 && !USING_KEYWORDS.has(s.toUpperCase())); + result.procedureUsing = params; + // Store per-program on the boundary stack + const topProg = programBoundaryStack[programBoundaryStack.length - 1]; + if (topProg) topProg.procedureUsing = params; + pendingProcUsing = false; + } else { + // USING may be on the next line — flag for extractProcedure to pick up + // Only set if the line is NOT period-terminated (period = no USING clause) + pendingProcUsing = !/\.\s*$/.test(line); + } + break; + } + } + return; + } + + // --- Section transitions --- + const secMatch = line.match(RE_SECTION); + if (secMatch) { + flushSelect(); + + const secName = secMatch[1].toUpperCase(); + switch (secName) { + case 'WORKING-STORAGE': currentDivision = 'data'; currentDataSection = 'working-storage'; break; + case 'LINKAGE': currentDivision = 'data'; currentDataSection = 'linkage'; break; + case 'FILE': currentDivision = 'data'; currentDataSection = 'file'; break; + case 'LOCAL-STORAGE': currentDivision = 'data'; currentDataSection = 'local-storage'; break; + case 'SCREEN': currentDivision = 'data'; currentDataSection = 'screen'; break; + case 'INPUT-OUTPUT': currentDivision = 'environment'; currentEnvSection = 'input-output'; break; + case 'CONFIGURATION': currentDivision = 'environment'; currentEnvSection = 'configuration'; break; + } + return; + } + + // --- COPY (all divisions) --- + const copyQMatch = line.match(RE_COPY_QUOTED); + if (copyQMatch) { + result.copies.push({ target: copyQMatch[1] ?? copyQMatch[2], line: lineNum }); + } else { + const copyUMatch = line.match(RE_COPY_UNQUOTED); + if (copyUMatch) { + result.copies.push({ target: copyUMatch[1], line: lineNum }); + } + } + + // --- CALL (all divisions, typically procedure) --- + // Multi-line CALL accumulator: accumulate CALL statement until period or END-CALL. + // Continuation lines (not the start line) are consumed entirely — return after flush + // to prevent false paragraph detection on lines like "WS-ADDR." or "WS-CUST-CODE." + if (callAccum !== null) { + // Check if this continuation line starts a new COBOL statement (not a USING parameter). + // Use (?:\s|$) instead of \b to prevent matching hyphenated identifiers like MOVE-COUNT. + // Only use RE_PROC_PARAGRAPH as flush trigger when in Area A (≤7 leading spaces, fixed-format). + // In free-format, never use RE_PROC_PARAGRAPH (can't distinguish parameters from paragraphs). + const trimmedLine = line.trimStart(); + const leadingSpaces = (line.match(/^(\s*)/)?.[1].length ?? 0); + const isAreaAParagraph = RE_PROC_PARAGRAPH.test(line) && (!isFreeFormat ? leadingSpaces <= 7 : false); + if (RE_STATEMENT_VERB_START.test(trimmedLine) + || RE_PROC_SECTION.test(line) || isAreaAParagraph) { + flushCallAccum(); // Flush CALL without this line's content + // Fall through to process this line normally + } else { + callAccum += ' ' + line; + if (/\.\s*$/.test(callAccum) || /\bEND-CALL\b/i.test(callAccum)) { + flushCallAccum(); + } + return; // continuation line consumed by CALL accumulator + } + } else if (currentDivision === 'procedure' && /(?= 0) { + const afterUsing = fullSort.substring(usingIdx + 6); + const gIdx = afterUsing.toUpperCase().search(/\bGIVING\b/); + const usingText = gIdx >= 0 ? afterUsing.substring(0, gIdx) : afterUsing; + usingFiles.push(...usingText.trim().split(/\s+/).map(f => f.replace(/\.$/, '')).filter(f => /^[A-Z][A-Z0-9-]+$/i.test(f) && !SORT_CLAUSE_NOISE.has(f.toUpperCase()))); + } + if (givingIdx >= 0) { + const givingText = fullSort.substring(givingIdx + 7); + givingFiles.push(...givingText.trim().split(/\s+/).map(f => f.replace(/\.$/, '')).filter(f => /^[A-Z][A-Z0-9-]+$/i.test(f) && !SORT_CLAUSE_NOISE.has(f.toUpperCase()))); + } + // INPUT PROCEDURE IS / OUTPUT PROCEDURE IS → control-flow targets (like PERFORM) + // Supports optional THRU/THROUGH range: INPUT PROCEDURE IS proc-start THRU proc-end + const inputProcMatch = fullSort.match(/\bINPUT\s+PROCEDURE\s+(?:IS\s+)?([A-Z][A-Z0-9-]+)(?:\s+(?:THRU|THROUGH)\s+([A-Z][A-Z0-9-]+))?/i); + const outputProcMatch = fullSort.match(/\bOUTPUT\s+PROCEDURE\s+(?:IS\s+)?([A-Z][A-Z0-9-]+)(?:\s+(?:THRU|THROUGH)\s+([A-Z][A-Z0-9-]+))?/i); + if (inputProcMatch) { + result.performs.push({ caller: currentParagraph, target: inputProcMatch[1], thruTarget: inputProcMatch[2] || undefined, line: sortStartLine }); + } + if (outputProcMatch) { + result.performs.push({ caller: currentParagraph, target: outputProcMatch[1], thruTarget: outputProcMatch[2] || undefined, line: sortStartLine }); + } + result.sorts.push({ sortFile: smatch[1], usingFiles, givingFiles, line: sortStartLine }); + } + sortAccum = null; + } + + function flushInspect(): void { + if (inspectAccum === null) return; + const text = inspectAccum; + const fieldMatch = text.match(/\bINSPECT\s+([A-Z][A-Z0-9-]+)/i); + if (!fieldMatch) { inspectAccum = null; return; } + + const counters: string[] = []; + const tallySection = text.match(/\bTALLYING\b([\s\S]+?)(?:\bREPLACING\b|\bCONVERTING\b|\.\s*$)/i); + if (tallySection) { + const counterRe = /([A-Z][A-Z0-9-]+)\s+FOR\b/gi; + let cm: RegExpExecArray | null; + while ((cm = counterRe.exec(tallySection[1])) !== null) { + counters.push(cm[1]); + } + } + + const hasTallying = /\bTALLYING\b/i.test(text); + const hasReplacing = /\bREPLACING\b/i.test(text); + const hasConverting = /\bCONVERTING\b/i.test(text); + const form = hasConverting ? 'converting' as const + : hasTallying && hasReplacing ? 'tallying-replacing' as const + : hasTallying ? 'tallying' as const + : 'replacing' as const; + + result.inspects.push({ + inspectedField: fieldMatch[1], + counters, + form, + line: inspectStartLine, + caller: currentParagraph, + }); + inspectAccum = null; + } + + /** + * Flush accumulated multi-line CALL statement. Re-extracts CALL target + * and USING parameters from the full accumulated text. + */ + function flushCallAccum(): void { + if (callAccum === null) return; + const text = callAccum; + + // Extract quoted CALLs from the full statement + for (const callMatch of text.matchAll(RE_CALL)) { + const callTarget = callMatch[1] ?? callMatch[2]; + const afterCall = text.substring(callMatch.index! + callMatch[0].length); + const usingMatch = afterCall.match(RE_USING_PARAMS); + const parameters = usingMatch + ? usingMatch[1].split(/\bRETURNING\b/i)[0].trim().split(/\s+/) + .filter(s => s.length > 0 && !CALL_USING_FILTER.has(s.toUpperCase()) && /^[A-Z][A-Z0-9-]+$/i.test(s)) + : undefined; + const retMatch = afterCall.match(/\bRETURNING\s+([A-Z][A-Z0-9-]+)/i); + const returning = retMatch ? retMatch[1] : undefined; + result.calls.push({ target: callTarget, line: callAccumLine, isQuoted: true, parameters, returning }); + } + + // Extract dynamic CALLs from the full statement + for (const dynCallMatch of text.matchAll(RE_CALL_DYNAMIC)) { + const afterDynCall = text.substring(dynCallMatch.index! + dynCallMatch[0].length); + const dynUsingMatch = afterDynCall.match(RE_USING_PARAMS); + const dynParameters = dynUsingMatch + ? dynUsingMatch[1].split(/\bRETURNING\b/i)[0].trim().split(/\s+/) + .filter(s => s.length > 0 && !CALL_USING_FILTER.has(s.toUpperCase()) && /^[A-Z][A-Z0-9-]+$/i.test(s)) + : undefined; + const dynRetMatch = afterDynCall.match(/\bRETURNING\s+([A-Z][A-Z0-9-]+)/i); + const dynReturning = dynRetMatch ? dynRetMatch[1] : undefined; + result.calls.push({ target: dynCallMatch[1], line: callAccumLine, isQuoted: false, parameters: dynParameters, returning: dynReturning }); + } + + // Extract CANCELs from within the CALL block (common in ON EXCEPTION handlers) + for (const cancelMatch of text.matchAll(RE_CANCEL)) { + result.cancels.push({ target: cancelMatch[1] ?? cancelMatch[2], line: callAccumLine, isQuoted: true }); + } + for (const dynCancelMatch of text.matchAll(RE_CANCEL_DYNAMIC)) { + result.cancels.push({ target: dynCancelMatch[1], line: callAccumLine, isQuoted: false }); + } + + callAccum = null; + } + + // ========================================================================= + // DATA DIVISION extraction + // ========================================================================= + function extractData(line: string, lineNum: number): void { + // FD entry + const fdMatch = line.match(RE_FD); + if (fdMatch) { + // Flush any previous FD without a record + if (pendingFdName !== null) { + result.fdEntries.push({ fdName: pendingFdName, line: pendingFdLine }); + } + pendingFdName = fdMatch[1]; + pendingFdLine = lineNum; + return; + } + + // 88-level condition names + const lv88Match = line.match(RE_88_LEVEL); + if (lv88Match) { + const name = lv88Match[1]; + const values = parseConditionValues(lv88Match[2]); + result.dataItems.push({ + name, + level: 88, + line: lineNum, + values, + section: currentDataSection, + }); + return; + } + + // Level 66 RENAMES + const lv66Match = line.match(RE_66_LEVEL); + if (lv66Match) { + result.dataItems.push({ + name: lv66Match[1], + level: 66, + line: lineNum, + redefines: lv66Match[2], // RENAMES target stored as redefines + section: currentDataSection, + }); + return; + } + + // Anonymous REDEFINES (no name, e.g. "01 REDEFINES WK-PERIVAL.") + const anonRedefMatch = line.match(RE_ANONYMOUS_REDEFINES); + if (anonRedefMatch) { + // Check it's truly anonymous: the second capture is not a valid data name + // followed by more clauses — it's the REDEFINES target directly after level + const level = parseInt(anonRedefMatch[1], 10); + // Only skip if this is genuinely "NN REDEFINES target" with no name between + // We detect this by checking the full data item regex does NOT match + // (because RE_DATA_ITEM expects a name before any clauses) + const dataMatch = line.match(RE_DATA_ITEM); + if (!dataMatch || dataMatch[2].toUpperCase() === 'REDEFINES') { + // Truly anonymous — skip, no node + return; + } + } + + // Standard data items: level 01-49, 66, 77 + const dataMatch = line.match(RE_DATA_ITEM); + if (dataMatch) { + const level = parseInt(dataMatch[1], 10); + const name = dataMatch[2]; + const rest = dataMatch[3] || ''; + + // Skip FILLER + if (name.toUpperCase() === 'FILLER') return; + + // Valid levels: 01-49, 66, 77 + if ((level >= 1 && level <= 49) || level === 66 || level === 77) { + const clauses = parseDataItemClauses(rest); + + const item: CobolRegexResults['dataItems'][number] = { + name, + level, + line: lineNum, + section: currentDataSection, + }; + if (clauses.pic) item.pic = clauses.pic; + if (clauses.usage) item.usage = clauses.usage; + if (clauses.occurs !== undefined) item.occurs = clauses.occurs; + if (clauses.dependingOn) item.dependingOn = clauses.dependingOn; + if (clauses.redefines) item.redefines = clauses.redefines; + if (clauses.value) item.values = [clauses.value]; + if (clauses.isExternal) item.isExternal = true; + if (clauses.isGlobal) item.isGlobal = true; + + result.dataItems.push(item); + + // If there's a pending FD and this is a 01-level, it's the FD's record + if (pendingFdName !== null && level === 1) { + result.fdEntries.push({ + fdName: pendingFdName, + recordName: name, + line: pendingFdLine, + }); + pendingFdName = null; + } + } + } + } + + // ========================================================================= + // PROCEDURE DIVISION extraction + // ========================================================================= + function extractProcedure(line: string, lineNum: number): void { + // USE AFTER EXCEPTION in DECLARATIVES + if (inDeclaratives) { + const useMatch = line.match(RE_USE_AFTER); + if (useMatch) { + // Find the most recent section name + const lastSection = result.sections[result.sections.length - 1]; + if (lastSection) { + result.declaratives.push({ + sectionName: lastSection.name, + target: useMatch[1], + line: lineNum, + }); + } + return; + } + } + + // Handle PROCEDURE DIVISION USING on a continuation line + if (pendingProcUsing) { + const usingMatch = line.match(/\bUSING\s+([\s\S]*?)(?:\.|$)/i); + if (usingMatch) { + const params = usingMatch[1].split(/\bRETURNING\b/i)[0].trim().split(/\s+/) + .filter(s => s.length > 0 && !USING_KEYWORDS.has(s.toUpperCase())); + result.procedureUsing = params; + const topProg = programBoundaryStack[programBoundaryStack.length - 1]; + if (topProg) topProg.procedureUsing = params; + } + pendingProcUsing = false; + if (usingMatch) return; // consumed the USING line + } + + // Section header + const secMatch = line.match(RE_PROC_SECTION); + if (secMatch) { + const name = secMatch[1]; + if (!EXCLUDED_PARA_NAMES.has(name.toUpperCase()) && !name.toUpperCase().includes('DIVISION')) { + result.sections.push({ name, line: lineNum }); + // Don't set currentParagraph to section name — sections are Namespaces, + // not Functions. Setting it here would cause PERFORMs to be attributed + // to the section instead of the containing paragraph. + } + return; + } + + // Paragraph header + const paraMatch = line.match(RE_PROC_PARAGRAPH); + if (paraMatch) { + const name = paraMatch[1]; + // In fixed-format, paragraphs must start in Area A (col 8-11, max 7 leading spaces). + // Reject deeply-indented lines (Area B, 8+ spaces) to prevent false paragraphs from + // data items or CALL USING parameters on continuation lines. + const leadingSpaces = line.match(/^(\s*)/)?.[1].length ?? 0; + if (!isFreeFormat && leadingSpaces > 7) return; // Area B — not a paragraph + if (!EXCLUDED_PARA_NAMES.has(name.toUpperCase()) && !name.toUpperCase().startsWith('END-') && name.toUpperCase() !== 'DIVISION' && name.toUpperCase() !== 'SECTION') { + result.paragraphs.push({ name, line: lineNum }); + currentParagraph = name; + } + return; + } + + // PERFORM (global — captures multiple PERFORMs on the same logical line) + for (const perfMatch of line.matchAll(RE_PERFORM)) { + const target = perfMatch[1]; + // Skip COBOL inline-perform keywords that are not paragraph names + if (!PERFORM_KEYWORD_SKIP.has(target.toUpperCase())) { + // Also check for "PERFORM identifier TIMES" — the identifier is a + // data item count, not a paragraph name (fundamental regex ambiguity). + const matchEnd = perfMatch.index! + perfMatch[0].length; + const afterTarget = line.substring(matchEnd).trim(); + if (!/^TIMES\b/i.test(afterTarget)) { + result.performs.push({ + caller: currentParagraph, + target, + thruTarget: perfMatch[2] || undefined, + line: lineNum, + }); + } + } + } + + // ENTRY point + const entryMatch = line.match(RE_ENTRY); + if (entryMatch) { + const entryName = entryMatch[1] ?? entryMatch[2]; + const usingClause = entryMatch[3]; + if (entryName) { + result.entryPoints.push({ + name: entryName, + parameters: usingClause + ? usingClause.trim().split(/\s+/).filter(s => s.length > 0 && !USING_KEYWORDS.has(s.toUpperCase())) + : [], + line: lineNum, + }); + } + } + + // MOVE statement (skip literals and figurative constants) + const moveMatch = line.match(RE_MOVE); + if (moveMatch) { + const from = moveMatch[2].toUpperCase(); + if (!MOVE_SKIP.has(from)) { + const isCorresponding = !!moveMatch[1]; + // MOVE CORRESPONDING is always single-target per COBOL standard + const targets = isCorresponding + ? [moveMatch[3].replace(/\..*$/, '').trim().split(/\s+/)[0]].filter(t => /^[A-Z][A-Z0-9-]+$/i.test(t)) + : extractMoveTargets(moveMatch[3]); + + if (targets.length > 0) { + result.moves.push({ + from: moveMatch[2], + targets, + line: lineNum, + caller: currentParagraph, + corresponding: isCorresponding, + }); + } + } + } + + // GO TO — control flow transfer (handles GO TO p1 p2 p3 DEPENDING ON x) + const gotoMatch = line.match(RE_GOTO); + if (gotoMatch) { + const targets = gotoMatch[1].trim().split(/\s+/).filter(t => /^[A-Z][A-Z0-9-]+$/i.test(t)); + for (const target of targets) { + result.gotos.push({ caller: currentParagraph, target, line: lineNum }); + } + } + + // SORT / MERGE file references (multi-line: accumulate until period) + if (sortAccum !== null) { + // Continue accumulating SORT/MERGE statement + sortAccum += ' ' + line; + if (!/\.\s*$/.test(sortAccum)) return; // still accumulating — skip other extractors + // Period found — flush, then re-check line for a new SORT/MERGE after the period + flushSort(); + // After flushing, fall through to check if this line also starts a new SORT/MERGE + } + const sortMatch = line.match(RE_SORT) || line.match(RE_MERGE); + if (sortMatch && sortAccum === null) { + sortAccum = line; + sortStartLine = lineNum; + if (!/\.\s*$/.test(sortAccum)) return; // multi-line — wait for period + flushSort(); + } + + // INSPECT — multi-line accumulator (like SORT) + // If a real paragraph/section header or statement verb arrives during accumulation, + // flush the INSPECT as-is and process the line normally. + if (inspectAccum !== null) { + const inspTrimmed = line.trimStart(); + const inspLeading = (line.match(/^(\s*)/)?.[1].length ?? 0); + const inspIsAreaAPara = RE_PROC_PARAGRAPH.test(line) && (!isFreeFormat ? inspLeading <= 7 : false); + if (RE_PROC_SECTION.test(line) || inspIsAreaAPara + || RE_STATEMENT_VERB_START.test(inspTrimmed) + || /^CALL(?:\s|$)/i.test(inspTrimmed)) { + flushInspect(); + // Fall through to process this line normally + } else { + inspectAccum += ' ' + line; + if (/\.\s*$/.test(inspectAccum)) { + flushInspect(); + } else { + return; + } + } + } + const inspectMatch = line.match(/\bINSPECT\s+([A-Z][A-Z0-9-]+)/i); + if (inspectMatch && inspectAccum === null) { + inspectAccum = line; + inspectStartLine = lineNum; + if (!/\.\s*$/.test(inspectAccum)) return; + flushInspect(); + } + + // SEARCH — table access + const searchMatch = line.match(RE_SEARCH); + if (searchMatch) { + result.searches.push({ target: searchMatch[1], line: lineNum }); + } + + // CANCEL — program lifecycle (global matchAll captures multiple CANCELs on same line) + for (const cancelMatch of line.matchAll(RE_CANCEL)) { + result.cancels.push({ target: cancelMatch[1] ?? cancelMatch[2], line: lineNum, isQuoted: true }); + } + // Dynamic CANCEL — RE_CANCEL_DYNAMIC cannot match quoted targets, no dedup guard needed + for (const dynCancelMatch of line.matchAll(RE_CANCEL_DYNAMIC)) { + result.cancels.push({ target: dynCancelMatch[1], line: lineNum, isQuoted: false }); + } + + // SET statement (condition, index) + const setTrueMatch = line.match(RE_SET_TO_TRUE); + if (setTrueMatch) { + const targets = setTrueMatch[1].trim().split(/\s+/) + .filter(t => /^[A-Z][A-Z0-9-]+$/i.test(t) && t.toUpperCase() !== 'OF'); + if (targets.length > 0) { + result.sets.push({ targets, form: 'to-true', line: lineNum, caller: currentParagraph }); + } + } else { + const setIdxMatch = line.match(RE_SET_INDEX); + if (setIdxMatch) { + const targets = setIdxMatch[1].trim().split(/\s+/) + .filter(t => /^[A-Z][A-Z0-9-]+$/i.test(t)); + const mode = setIdxMatch[2].toUpperCase(); + const form = mode === 'TO' ? 'to-value' as const + : mode.startsWith('UP') ? 'up-by' as const + : 'down-by' as const; + result.sets.push({ targets, form, value: setIdxMatch[3], line: lineNum, caller: currentParagraph }); + } + } + + // INITIALIZE — data reset (multi-target: INITIALIZE WS-A WS-B WS-C.) + const initMatch = line.match(RE_INITIALIZE); + if (initMatch) { + const targets = initMatch[1].trim().split(/\s+/) + .filter(t => /^[A-Z][A-Z0-9-]+$/i.test(t) && !INITIALIZE_CLAUSE_KEYWORDS.has(t.toUpperCase())); + for (const target of targets) { + result.initializes.push({ target, line: lineNum, caller: currentParagraph }); + } + } + } +} diff --git a/gitnexus/src/core/ingestion/cobol/jcl-parser.ts b/gitnexus/src/core/ingestion/cobol/jcl-parser.ts new file mode 100644 index 0000000000..af1136410e --- /dev/null +++ b/gitnexus/src/core/ingestion/cobol/jcl-parser.ts @@ -0,0 +1,263 @@ +/** + * JCL Parser — Regex single-pass extraction. + * + * Extracts JCL constructs from mainframe job streams: + * - JOB statements (job name, CLASS, MSGCLASS) + * - EXEC statements (step -> program or proc) + * - DD statements (dataset references, DISP) + * - PROC definitions (in-stream and catalogued) + * - INCLUDE MEMBER= directives + * - SET symbolic parameters + * - IF/ELSE/ENDIF conditional execution + * - JCLLIB ORDER= search paths + * + * Pattern follows cobol-preprocessor.ts — regex-only, no tree-sitter. + */ + +export interface JclParseResults { + jobs: Array<{ name: string; line: number; class?: string; msgclass?: string }>; + steps: Array<{ name: string; jobName: string; program?: string; proc?: string; line: number }>; + ddStatements: Array<{ ddName: string; stepName: string; dataset?: string; disp?: string; line: number }>; + procs: Array<{ name: string; line: number; isInStream: boolean }>; + includes: Array<{ member: string; line: number }>; + sets: Array<{ variable: string; value: string; line: number }>; + jcllib: Array<{ order: string[]; line: number }>; + conditionals: Array<{ type: 'IF' | 'ELSE' | 'ENDIF'; condition?: string; line: number }>; +} + +// ── JCL statement patterns ───────────────────────────────────────────── + +// JCL continuation: line ends with a non-blank in col 72, next line starts with // +// We handle continuations by joining lines before matching. + +/** Match //jobname JOB ... */ +const JOB_RE = /^\/\/(\w{1,8})\s+JOB\s+(.*)/i; + +/** Match //stepname EXEC PGM=program or //stepname EXEC procname */ +const EXEC_RE = /^\/\/(\w{1,8})\s+EXEC\s+(.*)/i; + +/** Match //ddname DD ... */ +const DD_RE = /^\/\/(\w{1,8})\s+DD\s+(.*)/i; + +/** Match // JCLLIB ORDER=(lib1,lib2,...) */ +const JCLLIB_RE = /^\/\/\s+JCLLIB\s+ORDER=\(([^)]+)\)/i; + +/** Match // IF condition THEN */ +const IF_RE = /^\/\/\s+IF\s+(.+)\s+THEN/i; + +/** Match // ELSE */ +const ELSE_RE = /^\/\/\s+ELSE\b/i; + +/** Match // ENDIF */ +const ENDIF_RE = /^\/\/\s+ENDIF\b/i; + +/** Match // INCLUDE MEMBER=name */ +const INCLUDE_RE = /^\/\/\s+INCLUDE\s+MEMBER=(\w+)/i; + +/** Match // SET var=value */ +const SET_RE = /^\/\/\s+SET\s+(\w+)=(.+)/i; + +/** Match // PROC or //name PROC */ +const PROC_RE = /^\/\/(\w*)\s+PROC\b/i; + +/** Match // PEND */ +const PEND_RE = /^\/\/\s+PEND\b/i; + +// ── Parameter extractors ─────────────────────────────────────────────── + +function extractParam(params: string, key: string): string | undefined { + // Match KEY=VALUE or KEY='VALUE' in JCL parameter string + const re = new RegExp(`${key}=(?:'([^']*)'|(\\S+?))(?:[,\\s]|$)`, 'i'); + const m = params.match(re); + return m ? (m[1] ?? m[2]) : undefined; +} + +function extractPgm(params: string): string | undefined { + return extractParam(params, 'PGM'); +} + +function extractProc(params: string): string | undefined { + // If no PGM= keyword, the first positional parameter is the proc name + if (/PGM=/i.test(params)) return undefined; + const cleaned = params.replace(/,.*/, '').trim(); + // Proc name is the first token (no = sign) + if (cleaned && !cleaned.includes('=')) { + return cleaned.replace(/[,\s].*/s, '').toUpperCase(); + } + return undefined; +} + +function extractDsn(params: string): string | undefined { + return extractParam(params, 'DSN') ?? extractParam(params, 'DSNAME'); +} + +function extractDisp(params: string): string | undefined { + const m = params.match(/DISP=\(?\s*([^),\s]+)/i); + return m ? m[1] : undefined; +} + +/** + * Parse a JCL file and extract all constructs. + * + * @param content - Raw JCL file content + * @param filePath - Path for diagnostics (not used in extraction) + * @returns Parsed JCL results + */ +export function parseJcl(content: string, filePath: string): JclParseResults { + const results: JclParseResults = { + jobs: [], + steps: [], + ddStatements: [], + procs: [], + includes: [], + sets: [], + jcllib: [], + conditionals: [], + }; + + const rawLines = content.split(/\r?\n/); + // Join continuation lines: a line ending with non-blank in col 71 (0-indexed) + // followed by a line starting with // is a continuation. + const lines: Array<{ text: string; lineNum: number }> = []; + let i = 0; + while (i < rawLines.length) { + let line = rawLines[i]; + const lineNum = i + 1; + + // JCL continuation: if line is exactly 72+ chars and col 72 is non-blank + // and the next line starts with //, join them. + while ( + i + 1 < rawLines.length && + line.length >= 72 && + line[71] !== ' ' && + rawLines[i + 1].startsWith('//') + ) { + i++; + // Continuation text starts after // and leading spaces + const contText = rawLines[i].substring(2).replace(/^\s+/, ' '); + // Remove the continuation marker (col 72+) from current line + line = line.substring(0, 71).trimEnd() + contText; + } + + lines.push({ text: line, lineNum }); + i++; + } + + let currentJobName = ''; + let currentStepName = ''; + let inStreamProcName = ''; + + for (const { text, lineNum } of lines) { + // Skip JCL comments (starting with //* ) + if (text.startsWith('//*')) continue; + // Skip non-JCL lines (don't start with //) + if (!text.startsWith('//')) continue; + + // PROC definition (in-stream) + const procMatch = text.match(PROC_RE); + if (procMatch) { + const procName = procMatch[1] || inStreamProcName; + if (procName) { + results.procs.push({ name: procName.toUpperCase(), line: lineNum, isInStream: true }); + } + inStreamProcName = procName?.toUpperCase() || ''; + continue; + } + + // PEND (end of in-stream proc) + if (PEND_RE.test(text)) { + inStreamProcName = ''; + continue; + } + + // JCLLIB ORDER= + const jcllibMatch = text.match(JCLLIB_RE); + if (jcllibMatch) { + const libs = jcllibMatch[1].split(',').map(s => s.trim().replace(/'/g, '')); + results.jcllib.push({ order: libs, line: lineNum }); + continue; + } + + // IF/ELSE/ENDIF + const ifMatch = text.match(IF_RE); + if (ifMatch) { + results.conditionals.push({ type: 'IF', condition: ifMatch[1].trim(), line: lineNum }); + continue; + } + if (ELSE_RE.test(text)) { + results.conditionals.push({ type: 'ELSE', line: lineNum }); + continue; + } + if (ENDIF_RE.test(text)) { + results.conditionals.push({ type: 'ENDIF', line: lineNum }); + continue; + } + + // INCLUDE MEMBER= + const includeMatch = text.match(INCLUDE_RE); + if (includeMatch) { + results.includes.push({ member: includeMatch[1].toUpperCase(), line: lineNum }); + continue; + } + + // SET var=value + const setMatch = text.match(SET_RE); + if (setMatch) { + results.sets.push({ + variable: setMatch[1].toUpperCase(), + value: setMatch[2].trim().replace(/,\s*$/, ''), + line: lineNum, + }); + continue; + } + + // JOB statement + const jobMatch = text.match(JOB_RE); + if (jobMatch) { + currentJobName = jobMatch[1].toUpperCase(); + const params = jobMatch[2]; + results.jobs.push({ + name: currentJobName, + line: lineNum, + class: extractParam(params, 'CLASS'), + msgclass: extractParam(params, 'MSGCLASS'), + }); + continue; + } + + // EXEC statement + const execMatch = text.match(EXEC_RE); + if (execMatch) { + currentStepName = execMatch[1].toUpperCase(); + const params = execMatch[2]; + const pgm = extractPgm(params); + const proc = pgm ? undefined : extractProc(params); + + results.steps.push({ + name: currentStepName, + jobName: currentJobName, + program: pgm?.toUpperCase(), + proc: proc?.toUpperCase(), + line: lineNum, + }); + continue; + } + + // DD statement + const ddMatch = text.match(DD_RE); + if (ddMatch) { + const ddName = ddMatch[1].toUpperCase(); + const params = ddMatch[2]; + results.ddStatements.push({ + ddName, + stepName: currentStepName, + dataset: extractDsn(params)?.toUpperCase(), + disp: extractDisp(params)?.toUpperCase(), + line: lineNum, + }); + continue; + } + } + + return results; +} diff --git a/gitnexus/src/core/ingestion/cobol/jcl-processor.ts b/gitnexus/src/core/ingestion/cobol/jcl-processor.ts new file mode 100644 index 0000000000..3709b4b531 --- /dev/null +++ b/gitnexus/src/core/ingestion/cobol/jcl-processor.ts @@ -0,0 +1,274 @@ +/** + * JCL Processor — Converts JCL parse results into graph nodes and edges. + * + * Maps JCL entities to existing graph types (no new tables): + * - Job -> CodeElement (description: "jcl-job class:A msgclass:X") + * - Step -> CodeElement (description: "jcl-step pgm:PROGRAMNAME") + * - Dataset -> CodeElement (description: "jcl-dataset disp:SHR") + * - PROC -> Module + * + * Edges: + * - Job CONTAINS Step + * - Step CALLS Module (when PGM= matches an indexed program) + * - Step references Dataset (CALLS edge with reason "jcl-dd") + * - Job/Step IMPORTS PROC + * + * Pattern follows detectCrossProgamContracts() in pipeline.ts. + */ + +import { parseJcl, type JclParseResults } from './jcl-parser.js'; +import type { KnowledgeGraph } from '../../graph/types.js'; +import { generateId } from '../../../lib/utils.js'; + +export interface JclProcessResult { + jobCount: number; + stepCount: number; + datasetCount: number; + programLinks: number; +} + +/** + * Process JCL files and integrate into the knowledge graph. + * + * @param graph - The in-memory knowledge graph + * @param jclPaths - File paths of JCL files + * @param jclContents - Map of path -> file content + * @returns Summary of what was added + */ +export function processJclFiles( + graph: KnowledgeGraph, + jclPaths: string[], + jclContents: Map, +): JclProcessResult { + let jobCount = 0; + let stepCount = 0; + let datasetCount = 0; + let programLinks = 0; + + // Collect all Module names for step -> program linking + const moduleNames = new Map(); // uppercase name -> node id + graph.forEachNode(node => { + if (node.label === 'Module') { + const nodeName = node.properties.name; + if (typeof nodeName === 'string') { + moduleNames.set(nodeName.toUpperCase(), node.id); + } + } + }); + + for (const filePath of jclPaths) { + const content = jclContents.get(filePath); + if (!content) continue; + + const parsed = parseJcl(content, filePath); + const result = integrateJclResults(graph, parsed, filePath, moduleNames); + + jobCount += result.jobCount; + stepCount += result.stepCount; + datasetCount += result.datasetCount; + programLinks += result.programLinks; + } + + return { jobCount, stepCount, datasetCount, programLinks }; +} + +function integrateJclResults( + graph: KnowledgeGraph, + parsed: JclParseResults, + filePath: string, + moduleNames: Map, +): JclProcessResult { + let jobCount = 0; + let stepCount = 0; + let datasetCount = 0; + let programLinks = 0; + + // Track step node IDs for DD -> step linking + const stepNodeIds = new Map(); // stepName -> nodeId + + // 1. Create Job nodes + for (const job of parsed.jobs) { + const jobId = generateId('CodeElement', `${filePath}:job:${job.name}`); + const classPart = job.class ? ` class:${job.class}` : ''; + const msgPart = job.msgclass ? ` msgclass:${job.msgclass}` : ''; + + graph.addNode({ + id: jobId, + label: 'CodeElement', + properties: { + name: job.name, + filePath, + startLine: job.line, + endLine: job.line, + description: `jcl-job${classPart}${msgPart}`, + }, + }); + + // Link File -> Job (CONTAINS) + const fileId = generateId('File', filePath); + graph.addRelationship({ + id: `${fileId}_contains_${jobId}`, + type: 'CONTAINS', + sourceId: fileId, + targetId: jobId, + confidence: 1.0, + reason: 'jcl-job', + }); + + jobCount++; + } + + // 1.5 Pre-register in-stream PROCs so steps can reference them + // (fixes ordering bug: steps processed before PROCs were registered) + for (const proc of parsed.procs) { + const procId = generateId('Module', `${filePath}:proc:${proc.name}`); + moduleNames.set(proc.name.toUpperCase(), procId); + } + + // 2. Create Step nodes and link to programs + for (const step of parsed.steps) { + const stepId = generateId('CodeElement', `${filePath}:step:${step.jobName}:${step.name}`); + const pgmPart = step.program ? ` pgm:${step.program}` : ''; + const procPart = step.proc ? ` proc:${step.proc}` : ''; + + graph.addNode({ + id: stepId, + label: 'CodeElement', + properties: { + name: step.name, + filePath, + startLine: step.line, + endLine: step.line, + description: `jcl-step${pgmPart}${procPart}`, + }, + }); + + stepNodeIds.set(step.name, stepId); + + // Link Job -> Step (CONTAINS) + if (step.jobName) { + const jobId = generateId('CodeElement', `${filePath}:job:${step.jobName}`); + graph.addRelationship({ + id: `${jobId}_contains_${stepId}`, + type: 'CONTAINS', + sourceId: jobId, + targetId: stepId, + confidence: 1.0, + reason: 'jcl-step', + }); + } + + // Link Step -> Module (CALLS) when PGM= matches an indexed program + if (step.program) { + const moduleId = moduleNames.get(step.program.toUpperCase()); + if (moduleId) { + graph.addRelationship({ + id: `${stepId}_calls_${moduleId}`, + type: 'CALLS', + sourceId: stepId, + targetId: moduleId, + confidence: 0.95, + reason: 'jcl-exec-pgm', + }); + programLinks++; + } + } + + // Link Step -> PROC (CALLS) — PROC as Module + if (step.proc) { + const procModuleId = moduleNames.get(step.proc.toUpperCase()); + if (procModuleId) { + graph.addRelationship({ + id: `${stepId}_calls_proc_${procModuleId}`, + type: 'CALLS', + sourceId: stepId, + targetId: procModuleId, + confidence: 0.9, + reason: 'jcl-exec-proc', + }); + } + } + + stepCount++; + } + + // 3. Create Dataset nodes from DD statements + const seenDatasets = new Set(); + for (const dd of parsed.ddStatements) { + if (!dd.dataset) continue; + + // Create dataset node (deduplicated per file) + const datasetKey = `${filePath}:dataset:${dd.dataset}`; + const datasetId = generateId('CodeElement', datasetKey); + + if (!seenDatasets.has(dd.dataset)) { + const dispPart = dd.disp ? ` disp:${dd.disp}` : ''; + graph.addNode({ + id: datasetId, + label: 'CodeElement', + properties: { + name: dd.dataset, + filePath, + startLine: dd.line, + endLine: dd.line, + + description: `jcl-dataset${dispPart}`, + }, + }); + seenDatasets.add(dd.dataset); + datasetCount++; + } + + // Link Step -> Dataset (CALLS with reason jcl-dd) + const stepId = stepNodeIds.get(dd.stepName); + if (stepId) { + graph.addRelationship({ + id: `${stepId}_dd_${dd.ddName}_${datasetId}`, + type: 'CALLS', + sourceId: stepId, + targetId: datasetId, + confidence: 0.85, + reason: `jcl-dd:${dd.ddName}`, + }); + } + } + + // 4. Create PROC nodes (in-stream procs as Module) + for (const proc of parsed.procs) { + if (!proc.isInStream) continue; + + const procId = generateId('Module', `${filePath}:proc:${proc.name}`); + graph.addNode({ + id: procId, + label: 'Module', + properties: { + name: proc.name, + filePath, + startLine: proc.line, + endLine: proc.line, + description: 'jcl-proc-instream', + }, + }); + + // Register for step linking + moduleNames.set(proc.name.toUpperCase(), procId); + } + + // 5. INCLUDE directives -> IMPORTS edges + for (const inc of parsed.includes) { + const moduleId = moduleNames.get(inc.member.toUpperCase()); + if (moduleId) { + const fileId = generateId('File', filePath); + graph.addRelationship({ + id: `${fileId}_includes_${moduleId}`, + type: 'IMPORTS', + sourceId: fileId, + targetId: moduleId, + confidence: 0.9, + reason: 'jcl-include', + }); + } + } + + return { jobCount, stepCount, datasetCount, programLinks }; +} diff --git a/gitnexus/src/core/ingestion/entry-point-scoring.ts b/gitnexus/src/core/ingestion/entry-point-scoring.ts index f6128f9fd8..555051ba20 100644 --- a/gitnexus/src/core/ingestion/entry-point-scoring.ts +++ b/gitnexus/src/core/ingestion/entry-point-scoring.ts @@ -226,6 +226,7 @@ export const ENTRY_POINT_PATTERNS = { /^onEvent$/, // BLoC event handler /^mapEventToState$/, // Legacy BLoC pattern ], + [SupportedLanguages.Cobol]: [], // Standalone regex processor — no tree-sitter entry points } satisfies Record; /** Pre-computed merged patterns (universal + language-specific) to avoid per-call array allocation. */ @@ -325,7 +326,7 @@ export function calculateEntryPointScore( // Check positive patterns const allPatterns = MERGED_ENTRY_POINT_PATTERNS[language]; - if (allPatterns.some(p => p.test(name))) { + if (allPatterns?.some(p => p.test(name))) { nameMultiplier = 1.5; // Bonus for matching entry point pattern reasons.push('entry-pattern'); } diff --git a/gitnexus/src/core/ingestion/framework-detection.ts b/gitnexus/src/core/ingestion/framework-detection.ts index 348a68d475..5d7e5cb756 100644 --- a/gitnexus/src/core/ingestion/framework-detection.ts +++ b/gitnexus/src/core/ingestion/framework-detection.ts @@ -601,6 +601,7 @@ export const AST_FRAMEWORK_PATTERNS_BY_LANGUAGE = { { framework: 'flutter', entryPointMultiplier: 2.5, reason: 'flutter-widget', patterns: FRAMEWORK_AST_PATTERNS.flutter }, { framework: 'riverpod', entryPointMultiplier: 2.8, reason: 'riverpod-pattern', patterns: FRAMEWORK_AST_PATTERNS.riverpod }, ], + [SupportedLanguages.Cobol]: [], // Standalone regex processor — no AST framework patterns } satisfies Record; /** Pre-lowercased patterns for O(1) pattern matching at runtime */ diff --git a/gitnexus/src/core/ingestion/language-provider.ts b/gitnexus/src/core/ingestion/language-provider.ts index b7449d33fd..1ac6ffbcf1 100644 --- a/gitnexus/src/core/ingestion/language-provider.ts +++ b/gitnexus/src/core/ingestion/language-provider.ts @@ -40,7 +40,12 @@ interface LanguageProviderConfig { readonly extensions: readonly string[]; // ── Parser ──────────────────────────────────────────────────────── - /** Tree-sitter query strings for definitions, imports, calls, heritage */ + /** Parse strategy: 'tree-sitter' (default) uses AST parsing via tree-sitter. + * 'standalone' means the language has its own regex-based processor and + * should be skipped by the tree-sitter pipeline (e.g., COBOL, Markdown). */ + readonly parseStrategy?: 'tree-sitter' | 'standalone'; + /** Tree-sitter query strings for definitions, imports, calls, heritage. + * Required for tree-sitter languages; empty string for standalone processors. */ readonly treeSitterQueries: string; // ── Core (required) ─────────────────────────────────────────────── diff --git a/gitnexus/src/core/ingestion/languages/cobol.ts b/gitnexus/src/core/ingestion/languages/cobol.ts new file mode 100644 index 0000000000..ff17653df0 --- /dev/null +++ b/gitnexus/src/core/ingestion/languages/cobol.ts @@ -0,0 +1,27 @@ +/** + * COBOL Language Provider + * + * Standalone regex-based processor — no tree-sitter grammar. + * COBOL files (.cbl, .cob, .cobol, .cpy, .copybook) are detected and + * processed by cobol-processor.ts in pipeline Phase 2.6, not by the + * tree-sitter pipeline. + * + * This provider exists to satisfy the SupportedLanguages exhaustiveness + * checks and to declare parseStrategy: 'standalone'. + */ +import { SupportedLanguages } from '../../../config/supported-languages.js'; +import { defineLanguage } from '../language-provider.js'; + +export const cobolProvider = defineLanguage({ + id: SupportedLanguages.Cobol, + parseStrategy: 'standalone', + extensions: [], // COBOL files detected by cobol-processor's isCobolFile/isJclFile + treeSitterQueries: '', + typeConfig: { + declarationNodeTypes: new Set(), + extractDeclaration: () => null, + extractParameter: () => null, + }, + exportChecker: () => false, + importResolver: () => null, +}); diff --git a/gitnexus/src/core/ingestion/languages/index.ts b/gitnexus/src/core/ingestion/languages/index.ts index d9984f990d..56e6154f76 100644 --- a/gitnexus/src/core/ingestion/languages/index.ts +++ b/gitnexus/src/core/ingestion/languages/index.ts @@ -23,6 +23,7 @@ import { phpProvider } from './php.js'; import { rubyProvider } from './ruby.js'; import { swiftProvider } from './swift.js'; import { dartProvider } from './dart.js'; +import { cobolProvider } from './cobol.js'; export const providers = { [SupportedLanguages.JavaScript]: javascriptProvider, @@ -39,6 +40,7 @@ export const providers = { [SupportedLanguages.Ruby]: rubyProvider, [SupportedLanguages.Swift]: swiftProvider, [SupportedLanguages.Dart]: dartProvider, + [SupportedLanguages.Cobol]: cobolProvider, } satisfies Record; /** Get provider by language enum (always succeeds for SupportedLanguages). */ diff --git a/gitnexus/src/core/ingestion/pipeline.ts b/gitnexus/src/core/ingestion/pipeline.ts index 983f9f774e..39294214ed 100644 --- a/gitnexus/src/core/ingestion/pipeline.ts +++ b/gitnexus/src/core/ingestion/pipeline.ts @@ -1,6 +1,7 @@ import { createKnowledgeGraph } from '../graph/graph.js'; import { processStructure } from './structure-processor.js'; import { processMarkdown } from './markdown-processor.js'; +import { processCobol, isCobolFile, isJclFile } from './cobol-processor.js'; import { processParsing } from './parsing-processor.js'; import { processImports, @@ -464,6 +465,14 @@ async function runScanAndStructure( stats: { filesProcessed: totalFiles, totalFiles, nodesCreated: graph.nodeCount }, }); + // ── Custom (non-tree-sitter) processors ───────────────────────────── + // Each custom processor follows the pattern in markdown-processor.ts: + // 1. Export a process function: (graph, files, allPathSet) => result + // 2. Export a file detection function: (path) => boolean + // 3. Filter files by extension, write nodes/edges directly to graph + // To add a new language: create a new processor file, import it here, + // and add a filter-read-call-log block following the pattern below. + // ── Phase 2.5: Markdown processing (headings + cross-links) ──────── const mdScanned = scannedFiles.filter(f => f.path.endsWith('.md') || f.path.endsWith('.mdx')); if (mdScanned.length > 0) { @@ -478,6 +487,26 @@ async function runScanAndStructure( } } + // ── Phase 2.6: COBOL processing (regex extraction, no tree-sitter) ── + const cobolScanned = scannedFiles.filter(f => isCobolFile(f.path) || isJclFile(f.path)); + if (cobolScanned.length > 0) { + const cobolContents = await readFileContents(repoPath, cobolScanned.map(f => f.path)); + const cobolFiles = cobolScanned + .filter(f => cobolContents.has(f.path)) + .map(f => ({ path: f.path, content: cobolContents.get(f.path)! })); + const allPathSet = new Set(allPaths); + const cobolResult = processCobol(graph, cobolFiles, allPathSet); + if (isDev) { + console.log(` COBOL: ${cobolResult.programs} programs, ${cobolResult.paragraphs} paragraphs, ${cobolResult.sections} sections from ${cobolFiles.length} files`); + if (cobolResult.execSqlBlocks > 0 || cobolResult.execCicsBlocks > 0 || cobolResult.entryPoints > 0) { + console.log(` COBOL enriched: ${cobolResult.execSqlBlocks} SQL blocks, ${cobolResult.execCicsBlocks} CICS blocks, ${cobolResult.entryPoints} entry points, ${cobolResult.moves} moves, ${cobolResult.fileDeclarations} file declarations`); + } + if (cobolResult.jclJobs > 0) { + console.log(` JCL: ${cobolResult.jclJobs} jobs, ${cobolResult.jclSteps} steps`); + } + } + } + return { scannedFiles, allPaths, totalFiles }; } diff --git a/gitnexus/src/core/ingestion/tree-sitter-queries.ts b/gitnexus/src/core/ingestion/tree-sitter-queries.ts index 9b7b456935..6cdc1c98dd 100644 --- a/gitnexus/src/core/ingestion/tree-sitter-queries.ts +++ b/gitnexus/src/core/ingestion/tree-sitter-queries.ts @@ -1134,4 +1134,5 @@ export const LANGUAGE_QUERIES: Record = { [SupportedLanguages.Ruby]: RUBY_QUERIES, [SupportedLanguages.Swift]: SWIFT_QUERIES, [SupportedLanguages.Dart]: DART_QUERIES, + [SupportedLanguages.Cobol]: '', // Standalone regex processor — no tree-sitter queries }; diff --git a/gitnexus/test/fixtures/lang-resolution/cobol-app/AUDITLOG.cbl b/gitnexus/test/fixtures/lang-resolution/cobol-app/AUDITLOG.cbl new file mode 100644 index 0000000000..53ecebadb2 --- /dev/null +++ b/gitnexus/test/fixtures/lang-resolution/cobol-app/AUDITLOG.cbl @@ -0,0 +1,25 @@ + IDENTIFICATION DIVISION. + PROGRAM-ID. AUDITLOG. + + DATA DIVISION. + WORKING-STORAGE SECTION. + 01 WS-LOG-MESSAGE PIC X(80). + 01 WS-TIMESTAMP PIC X(26). + + LINKAGE SECTION. + 01 LS-CUST-ID PIC 9(8). + 01 LS-AMOUNT PIC 9(7)V99. + + PROCEDURE DIVISION USING LS-CUST-ID LS-AMOUNT. + MAIN-PARAGRAPH. + PERFORM WRITE-LOG + GOBACK. + + WRITE-LOG. + STRING 'Customer ' LS-CUST-ID ' amount ' LS-AMOUNT + DELIMITED BY SIZE INTO WS-LOG-MESSAGE + DISPLAY WS-LOG-MESSAGE. + + ENTRY "AUDITLOG-BATCH" USING LS-CUST-ID. + DISPLAY 'Batch audit for ' LS-CUST-ID + GOBACK. diff --git a/gitnexus/test/fixtures/lang-resolution/cobol-app/COPYLIB.cpy b/gitnexus/test/fixtures/lang-resolution/cobol-app/COPYLIB.cpy new file mode 100644 index 0000000000..e78840d896 --- /dev/null +++ b/gitnexus/test/fixtures/lang-resolution/cobol-app/COPYLIB.cpy @@ -0,0 +1,3 @@ + 01 PREFIX-RECORD. + 05 PREFIX-CODE PIC X(10). + 05 PREFIX-NAME PIC X(30). diff --git a/gitnexus/test/fixtures/lang-resolution/cobol-app/CUSTDAT.cpy b/gitnexus/test/fixtures/lang-resolution/cobol-app/CUSTDAT.cpy new file mode 100644 index 0000000000..52428837ae --- /dev/null +++ b/gitnexus/test/fixtures/lang-resolution/cobol-app/CUSTDAT.cpy @@ -0,0 +1,6 @@ + 01 WS-CUSTOMER-DATA. + 05 WS-CUST-CODE PIC X(10). + 05 WS-CUST-TYPE PIC X(3). + 88 PREMIUM-CUSTOMER VALUE 'PRM'. + 88 REGULAR-CUSTOMER VALUE 'REG'. + 05 WS-CUST-ADDR PIC X(50). diff --git a/gitnexus/test/fixtures/lang-resolution/cobol-app/CUSTUPDT.cbl b/gitnexus/test/fixtures/lang-resolution/cobol-app/CUSTUPDT.cbl new file mode 100644 index 0000000000..978e289d13 --- /dev/null +++ b/gitnexus/test/fixtures/lang-resolution/cobol-app/CUSTUPDT.cbl @@ -0,0 +1,74 @@ + IDENTIFICATION DIVISION. + PROGRAM-ID. CUSTUPDT. + AUTHOR. TEST. + + ENVIRONMENT DIVISION. + INPUT-OUTPUT SECTION. + FILE-CONTROL. + SELECT CUSTOMER-FILE ASSIGN TO 'CUSTFILE' + ORGANIZATION IS INDEXED + ACCESS IS DYNAMIC + RECORD KEY IS CUST-ID + FILE STATUS IS WS-FILE-STATUS. + + DATA DIVISION. + FILE SECTION. + FD CUSTOMER-FILE. + 01 CUSTOMER-RECORD. + 05 CUST-ID PIC 9(8). + 05 CUST-NAME PIC X(30). + 05 CUST-BALANCE PIC 9(7)V99. + + WORKING-STORAGE SECTION. + 01 WS-FILE-STATUS PIC XX. + 01 WS-CUSTOMER-NAME PIC X(30). + 01 WS-AMOUNT PIC 9(7)V99. + 01 WS-EOF PIC 9 VALUE 0. + 88 END-OF-FILE VALUE 1. + 01 WS-AMT PIC 9(5)V99. + 01 WS-PROG-NAME PIC X(8). + 01 FIELD-A PIC 9(5)V99. + 01 FIELD-B PIC 9(5)V99. + COPY COPYLIB REPLACING ==PREFIX-== BY ==WS-==. + + LINKAGE SECTION. + 01 LS-PARAM PIC X(20). + + PROCEDURE DIVISION. + INIT-SECTION SECTION. + MAIN-PARAGRAPH. + PERFORM INIT-PARAGRAPH + PERFORM PROCESS-PARAGRAPH + PERFORM CLEANUP-PARAGRAPH + STOP RUN. + + INIT-PARAGRAPH. + OPEN I-O CUSTOMER-FILE + MOVE SPACES TO WS-CUSTOMER-NAME. + + PROCESSING-SECTION SECTION. + PROCESS-PARAGRAPH. + PERFORM READ-CUSTOMER THRU WRITE-CUSTOMER + CALL "AUDITLOG" USING CUST-ID WS-AMOUNT + CALL WS-PROG-NAME. + + READ-CUSTOMER. + READ CUSTOMER-FILE + NOT AT END + MOVE CUST-NAME TO WS-CUSTOMER-NAME + END-READ. + + UPDATE-BALANCE. + ADD WS-AMOUNT TO CUST-BALANCE + MOVE WS-AMOUNT TO CUST-BALANCE + MOVE WS-AMT TO FIELD-A FIELD-B. + + WRITE-CUSTOMER. + REWRITE CUSTOMER-RECORD. + + CLEANUP-PARAGRAPH. + CLOSE CUSTOMER-FILE. + + ENTRY 'ALTENTRY' USING LS-PARAM. + DISPLAY 'ALTERNATE ENTRY POINT' + GOBACK. diff --git a/gitnexus/test/fixtures/lang-resolution/cobol-app/NESTED.cbl b/gitnexus/test/fixtures/lang-resolution/cobol-app/NESTED.cbl new file mode 100644 index 0000000000..60af79ef7c --- /dev/null +++ b/gitnexus/test/fixtures/lang-resolution/cobol-app/NESTED.cbl @@ -0,0 +1,33 @@ + IDENTIFICATION DIVISION. + PROGRAM-ID. OUTER-PROG. + + DATA DIVISION. + WORKING-STORAGE SECTION. + 01 WS-OUTER-FLAG PIC 9 VALUE 0. + + PROCEDURE DIVISION. + OUTER-MAIN. + PERFORM OUTER-PROCESS + CALL "INNER-PROG" + STOP RUN. + + OUTER-PROCESS. + DISPLAY 'OUTER PROCESSING'. + + IDENTIFICATION DIVISION. + PROGRAM-ID. INNER-PROG. + + DATA DIVISION. + WORKING-STORAGE SECTION. + 01 WS-INNER-CODE PIC X(5). + + PROCEDURE DIVISION. + INNER-MAIN. + PERFORM INNER-PROCESS + GOBACK. + + INNER-PROCESS. + DISPLAY 'INNER PROCESSING'. + + END PROGRAM INNER-PROG. + END PROGRAM OUTER-PROG. diff --git a/gitnexus/test/fixtures/lang-resolution/cobol-app/RPTGEN.cbl b/gitnexus/test/fixtures/lang-resolution/cobol-app/RPTGEN.cbl new file mode 100644 index 0000000000..3ef2c137f2 --- /dev/null +++ b/gitnexus/test/fixtures/lang-resolution/cobol-app/RPTGEN.cbl @@ -0,0 +1,94 @@ + IDENTIFICATION DIVISION. + PROGRAM-ID. RPTGEN. + + DATA DIVISION. + WORKING-STORAGE SECTION. + COPY CUSTDAT. + 01 WS-REPORT-LINE PIC X(132). + 01 WS-SQL-CODE PIC S9(9) COMP. + 01 WS-COUNT PIC 9(4). + 01 WS-MAP-NAME PIC X(8). + 01 WS-SORT-FILE PIC X(8). + 01 WS-QUEUE-NAME PIC X(16). + 01 WS-NEXT-PGM PIC X(8). + + PROCEDURE DIVISION. + MAIN-PARAGRAPH. + PERFORM FETCH-DATA + PERFORM FORMAT-REPORT + PERFORM SEND-SCREEN + CALL "CUSTUPDT" + GO TO EXIT-PARAGRAPH. + + FETCH-DATA. + EXEC SQL + SELECT CUST_NAME, CUST_BALANCE + FROM CUSTOMER + WHERE CUST_ID = :WS-CUST-CODE + END-EXEC. + + FORMAT-REPORT. + PERFORM WS-COUNT TIMES + MOVE WS-CUST-CODE TO WS-REPORT-LINE + END-PERFORM + PERFORM MAIN-PARAGRAPH THRU FORMAT-REPORT + IF WS-COUNT > 0 PERFORM FETCH-DATA + ELSE PERFORM SEND-SCREEN + END-IF + SORT WS-SORT-FILE USING CUSTOMER-DATA + GIVING WS-REPORT-LINE. + SORT WS-SORT-FILE ON ASCENDING KEY WS-COUNT + INPUT PROCEDURE IS BUILD-SORT-INPUT + OUTPUT PROCEDURE IS WRITE-SORTED. + MOVE CORR WS-CUSTOMER-DATA TO WS-REPORT-LINE + SEARCH WS-CUSTOMER-DATA + GO TO FETCH-DATA FORMAT-REPORT SEND-SCREEN + DEPENDING ON WS-COUNT. + + SEND-SCREEN. + EXEC CICS + SEND MAP(WS-MAP-NAME) MAPSET('CUSTSET') + FROM(WS-REPORT-LINE) + END-EXEC. + + EXEC CICS + LINK PROGRAM('AUDITLOG') + END-EXEC. + + EXEC CICS + XCTL PROGRAM('CUSTUPDT') + END-EXEC. + + EXEC CICS + READ FILE('CUSTFILE') + INTO(WS-CUSTOMER-DATA) + END-EXEC. + + EXEC CICS + WRITEQ TS QUEUE('RPTQUEUE') + FROM(WS-REPORT-LINE) + END-EXEC. + + EXEC CICS + HANDLE ABEND LABEL(ABEND-HANDLER) + END-EXEC. + + EXEC CICS + RETURN TRANSID('RPTG') + END-EXEC. + + EXEC CICS + XCTL PROGRAM(WS-NEXT-PGM) + END-EXEC. + + BUILD-SORT-INPUT. + DISPLAY 'BUILDING SORT INPUT'. + + WRITE-SORTED. + DISPLAY 'WRITING SORTED OUTPUT'. + + ABEND-HANDLER. + DISPLAY 'ABEND OCCURRED'. + + EXIT-PARAGRAPH. + STOP RUN. diff --git a/gitnexus/test/fixtures/lang-resolution/cobol-app/RUNJOBS.jcl b/gitnexus/test/fixtures/lang-resolution/cobol-app/RUNJOBS.jcl new file mode 100644 index 0000000000..b737f337a3 --- /dev/null +++ b/gitnexus/test/fixtures/lang-resolution/cobol-app/RUNJOBS.jcl @@ -0,0 +1,5 @@ +//CUSTJOB JOB (ACCT),'CUSTOMER UPDATE',CLASS=A,MSGCLASS=X +//STEP1 EXEC PGM=CUSTUPDT +//CUSTFILE DD DSN=PROD.CUSTOMER.MASTER,DISP=SHR +//STEP2 EXEC PGM=RPTGEN +//SYSOUT DD SYSOUT=* diff --git a/gitnexus/test/integration/resolvers/cobol.test.ts b/gitnexus/test/integration/resolvers/cobol.test.ts new file mode 100644 index 0000000000..27fd83d3aa --- /dev/null +++ b/gitnexus/test/integration/resolvers/cobol.test.ts @@ -0,0 +1,608 @@ +/** + * COBOL: Exhaustive strict integration test. + * + * Every single node and edge produced by the COBOL/JCL pipeline is asserted + * with exact counts AND exact sorted edge-pair lists. No fuzzy assertions. + * + * Ground truth captured from the cobol-app fixture: + * CUSTUPDT.cbl, AUDITLOG.cbl, RPTGEN.cbl, NESTED.cbl, + * CUSTDAT.cpy, COPYLIB.cpy, RUNJOBS.jcl + */ +import { describe, it, expect, beforeAll } from 'vitest'; +import path from 'path'; +import { + FIXTURES, getRelationships, getNodesByLabel, edgeSet, + runPipelineFromRepo, type PipelineResult, +} from './helpers.js'; + +describe('COBOL full system extraction', () => { + let result: PipelineResult; + + beforeAll(async () => { + result = await runPipelineFromRepo( + path.join(FIXTURES, 'cobol-app'), + () => {}, + { skipGraphPhases: true }, + ); + }, 60000); + + // ===================================================================== + // NODE COMPLETENESS — exact count + exact sorted name list per label + // ===================================================================== + + describe('node completeness', () => { + + it('produces exactly 5 Module nodes', () => { + const nodes = getNodesByLabel(result, 'Module'); + expect(nodes.length).toBe(5); + expect(nodes).toEqual(['AUDITLOG', 'CUSTUPDT', 'INNER-PROG', 'OUTER-PROG', 'RPTGEN']); + }); + + it('produces exactly 21 Function nodes', () => { + const nodes = getNodesByLabel(result, 'Function'); + expect(nodes.length).toBe(21); + expect(nodes).toEqual([ + 'ABEND-HANDLER', 'BUILD-SORT-INPUT', 'CLEANUP-PARAGRAPH', + 'EXIT-PARAGRAPH', 'FETCH-DATA', 'FORMAT-REPORT', 'INIT-PARAGRAPH', + 'INNER-MAIN', 'INNER-PROCESS', + 'MAIN-PARAGRAPH', 'MAIN-PARAGRAPH', 'MAIN-PARAGRAPH', + 'OUTER-MAIN', 'OUTER-PROCESS', + 'PROCESS-PARAGRAPH', 'READ-CUSTOMER', 'SEND-SCREEN', + 'UPDATE-BALANCE', 'WRITE-CUSTOMER', 'WRITE-LOG', 'WRITE-SORTED', + ]); + }); + + it('produces exactly 2 Namespace nodes', () => { + expect(getNodesByLabel(result, 'Namespace')).toEqual(['INIT-SECTION', 'PROCESSING-SECTION']); + }); + + it('produces exactly 36 Property nodes', () => { + const nodes = getNodesByLabel(result, 'Property'); + expect(nodes.length).toBe(36); + expect(nodes).toEqual([ + 'CUST-BALANCE', 'CUST-ID', 'CUST-NAME', 'CUSTOMER-RECORD', + 'END-OF-FILE', 'FIELD-A', 'FIELD-B', + 'LS-AMOUNT', 'LS-CUST-ID', 'LS-PARAM', + 'PREMIUM-CUSTOMER', 'REGULAR-CUSTOMER', + 'WS-AMOUNT', 'WS-AMT', 'WS-CODE', 'WS-COUNT', + 'WS-CUST-ADDR', 'WS-CUST-CODE', 'WS-CUST-TYPE', + 'WS-CUSTOMER-DATA', 'WS-CUSTOMER-NAME', 'WS-EOF', + 'WS-FILE-STATUS', 'WS-INNER-CODE', 'WS-LOG-MESSAGE', + 'WS-MAP-NAME', 'WS-NAME', 'WS-NEXT-PGM', 'WS-OUTER-FLAG', + 'WS-PROG-NAME', 'WS-QUEUE-NAME', 'WS-RECORD', + 'WS-REPORT-LINE', 'WS-SORT-FILE', 'WS-SQL-CODE', 'WS-TIMESTAMP', + ]); + }); + + it('produces exactly 1 Record node', () => { + expect(getNodesByLabel(result, 'Record')).toEqual(['CUSTOMER-FILE']); + }); + + it('produces exactly 15 CodeElement nodes', () => { + const nodes = getNodesByLabel(result, 'CodeElement'); + expect(nodes.length).toBe(15); + expect(nodes).toEqual([ + 'CALL WS-PROG-NAME', 'CICS XCTL WS-NEXT-PGM', 'CUSTJOB', + 'EXEC CICS HANDLE ABEND', 'EXEC CICS LINK', 'EXEC CICS READ', + 'EXEC CICS RETURN', 'EXEC CICS SEND MAP', 'EXEC CICS WRITEQ TS', + 'EXEC CICS XCTL', 'EXEC CICS XCTL', 'EXEC SQL SELECT', + 'PROD.CUSTOMER.MASTER', 'STEP1', 'STEP2', + ]); + }); + + it('produces exactly 2 Constructor nodes', () => { + expect(getNodesByLabel(result, 'Constructor')).toEqual(['ALTENTRY', 'AUDITLOG-BATCH']); + }); + }); + + // ===================================================================== + // CALLS EDGES — exact count + exact sorted pairs per reason + // ===================================================================== + + describe('CALLS edge completeness', () => { + + it('produces exactly 15 CALLS edges with reason cobol-perform', () => { + const edges = getRelationships(result, 'CALLS').filter(e => e.rel.reason === 'cobol-perform'); + expect(edges.length).toBe(15); + expect(edgeSet(edges)).toEqual([ + 'FORMAT-REPORT \u2192 BUILD-SORT-INPUT', + 'FORMAT-REPORT \u2192 FETCH-DATA', + 'FORMAT-REPORT \u2192 MAIN-PARAGRAPH', + 'FORMAT-REPORT \u2192 SEND-SCREEN', + 'FORMAT-REPORT \u2192 WRITE-SORTED', + 'INNER-MAIN \u2192 INNER-PROCESS', + 'MAIN-PARAGRAPH \u2192 CLEANUP-PARAGRAPH', + 'MAIN-PARAGRAPH \u2192 FETCH-DATA', + 'MAIN-PARAGRAPH \u2192 FORMAT-REPORT', + 'MAIN-PARAGRAPH \u2192 INIT-PARAGRAPH', + 'MAIN-PARAGRAPH \u2192 PROCESS-PARAGRAPH', + 'MAIN-PARAGRAPH \u2192 SEND-SCREEN', + 'MAIN-PARAGRAPH \u2192 WRITE-LOG', + 'OUTER-MAIN \u2192 OUTER-PROCESS', + 'PROCESS-PARAGRAPH \u2192 READ-CUSTOMER', + ]); + }); + + it('produces exactly 2 CALLS edges with reason cobol-perform-thru', () => { + const edges = getRelationships(result, 'CALLS').filter(e => e.rel.reason === 'cobol-perform-thru'); + expect(edges.length).toBe(2); + expect(edgeSet(edges)).toEqual([ + 'FORMAT-REPORT \u2192 FORMAT-REPORT', + 'PROCESS-PARAGRAPH \u2192 WRITE-CUSTOMER', + ]); + }); + + it('produces exactly 3 CALLS edges with reason cobol-call', () => { + const edges = getRelationships(result, 'CALLS').filter(e => e.rel.reason === 'cobol-call'); + expect(edges.length).toBe(3); + expect(edgeSet(edges)).toEqual([ + 'CUSTUPDT \u2192 AUDITLOG', + 'OUTER-PROG \u2192 INNER-PROG', + 'RPTGEN \u2192 CUSTUPDT', + ]); + }); + + it('produces exactly 4 CALLS edges with reason cobol-goto', () => { + const edges = getRelationships(result, 'CALLS').filter(e => e.rel.reason === 'cobol-goto'); + expect(edges.length).toBe(4); + expect(edgeSet(edges)).toEqual([ + 'FORMAT-REPORT \u2192 FETCH-DATA', + 'FORMAT-REPORT \u2192 FORMAT-REPORT', + 'FORMAT-REPORT \u2192 SEND-SCREEN', + 'MAIN-PARAGRAPH \u2192 EXIT-PARAGRAPH', + ]); + }); + + it('produces exactly 1 CALLS edge with reason cics-link', () => { + const edges = getRelationships(result, 'CALLS').filter(e => e.rel.reason === 'cics-link'); + expect(edges.length).toBe(1); + expect(edgeSet(edges)).toEqual(['RPTGEN \u2192 AUDITLOG']); + }); + + it('produces exactly 1 CALLS edge with reason cics-xctl', () => { + const edges = getRelationships(result, 'CALLS').filter(e => e.rel.reason === 'cics-xctl'); + expect(edges.length).toBe(1); + expect(edgeSet(edges)).toEqual(['RPTGEN \u2192 CUSTUPDT']); + }); + + it('produces exactly 1 CALLS edge with reason cics-handle-abend', () => { + const edges = getRelationships(result, 'CALLS').filter(e => e.rel.reason === 'cics-handle-abend'); + expect(edges.length).toBe(1); + expect(edgeSet(edges)).toEqual(['RPTGEN \u2192 ABEND-HANDLER']); + }); + + it('produces exactly 1 CALLS edge with reason cics-return-transid', () => { + const edges = getRelationships(result, 'CALLS').filter(e => e.rel.reason === 'cics-return-transid'); + expect(edges.length).toBe(1); + }); + + it('produces exactly 2 CALLS edges with reason jcl-exec-pgm', () => { + const edges = getRelationships(result, 'CALLS').filter(e => e.rel.reason === 'jcl-exec-pgm'); + expect(edges.length).toBe(2); + expect(edgeSet(edges)).toEqual(['STEP1 \u2192 CUSTUPDT', 'STEP2 \u2192 RPTGEN']); + }); + + it('produces exactly 1 CALLS edge with reason jcl-dd:CUSTFILE', () => { + const edges = getRelationships(result, 'CALLS').filter(e => e.rel.reason === 'jcl-dd:CUSTFILE'); + expect(edges.length).toBe(1); + expect(edgeSet(edges)).toEqual(['STEP1 \u2192 PROD.CUSTOMER.MASTER']); + }); + + it('produces zero unresolved CALLS edges', () => { + expect(getRelationships(result, 'CALLS').filter(e => e.rel.reason.endsWith('-unresolved')).length).toBe(0); + }); + }); + + // ===================================================================== + // CONTAINS EDGES — exact count + exact sorted pairs per reason + // ===================================================================== + + describe('CONTAINS edge completeness', () => { + + it('produces exactly 4 CONTAINS edges with reason cobol-program-id', () => { + const edges = getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'cobol-program-id'); + expect(edges.length).toBe(4); + expect(edgeSet(edges)).toEqual([ + 'AUDITLOG.cbl \u2192 AUDITLOG', + 'CUSTUPDT.cbl \u2192 CUSTUPDT', + 'NESTED.cbl \u2192 OUTER-PROG', + 'RPTGEN.cbl \u2192 RPTGEN', + ]); + }); + + it('produces exactly 1 CONTAINS edge with reason cobol-nested-program', () => { + const edges = getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'cobol-nested-program'); + expect(edges.length).toBe(1); + expect(edgeSet(edges)).toEqual(['OUTER-PROG \u2192 INNER-PROG']); + }); + + it('produces exactly 2 CONTAINS edges with reason cobol-section', () => { + const edges = getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'cobol-section'); + expect(edges.length).toBe(2); + expect(edgeSet(edges)).toEqual([ + 'CUSTUPDT \u2192 INIT-SECTION', + 'CUSTUPDT \u2192 PROCESSING-SECTION', + ]); + }); + + it('produces exactly 21 CONTAINS edges with reason cobol-paragraph', () => { + const edges = getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'cobol-paragraph'); + expect(edges.length).toBe(21); + expect(edgeSet(edges)).toEqual([ + 'AUDITLOG \u2192 MAIN-PARAGRAPH', + 'AUDITLOG \u2192 WRITE-LOG', + 'INIT-SECTION \u2192 INIT-PARAGRAPH', + 'INIT-SECTION \u2192 MAIN-PARAGRAPH', + 'INNER-PROG \u2192 INNER-MAIN', + 'INNER-PROG \u2192 INNER-PROCESS', + 'OUTER-PROG \u2192 OUTER-MAIN', + 'OUTER-PROG \u2192 OUTER-PROCESS', + 'PROCESSING-SECTION \u2192 CLEANUP-PARAGRAPH', + 'PROCESSING-SECTION \u2192 PROCESS-PARAGRAPH', + 'PROCESSING-SECTION \u2192 READ-CUSTOMER', + 'PROCESSING-SECTION \u2192 UPDATE-BALANCE', + 'PROCESSING-SECTION \u2192 WRITE-CUSTOMER', + 'RPTGEN \u2192 ABEND-HANDLER', + 'RPTGEN \u2192 BUILD-SORT-INPUT', + 'RPTGEN \u2192 EXIT-PARAGRAPH', + 'RPTGEN \u2192 FETCH-DATA', + 'RPTGEN \u2192 FORMAT-REPORT', + 'RPTGEN \u2192 MAIN-PARAGRAPH', + 'RPTGEN \u2192 SEND-SCREEN', + 'RPTGEN \u2192 WRITE-SORTED', + ]); + }); + + it('produces exactly 36 CONTAINS edges with reason cobol-data-item', () => { + const edges = getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'cobol-data-item'); + expect(edges.length).toBe(36); + expect(edgeSet(edges)).toEqual([ + 'AUDITLOG \u2192 LS-AMOUNT', + 'AUDITLOG \u2192 LS-CUST-ID', + 'AUDITLOG \u2192 WS-LOG-MESSAGE', + 'AUDITLOG \u2192 WS-TIMESTAMP', + 'CUSTUPDT \u2192 CUST-BALANCE', + 'CUSTUPDT \u2192 CUST-ID', + 'CUSTUPDT \u2192 CUST-NAME', + 'CUSTUPDT \u2192 CUSTOMER-RECORD', + 'CUSTUPDT \u2192 END-OF-FILE', + 'CUSTUPDT \u2192 FIELD-A', + 'CUSTUPDT \u2192 FIELD-B', + 'CUSTUPDT \u2192 LS-PARAM', + 'CUSTUPDT \u2192 WS-AMOUNT', + 'CUSTUPDT \u2192 WS-AMT', + 'CUSTUPDT \u2192 WS-CODE', + 'CUSTUPDT \u2192 WS-CUSTOMER-NAME', + 'CUSTUPDT \u2192 WS-EOF', + 'CUSTUPDT \u2192 WS-FILE-STATUS', + 'CUSTUPDT \u2192 WS-NAME', + 'CUSTUPDT \u2192 WS-PROG-NAME', + 'CUSTUPDT \u2192 WS-RECORD', + 'INNER-PROG \u2192 WS-INNER-CODE', + 'OUTER-PROG \u2192 WS-OUTER-FLAG', + 'RPTGEN \u2192 PREMIUM-CUSTOMER', + 'RPTGEN \u2192 REGULAR-CUSTOMER', + 'RPTGEN \u2192 WS-COUNT', + 'RPTGEN \u2192 WS-CUST-ADDR', + 'RPTGEN \u2192 WS-CUST-CODE', + 'RPTGEN \u2192 WS-CUST-TYPE', + 'RPTGEN \u2192 WS-CUSTOMER-DATA', + 'RPTGEN \u2192 WS-MAP-NAME', + 'RPTGEN \u2192 WS-NEXT-PGM', + 'RPTGEN \u2192 WS-QUEUE-NAME', + 'RPTGEN \u2192 WS-REPORT-LINE', + 'RPTGEN \u2192 WS-SORT-FILE', + 'RPTGEN \u2192 WS-SQL-CODE', + ]); + }); + + it('produces exactly 8 CONTAINS edges with reason cobol-exec-cics', () => { + const edges = getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'cobol-exec-cics'); + expect(edges.length).toBe(8); + expect(edgeSet(edges)).toEqual([ + 'RPTGEN \u2192 EXEC CICS HANDLE ABEND', + 'RPTGEN \u2192 EXEC CICS LINK', + 'RPTGEN \u2192 EXEC CICS READ', + 'RPTGEN \u2192 EXEC CICS RETURN', + 'RPTGEN \u2192 EXEC CICS SEND MAP', + 'RPTGEN \u2192 EXEC CICS WRITEQ TS', + 'RPTGEN \u2192 EXEC CICS XCTL', + 'RPTGEN \u2192 EXEC CICS XCTL', + ]); + }); + + it('produces exactly 1 CONTAINS edge with reason cobol-exec-sql', () => { + expect(edgeSet(getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'cobol-exec-sql'))) + .toEqual(['RPTGEN \u2192 EXEC SQL SELECT']); + }); + + it('produces exactly 1 CONTAINS edge with reason cics-dynamic-program', () => { + expect(edgeSet(getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'cics-dynamic-program'))) + .toEqual(['RPTGEN \u2192 CICS XCTL WS-NEXT-PGM']); + }); + + it('produces exactly 1 CONTAINS edge with reason cobol-dynamic-call', () => { + expect(edgeSet(getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'cobol-dynamic-call'))) + .toEqual(['CUSTUPDT \u2192 CALL WS-PROG-NAME']); + }); + + it('produces exactly 2 CONTAINS edges with reason cobol-entry-point', () => { + expect(edgeSet(getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'cobol-entry-point'))) + .toEqual(['AUDITLOG \u2192 AUDITLOG-BATCH', 'CUSTUPDT \u2192 ALTENTRY']); + }); + + it('produces exactly 1 CONTAINS edge with reason cobol-file-declaration', () => { + expect(edgeSet(getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'cobol-file-declaration'))) + .toEqual(['CUSTUPDT \u2192 CUSTOMER-FILE']); + }); + + it('produces exactly 1 CONTAINS edge with reason jcl-job', () => { + expect(edgeSet(getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'jcl-job'))) + .toEqual(['RUNJOBS.jcl \u2192 CUSTJOB']); + }); + + it('produces exactly 2 CONTAINS edges with reason jcl-step', () => { + expect(edgeSet(getRelationships(result, 'CONTAINS').filter(e => e.rel.reason === 'jcl-step'))) + .toEqual(['CUSTJOB \u2192 STEP1', 'CUSTJOB \u2192 STEP2']); + }); + }); + + // ===================================================================== + // ACCESSES EDGES — exact count + exact sorted pairs per reason + // ===================================================================== + + describe('ACCESSES edge completeness', () => { + + it('produces exactly 4 ACCESSES edges with reason cobol-move-read', () => { + const edges = getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'cobol-move-read'); + expect(edges.length).toBe(4); + expect(edgeSet(edges)).toEqual([ + 'FORMAT-REPORT \u2192 WS-CUST-CODE', + 'READ-CUSTOMER \u2192 CUST-NAME', + 'UPDATE-BALANCE \u2192 WS-AMOUNT', + 'UPDATE-BALANCE \u2192 WS-AMT', + ]); + }); + + it('produces exactly 5 ACCESSES edges with reason cobol-move-write', () => { + const edges = getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'cobol-move-write'); + expect(edges.length).toBe(5); + expect(edgeSet(edges)).toEqual([ + 'FORMAT-REPORT \u2192 WS-REPORT-LINE', + 'READ-CUSTOMER \u2192 WS-CUSTOMER-NAME', + 'UPDATE-BALANCE \u2192 CUST-BALANCE', + 'UPDATE-BALANCE \u2192 FIELD-A', + 'UPDATE-BALANCE \u2192 FIELD-B', + ]); + }); + + it('produces exactly 1 ACCESSES edge with reason cics-file-read', () => { + expect(getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'cics-file-read').length).toBe(1); + }); + + it('produces exactly 1 ACCESSES edge with reason cics-map', () => { + expect(getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'cics-map').length).toBe(1); + }); + + it('produces exactly 1 ACCESSES edge with reason cics-queue-write', () => { + expect(getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'cics-queue-write').length).toBe(1); + }); + + it('produces exactly 1 ACCESSES edge with reason cics-receive-into', () => { + const edges = getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'cics-receive-into'); + expect(edges.length).toBe(1); + expect(edges[0].target).toBe('WS-CUSTOMER-DATA'); + }); + + it('produces exactly 2 ACCESSES edges with reason cics-send-from', () => { + const edges = getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'cics-send-from'); + expect(edges.length).toBe(2); + expect(edgeSet(edges)).toEqual([ + 'EXEC CICS SEND MAP \u2192 WS-REPORT-LINE', + 'EXEC CICS WRITEQ TS \u2192 WS-REPORT-LINE', + ]); + }); + + it('produces exactly 1 ACCESSES edge with reason cobol-search', () => { + const edges = getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'cobol-search'); + expect(edges.length).toBe(1); + expect(edgeSet(edges)).toEqual(['RPTGEN \u2192 WS-CUSTOMER-DATA']); + }); + + it('produces exactly 1 ACCESSES edge with reason sort-using', () => { + expect(getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'sort-using').length).toBe(1); + }); + + it('produces exactly 1 ACCESSES edge with reason sort-giving (multi-line SORT)', () => { + expect(getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'sort-giving').length).toBe(1); + }); + + it('produces exactly 2 ACCESSES edges with reason cobol-procedure-using', () => { + const edges = getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'cobol-procedure-using'); + expect(edges.length).toBe(2); + expect(edgeSet(edges)).toEqual([ + 'AUDITLOG \u2192 LS-AMOUNT', + 'AUDITLOG \u2192 LS-CUST-ID', + ]); + }); + + it('produces exactly 1 ACCESSES edge with reason sql-select', () => { + expect(getRelationships(result, 'ACCESSES').filter(e => e.rel.reason === 'sql-select').length).toBe(1); + }); + }); + + // ===================================================================== + // IMPORTS EDGES — exact pairs + // ===================================================================== + + describe('IMPORTS edge completeness', () => { + + it('produces exactly 2 IMPORTS edges with reason cobol-copy', () => { + const edges = getRelationships(result, 'IMPORTS').filter(e => e.rel.reason === 'cobol-copy'); + expect(edges.length).toBe(2); + }); + }); + + // ===================================================================== + // FEATURE-SPECIFIC ASSERTIONS — validates all review findings resolved + // ===================================================================== + + describe('multi-PERFORM on same line (Finding #III)', () => { + + it('captures both PERFORMs in IF/ELSE on a single logical line', () => { + // IF WS-COUNT > 0 PERFORM FETCH-DATA ELSE PERFORM SEND-SCREEN + const edges = getRelationships(result, 'CALLS').filter( + e => e.rel.reason === 'cobol-perform' && e.source === 'FORMAT-REPORT', + ); + const targets = edges.map(e => e.target).sort(); + expect(targets).toContain('FETCH-DATA'); + expect(targets).toContain('SEND-SCREEN'); + }); + }); + + describe('INPUT/OUTPUT PROCEDURE IS in SORT (Finding #iii)', () => { + + it('creates CALLS edges for INPUT PROCEDURE and OUTPUT PROCEDURE targets', () => { + const edges = getRelationships(result, 'CALLS').filter( + e => e.rel.reason === 'cobol-perform' && e.source === 'FORMAT-REPORT', + ); + const targets = edges.map(e => e.target).sort(); + expect(targets).toContain('BUILD-SORT-INPUT'); + expect(targets).toContain('WRITE-SORTED'); + }); + + it('creates paragraph nodes for INPUT/OUTPUT PROCEDURE targets', () => { + const nodes = getNodesByLabel(result, 'Function'); + expect(nodes).toContain('BUILD-SORT-INPUT'); + expect(nodes).toContain('WRITE-SORTED'); + }); + }); + + describe('GO TO DEPENDING ON multi-target (Finding #iv)', () => { + + it('captures all three targets from GO TO ... DEPENDING ON', () => { + // GO TO FETCH-DATA FORMAT-REPORT SEND-SCREEN DEPENDING ON WS-COUNT + const edges = getRelationships(result, 'CALLS').filter( + e => e.rel.reason === 'cobol-goto' && e.source === 'FORMAT-REPORT', + ); + expect(edges.length).toBe(3); + expect(edgeSet(edges)).toEqual([ + 'FORMAT-REPORT \u2192 FETCH-DATA', + 'FORMAT-REPORT \u2192 FORMAT-REPORT', + 'FORMAT-REPORT \u2192 SEND-SCREEN', + ]); + }); + }); + + describe('MOVE CORR abbreviation (Finding #IV)', () => { + + it('produces ACCESSES edges for MOVE CORR with corresponding reason', () => { + const readEdges = getRelationships(result, 'ACCESSES').filter( + e => e.rel.reason === 'cobol-move-corresponding-read', + ); + expect(readEdges.length).toBe(1); + expect(edgeSet(readEdges)).toEqual(['FORMAT-REPORT \u2192 WS-CUSTOMER-DATA']); + + const writeEdges = getRelationships(result, 'ACCESSES').filter( + e => e.rel.reason === 'cobol-move-corresponding-write', + ); + expect(writeEdges.length).toBe(1); + expect(edgeSet(writeEdges)).toEqual(['FORMAT-REPORT \u2192 WS-REPORT-LINE']); + }); + }); + + describe('nested program CONTAINS attribution (Finding #I, #II)', () => { + + it('attributes INNER-PROG paragraphs to INNER-PROG, not OUTER-PROG', () => { + const edges = getRelationships(result, 'CONTAINS').filter( + e => e.rel.reason === 'cobol-paragraph' && e.target === 'INNER-MAIN', + ); + expect(edges.length).toBe(1); + expect(edges[0].source).toBe('INNER-PROG'); + }); + + it('attributes INNER-PROG data items to INNER-PROG, not OUTER-PROG', () => { + const edges = getRelationships(result, 'CONTAINS').filter( + e => e.rel.reason === 'cobol-data-item' && e.target === 'WS-INNER-CODE', + ); + expect(edges.length).toBe(1); + expect(edges[0].source).toBe('INNER-PROG'); + }); + + it('attributes OUTER-PROG data items to OUTER-PROG', () => { + const edges = getRelationships(result, 'CONTAINS').filter( + e => e.rel.reason === 'cobol-data-item' && e.target === 'WS-OUTER-FLAG', + ); + expect(edges.length).toBe(1); + expect(edges[0].source).toBe('OUTER-PROG'); + }); + }); + + describe('per-program PROCEDURE DIVISION USING (Finding #III partial)', () => { + + it('creates ACCESSES edges from AUDITLOG, not from wrong program', () => { + const edges = getRelationships(result, 'ACCESSES').filter( + e => e.rel.reason === 'cobol-procedure-using', + ); + expect(edges.length).toBe(2); + // Both edges should source from AUDITLOG (the program that declares USING) + for (const e of edges) { + expect(e.source).toBe('AUDITLOG'); + } + }); + }); + + describe('PERFORM THRU edge correctness', () => { + + it('captures FORMAT-REPORT PERFORM THRU from MAIN-PARAGRAPH', () => { + const edges = getRelationships(result, 'CALLS').filter( + e => e.rel.reason === 'cobol-perform-thru', + ); + expect(edgeSet(edges)).toContain('FORMAT-REPORT \u2192 FORMAT-REPORT'); + }); + }); + + describe('nested program CALLS attribution', () => { + + it('attributes INNER-PROG PERFORM edges to INNER-PROG paragraphs', () => { + const edges = getRelationships(result, 'CALLS').filter( + e => e.rel.reason === 'cobol-perform' && e.source === 'INNER-MAIN', + ); + expect(edges.length).toBe(1); + expect(edges[0].target).toBe('INNER-PROCESS'); + }); + }); + + // ===================================================================== + // GRAND TOTALS — catch any unexpected edge leakage + // ===================================================================== + + describe('grand totals', () => { + + it('produces exactly 31 total CALLS edges', () => { + // 15 perform + 2 perform-thru + 3 call + 4 goto + 1 link + 1 xctl + // + 1 handle-abend + 1 return-transid + 2 jcl-exec-pgm + 1 jcl-dd + expect(getRelationships(result, 'CALLS').length).toBe(31); + }); + + it('produces exactly 81 total CONTAINS edges', () => { + // 4 program-id + 1 nested-program + 2 section + 21 paragraph + // + 36 data-item + 8 exec-cics + 1 exec-sql + 1 dynamic-call + // + 1 cics-dynamic-program + 2 entry-point + 1 file-declaration + // + 1 jcl-job + 2 jcl-step + expect(getRelationships(result, 'CONTAINS').length).toBe(81); + }); + + it('produces exactly 2 total IMPORTS edges', () => { + expect(getRelationships(result, 'IMPORTS').length).toBe(2); + }); + + it('produces exactly 25 total ACCESSES edges', () => { + // 4 move-read + 5 move-write + 1 move-corresponding-read + 1 move-corresponding-write + // + 1 file-read + 1 map + 1 queue-write + // + 1 receive-into + 2 send-from + 1 search + 1 sort-using + 1 sort-giving + // + 2 procedure-using + 1 sql-select + 2 call-using + expect(getRelationships(result, 'ACCESSES').length).toBe(25); + }); + }); +}); diff --git a/gitnexus/test/unit/cobol-copy-expander.test.ts b/gitnexus/test/unit/cobol-copy-expander.test.ts new file mode 100644 index 0000000000..8c1ca98f81 --- /dev/null +++ b/gitnexus/test/unit/cobol-copy-expander.test.ts @@ -0,0 +1,69 @@ +/** + * Unit Tests: COBOL Copy Expander — pseudotext REPLACING support + */ +import { describe, it, expect } from 'vitest'; +import { parseReplacingClause } from '../../src/core/ingestion/cobol/cobol-copy-expander.js'; + +describe('parseReplacingClause', () => { + // Existing quoted-string behavior preserved + it('parses quoted EXACT replacement', () => { + const result = parseReplacingClause(' "OLD-NAME" BY "NEW-NAME" '); + expect(result).toEqual([{ type: 'EXACT', from: 'OLD-NAME', to: 'NEW-NAME' }]); + }); + + it('parses LEADING replacement', () => { + const result = parseReplacingClause(' LEADING "ESP-" BY "LK-ESP-" '); + expect(result).toEqual([{ type: 'LEADING', from: 'ESP-', to: 'LK-ESP-' }]); + }); + + it('parses TRAILING replacement', () => { + const result = parseReplacingClause(' TRAILING "-IN" BY "-OUT" '); + expect(result).toEqual([{ type: 'TRAILING', from: '-IN', to: '-OUT' }]); + }); + + // Pseudotext ==...== support (isPseudotext flag propagated) + it('parses basic pseudotext: ==OLD== BY ==NEW==', () => { + const result = parseReplacingClause(' ==WS-OLD== BY ==WS-NEW== '); + expect(result).toEqual([{ type: 'EXACT', from: 'WS-OLD', to: 'WS-NEW', isPseudotext: true }]); + }); + + it('parses empty pseudotext (deletion): ==TEXT== BY ====', () => { + const result = parseReplacingClause(' ==REMOVE-ME== BY ==== '); + expect(result).toEqual([{ type: 'EXACT', from: 'REMOVE-ME', to: '', isPseudotext: true }]); + }); + + it('parses pseudotext with spaces: ==SOME TEXT== BY ==OTHER TEXT==', () => { + const result = parseReplacingClause(' ==WORKING STORAGE== BY ==LOCAL STORAGE== '); + expect(result).toEqual([{ type: 'EXACT', from: 'WORKING STORAGE', to: 'LOCAL STORAGE', isPseudotext: true }]); + }); + + it('parses pseudotext with single = inside: ==A=B== BY ==C=D==', () => { + const result = parseReplacingClause(' ==A=B== BY ==C=D== '); + expect(result).toEqual([{ type: 'EXACT', from: 'A=B', to: 'C=D', isPseudotext: true }]); + }); + + it('parses mixed quoted + pseudotext in one clause', () => { + const result = parseReplacingClause( + ' "OLD-NAME" BY "NEW-NAME" ==DEL-PREFIX== BY ==== ', + ); + expect(result).toEqual([ + { type: 'EXACT', from: 'OLD-NAME', to: 'NEW-NAME' }, + { type: 'EXACT', from: 'DEL-PREFIX', to: '', isPseudotext: true }, + ]); + }); + + it('LEADING modifier works alongside pseudotext', () => { + const result = parseReplacingClause( + ' LEADING "ESP-" BY "LK-ESP-" ==OLD-EXACT== BY ==NEW-EXACT== ', + ); + expect(result).toEqual([ + { type: 'LEADING', from: 'ESP-', to: 'LK-ESP-' }, + { type: 'EXACT', from: 'OLD-EXACT', to: 'NEW-EXACT', isPseudotext: true }, + ]); + }); + + it('returns empty array for empty input', () => { + expect(parseReplacingClause('')).toEqual([]); + expect(parseReplacingClause(' ')).toEqual([]); + }); +}); diff --git a/gitnexus/test/unit/cobol-preprocessor.test.ts b/gitnexus/test/unit/cobol-preprocessor.test.ts new file mode 100644 index 0000000000..22169d6013 --- /dev/null +++ b/gitnexus/test/unit/cobol-preprocessor.test.ts @@ -0,0 +1,2746 @@ +import { describe, it, expect } from 'vitest'; +import { + preprocessCobolSource, + extractCobolSymbolsWithRegex, +} from '../../src/core/ingestion/cobol/cobol-preprocessor.js'; +import type { CobolRegexResults } from '../../src/core/ingestion/cobol/cobol-preprocessor.js'; +import { parseReplacingClause } from '../../src/core/ingestion/cobol/cobol-copy-expander.js'; + +// --------------------------------------------------------------------------- +// Helper: build COBOL source from an array of lines. +// +// The parser processes full raw lines including columns 1-6 (sequence area). +// Regexes anchored with ^\s+ (data items, FD, AUTHOR, etc.) require the line +// to start with whitespace, so test lines use spaces in cols 1-6 instead of +// numeric sequence numbers unless specifically testing sequence-number behavior. +// +// Column layout: +// 1-6: sequence/patch area (spaces or digits) +// 7: indicator (* comment, - continuation, / page break, space normal) +// 8-11: Area A (divisions, sections, paragraphs start here = 7 leading spaces) +// 12+: Area B (statements = 11+ leading spaces) +// --------------------------------------------------------------------------- +function cobol(...lines: string[]): string { + return lines.join('\n'); +} + +// --------------------------------------------------------------------------- +// preprocessCobolSource +// --------------------------------------------------------------------------- + +describe('preprocessCobolSource', () => { + it('replaces alphabetic patch markers in cols 1-6 with spaces', () => { + const input = cobol( + 'mzADD IDENTIFICATION DIVISION.', + 'estero PROGRAM-ID. TEST1.', + ); + const output = preprocessCobolSource(input); + const lines = output.split('\n'); + expect(lines[0].substring(0, 6)).toBe(' '); + expect(lines[0].substring(6)).toBe(' IDENTIFICATION DIVISION.'); + expect(lines[1].substring(0, 6)).toBe(' '); + }); + + it('strips numeric sequence numbers from cols 1-6', () => { + const input = cobol( + '000100 IDENTIFICATION DIVISION.', + '000200 PROGRAM-ID. TEST1.', + ); + const output = preprocessCobolSource(input); + const lines = output.split('\n'); + expect(lines[0]).toBe(' IDENTIFICATION DIVISION.'); + expect(lines[1]).toBe(' PROGRAM-ID. TEST1.'); + }); + + it('preserves lines shorter than 7 characters', () => { + const input = cobol('SHORT', ' ', '000100 IDENTIFICATION DIVISION.'); + const output = preprocessCobolSource(input); + const lines = output.split('\n'); + expect(lines[0]).toBe('SHORT'); + expect(lines[1]).toBe(' '); + }); + + it('preserves exact line count (no lines added/removed)', () => { + const input = cobol( + 'mzADD IDENTIFICATION DIVISION.', + '000200 PROGRAM-ID. TEST1.', + 'patch# DATA DIVISION.', + '', + '000500 PROCEDURE DIVISION.', + ); + const output = preprocessCobolSource(input); + expect(output.split('\n').length).toBe(input.split('\n').length); + }); +}); + +// --------------------------------------------------------------------------- +// extractCobolSymbolsWithRegex +// --------------------------------------------------------------------------- + +describe('extractCobolSymbolsWithRegex', () => { + + // ------------------------------------------------------------------------- + // PROGRAM-ID + // ------------------------------------------------------------------------- + describe('PROGRAM-ID', () => { + it('extracts PROGRAM-ID from IDENTIFICATION DIVISION', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.programName).toBe('TESTPROG'); + }); + + it('captures all PROGRAM-IDs in programs array with line ranges', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. OUTER-PROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' DISPLAY "OUTER".', + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. INNER-PROG.', + ' PROCEDURE DIVISION.', + ' INNER-PARA.', + ' DISPLAY "INNER".', + ' END PROGRAM INNER-PROG.', + ' END PROGRAM OUTER-PROG.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.programName).toBe('OUTER-PROG'); + expect(r.programs).toHaveLength(2); + expect(r.programs[0].name).toBe('OUTER-PROG'); + expect(r.programs[0].nestingDepth).toBe(0); + expect(r.programs[1].name).toBe('INNER-PROG'); + expect(r.programs[1].nestingDepth).toBe(1); + // INNER-PROG's startLine < endLine, contained within OUTER-PROG + expect(r.programs[0].startLine).toBe(2); // OUTER-PROG + expect(r.programs[1].startLine).toBe(7); // INNER-PROG + expect(r.programs[1].endLine).toBe(11); // END PROGRAM INNER-PROG + expect(r.programs[0].endLine).toBe(12); // END PROGRAM OUTER-PROG + }); + + it('returns null programName for content without PROGRAM-ID', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' AUTHOR. SOMEONE.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.programName).toBeNull(); + }); + }); + + // ------------------------------------------------------------------------- + // Paragraphs & Sections + // ------------------------------------------------------------------------- + describe('Paragraphs & Sections', () => { + it('extracts paragraphs in PROCEDURE DIVISION (7 leading spaces)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' DISPLAY "HELLO".', + ' SUB-PARA.', + ' DISPLAY "WORLD".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.paragraphs).toHaveLength(2); + expect(r.paragraphs[0].name).toBe('MAIN-PARA'); + expect(r.paragraphs[1].name).toBe('SUB-PARA'); + }); + + it('extracts sections in PROCEDURE DIVISION', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' INIT-SECTION SECTION.', + ' INIT-PARA.', + ' DISPLAY "INIT".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.sections).toHaveLength(1); + expect(r.sections[0].name).toBe('INIT-SECTION'); + expect(r.paragraphs).toHaveLength(1); + expect(r.paragraphs[0].name).toBe('INIT-PARA'); + }); + + it('excludes reserved names (DECLARATIVES, END, PROCEDURE, etc.)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' DECLARATIVES.', + ' END.', + ' REAL-PARA.', + ' DISPLAY "OK".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.paragraphs.map(p => p.name)).toEqual(['REAL-PARA']); + }); + + it('does NOT treat IDENTIFICATION/ENVIRONMENT/DATA/WORKING-STORAGE as paragraphs', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' ENVIRONMENT DIVISION.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' PROCEDURE DIVISION.', + ' REAL-PARA.', + ' DISPLAY "OK".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const names = r.paragraphs.map(p => p.name); + expect(names).not.toContain('IDENTIFICATION'); + expect(names).not.toContain('ENVIRONMENT'); + expect(names).not.toContain('DATA'); + expect(names).not.toContain('WORKING-STORAGE'); + expect(names).toContain('REAL-PARA'); + }); + }); + + // ------------------------------------------------------------------------- + // CALL / PERFORM / COPY + // ------------------------------------------------------------------------- + describe('CALL / PERFORM / COPY', () => { + it('extracts CALL "PROGRAM" statements', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' CALL "SUBPROG".', + ' CALL "ANOTHER".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(2); + expect(r.calls[0].target).toBe('SUBPROG'); + expect(r.calls[1].target).toBe('ANOTHER'); + }); + + it('extracts PERFORM paragraph-name with caller context', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' PERFORM SUB-PARA.', + ' SUB-PARA.', + ' DISPLAY "HELLO".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.performs).toHaveLength(1); + expect(r.performs[0].target).toBe('SUB-PARA'); + expect(r.performs[0].caller).toBe('MAIN-PARA'); + }); + + it('extracts PERFORM ... THRU ... statements', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' PERFORM STEP-A THRU STEP-Z.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.performs).toHaveLength(1); + expect(r.performs[0].target).toBe('STEP-A'); + expect(r.performs[0].thruTarget).toBe('STEP-Z'); + }); + + it('does NOT store PERFORM WS-COUNT TIMES as a perform target', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' PERFORM WS-COUNT TIMES.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.performs.map(p => p.target)).not.toContain('WS-COUNT'); + }); + + it('extracts dynamic CALL (unquoted) with isQuoted=false', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' CALL WS-PROG-NAME.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].target).toBe('WS-PROG-NAME'); + expect(r.calls[0].isQuoted).toBe(false); + }); + + it('quoted CALL has isQuoted=true', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' CALL "SUBPROG".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].isQuoted).toBe(true); + }); + + it('extracts COPY copybook (unquoted)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' COPY WSCOPY.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.copies).toHaveLength(1); + expect(r.copies[0].target).toBe('WSCOPY'); + }); + + it('extracts COPY "copybook" (quoted)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' COPY "MY-COPY".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.copies).toHaveLength(1); + expect(r.copies[0].target).toBe('MY-COPY'); + }); + }); + + // ------------------------------------------------------------------------- + // Data Division + // ------------------------------------------------------------------------- + describe('Data Division', () => { + it('extracts data items with level, name, PIC, USAGE', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 01 WS-RECORD.', + ' 05 WS-NAME PIC X(30).', + ' 05 WS-AMOUNT PIC 9(7)V99 USAGE COMP-3.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.dataItems.length).toBe(3); // WS-NAME + WS-BALANCE + WS-AMOUNT (01-level group with only period has no clauses) + + const wsName = r.dataItems.find(d => d.name === 'WS-NAME'); + expect(wsName).toBeDefined(); + expect(wsName!.level).toBe(5); + expect(wsName!.pic).toMatch(/^X\(30\)/); + + const wsAmount = r.dataItems.find(d => d.name === 'WS-AMOUNT'); + expect(wsAmount).toBeDefined(); + expect(wsAmount!.usage).toBe('COMP-3'); + }); + + it('extracts 88-level condition names with values', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 01 WS-STATUS PIC X.', + ' 88 WS-ACTIVE VALUE "A".', + ' 88 WS-INACTIVE VALUE "I".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const active = r.dataItems.find(d => d.name === 'WS-ACTIVE'); + expect(active).toBeDefined(); + expect(active!.level).toBe(88); + expect(active!.values).toEqual(['A']); + + const inactive = r.dataItems.find(d => d.name === 'WS-INACTIVE'); + expect(inactive).toBeDefined(); + expect(inactive!.values).toEqual(['I']); + }); + + it('extracts FD entries with record name linkage', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' FILE SECTION.', + ' FD EMPLOYEE-FILE.', + ' 01 EMPLOYEE-RECORD.', + ' 05 EMP-ID PIC 9(5).', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.fdEntries).toHaveLength(1); + expect(r.fdEntries[0].fdName).toBe('EMPLOYEE-FILE'); + expect(r.fdEntries[0].recordName).toBe('EMPLOYEE-RECORD'); + }); + + it('skips FILLER items', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 01 WS-REC.', + ' 05 FILLER PIC X(10).', + ' 05 WS-DATA PIC X(20).', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const fillerItems = r.dataItems.filter(d => d.name === 'FILLER'); + expect(fillerItems).toHaveLength(0); + expect(r.dataItems.find(d => d.name === 'WS-DATA')).toBeDefined(); + }); + + it('correctly assigns data section (working-storage, linkage, file)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' FILE SECTION.', + ' FD MY-FILE.', + ' 01 FILE-REC PIC X(80).', + ' WORKING-STORAGE SECTION.', + ' 01 WS-VAR PIC X(10).', + ' LINKAGE SECTION.', + ' 01 LK-VAR PIC X(10).', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + + const fileRec = r.dataItems.find(d => d.name === 'FILE-REC'); + expect(fileRec).toBeDefined(); + expect(fileRec!.section).toBe('file'); + + const wsVar = r.dataItems.find(d => d.name === 'WS-VAR'); + expect(wsVar).toBeDefined(); + expect(wsVar!.section).toBe('working-storage'); + + const lkVar = r.dataItems.find(d => d.name === 'LK-VAR'); + expect(lkVar).toBeDefined(); + expect(lkVar!.section).toBe('linkage'); + }); + }); + + // ------------------------------------------------------------------------- + // Environment Division + // ------------------------------------------------------------------------- + describe('Environment Division', () => { + it('extracts SELECT ... ASSIGN TO with organization, access, record key', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' ENVIRONMENT DIVISION.', + ' INPUT-OUTPUT SECTION.', + ' FILE-CONTROL.', + ' SELECT EMPLOYEE-FILE', + ' ASSIGN TO "EMPFILE"', + ' ORGANIZATION IS INDEXED', + ' ACCESS MODE IS DYNAMIC', + ' RECORD KEY IS EMP-ID.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.fileDeclarations).toHaveLength(1); + const fd = r.fileDeclarations[0]; + expect(fd.selectName).toBe('EMPLOYEE-FILE'); + expect(fd.assignTo).toBe('EMPFILE'); + expect(fd.organization).toBe('INDEXED'); + expect(fd.access).toBe('DYNAMIC'); + expect(fd.recordKey).toBe('EMP-ID'); + }); + }); + + // ------------------------------------------------------------------------- + // State Machine + // ------------------------------------------------------------------------- + describe('State Machine', () => { + it('correctly transitions between divisions', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' ENVIRONMENT DIVISION.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 01 WS-VAR PIC X(10).', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' DISPLAY WS-VAR.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.programName).toBe('TESTPROG'); + expect(r.dataItems.find(d => d.name === 'WS-VAR')).toBeDefined(); + expect(r.paragraphs).toHaveLength(1); + expect(r.paragraphs[0].name).toBe('MAIN-PARA'); + }); + + it('handles continuation lines (indicator "-" in column 7)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' CALL "VERY-LONG-PR', + ' - "OGRAM-NAME".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // Continuation merges lines; at minimum verify no crash and paragraph found + expect(r.paragraphs).toHaveLength(1); + expect(r.paragraphs[0].name).toBe('MAIN-PARA'); + }); + + it('skips comment lines (indicator "*" or "/" in column 7)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' * THIS IS A COMMENT', + ' / THIS IS A PAGE BREAK COMMENT', + ' CALL "REALPROG".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].target).toBe('REALPROG'); + }); + }); + + // ------------------------------------------------------------------------- + // EXEC Blocks + // ------------------------------------------------------------------------- + describe('EXEC Blocks', () => { + it('extracts EXEC SQL blocks with tables and host variables', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC SQL', + ' SELECT EMP-NAME, EMP-SALARY', + ' FROM EMPLOYEE', + ' WHERE EMP-ID = :WS-EMP-ID', + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execSqlBlocks).toHaveLength(1); + const sql = r.execSqlBlocks[0]; + expect(sql.operation).toBe('SELECT'); + expect(sql.tables).toContain('EMPLOYEE'); + expect(sql.hostVariables).toContain('WS-EMP-ID'); + }); + + it('extracts EXEC CICS blocks with command and MAP/PROGRAM/TRANSID', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " EXEC CICS SEND MAP('EMPMAP')", + " PROGRAM('EMPPROG')", + " TRANSID('EMPT')", + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execCicsBlocks).toHaveLength(1); + const cics = r.execCicsBlocks[0]; + expect(cics.command).toBe('SEND MAP'); + expect(cics.mapName).toBe('EMPMAP'); + expect(cics.programName).toBe('EMPPROG'); + expect(cics.transId).toBe('EMPT'); + }); + + it('extracts EXEC CICS MAP with unquoted identifier', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC CICS SEND MAP(WS-MAP-NAME)', + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execCicsBlocks).toHaveLength(1); + expect(r.execCicsBlocks[0].mapName).toBe('WS-MAP-NAME'); + }); + + it('handles single-line EXEC SQL ... END-EXEC', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC SQL DELETE FROM ORDERS WHERE ORD-ID = :WS-ORD END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execSqlBlocks).toHaveLength(1); + expect(r.execSqlBlocks[0].operation).toBe('DELETE'); + expect(r.execSqlBlocks[0].tables).toContain('ORDERS'); + }); + + it('handles multi-line EXEC SQL ... END-EXEC', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC SQL', + ' INSERT INTO AUDIT_LOG', + ' VALUES (:WS-TIMESTAMP, :WS-USER)', + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execSqlBlocks).toHaveLength(1); + const sql = r.execSqlBlocks[0]; + expect(sql.operation).toBe('INSERT'); + expect(sql.tables).toContain('AUDIT_LOG'); + expect(sql.hostVariables).toContain('WS-TIMESTAMP'); + expect(sql.hostVariables).toContain('WS-USER'); + }); + }); + + // ------------------------------------------------------------------------- + // Linkage & Data Flow + // ------------------------------------------------------------------------- + describe('Linkage & Data Flow', () => { + it('extracts PROCEDURE DIVISION USING parameters', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' LINKAGE SECTION.', + ' 01 LK-PARAM1 PIC X(10).', + ' 01 LK-PARAM2 PIC 9(5).', + ' PROCEDURE DIVISION USING LK-PARAM1 LK-PARAM2.', + ' MAIN-PARA.', + ' DISPLAY LK-PARAM1.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.procedureUsing).toEqual(['LK-PARAM1', 'LK-PARAM2']); + }); + + it('extracts ENTRY points with USING', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' ENTRY "ALTENTRY" USING WS-PARAM1 WS-PARAM2.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.entryPoints).toHaveLength(1); + expect(r.entryPoints[0].name).toBe('ALTENTRY'); + expect(r.entryPoints[0].parameters).toEqual(['WS-PARAM1', 'WS-PARAM2']); + }); + + it("extracts ENTRY 'ALTENTRY' with single-quoted target", () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " ENTRY 'ALTENTRY' USING WS-PARAM1.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.entryPoints).toHaveLength(1); + expect(r.entryPoints[0].name).toBe('ALTENTRY'); + expect(r.entryPoints[0].parameters).toEqual(['WS-PARAM1']); + }); + + it('ENTRY USING filters calling-convention keywords (BY VALUE REFERENCE)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " ENTRY 'ALTENTRY' USING BY VALUE WS-AMT BY REFERENCE LS-REC.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.entryPoints).toHaveLength(1); + // BY, VALUE, REFERENCE should be filtered out — only actual parameter names remain + expect(r.entryPoints[0].parameters).toEqual(['WS-AMT', 'LS-REC']); + }); + + it('paragraphs with SECTION in name are NOT excluded (e.g., CROSS-SECTION-PROC)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' CROSS-SECTION-ANALYSIS.', + ' DISPLAY "HELLO".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.paragraphs.map(p => p.name)).toContain('CROSS-SECTION-ANALYSIS'); + }); + + it('PERFORM THROUGH (full spelling) captures thruTarget', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' PERFORM FIRST-PARA THROUGH LAST-PARA.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.performs).toHaveLength(1); + expect(r.performs[0].target).toBe('FIRST-PARA'); + expect(r.performs[0].thruTarget).toBe('LAST-PARA'); + }); + + it('PROCEDURE DIVISION USING RETURNING excludes return value from USING list', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION USING WS-INPUT RETURNING WS-RESULT.', + ' MAIN-PARA.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // RETURNING and everything after it should be excluded — only USING parameters remain + expect(r.procedureUsing).toEqual(['WS-INPUT']); + }); + + it('RE_CALL_DYNAMIC does NOT false-match on WS-CALL compound identifier', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 05 WS-CALL OCCURS 10 PIC X(10).', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' DISPLAY WS-CALL.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // WS-CALL should NOT produce a dynamic CALL — it's a data item name + expect(r.calls.filter(c => !c.isQuoted)).toHaveLength(0); + }); + + it('multi-line SORT captures USING and GIVING from continuation lines', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' SORT SORT-FILE', + ' ON ASCENDING KEY WS-KEY', + ' USING INPUT-FILE', + ' GIVING OUTPUT-FILE.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.sorts).toHaveLength(1); + expect(r.sorts[0].sortFile).toBe('SORT-FILE'); + expect(r.sorts[0].usingFiles).toContain('INPUT-FILE'); + expect(r.sorts[0].givingFiles).toContain('OUTPUT-FILE'); + }); + + it('PROCEDURE DIVISION USING on split line is captured via pendingProcUsing', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION', + ' USING WS-PARAM1 WS-PARAM2.', + ' MAIN-PARA.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.procedureUsing).toEqual(['WS-PARAM1', 'WS-PARAM2']); + }); + + it('nested programs carry per-program procedureUsing', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. OUTER.', + ' PROCEDURE DIVISION USING WS-OUTER-PARAM.', + ' MAIN-PARA.', + ' DISPLAY WS-OUTER-PARAM.', + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. INNER.', + ' PROCEDURE DIVISION USING WS-INNER-PARAM.', + ' INNER-PARA.', + ' DISPLAY WS-INNER-PARAM.', + ' END PROGRAM INNER.', + ' END PROGRAM OUTER.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.programs).toHaveLength(2); + const outer = r.programs.find(p => p.name === 'OUTER'); + const inner = r.programs.find(p => p.name === 'INNER'); + expect(outer?.procedureUsing).toEqual(['WS-OUTER-PARAM']); + expect(inner?.procedureUsing).toEqual(['WS-INNER-PARAM']); + }); + + it('SECTION with segment number is detected', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-SECTION SECTION 30.', + ' MAIN-PARA.', + ' DISPLAY "HI".', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.sections.map(s => s.name)).toContain('MAIN-SECTION'); + }); + + it('dynamic CANCEL via data item is captured with isQuoted=false', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' CANCEL WS-PGM-NAME.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.cancels).toHaveLength(1); + expect(r.cancels[0].target).toBe('WS-PGM-NAME'); + expect(r.cancels[0].isQuoted).toBe(false); + }); + + it('copybook preprocessing strips sequence numbers before expansion', () => { + // This is tested indirectly — preprocessCobolSource is called in readCopy + const input = cobol('000100 IDENTIFICATION DIVISION.', '000200 PROGRAM-ID. TEST1.'); + const output = preprocessCobolSource(input); + // Verify cols 1-6 are blanked for numeric sequences + expect(output.split('\n')[0]).toBe(' IDENTIFICATION DIVISION.'); + }); + + it('numeric sequence numbers are stripped so paragraphs are detected', () => { + const src = preprocessCobolSource(cobol( + '000100 IDENTIFICATION DIVISION.', + '000200 PROGRAM-ID. SEQTEST.', + '000300 PROCEDURE DIVISION.', + '000400 MAIN-PARA.', + '000500 DISPLAY "HI".', + )); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.programName).toBe('SEQTEST'); + expect(r.paragraphs.map(p => p.name)).toEqual(['MAIN-PARA']); + }); + + it('extracts MOVE statements (skipping figurative constants)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' MOVE WS-SOURCE TO WS-TARGET.', + ' MOVE SPACES TO WS-BLANK.', + ' MOVE ZEROS TO WS-ZERO.', + ' MOVE CORRESPONDING WS-REC1 TO WS-REC2.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const moveData = r.moves.map(m => ({ from: m.from, targets: m.targets, corr: m.corresponding })); + expect(moveData).toContainEqual({ from: 'WS-SOURCE', targets: ['WS-TARGET'], corr: false }); + expect(moveData).toContainEqual({ from: 'WS-REC1', targets: ['WS-REC2'], corr: true }); + expect(r.moves.find(m => m.from === 'SPACES')).toBeUndefined(); + expect(r.moves.find(m => m.from === 'ZEROS')).toBeUndefined(); + }); + + it('captures multiple MOVE targets: MOVE X TO A B C', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' MOVE WS-SOURCE TO WS-A WS-B WS-C.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.moves).toHaveLength(1); + expect(r.moves[0].targets).toEqual(['WS-A', 'WS-B', 'WS-C']); + }); + + it('MOVE CORRESPONDING is always single target', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' MOVE CORRESPONDING WS-REC1 TO WS-REC2.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.moves).toHaveLength(1); + expect(r.moves[0].targets).toEqual(['WS-REC2']); + expect(r.moves[0].corresponding).toBe(true); + }); + + it('MOVE handles OF-qualified names', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' MOVE WS-SRC TO WS-NAME OF WS-RECORD WS-CODE.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.moves).toHaveLength(1); + // WS-NAME OF WS-RECORD -> WS-NAME is the target; WS-CODE is a second target + expect(r.moves[0].targets).toEqual(['WS-NAME', 'WS-CODE']); + }); + + it('MOVE skips figurative constants in targets', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' MOVE WS-SRC TO SPACES.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // SPACES is in MOVE_SKIP, so no targets -> no move entry + expect(r.moves).toHaveLength(0); + }); + }); + + // ------------------------------------------------------------------------- + // Edge Cases + // ------------------------------------------------------------------------- + describe('Edge Cases', () => { + it('empty program returns empty results', () => { + const r = extractCobolSymbolsWithRegex('', 'empty.cbl'); + expect(r.programName).toBeNull(); + expect(r.paragraphs).toHaveLength(0); + expect(r.sections).toHaveLength(0); + expect(r.performs).toHaveLength(0); + expect(r.calls).toHaveLength(0); + expect(r.copies).toHaveLength(0); + expect(r.dataItems).toHaveLength(0); + expect(r.fileDeclarations).toHaveLength(0); + expect(r.fdEntries).toHaveLength(0); + expect(r.execSqlBlocks).toHaveLength(0); + expect(r.execCicsBlocks).toHaveLength(0); + expect(r.procedureUsing).toHaveLength(0); + expect(r.entryPoints).toHaveLength(0); + expect(r.moves).toHaveLength(0); + }); + + it('extracts AUTHOR and DATE-WRITTEN from program metadata', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' AUTHOR. JOHN DOE.', + ' DATE-WRITTEN. 2025-01-15.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.programMetadata.author).toBe('JOHN DOE'); + expect(r.programMetadata.dateWritten).toBe('2025-01-15'); + }); + }); + + // ------------------------------------------------------------------------- + // Phase 1: Data Flow Features + // ------------------------------------------------------------------------- + describe('Phase 1: Data Flow Features', () => { + + it('EXEC SQL INCLUDE extracts member name (unquoted)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' EXEC SQL INCLUDE SQLCA END-EXEC.', + ' EXEC SQL INCLUDE CUSTDCL END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const includes = r.execSqlBlocks.filter(b => b.includeMember); + expect(includes).toHaveLength(2); + expect(includes[0].includeMember).toBe('SQLCA'); + expect(includes[1].includeMember).toBe('CUSTDCL'); + }); + + it('EXEC SQL INCLUDE handles quoted and underscored member names', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + " EXEC SQL INCLUDE 'DBRMLIB.MEMBER' END-EXEC.", + ' EXEC SQL INCLUDE CUST_TBL_DCL END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const includes = r.execSqlBlocks.filter(b => b.includeMember); + expect(includes).toHaveLength(2); + expect(includes[0].includeMember).toBe('DBRMLIB.MEMBER'); + expect(includes[1].includeMember).toBe('CUST_TBL_DCL'); + }); + + it('CALL USING extracts parameters', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'AUDITLOG' USING WS-CUST-ID WS-AMOUNT.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-CUST-ID', 'WS-AMOUNT']); + }); + + it('CALL USING filters BY REFERENCE/CONTENT/VALUE keywords', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM' USING BY REFERENCE WS-A BY CONTENT WS-B BY VALUE WS-C.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls[0].parameters).toEqual(['WS-A', 'WS-B', 'WS-C']); + }); + + it('CALL USING filters ADDRESS OF and OMITTED', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM' USING ADDRESS OF WS-REC OMITTED WS-FLAG.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls[0].parameters).toEqual(['WS-REC', 'WS-FLAG']); + }); + + it('CALL RETURNING extracts return target', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'FUNC' USING WS-INPUT RETURNING WS-RESULT.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls[0].parameters).toEqual(['WS-INPUT']); + expect(r.calls[0].returning).toBe('WS-RESULT'); + }); + + it('OCCURS DEPENDING ON captures controlling field', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 01 WS-COUNT PIC 9(4).', + ' 01 WS-TABLE OCCURS 1 TO 100 DEPENDING ON WS-COUNT.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const table = r.dataItems.find(d => d.name === 'WS-TABLE'); + expect(table).toBeDefined(); + expect(table!.dependingOn).toBe('WS-COUNT'); + expect(table!.occurs).toBe(1); + }); + + it('VALUE clause extracts quoted string', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + " 01 WS-STATUS PIC X VALUE 'A'.", + ' 01 WS-COUNT PIC 9(4) VALUE 0.', + ' 01 WS-NAME PIC X(10) VALUE SPACES.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.dataItems.find(d => d.name === 'WS-STATUS')?.values).toEqual(['A']); + expect(r.dataItems.find(d => d.name === 'WS-COUNT')?.values).toEqual(['0']); + expect(r.dataItems.find(d => d.name === 'WS-NAME')?.values).toEqual(['SPACES']); + }); + }); + + // ------------------------------------------------------------------------- + // Phase 2: IMS + Error Handling Features + // ------------------------------------------------------------------------- + describe('Phase 2: IMS + Error Handling Features', () => { + + it('EXEC DLI GU extracts verb, segment, PCB, and INTO', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC DLI GU USING PCB(2)', + ' SEGMENT(CUSTOMER)', + ' INTO(CUST-IO-AREA)', + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execDliBlocks).toHaveLength(1); + expect(r.execDliBlocks[0].verb).toBe('GU'); + expect(r.execDliBlocks[0].pcbNumber).toBe(2); + expect(r.execDliBlocks[0].segmentName).toBe('CUSTOMER'); + expect(r.execDliBlocks[0].intoField).toBe('CUST-IO-AREA'); + }); + + it('EXEC DLI ISRT extracts FROM field', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC DLI ISRT USING PCB(1)', + ' SEGMENT(ORDER)', + ' FROM(ORDER-IO-AREA)', + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execDliBlocks[0].verb).toBe('ISRT'); + expect(r.execDliBlocks[0].fromField).toBe('ORDER-IO-AREA'); + }); + + it('EXEC DLI SCHD extracts PSB name', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC DLI SCHD PSB(CUSTPSB) END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execDliBlocks[0].verb).toBe('SCHD'); + expect(r.execDliBlocks[0].psbName).toBe('CUSTPSB'); + }); + + it('DECLARATIVES USE AFTER EXCEPTION extracts file binding', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' DECLARATIVES.', + ' CUST-ERR SECTION.', + ' USE AFTER STANDARD ERROR ON CUSTOMER-FILE.', + ' CUST-ERR-PARA.', + ' DISPLAY "FILE ERROR".', + ' END DECLARATIVES.', + ' MAIN-PARA.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.declaratives).toHaveLength(1); + expect(r.declaratives[0].sectionName).toBe('CUST-ERR'); + expect(r.declaratives[0].target).toBe('CUSTOMER-FILE'); + }); + + it('DECLARATIVES with multiple USE sections', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' DECLARATIVES.', + ' ERR-A SECTION.', + ' USE AFTER STANDARD EXCEPTION ON FILE-A.', + ' ERR-A-PARA.', + ' DISPLAY "A".', + ' ERR-B SECTION.', + ' USE AFTER STANDARD EXCEPTION ON INPUT.', + ' ERR-B-PARA.', + ' DISPLAY "B".', + ' END DECLARATIVES.', + ' MAIN-PARA.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.declaratives).toHaveLength(2); + expect(r.declaratives[0].target).toBe('FILE-A'); + expect(r.declaratives[1].target).toBe('INPUT'); + }); + + it('SET condition TO TRUE extracts targets', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' SET END-OF-FILE TO TRUE.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.sets).toHaveLength(1); + expect(r.sets[0].form).toBe('to-true'); + expect(r.sets[0].targets).toEqual(['END-OF-FILE']); + }); + + it('SET index UP BY extracts target and value', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' SET IDX-1 UP BY 1.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.sets).toHaveLength(1); + expect(r.sets[0].form).toBe('up-by'); + expect(r.sets[0].targets).toEqual(['IDX-1']); + expect(r.sets[0].value).toBe('1'); + }); + + it('INSPECT TALLYING extracts field and counter', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " INSPECT WS-STRING TALLYING WS-COUNT FOR ALL 'A'.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.inspects).toHaveLength(1); + expect(r.inspects[0].inspectedField).toBe('WS-STRING'); + expect(r.inspects[0].counters).toEqual(['WS-COUNT']); + expect(r.inspects[0].form).toBe('tallying'); + }); + + it('INSPECT REPLACING detected', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " INSPECT WS-FIELD REPLACING ALL 'A' BY 'B'.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.inspects).toHaveLength(1); + expect(r.inspects[0].form).toBe('replacing'); + }); + }); + + // ------------------------------------------------------------------------- + // Phase 3-4: Completeness + Niche Features + // ------------------------------------------------------------------------- + describe('Phase 3-4: Completeness + Niche Features', () => { + + it('SELECT OPTIONAL sets isOptional flag', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' ENVIRONMENT DIVISION.', + ' INPUT-OUTPUT SECTION.', + ' FILE-CONTROL.', + " SELECT OPTIONAL CUST-FILE ASSIGN TO 'CUSTFILE'.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.fileDeclarations).toHaveLength(1); + expect(r.fileDeclarations[0].selectName).toBe('CUST-FILE'); + expect(r.fileDeclarations[0].isOptional).toBe(true); + }); + + it('ALTERNATE RECORD KEY extraction', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' ENVIRONMENT DIVISION.', + ' INPUT-OUTPUT SECTION.', + ' FILE-CONTROL.', + " SELECT CUST-FILE ASSIGN TO 'CUSTFILE'", + ' RECORD KEY IS CUST-ID', + ' ALTERNATE RECORD KEY IS CUST-NAME.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.fileDeclarations[0].recordKey).toBe('CUST-ID'); + expect(r.fileDeclarations[0].alternateKeys).toEqual(['CUST-NAME']); + }); + + it('PROGRAM-ID IS COMMON sets isCommon flag', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. OUTER-PGM.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' STOP RUN.', + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. INNER-PGM IS COMMON.', + ' PROCEDURE DIVISION.', + ' INNER-PARA.', + ' STOP RUN.', + ' END PROGRAM INNER-PGM.', + ' END PROGRAM OUTER-PGM.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const inner = r.programs.find(p => p.name === 'INNER-PGM'); + expect(inner).toBeDefined(); + expect(inner!.isCommon).toBe(true); + const outer = r.programs.find(p => p.name === 'OUTER-PGM'); + expect(outer!.isCommon).toBeFalsy(); + }); + + it('IS EXTERNAL and IS GLOBAL as boolean properties', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 01 WS-SHARED PIC X(10) IS EXTERNAL.', + ' 01 WS-GLOBAL PIC X(10) IS GLOBAL.', + ' 01 WS-NORMAL PIC X(10).', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.dataItems.find(d => d.name === 'WS-SHARED')?.isExternal).toBe(true); + expect(r.dataItems.find(d => d.name === 'WS-GLOBAL')?.isGlobal).toBe(true); + expect(r.dataItems.find(d => d.name === 'WS-NORMAL')?.isExternal).toBeUndefined(); + expect(r.dataItems.find(d => d.name === 'WS-NORMAL')?.isGlobal).toBeUndefined(); + }); + + it('INITIALIZE extracts target', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' INITIALIZE WS-RECORD.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.initializes).toHaveLength(1); + expect(r.initializes[0].target).toBe('WS-RECORD'); + }); + + it('AUTHOR and DATE-WRITTEN mapped to programMetadata', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' AUTHOR. JOHN DOE.', + ' DATE-WRITTEN. 2026-03-26.', + ' DATE-COMPILED. 2026-03-26.', + ' INSTALLATION. MAINFRAME-01.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.programMetadata.author).toBe('JOHN DOE'); + expect(r.programMetadata.dateWritten).toBe('2026-03-26'); + expect(r.programMetadata.dateCompiled).toBe('2026-03-26'); + expect(r.programMetadata.installation).toBe('MAINFRAME-01'); + }); + }); + + // ------------------------------------------------------------------------- + // Reviews 9-15: Multi-line CALL USING accumulation + // ------------------------------------------------------------------------- + describe('Multi-line CALL USING accumulation', () => { + + it('captures USING parameters on separate lines (IBM mainframe style)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'CUSTUPDT'", + ' USING BY REFERENCE WS-CUST-ID', + ' WS-CUST-NAME', + ' WS-CUST-ADDR.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].target).toBe('CUSTUPDT'); + expect(r.calls[0].parameters).toEqual(['WS-CUST-ID', 'WS-CUST-NAME', 'WS-CUST-ADDR']); + }); + + it('does NOT absorb next statement as USING parameter (no END-CALL)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'CUSTUPDT'", + ' USING WS-PARM.', + ' INSPECT WS-STATUS TALLYING WS-CNT FOR ALL SPACES.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-PARM']); + // INSPECT should be extracted separately, not absorbed + expect(r.inspects).toHaveLength(1); + expect(r.inspects[0].inspectedField).toBe('WS-STATUS'); + }); + + it('does NOT absorb GO TO on next line', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'CUSTUPDT'", + ' USING WS-PARM.', + ' GO TO EXIT-PARAGRAPH.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].target).toBe('CUSTUPDT'); + expect(r.gotos).toHaveLength(1); + expect(r.gotos[0].target).toBe('EXIT-PARAGRAPH'); + }); + + it('does NOT create false paragraph from last USING parameter on own line', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM'", + ' USING WS-A', + ' WS-B.', + ' PERFORM NEXT-PARA.', + ' NEXT-PARA.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // WS-B should NOT be a paragraph + const paraNames = r.paragraphs.map(p => p.name); + expect(paraNames).toContain('MAIN-PARA'); + expect(paraNames).toContain('NEXT-PARA'); + expect(paraNames).not.toContain('WS-B'); + // WS-B should be captured as USING parameter + expect(r.calls[0].parameters).toContain('WS-B'); + }); + + it('handles CALL with END-CALL scope terminator', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM' USING WS-A", + ' ON EXCEPTION', + ' DISPLAY "ERROR"', + ' END-CALL', + ' PERFORM NEXT-STEP.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-A']); + expect(r.performs).toHaveLength(1); + expect(r.performs[0].target).toBe('NEXT-STEP'); + }); + + it('does NOT false-flush on hyphenated identifiers like MOVE-COUNT', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM'", + ' USING MOVE-COUNT', + ' PERFORM-LIMIT', + ' READ-STATUS.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls[0].parameters).toEqual(['MOVE-COUNT', 'PERFORM-LIMIT', 'READ-STATUS']); + }); + + it('captures both quoted and dynamic CALL on same line (ON EXCEPTION)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PRIMARY' ON EXCEPTION CALL WS-FALLBACK.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(2); + expect(r.calls[0].target).toBe('PRIMARY'); + expect(r.calls[0].isQuoted).toBe(true); + expect(r.calls[1].target).toBe('WS-FALLBACK'); + expect(r.calls[1].isQuoted).toBe(false); + }); + }); + + // ------------------------------------------------------------------------- + // Reviews 9-15: Nested program edge attribution + // ------------------------------------------------------------------------- + describe('Nested program edge attribution', () => { + + it('CALL in inner nested program attributed to inner module (not outer)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. OUTER-PGM.', + ' PROCEDURE DIVISION.', + ' OUTER-MAIN.', + ' STOP RUN.', + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. INNER-PGM.', + ' PROCEDURE DIVISION.', + ' INNER-MAIN.', + " CALL 'SUBPROG'.", + ' END PROGRAM INNER-PGM.', + ' END PROGRAM OUTER-PGM.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // The CALL should have line number within INNER-PGM's range + expect(r.calls).toHaveLength(1); + expect(r.calls[0].target).toBe('SUBPROG'); + const innerProg = r.programs.find(p => p.name === 'INNER-PGM'); + expect(innerProg).toBeDefined(); + expect(r.calls[0].line).toBe(10); // Line 10 in the fixture: CALL 'SUBPROG'. + }); + + it('PERFORM before first paragraph in nested program has correct caller', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. OUTER-PGM.', + ' PROCEDURE DIVISION.', + ' OUTER-MAIN.', + ' STOP RUN.', + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. INNER-PGM.', + ' PROCEDURE DIVISION.', + ' PERFORM INNER-INIT.', + ' INNER-INIT.', + ' STOP RUN.', + ' END PROGRAM INNER-PGM.', + ' END PROGRAM OUTER-PGM.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // PERFORM before first paragraph — caller should be null (module-level) + const innerPerform = r.performs.find(p => p.target === 'INNER-INIT'); + expect(innerPerform).toBeDefined(); + expect(innerPerform!.caller).toBeNull(); + }); + }); + + // ------------------------------------------------------------------------- + // Reviews 9-15: CRLF / Windows line ending compatibility + // ------------------------------------------------------------------------- + describe('CRLF / Windows line ending compatibility', () => { + + it('GO TO DEPENDING ON works with CRLF line endings', () => { + // Simulate CRLF by using \r\n + const src = [ + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' GO TO PARA-A PARA-B PARA-C', + ' DEPENDING ON WS-SWITCH.', + ].join('\r\n'); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.gotos).toHaveLength(3); + expect(r.gotos.map(g => g.target).sort()).toEqual(['PARA-A', 'PARA-B', 'PARA-C']); + }); + }); + + // ------------------------------------------------------------------------- + // Reviews 9-15: Fixed-format Area A paragraph detection + // ------------------------------------------------------------------------- + describe('Fixed-format Area A paragraph detection', () => { + + it('rejects deeply-indented identifiers as paragraphs (Area B)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' WS-CUST-ADDR.', // Area B (>7 spaces) — NOT a paragraph + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const paraNames = r.paragraphs.map(p => p.name); + expect(paraNames).toContain('MAIN-PARA'); + expect(paraNames).not.toContain('WS-CUST-ADDR'); + }); + + it('accepts Area A indented paragraphs (7 spaces)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' REAL-PARA.', // 7 spaces — Area A, valid paragraph + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.paragraphs.map(p => p.name)).toContain('REAL-PARA'); + }); + }); + + // ------------------------------------------------------------------------- + // Reviews 9-15: SORT/MERGE edge cases + // ------------------------------------------------------------------------- + describe('SORT/MERGE edge cases', () => { + + it('captures SORT GIVING without spurious COLLATING SEQUENCE keywords', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' SORT SORT-FILE ON ASCENDING KEY SORT-KEY', + ' COLLATING SEQUENCE IS NATL', + ' USING INPUT-FILE', + ' GIVING OUTPUT-FILE.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.sorts).toHaveLength(1); + expect(r.sorts[0].usingFiles).toEqual(['INPUT-FILE']); + // COLLATING, SEQUENCE, IS, NATL should NOT appear as giving files + expect(r.sorts[0].givingFiles).toEqual(['OUTPUT-FILE']); + }); + }); + + // ------------------------------------------------------------------------- + // Reviews 9-15: PROCEDURE DIVISION USING edge cases + // ------------------------------------------------------------------------- + describe('PROCEDURE DIVISION USING edge cases', () => { + + it('excludes RETURNING value from USING parameter list', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION USING WS-INPUT RETURNING WS-RESULT.', + ' MAIN-PARA.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.procedureUsing).toEqual(['WS-INPUT']); + }); + + it('pendingProcUsing not set for period-terminated PROCEDURE DIVISION', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.procedureUsing).toEqual([]); + // No spurious parameters from the first procedure line + }); + }); + + // ------------------------------------------------------------------------- + // Reviews 9-15: Comment stripping edge cases + // ------------------------------------------------------------------------- + describe('Comment stripping edge cases', () => { + + it('pipe character inside quoted string is preserved (not treated as comment)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + " 01 WS-SEP PIC X VALUE '|'.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // The data item should be extracted (not truncated by pipe) + expect(r.dataItems.find(d => d.name === 'WS-SEP')).toBeDefined(); + }); + }); + + // ------------------------------------------------------------------------- + // Reviews 9-15: SELECT OPTIONAL and ALTERNATE KEY + // ------------------------------------------------------------------------- + describe('SELECT OPTIONAL and ALTERNATE KEY', () => { + + it('SELECT OPTIONAL captures correct file name (not OPTIONAL keyword)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' ENVIRONMENT DIVISION.', + ' INPUT-OUTPUT SECTION.', + ' FILE-CONTROL.', + " SELECT OPTIONAL BACKUP-FILE ASSIGN TO 'BACKUP'.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.fileDeclarations).toHaveLength(1); + expect(r.fileDeclarations[0].selectName).toBe('BACKUP-FILE'); + expect(r.fileDeclarations[0].isOptional).toBe(true); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: EXEC DLI edge cases + // ------------------------------------------------------------------------- + describe('EXEC DLI edge cases', () => { + it('EXEC DLI without SEGMENT clause (DLET/REPL)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC DLI DLET USING PCB(2) END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execDliBlocks).toHaveLength(1); + expect(r.execDliBlocks[0].verb).toBe('DLET'); + expect(r.execDliBlocks[0].segmentName).toBeUndefined(); + }); + + it('multi-line EXEC DLI accumulates correctly', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC DLI GN', + ' USING PCB(1)', + ' SEGMENT(ORDER)', + ' INTO(ORDER-IO)', + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execDliBlocks).toHaveLength(1); + expect(r.execDliBlocks[0].verb).toBe('GN'); + expect(r.execDliBlocks[0].segmentName).toBe('ORDER'); + expect(r.execDliBlocks[0].intoField).toBe('ORDER-IO'); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: SET statement edge cases + // ------------------------------------------------------------------------- + describe('SET statement edge cases', () => { + it('SET multiple conditions TO TRUE', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' SET COND-A COND-B COND-C TO TRUE.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.sets).toHaveLength(1); + expect(r.sets[0].targets).toEqual(['COND-A', 'COND-B', 'COND-C']); + expect(r.sets[0].form).toBe('to-true'); + }); + + it('SET index DOWN BY identifier', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' SET IDX-1 DOWN BY WS-DECREMENT.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.sets).toHaveLength(1); + expect(r.sets[0].form).toBe('down-by'); + expect(r.sets[0].value).toBe('WS-DECREMENT'); + }); + + it('SET index TO numeric value', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' SET IDX-1 TO 5.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.sets).toHaveLength(1); + expect(r.sets[0].form).toBe('to-value'); + expect(r.sets[0].value).toBe('5'); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: INSPECT multi-line edge cases + // ------------------------------------------------------------------------- + describe('INSPECT multi-line edge cases', () => { + it('INSPECT CONVERTING on single line', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " INSPECT WS-FIELD CONVERTING 'abc' TO 'ABC'.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.inspects).toHaveLength(1); + expect(r.inspects[0].form).toBe('converting'); + expect(r.inspects[0].inspectedField).toBe('WS-FIELD'); + }); + + it('INSPECT TALLYING with multiple counters', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' INSPECT WS-STRING TALLYING', + " WS-CNT-A FOR ALL 'A'", + " WS-CNT-B FOR ALL 'B'.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.inspects).toHaveLength(1); + expect(r.inspects[0].counters).toEqual(['WS-CNT-A', 'WS-CNT-B']); + }); + + it('INSPECT combined TALLYING and REPLACING', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' INSPECT WS-DATA', + " TALLYING WS-COUNT FOR ALL 'X'", + " REPLACING ALL 'X' BY 'Y'.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.inspects).toHaveLength(1); + expect(r.inspects[0].form).toBe('tallying-replacing'); + }); + + it('real paragraph header during INSPECT flushes accumulator', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " INSPECT WS-FIELD REPLACING ALL 'A' BY 'B'", + ' NEXT-PARA.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // INSPECT should be flushed, NEXT-PARA should be detected + expect(r.inspects).toHaveLength(1); + expect(r.paragraphs.map(p => p.name)).toContain('NEXT-PARA'); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: DECLARATIVES edge cases + // ------------------------------------------------------------------------- + describe('DECLARATIVES edge cases', () => { + it('USE AFTER without STANDARD keyword (IBM extension)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' DECLARATIVES.', + ' FILE-ERR SECTION.', + ' USE AFTER EXCEPTION ON MASTER-FILE.', + ' FILE-ERR-PARA.', + ' DISPLAY "ERROR".', + ' END DECLARATIVES.', + ' MAIN-PARA.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.declaratives).toHaveLength(1); + expect(r.declaratives[0].target).toBe('MASTER-FILE'); + }); + + it('USE AFTER on I-O mode (catch-all handler)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' DECLARATIVES.', + ' IO-ERR SECTION.', + ' USE AFTER STANDARD ERROR ON I-O.', + ' IO-ERR-PARA.', + ' DISPLAY "I-O ERROR".', + ' END DECLARATIVES.', + ' MAIN-PARA.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.declaratives).toHaveLength(1); + expect(r.declaratives[0].target).toBe('I-O'); + }); + + it('paragraphs after END DECLARATIVES are normal paragraphs', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' DECLARATIVES.', + ' ERR SECTION.', + ' USE AFTER STANDARD ERROR ON INPUT.', + ' ERR-PARA.', + ' DISPLAY "E".', + ' END DECLARATIVES.', + ' MAIN-PARA.', + ' PERFORM PROCESS-DATA.', + ' PROCESS-DATA.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const paraNames = r.paragraphs.map(p => p.name); + expect(paraNames).toContain('ERR-PARA'); + expect(paraNames).toContain('MAIN-PARA'); + expect(paraNames).toContain('PROCESS-DATA'); + expect(r.performs).toHaveLength(1); + expect(r.performs[0].target).toBe('PROCESS-DATA'); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: COPY REPLACING edge cases + // ------------------------------------------------------------------------- + describe('COPY REPLACING edge cases', () => { + it('pseudotext replacement with empty target (deletion)', () => { + const replacings = parseReplacingClause('==OLD-TEXT== BY ===='); + expect(replacings).toHaveLength(1); + expect(replacings[0].from).toBe('OLD-TEXT'); + expect(replacings[0].to).toBe(''); + expect(replacings[0].isPseudotext).toBe(true); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: Value clause edge cases + // ------------------------------------------------------------------------- + describe('Value clause edge cases', () => { + it('VALUE with hex literal', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + " 01 WS-HEX PIC X(4) VALUE X'F1F2F3F4'.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const hex = r.dataItems.find(d => d.name === 'WS-HEX'); + expect(hex).toBeDefined(); + expect(hex!.values).toBeDefined(); + expect(hex!.values![0]).toContain('F1F2F3F4'); + }); + + it('VALUE with negative numeric', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 01 WS-NEG PIC S9(4) VALUE -1.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.dataItems.find(d => d.name === 'WS-NEG')?.values).toEqual(['-1']); + }); + + it('VALUE with ALL literal', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + " 01 WS-STARS PIC X(80) VALUE ALL '*'.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const stars = r.dataItems.find(d => d.name === 'WS-STARS'); + expect(stars?.values).toBeDefined(); + expect(stars!.values![0]).toContain('*'); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: OCCURS DEPENDING ON edge cases + // ------------------------------------------------------------------------- + describe('OCCURS DEPENDING ON edge cases', () => { + it('OCCURS with TO range', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 01 WS-CNT PIC 9(4).', + ' 01 WS-TBL OCCURS 1 TO 50 DEPENDING ON WS-CNT.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const tbl = r.dataItems.find(d => d.name === 'WS-TBL'); + expect(tbl?.occurs).toBe(1); + expect(tbl?.dependingOn).toBe('WS-CNT'); + }); + + it('OCCURS without DEPENDING ON (fixed-size)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 01 WS-ARR OCCURS 10.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.dataItems.find(d => d.name === 'WS-ARR')?.occurs).toBe(10); + expect(r.dataItems.find(d => d.name === 'WS-ARR')?.dependingOn).toBeUndefined(); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: Dynamic CALL edge cases + // ------------------------------------------------------------------------- + describe('Dynamic CALL edge cases', () => { + it('dynamic CALL at end of line (no trailing space or period)', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' CALL WS-PROGRAM', + ' USING WS-DATA.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].target).toBe('WS-PROGRAM'); + expect(r.calls[0].isQuoted).toBe(false); + }); + + it('CANCEL at end of line', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' CANCEL WS-OLD-PROG.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.cancels).toHaveLength(1); + expect(r.cancels[0].target).toBe('WS-OLD-PROG'); + expect(r.cancels[0].isQuoted).toBe(false); + }); + + it('multiple CANCELs on same line', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CANCEL 'PROG-A' CANCEL 'PROG-B'.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.cancels).toHaveLength(2); + expect(r.cancels[0].target).toBe('PROG-A'); + expect(r.cancels[1].target).toBe('PROG-B'); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: EXEC SQL edge cases + // ------------------------------------------------------------------------- + describe('EXEC SQL edge cases', () => { + it('EXEC SQL INCLUDE does not extract tables', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' EXEC SQL INCLUDE SQLCA END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execSqlBlocks).toHaveLength(1); + expect(r.execSqlBlocks[0].includeMember).toBe('SQLCA'); + expect(r.execSqlBlocks[0].tables).toHaveLength(0); + }); + + it('EXEC SQL SELECT INTO host variable does not capture as table', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC SQL', + ' SELECT CUST_NAME INTO :WS-NAME', + ' FROM CUSTOMER', + ' WHERE CUST_ID = :WS-ID', + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execSqlBlocks).toHaveLength(1); + // CUSTOMER should be a table, :WS-NAME should NOT + expect(r.execSqlBlocks[0].tables).toContain('CUSTOMER'); + expect(r.execSqlBlocks[0].tables).not.toContain('WS-NAME'); + }); + + it('EXEC SQL with host variables extracted', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC SQL', + ' UPDATE CUSTOMER SET BALANCE = :WS-AMT', + ' WHERE CUST_ID = :WS-ID', + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.execSqlBlocks[0].hostVariables).toContain('WS-AMT'); + expect(r.execSqlBlocks[0].hostVariables).toContain('WS-ID'); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: INITIALIZE extraction + // ------------------------------------------------------------------------- + describe('INITIALIZE extraction', () => { + it('INITIALIZE extracts target field', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' INITIALIZE WS-CUSTOMER-REC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.initializes).toHaveLength(1); + expect(r.initializes[0].target).toBe('WS-CUSTOMER-REC'); + expect(r.initializes[0].caller).toBe('MAIN-PARA'); + }); + + it('INITIALIZE multi-target extracts all targets', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' INITIALIZE WS-CUSTOMER WS-ORDER WS-LINE-ITEM.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.initializes).toHaveLength(3); + expect(r.initializes.map(i => i.target)).toEqual(['WS-CUSTOMER', 'WS-ORDER', 'WS-LINE-ITEM']); + }); + + it('INITIALIZE with REPLACING clause does not capture keywords as targets', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' INITIALIZE WS-RECORD REPLACING NUMERIC BY ZEROS.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.initializes).toHaveLength(1); + expect(r.initializes[0].target).toBe('WS-RECORD'); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: Nested program boundary tracking + // ------------------------------------------------------------------------- + describe('Nested program boundary tracking', () => { + it('sibling programs after END PROGRAM are correctly scoped', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. OUTER.', + ' PROCEDURE DIVISION.', + ' OUTER-MAIN.', + ' STOP RUN.', + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. INNER-A.', + ' PROCEDURE DIVISION.', + ' A-MAIN.', + ' STOP RUN.', + ' END PROGRAM INNER-A.', + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. INNER-B.', + ' PROCEDURE DIVISION.', + ' B-MAIN.', + ' STOP RUN.', + ' END PROGRAM INNER-B.', + ' END PROGRAM OUTER.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.programs).toHaveLength(3); + expect(r.programs.map(p => p.name).sort()).toEqual(['INNER-A', 'INNER-B', 'OUTER']); + const innerA = r.programs.find(p => p.name === 'INNER-A')!; + const innerB = r.programs.find(p => p.name === 'INNER-B')!; + expect(innerA.endLine).toBe(11); // END PROGRAM INNER-A + expect(innerB.startLine).toBe(13); // PROGRAM-ID. INNER-B + }); + + it('PROGRAM-ID without IDENTIFICATION DIVISION header detected', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. OUTER.', + ' PROCEDURE DIVISION.', + ' OUTER-MAIN.', + ' STOP RUN.', + ' PROGRAM-ID. SIBLING.', + ' PROCEDURE DIVISION.', + ' SIB-MAIN.', + ' STOP RUN.', + ' END PROGRAM SIBLING.', + ' END PROGRAM OUTER.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const names = r.programs.map(p => p.name); + expect(names).toContain('SIBLING'); + expect(names).toContain('OUTER'); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: EXEC block EOF flush + // ------------------------------------------------------------------------- + describe('EXEC block EOF flush', () => { + it('unclosed EXEC SQL is flushed at EOF', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' EXEC SQL', + ' SELECT * FROM CUSTOMER', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // Should still extract even without END-EXEC + expect(r.execSqlBlocks).toHaveLength(1); + expect(r.execSqlBlocks[0].tables).toContain('CUSTOMER'); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: Multi-PERFORM on same line + // ------------------------------------------------------------------------- + describe('Multi-PERFORM on same line', () => { + it('captures both PERFORMs in IF/ELSE on single line', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' IF WS-FLAG = 1 PERFORM PARA-A ELSE PERFORM PARA-B END-IF.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const targets = r.performs.map(p => p.target).sort(); + expect(targets).toEqual(['PARA-A', 'PARA-B']); + }); + }); + + // ------------------------------------------------------------------------- + // Regression: Data item IS EXTERNAL / IS GLOBAL + // ------------------------------------------------------------------------- + describe('Data item IS EXTERNAL / IS GLOBAL', () => { + it('IS EXTERNAL does not pollute usage string', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 01 WS-SHARED PIC X(10) USAGE DISPLAY IS EXTERNAL.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const item = r.dataItems.find(d => d.name === 'WS-SHARED'); + expect(item?.isExternal).toBe(true); + // usage should NOT contain 'external' as a string suffix + expect(item?.usage).toBe('DISPLAY'); + }); + }); + + // ------------------------------------------------------------------------- + // Accumulator flush on division transitions + // ------------------------------------------------------------------------- + describe('Accumulator flush on division transitions', () => { + + it('callAccum flushed when EXEC SQL interrupts multi-line CALL', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'SUBPROG'", + ' USING WS-PARM', + ' EXEC SQL', + ' SELECT * FROM CUSTOMER', + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // CALL should be extracted with USING parameters (flushed before EXEC SQL) + expect(r.calls).toHaveLength(1); + expect(r.calls[0].target).toBe('SUBPROG'); + expect(r.calls[0].parameters).toEqual(['WS-PARM']); + // EXEC SQL should also be extracted + expect(r.execSqlBlocks).toHaveLength(1); + expect(r.execSqlBlocks[0].tables).toContain('CUSTOMER'); + }); + + it('callAccum flushed when EXEC CICS interrupts multi-line CALL', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'SUBPROG'", + ' USING WS-DATA', + ' EXEC CICS', + " LINK PROGRAM('AUDITLOG')", + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-DATA']); + expect(r.execCicsBlocks).toHaveLength(1); + }); + + it('callAccum flushed when EXEC DLI interrupts multi-line CALL', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'SUBPROG'", + ' USING WS-KEY', + ' EXEC DLI GU', + ' USING PCB(1)', + ' SEGMENT(CUSTOMER)', + ' END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-KEY']); + expect(r.execDliBlocks).toHaveLength(1); + expect(r.execDliBlocks[0].verb).toBe('GU'); + }); + + it('all accumulators flushed on division transition', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. OUTER-PGM.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'SUBPROG'", + ' USING WS-DATA', + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. INNER-PGM.', + ' PROCEDURE DIVISION.', + ' INNER-MAIN.', + ' STOP RUN.', + ' END PROGRAM INNER-PGM.', + ' END PROGRAM OUTER-PGM.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // CALL should be flushed before the new IDENTIFICATION DIVISION + expect(r.calls).toHaveLength(1); + expect(r.calls[0].target).toBe('SUBPROG'); + // Both programs should be detected + expect(r.programs.map(p => p.name).sort()).toEqual(['INNER-PGM', 'OUTER-PGM']); + }); + }); + + // ------------------------------------------------------------------------- + // Free-format COBOL handling + // ------------------------------------------------------------------------- + describe('Free-format COBOL handling', () => { + + it('free-format source detected via >>SOURCE FREE', () => { + const src = [ + '>>SOURCE FORMAT IS FREE', + 'IDENTIFICATION DIVISION.', + 'PROGRAM-ID. FREEPROG.', + 'PROCEDURE DIVISION.', + 'MAIN-PARA.', + ' PERFORM PROCESS-DATA.', + 'PROCESS-DATA.', + ' STOP RUN.', + ].join('\n'); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.programName).toBe('FREEPROG'); + expect(r.paragraphs).toHaveLength(2); + expect(r.performs).toHaveLength(1); + }); + + it('free-format *> comments stripped but not inside quotes', () => { + const src = [ + '>>SOURCE FREE', + 'IDENTIFICATION DIVISION.', + 'PROGRAM-ID. TESTPROG.', + 'DATA DIVISION.', + 'WORKING-STORAGE SECTION.', + '01 WS-DATA PIC X(10). *> this is a comment', + ].join('\n'); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.dataItems.find(d => d.name === 'WS-DATA')).toBeDefined(); + }); + }); + + // ------------------------------------------------------------------------- + // CANCEL extraction in CALL ON EXCEPTION block + // ------------------------------------------------------------------------- + describe('CANCEL extraction in CALL ON EXCEPTION block', () => { + + it('CANCEL inside CALL END-CALL block is extracted', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'MAINPROG'", + ' USING WS-DATA', + ' ON EXCEPTION', + " CANCEL 'MAINPROG'", + " CALL 'BACKUP-PGM'", + ' END-CALL.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // Both CALLs should be captured + expect(r.calls).toHaveLength(2); + expect(r.calls.map(c => c.target).sort()).toEqual(['BACKUP-PGM', 'MAINPROG']); + // CANCEL should be captured from within the CALL block + expect(r.cancels).toHaveLength(1); + expect(r.cancels[0].target).toBe('MAINPROG'); + }); + }); + + // ------------------------------------------------------------------------- + // SORT INPUT PROCEDURE THRU range + // ------------------------------------------------------------------------- + describe('SORT INPUT PROCEDURE THRU range', () => { + + it('captures both start and thru target', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' SORT SORT-FILE ON ASCENDING KEY SORT-KEY', + ' INPUT PROCEDURE IS BUILD-INPUT THRU BUILD-END', + ' OUTPUT PROCEDURE IS WRITE-OUTPUT.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + // INPUT PROCEDURE should produce a perform with thruTarget + const inputProc = r.performs.find(p => p.target === 'BUILD-INPUT'); + expect(inputProc).toBeDefined(); + expect(inputProc!.thruTarget).toBe('BUILD-END'); + // OUTPUT PROCEDURE should be captured too + expect(r.performs.find(p => p.target === 'WRITE-OUTPUT')).toBeDefined(); + }); + }); + + // ------------------------------------------------------------------------- + // Shared verb constant coverage + // ------------------------------------------------------------------------- + describe('Shared verb constant coverage', () => { + + it('COBOL_STATEMENT_VERBS flush trigger works for all major verbs', () => { + // Test that each verb in the shared constant terminates callAccum + const verbs = [ + 'PERFORM NEXT-PARA.', 'MOVE WS-A TO WS-B.', 'DISPLAY "HELLO".', + 'GO TO EXIT-PARA.', 'INSPECT WS-X REPLACING ALL SPACES BY ZEROS.', + 'SET WS-FLAG TO TRUE.', 'INITIALIZE WS-REC.', 'CANCEL WS-OLD.', + ]; + for (const verb of verbs) { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM'", + ' USING WS-PARM', + ` ${verb}`, + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls.length).toBe(1); + expect(r.calls[0].parameters).toEqual(['WS-PARM']); + } + }); + }); + + // ------------------------------------------------------------------------- + // EXEC SQL INCLUDE edge cases + // ------------------------------------------------------------------------- + describe('EXEC SQL INCLUDE edge cases', () => { + + it('multiple EXEC SQL INCLUDEs extracted', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' EXEC SQL INCLUDE SQLCA END-EXEC.', + ' EXEC SQL INCLUDE SQLDA END-EXEC.', + ' EXEC SQL INCLUDE CUSTDCL END-EXEC.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + const includes = r.execSqlBlocks.filter(b => b.includeMember); + expect(includes).toHaveLength(3); + expect(includes.map(i => i.includeMember).sort()).toEqual(['CUSTDCL', 'SQLCA', 'SQLDA']); + }); + }); + + // ------------------------------------------------------------------------- + // Complete COBOL program integration + // ------------------------------------------------------------------------- + describe('Complete COBOL program integration', () => { + + it('extracts all construct types from a realistic program', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. FULLTEST.', + ' AUTHOR. TEST AUTHOR.', + ' ENVIRONMENT DIVISION.', + ' INPUT-OUTPUT SECTION.', + ' FILE-CONTROL.', + " SELECT CUST-FILE ASSIGN TO 'CUSTFILE'", + ' ORGANIZATION IS INDEXED', + ' ACCESS IS DYNAMIC', + ' RECORD KEY IS CUST-ID.', + ' DATA DIVISION.', + ' WORKING-STORAGE SECTION.', + ' 01 WS-COUNT PIC 9(4) VALUE 0.', + ' 01 WS-TABLE OCCURS 10 DEPENDING ON WS-COUNT.', + ' 01 WS-FLAG PIC 9 VALUE 0.', + ' 88 END-OF-FILE VALUE 1.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' PERFORM PROCESS-DATA', + ' SET END-OF-FILE TO TRUE', + " CALL 'SUBPROG' USING WS-COUNT.", + ' PROCESS-DATA.', + " INSPECT WS-FLAG REPLACING ALL '0' BY '1'.", + ' INITIALIZE WS-TABLE.', + ' STOP RUN.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.programName).toBe('FULLTEST'); + expect(r.programMetadata.author).toBe('TEST AUTHOR'); + expect(r.fileDeclarations).toHaveLength(1); + expect(r.fileDeclarations[0].organization).toBe('INDEXED'); + expect(r.dataItems.find(d => d.name === 'WS-COUNT')?.values).toEqual(['0']); + expect(r.dataItems.find(d => d.name === 'WS-TABLE')?.dependingOn).toBe('WS-COUNT'); + expect(r.paragraphs).toHaveLength(2); + expect(r.performs).toHaveLength(1); + expect(r.sets).toHaveLength(1); + expect(r.sets[0].form).toBe('to-true'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-COUNT']); + expect(r.inspects).toHaveLength(1); + expect(r.inspects[0].form).toBe('replacing'); + expect(r.initializes).toHaveLength(1); + }); + }); + + // ------------------------------------------------------------------------- + // Accumulator flush at END PROGRAM boundary + // ------------------------------------------------------------------------- + describe('Accumulator flush at END PROGRAM boundary', () => { + + it('multi-line CALL flushed at END PROGRAM', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. OUTER.', + ' PROCEDURE DIVISION.', + ' OUTER-MAIN.', + " CALL 'SUBPROG'", + ' USING WS-DATA', + ' END PROGRAM OUTER.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].target).toBe('SUBPROG'); + expect(r.calls[0].parameters).toEqual(['WS-DATA']); + }); + + it('multi-line CALL flushed at END PROGRAM in nested programs', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. OUTER.', + ' PROCEDURE DIVISION.', + ' OUTER-MAIN.', + ' STOP RUN.', + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. INNER.', + ' PROCEDURE DIVISION.', + ' INNER-MAIN.', + " CALL 'INNERSUB'", + ' USING WS-INNER-DATA', + ' END PROGRAM INNER.', + ' END PROGRAM OUTER.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].target).toBe('INNERSUB'); + expect(r.calls[0].parameters).toEqual(['WS-INNER-DATA']); + expect(r.programs).toHaveLength(2); + }); + }); + + // ------------------------------------------------------------------------- + // Accumulator flush at PROGRAM-ID sibling boundary + // ------------------------------------------------------------------------- + describe('Accumulator flush at PROGRAM-ID sibling boundary', () => { + + it('multi-line CALL flushed when sibling PROGRAM-ID appears without ID DIVISION', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. OUTER.', + ' PROCEDURE DIVISION.', + ' OUTER-MAIN.', + " CALL 'OUTERSUB'", + ' USING WS-OUTER', + ' PROGRAM-ID. SIBLING.', + ' PROCEDURE DIVISION.', + ' SIB-MAIN.', + ' STOP RUN.', + ' END PROGRAM SIBLING.', + ' END PROGRAM OUTER.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].target).toBe('OUTERSUB'); + expect(r.calls[0].parameters).toEqual(['WS-OUTER']); + const names = r.programs.map(p => p.name); + expect(names).toContain('SIBLING'); + }); + }); + + // ------------------------------------------------------------------------- + // Accumulator flush on arithmetic verb boundaries + // ------------------------------------------------------------------------- + describe('Accumulator flush on arithmetic verb boundaries', () => { + + it('COMPUTE terminates multi-line CALL accumulation', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM'", + ' USING WS-INPUT', + ' COMPUTE WS-TOTAL = WS-A + WS-B.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-INPUT']); + }); + + it('ADD terminates multi-line CALL accumulation', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM'", + ' USING WS-AMT', + ' ADD WS-AMT TO WS-TOTAL.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-AMT']); + }); + + it('SUBTRACT terminates multi-line CALL accumulation', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM'", + ' USING WS-VAL', + ' SUBTRACT WS-DISCOUNT FROM WS-TOTAL.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-VAL']); + }); + + it('MULTIPLY terminates multi-line CALL accumulation', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM'", + ' USING WS-QTY', + ' MULTIPLY WS-PRICE BY WS-QTY.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-QTY']); + }); + + it('DIVIDE terminates multi-line CALL accumulation', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM'", + ' USING WS-TOTAL', + ' DIVIDE WS-TOTAL BY WS-COUNT GIVING WS-AVG.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-TOTAL']); + }); + + it('STRING terminates multi-line CALL accumulation', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM'", + ' USING WS-NAME', + " STRING WS-FIRST DELIMITED BY SIZE INTO WS-FULL.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-NAME']); + }); + + it('UNSTRING terminates multi-line CALL accumulation', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM'", + ' USING WS-LINE', + " UNSTRING WS-LINE DELIMITED BY ',' INTO WS-A WS-B.", + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + expect(r.calls[0].parameters).toEqual(['WS-LINE']); + }); + }); + + // ------------------------------------------------------------------------- + // Arithmetic verbs not captured as false USING parameters + // ------------------------------------------------------------------------- + describe('Arithmetic verbs not captured as false USING parameters', () => { + + it('COMPUTE after CALL USING does not pollute parameters', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + " CALL 'PGM' USING WS-INPUT.", + ' COMPUTE WS-RESULT = WS-A * WS-B.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.calls).toHaveLength(1); + // Only WS-INPUT should be a parameter, not WS-RESULT/WS-A/WS-B + expect(r.calls[0].parameters).toEqual(['WS-INPUT']); + }); + }); + + // ------------------------------------------------------------------------- + // SORT accumulator flushed at program boundaries + // ------------------------------------------------------------------------- + describe('SORT accumulator flushed at program boundaries', () => { + + it('multi-line SORT flushed at END PROGRAM', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' SORT SORT-FILE', + ' USING INPUT-FILE', + ' END PROGRAM TESTPROG.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.sorts).toHaveLength(1); + expect(r.sorts[0].sortFile).toBe('SORT-FILE'); + expect(r.sorts[0].usingFiles).toEqual(['INPUT-FILE']); + }); + }); + + // ------------------------------------------------------------------------- + // INSPECT accumulator flushed at program boundaries + // ------------------------------------------------------------------------- + describe('INSPECT accumulator flushed at program boundaries', () => { + + it('multi-line INSPECT flushed at END PROGRAM', () => { + const src = cobol( + ' IDENTIFICATION DIVISION.', + ' PROGRAM-ID. TESTPROG.', + ' PROCEDURE DIVISION.', + ' MAIN-PARA.', + ' INSPECT WS-DATA', + " REPLACING ALL 'X' BY 'Y'", + ' END PROGRAM TESTPROG.', + ); + const r = extractCobolSymbolsWithRegex(src, 'test.cbl'); + expect(r.inspects).toHaveLength(1); + expect(r.inspects[0].inspectedField).toBe('WS-DATA'); + expect(r.inspects[0].form).toBe('replacing'); + }); + }); +}); diff --git a/gitnexus/test/unit/graph.test.ts b/gitnexus/test/unit/graph.test.ts index 4e87afc773..070a69a4b9 100644 --- a/gitnexus/test/unit/graph.test.ts +++ b/gitnexus/test/unit/graph.test.ts @@ -186,4 +186,61 @@ describe('createKnowledgeGraph', () => { g.forEachRelationship(r => types.push(r.type)); expect(types).toEqual(['CALLS']); }); + + // ─── removeRelationship ───────────────────────────────────────────── + + it('removes a relationship by id', () => { + const g = createKnowledgeGraph(); + g.addNode(makeNode('fn:a', 'a')); + g.addNode(makeNode('fn:b', 'b')); + g.addRelationship(makeRel('fn:a', 'fn:b')); + expect(g.relationshipCount).toBe(1); + + const removed = g.removeRelationship('fn:a-CALLS-fn:b'); + expect(removed).toBe(true); + expect(g.relationshipCount).toBe(0); + }); + + it('removeRelationship returns false for unknown id', () => { + const g = createKnowledgeGraph(); + expect(g.removeRelationship('nonexistent')).toBe(false); + }); + + it('removeRelationship returns false on second call with same id', () => { + const g = createKnowledgeGraph(); + g.addNode(makeNode('fn:a', 'a')); + g.addNode(makeNode('fn:b', 'b')); + g.addRelationship(makeRel('fn:a', 'fn:b')); + + expect(g.removeRelationship('fn:a-CALLS-fn:b')).toBe(true); + expect(g.removeRelationship('fn:a-CALLS-fn:b')).toBe(false); + }); + + it('removeRelationship does not affect nodes', () => { + const g = createKnowledgeGraph(); + g.addNode(makeNode('fn:a', 'a')); + g.addNode(makeNode('fn:b', 'b')); + g.addRelationship(makeRel('fn:a', 'fn:b')); + + g.removeRelationship('fn:a-CALLS-fn:b'); + expect(g.nodeCount).toBe(2); + expect(g.getNode('fn:a')).toBeDefined(); + expect(g.getNode('fn:b')).toBeDefined(); + }); + + it('removeRelationship leaves other relationships intact', () => { + const g = createKnowledgeGraph(); + g.addNode(makeNode('fn:a', 'a')); + g.addNode(makeNode('fn:b', 'b')); + g.addNode(makeNode('fn:c', 'c')); + g.addRelationship(makeRel('fn:a', 'fn:b')); + g.addRelationship(makeRel('fn:b', 'fn:c')); + expect(g.relationshipCount).toBe(2); + + g.removeRelationship('fn:a-CALLS-fn:b'); + expect(g.relationshipCount).toBe(1); + const remaining = [...g.iterRelationships()]; + expect(remaining[0].sourceId).toBe('fn:b'); + expect(remaining[0].targetId).toBe('fn:c'); + }); }); diff --git a/gitnexus/test/unit/jcl-parser.test.ts b/gitnexus/test/unit/jcl-parser.test.ts new file mode 100644 index 0000000000..f9b4bc66cd --- /dev/null +++ b/gitnexus/test/unit/jcl-parser.test.ts @@ -0,0 +1,338 @@ +import { describe, it, expect } from 'vitest'; +import { parseJcl } from '../../src/core/ingestion/cobol/jcl-parser.js'; +import type { JclParseResults } from '../../src/core/ingestion/cobol/jcl-parser.js'; + +describe('parseJcl', () => { + // ── JOB statements ────────────────────────────────────────────────── + + describe('JOB statements', () => { + it('extracts job name', () => { + const jcl = `//MYJOB JOB (ACCT),'MY JOB'`; + const r = parseJcl(jcl, 'test.jcl'); + expect(r.jobs).toHaveLength(1); + expect(r.jobs[0].name).toBe('MYJOB'); + expect(r.jobs[0].line).toBe(1); + }); + + it('extracts CLASS and MSGCLASS parameters', () => { + const jcl = `//PAYJOB JOB (ACCT),'PAYROLL',CLASS=A,MSGCLASS=X`; + const r = parseJcl(jcl, 'test.jcl'); + expect(r.jobs).toHaveLength(1); + expect(r.jobs[0].name).toBe('PAYJOB'); + expect(r.jobs[0].class).toBe('A'); + expect(r.jobs[0].msgclass).toBe('X'); + }); + + it('handles job with no CLASS or MSGCLASS', () => { + const jcl = `//BAREJOB JOB (ACCT),'BARE'`; + const r = parseJcl(jcl, 'test.jcl'); + expect(r.jobs).toHaveLength(1); + expect(r.jobs[0].name).toBe('BAREJOB'); + expect(r.jobs[0].class).toBeUndefined(); + expect(r.jobs[0].msgclass).toBeUndefined(); + }); + }); + + // ── EXEC statements ───────────────────────────────────────────────── + + describe('EXEC statements', () => { + it('extracts step with PGM=program', () => { + const jcl = [ + '//MYJOB JOB (ACCT)', + '//STEP1 EXEC PGM=IEFBR14', + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + expect(r.steps).toHaveLength(1); + expect(r.steps[0].name).toBe('STEP1'); + expect(r.steps[0].program).toBe('IEFBR14'); + expect(r.steps[0].proc).toBeUndefined(); + }); + + it('extracts step with proc name (no PGM= keyword)', () => { + const jcl = [ + '//MYJOB JOB (ACCT)', + '//STEP1 EXEC MYPROC', + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + expect(r.steps).toHaveLength(1); + expect(r.steps[0].name).toBe('STEP1'); + expect(r.steps[0].program).toBeUndefined(); + expect(r.steps[0].proc).toBe('MYPROC'); + }); + + it('associates step with current job', () => { + const jcl = [ + '//JOB1 JOB (ACCT)', + '//STEPA EXEC PGM=PROG1', + '//JOB2 JOB (ACCT)', + '//STEPB EXEC PGM=PROG2', + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + expect(r.steps).toHaveLength(2); + expect(r.steps[0].jobName).toBe('JOB1'); + expect(r.steps[1].jobName).toBe('JOB2'); + }); + }); + + // ── DD statements ─────────────────────────────────────────────────── + + describe('DD statements', () => { + it('extracts DD name and dataset (DSN=)', () => { + const jcl = [ + '//MYJOB JOB (ACCT)', + '//STEP1 EXEC PGM=IEFBR14', + '//INPUT DD DSN=MY.DATA.SET,DISP=SHR', + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + expect(r.ddStatements).toHaveLength(1); + expect(r.ddStatements[0].ddName).toBe('INPUT'); + expect(r.ddStatements[0].dataset).toBe('MY.DATA.SET'); + }); + + it('extracts DISP parameter', () => { + const jcl = [ + '//MYJOB JOB (ACCT)', + '//STEP1 EXEC PGM=IEFBR14', + '//OUTPUT DD DSN=MY.OUT,DISP=(NEW,CATLG,DELETE)', + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + expect(r.ddStatements).toHaveLength(1); + expect(r.ddStatements[0].disp).toBe('NEW'); + }); + + it('associates DD with current step', () => { + const jcl = [ + '//MYJOB JOB (ACCT)', + '//STEP1 EXEC PGM=PROG1', + '//DD1 DD DSN=DS1,DISP=SHR', + '//STEP2 EXEC PGM=PROG2', + '//DD2 DD DSN=DS2,DISP=SHR', + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + expect(r.ddStatements).toHaveLength(2); + expect(r.ddStatements[0].stepName).toBe('STEP1'); + expect(r.ddStatements[1].stepName).toBe('STEP2'); + }); + }); + + // ── PROC definitions ──────────────────────────────────────────────── + + describe('PROC definitions', () => { + it('extracts in-stream PROC with name', () => { + const jcl = [ + '//MYPROC PROC', + '//STEP1 EXEC PGM=IEFBR14', + '// PEND', + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + expect(r.procs).toHaveLength(1); + expect(r.procs[0].name).toBe('MYPROC'); + expect(r.procs[0].isInStream).toBe(true); + }); + + it('handles PROC/PEND pairs', () => { + const jcl = [ + '//PROC1 PROC', + '//S1 EXEC PGM=PROG1', + '// PEND', + '//PROC2 PROC', + '//S2 EXEC PGM=PROG2', + '// PEND', + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + expect(r.procs).toHaveLength(2); + expect(r.procs[0].name).toBe('PROC1'); + expect(r.procs[1].name).toBe('PROC2'); + }); + }); + + // ── INCLUDE / SET ─────────────────────────────────────────────────── + + describe('INCLUDE and SET', () => { + it('extracts INCLUDE MEMBER=name', () => { + const jcl = `// INCLUDE MEMBER=MYINCL`; + const r = parseJcl(jcl, 'test.jcl'); + expect(r.includes).toHaveLength(1); + expect(r.includes[0].member).toBe('MYINCL'); + expect(r.includes[0].line).toBe(1); + }); + + it('extracts SET variable=value', () => { + const jcl = `// SET ENV=PROD`; + const r = parseJcl(jcl, 'test.jcl'); + expect(r.sets).toHaveLength(1); + expect(r.sets[0].variable).toBe('ENV'); + expect(r.sets[0].value).toBe('PROD'); + }); + }); + + // ── Conditionals ──────────────────────────────────────────────────── + + describe('Conditionals', () => { + it('extracts IF condition THEN', () => { + const jcl = `// IF STEP1.RC = 0 THEN`; + const r = parseJcl(jcl, 'test.jcl'); + expect(r.conditionals).toHaveLength(1); + expect(r.conditionals[0].type).toBe('IF'); + expect(r.conditionals[0].condition).toBe('STEP1.RC = 0'); + }); + + it('extracts ELSE and ENDIF', () => { + const jcl = [ + '// IF STEP1.RC = 0 THEN', + '//GOOD EXEC PGM=GOODPGM', + '// ELSE', + '//BAD EXEC PGM=BADPGM', + '// ENDIF', + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + expect(r.conditionals).toHaveLength(3); + expect(r.conditionals[0].type).toBe('IF'); + expect(r.conditionals[1].type).toBe('ELSE'); + expect(r.conditionals[1].condition).toBeUndefined(); + expect(r.conditionals[2].type).toBe('ENDIF'); + expect(r.conditionals[2].condition).toBeUndefined(); + }); + }); + + // ── JCLLIB ────────────────────────────────────────────────────────── + + describe('JCLLIB', () => { + it('extracts JCLLIB ORDER=(lib1,lib2)', () => { + const jcl = `// JCLLIB ORDER=(SYS1.PROCLIB,USER.PROCLIB)`; + const r = parseJcl(jcl, 'test.jcl'); + expect(r.jcllib).toHaveLength(1); + expect(r.jcllib[0].order).toEqual(['SYS1.PROCLIB', 'USER.PROCLIB']); + expect(r.jcllib[0].line).toBe(1); + }); + }); + + // ── Continuation lines ────────────────────────────────────────────── + + describe('Continuation lines', () => { + it('joins continuation lines (col 72 non-blank + next line starts with //)', () => { + // Build a DD line that is exactly 72 chars with non-blank at col 72 (index 71). + // The continuation line provides the DISP parameter. + // "//DD1 DD DSN=MY.VERY.LONG.DATASET.NAME.THAT.KEEPS.GOING," is 60 chars. + // Pad to 71 then add non-blank at col 72. + const base = '//DD1 DD DSN=MY.VERY.LONG.DATASET.NAME.THAT.KEEPS.GOING,'; + const padding = ' '.repeat(71 - base.length); + const line1 = base + padding + 'X'; // col 72 is 'X' (non-blank) -> continuation + const line2 = '// DISP=SHR'; + const jcl = [ + '//MYJOB JOB (ACCT)', + '//STEP1 EXEC PGM=IEFBR14', + line1, + line2, + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + // The continuation should join the DD line so both DSN and DISP are parsed + expect(r.ddStatements).toHaveLength(1); + expect(r.ddStatements[0].ddName).toBe('DD1'); + expect(r.ddStatements[0].dataset).toBe('MY.VERY.LONG.DATASET.NAME.THAT.KEEPS.GOING'); + expect(r.ddStatements[0].disp).toBe('SHR'); + }); + }); + + // ── Edge cases ────────────────────────────────────────────────────── + + describe('Edge cases', () => { + it('skips JCL comments (//*)', () => { + const jcl = [ + '//* This is a comment', + '//MYJOB JOB (ACCT)', + '//* Another comment', + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + expect(r.jobs).toHaveLength(1); + expect(r.jobs[0].name).toBe('MYJOB'); + }); + + it('skips non-JCL lines', () => { + const jcl = [ + 'This is not a JCL line', + '//MYJOB JOB (ACCT)', + ' Some data', + '//STEP1 EXEC PGM=IEFBR14', + ].join('\n'); + const r = parseJcl(jcl, 'test.jcl'); + expect(r.jobs).toHaveLength(1); + expect(r.steps).toHaveLength(1); + }); + + it('empty input returns empty results', () => { + const r = parseJcl('', 'test.jcl'); + expect(r.jobs).toEqual([]); + expect(r.steps).toEqual([]); + expect(r.ddStatements).toEqual([]); + expect(r.procs).toEqual([]); + expect(r.includes).toEqual([]); + expect(r.sets).toEqual([]); + expect(r.jcllib).toEqual([]); + expect(r.conditionals).toEqual([]); + }); + + it('complete JCL job with multiple steps and DDs', () => { + const jcl = [ + '//* Complete payroll job', + '//PAYJOB JOB (ACCT123),\'PAYROLL RUN\',CLASS=A,MSGCLASS=X', + '// JCLLIB ORDER=(PAY.PROCLIB,SYS1.PROCLIB)', + '// SET ENV=PROD', + '// INCLUDE MEMBER=STDPARMS', + '//*', + '// IF 1 = 1 THEN', + '//STEP01 EXEC PGM=PAYEXT', + '//INPUT DD DSN=PAY.MASTER,DISP=SHR', + '//OUTPUT DD DSN=PAY.EXTRACT,DISP=(NEW,CATLG,DELETE)', + '//SYSPRINT DD SYSOUT=*', + '//*', + '//STEP02 EXEC PAYCALC', + '//INFILE DD DSN=PAY.EXTRACT,DISP=SHR', + '// ELSE', + '//STEP03 EXEC PGM=IEFBR14', + '// ENDIF', + ].join('\n'); + const r = parseJcl(jcl, 'payroll.jcl'); + + // Jobs + expect(r.jobs).toHaveLength(1); + expect(r.jobs[0]).toEqual({ + name: 'PAYJOB', + line: 2, + class: 'A', + msgclass: 'X', + }); + + // JCLLIB + expect(r.jcllib).toHaveLength(1); + expect(r.jcllib[0].order).toEqual(['PAY.PROCLIB', 'SYS1.PROCLIB']); + + // SET + expect(r.sets).toHaveLength(1); + expect(r.sets[0]).toEqual({ variable: 'ENV', value: 'PROD', line: 4 }); + + // INCLUDE + expect(r.includes).toHaveLength(1); + expect(r.includes[0].member).toBe('STDPARMS'); + + // Conditionals + expect(r.conditionals).toHaveLength(3); + expect(r.conditionals[0].type).toBe('IF'); + expect(r.conditionals[1].type).toBe('ELSE'); + expect(r.conditionals[2].type).toBe('ENDIF'); + + // Steps + expect(r.steps).toHaveLength(3); + expect(r.steps[0]).toMatchObject({ name: 'STEP01', program: 'PAYEXT', jobName: 'PAYJOB' }); + expect(r.steps[1]).toMatchObject({ name: 'STEP02', proc: 'PAYCALC', jobName: 'PAYJOB' }); + expect(r.steps[2]).toMatchObject({ name: 'STEP03', program: 'IEFBR14', jobName: 'PAYJOB' }); + + // DD statements + expect(r.ddStatements).toHaveLength(4); + expect(r.ddStatements[0]).toMatchObject({ ddName: 'INPUT', stepName: 'STEP01', dataset: 'PAY.MASTER', disp: 'SHR' }); + expect(r.ddStatements[1]).toMatchObject({ ddName: 'OUTPUT', stepName: 'STEP01', disp: 'NEW' }); + expect(r.ddStatements[2]).toMatchObject({ ddName: 'SYSPRINT', stepName: 'STEP01' }); + expect(r.ddStatements[3]).toMatchObject({ ddName: 'INFILE', stepName: 'STEP02', dataset: 'PAY.EXTRACT' }); + }); + }); +});