Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
4af677e
feat: add COBOL language support with regex extraction pipeline
magyargergo Mar 24, 2026
88c89c4
docs: document custom processor pattern in pipeline.ts
magyargergo Mar 24, 2026
9760f96
feat(cobol): enrich graph with EXEC SQL/CICS, ENTRY points, MOVE data…
magyargergo Mar 24, 2026
41b0d8d
test(cobol): add 26 integration tests with exact assertions + fix CIC…
magyargergo Mar 24, 2026
832789f
test(cobol): exhaustive 57-test suite with strict exact assertions
magyargergo Mar 24, 2026
49fd493
Merge remote-tracking branch 'origin' into feat/cobol-language-support
magyargergo Mar 25, 2026
f8ea9db
fix(cobol): add removeRelationship API + single-quote CALL/COPY/ENTRY…
magyargergo Mar 25, 2026
77ce6f4
fix(cobol): RE_ENTRY single-quote + remove orphan unresolved CALLS edges
magyargergo Mar 25, 2026
e8b0830
fix(cobol): Property ID collisions + O(1) Map lookup for MOVE edges
magyargergo Mar 25, 2026
52af945
feat(cobol): MOVE multi-target extraction with OF/IN qualifier filtering
magyargergo Mar 25, 2026
b2f88ca
feat(cobol): COPY IN/OF library, pseudotext REPLACING, dynamic CALL, …
magyargergo Mar 25, 2026
cd3e30b
feat(cobol): nested program support — capture multiple PROGRAM-IDs pe…
magyargergo Mar 25, 2026
5edab1e
test(cobol): expand integration tests for all new language features
magyargergo Mar 25, 2026
09996b8
fix(cobol): pseudotext REPLACING now applies correctly via isPseudote…
magyargergo Mar 25, 2026
8c72ac8
refactor(cobol): per-program scoping via boundary tracking + line-ran…
magyargergo Mar 25, 2026
00bf8d0
feat(cobol): free-format COBOL support (>>source free)
magyargergo Mar 25, 2026
1ba9540
fix(cobol): relax data item regexes for free-format (^\s+ to ^\s*)
magyargergo Mar 25, 2026
d3a38e8
feat(cobol): 100% structural feature coverage — GO TO, SCREEN, SD/RD,…
magyargergo Mar 25, 2026
7702341
feat(cobol): enriched CICS extraction — file I/O, dynamic PROGRAM, qu…
magyargergo Mar 25, 2026
46b9ffc
feat(cobol): complete CICS command extraction — all 7 expert recommen…
magyargergo Mar 25, 2026
16c9ac1
test(cobol): strict exhaustive integration tests with exact edgeSet a…
magyargergo Mar 25, 2026
42f1a65
Merge remote-tracking branch 'origin' into feat/cobol-language-support
magyargergo Mar 25, 2026
ba6aa85
fix(cobol): address 5 findings from second Claude review (compiler fr…
magyargergo Mar 25, 2026
5e4cf0d
fix(cobol): address code review findings — ReDoS fix, perf, cleanup
magyargergo Mar 25, 2026
5b8ecd2
refactor: add Cobol to SupportedLanguages with parseStrategy: standalone
magyargergo Mar 25, 2026
009ee70
fix(cobol): 5 fixes from third Claude review + 3 regression tests
magyargergo Mar 25, 2026
eb10d86
fix(cobol): address 6 findings from fourth Claude review + tests
magyargergo Mar 25, 2026
985c040
fix(cobol): address 4 findings from fifth Claude review
magyargergo Mar 25, 2026
4660da8
fix(cobol): address findings from reviews 5+6 with full test coverage
magyargergo Mar 25, 2026
513dab4
fix(cobol): address findings from seventh Claude review + 3 tests
magyargergo Mar 25, 2026
bde9956
feat(cobol): link PROCEDURE DIVISION USING to LINKAGE data items + cl…
magyargergo Mar 25, 2026
646ce62
Merge remote-tracking branch 'origin' into feat/cobol-language-support
magyargergo Mar 26, 2026
6c0e9a9
fix(cobol): resolve 48 review findings across 9 review cycles
magyargergo Mar 26, 2026
dd7e36d
Merge remote-tracking branch 'origin' into feat/cobol-language-support
magyargergo Mar 26, 2026
f078be8
docs(cobol): update documentation for ninth review cycle fixes
magyargergo Mar 26, 2026
eb05db1
fix(cobol): resolve 10th review findings — nested program edge attrib…
magyargergo Mar 26, 2026
fab3bc8
fix(cobol): resolve 10th review findings — nested program edge attrib…
magyargergo Mar 26, 2026
47ded4f
fix(cobol): resolve 11th review findings — final nested program + mul…
magyargergo Mar 26, 2026
c3a23a6
docs(cobol): deepened full language coverage plan with research findings
magyargergo Mar 26, 2026
ec02211
feat(cobol): implement Phase 1 — high-value data flow edges
magyargergo Mar 26, 2026
a9927a1
feat(cobol): implement Phase 2 — DECLARATIVES, SET, INSPECT, EXEC DLI
magyargergo Mar 26, 2026
160679d
feat(cobol): implement Phase 3 — completeness fixes
magyargergo Mar 26, 2026
4a3f483
feat(cobol): implement Phase 4 — INITIALIZE + metadata completeness
magyargergo Mar 26, 2026
b8bbda6
test(cobol): add 24 unit tests for Phase 1-4 features
magyargergo Mar 26, 2026
ffd3197
Merge remote-tracking branch 'origin/main' into feat/cobol-language-s…
magyargergo Mar 26, 2026
874e408
fix(cobol): use /\r?\n/ split for Windows CRLF compatibility
magyargergo Mar 26, 2026
fb0fc10
fix(cobol): resolve 12th review — dynamic CALL/CANCEL dedup + trailin…
magyargergo Mar 26, 2026
aa70ebf
feat(cobol): add CALL accumulator + fix SORT double-statement (#4, #6)
magyargergo Mar 26, 2026
d8c6e03
fix(cobol): resolve 13th review — CICS LOAD, USING extraction, file s…
magyargergo Mar 26, 2026
2b222ef
fix(cobol): resolve 14th review — callAccum false paragraph + Area A …
magyargergo Mar 26, 2026
86a36e5
fix(cobol): resolve 15th review — callAccum Area A + verb boundary fixes
magyargergo Mar 26, 2026
3296201
test(cobol): add 17 edge-case regression tests + fix USING verb boundary
magyargergo Mar 26, 2026
5aa0e18
test(cobol): add 32 comprehensive edge-case regression tests
magyargergo Mar 26, 2026
38e37e0
fix(cobol): resolve 16th review — CANCEL in CALL block + USING boundary
magyargergo Mar 26, 2026
7e52f98
refactor(cobol): extract shared verb constants + resolve 17th review
magyargergo Mar 26, 2026
5dae8cd
test(cobol): replace all fuzzy assertions with exact toBe checks
magyargergo Mar 26, 2026
bf38528
fix(cobol): resolve 19th review + 15 accumulator flush tests
magyargergo Mar 26, 2026
119bd72
fix(cobol): resolve 20th review — INITIALIZE multi-target + 2 tests
magyargergo Mar 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions docs/code-indexing/cobol/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# COBOL Code Indexing

GitNexus indexes COBOL codebases using a **regex-only extraction** strategy, bypassing tree-sitter entirely. This document explains why, how the pipeline works, and links to detailed sub-documents.

## Why Regex-Only?

The tree-sitter-cobol grammar (v0.0.1) has three critical limitations that make it unusable for production indexing:

| Issue | Impact | Severity |
|-------|--------|----------|
| External scanner hangs on ~5% of files | No timeout mechanism exists for the C scanner; the process blocks indefinitely | **Blocking** |
| Only ~15% of paragraph headers detected | Most procedure-division paragraphs are invisible to the grammar | High |
| Patch markers in cols 1-6 cause parse errors | Enterprise COBOL uses non-standard sequence area content (e.g., `mzADD`, `estero`, `#FIX`) | High |

Because the external scanner hang cannot be interrupted (there is no `setTimeoutMicros` equivalent for tree-sitter), using tree-sitter-cobol would hang the indexing pipeline on a non-trivial fraction of real-world files.

The regex-only approach provides:

- **Speed**: ~1ms per file average extraction time
- **Reliability**: zero hangs, zero crashes across 13,000+ files
- **Coverage**: captures all critical symbols -- program name, paragraphs, sections, CALL, PERFORM, COPY, data items (01-77, 88-level), file declarations, FD entries, EXEC SQL/CICS blocks, ENTRY points, and MOVE statements

## Architecture

```mermaid
flowchart TD
A[Repository Scan] --> B{File Detection}
B -->|Extension match| C[COBOL file]
B -->|GITNEXUS_COBOL_DIRS match| C
B -->|No match| Z[Skip]

C --> D{Copybook?}
D -->|Yes| E[Add to Copybook Map]
D -->|No| F[Source Program]

E --> G[COPY Expansion Engine]
F --> G

G -->|Inline copybook content| H[Expanded Source]
H --> I[Patch Marker Cleanup]
I --> J[Regex State Machine]

J --> K[Extracted Symbols]
K --> L[Graph Model Builder]
L --> M[Knowledge Graph]

subgraph "Per-Chunk Processing"
G
H
I
J
K
L
end

subgraph "Post-Processing"
M --> N[Community Detection]
M --> O[Process Detection]
M --> P[Contract Detection]
end

style J fill:#e8f5e9,stroke:#2e7d32
style G fill:#e3f2fd,stroke:#1565c0
```

## COBOL vs Tree-Sitter Languages

| Feature | COBOL (Regex) | Tree-Sitter Languages |
|---------|--------------|----------------------|
| Parser | Single-pass regex state machine | tree-sitter grammar + queries |
| Speed | ~1ms/file | ~5ms/file |
| AST available | No | Yes |
| COPY expansion | Yes (pre-processing step) | N/A |
| Deep indexing | Data items, SQL, CICS, FD, ENTRY | Type annotations, generics, etc. |
| Call extraction | PERFORM (intra-file) + CALL (cross-program) | AST-based call site detection |
| Import extraction | COPY statements | `import`/`require`/`use`/`#include` |
| Coverage | All critical symbols | Language-dependent query coverage |
| Failure mode | Never hangs | External scanner can hang (COBOL only) |

## Sub-Documents

| Document | Description |
|----------|-------------|
| [File Detection](./file-detection.md) | Extension mapping, `GITNEXUS_COBOL_DIRS`, copybook classification |
| [COPY Expansion](./copy-expansion.md) | Copybook inlining, REPLACING transformations, cycle detection |
| [Regex Extraction](./regex-extraction.md) | State machine, regex patterns, line processing |
| [Deep Indexing](./deep-indexing.md) | Data items, EXEC SQL/CICS, file declarations, FD, ENTRY, MOVE |
| [Graph Model](./graph-model.md) | COBOL-specific node types, edge types, full annotated example |
| [Performance](./performance.md) | Benchmarks, worker pool tuning, caps, troubleshooting |

## Key Source Files

| File | Purpose |
|------|---------|
| `gitnexus/src/core/ingestion/cobol-preprocessor.ts` | Patch marker cleanup + regex extraction engine |
| `gitnexus/src/core/ingestion/cobol-copy-expander.ts` | COPY statement expansion with REPLACING |
| `gitnexus/src/core/ingestion/utils.ts` | `getLanguageFromPath`, `getLanguageFromFilename` |
| `gitnexus/src/core/ingestion/pipeline.ts` | `isCobolCopybook`, `expandCobolCopies`, `detectCrossProgamContracts` |
| `gitnexus/src/core/ingestion/workers/parse-worker.ts` | `processCobolRegexOnly` -- graph model builder |
| `gitnexus/src/core/ingestion/workers/worker-pool.ts` | Configurable sub-batch size for COBOL |
157 changes: 157 additions & 0 deletions docs/code-indexing/cobol/copy-expansion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# COBOL COPY Expansion

The COPY statement is COBOL's include mechanism -- analogous to `#include` in C or `import` in modern languages. GitNexus expands COPY statements **before** regex extraction so that symbols defined inside copybooks (data items, paragraphs, etc.) are visible in the program's extracted graph.

## Supported Syntax

### Basic COPY

```cobol
COPY CPSESP.
COPY "WORKGRID.CPY".
```

Inlines the content of the named copybook, replacing the COPY line(s).

### COPY with REPLACING

```cobol
COPY CPSESP REPLACING "ANAZI-KEY" BY "LK-KEY".
COPY CPSESP REPLACING LEADING "ESP-" BY "LK-ESP-"
LEADING "KPSESPL" BY "LK-KPSESPL".
COPY LINKAGE REPLACING TRAILING "-IN" BY "-OUT".
```

Three REPLACING types are supported:

| Type | Syntax | Behavior | Example |
| ------------ | ------------------------------------ | --------------------------------------- | -------------------------------- |
| **EXACT** | `REPLACING "OLD" BY "NEW"` | Replace exact identifier matches | `ANAZI-KEY` becomes `LK-KEY` |
| **LEADING** | `REPLACING LEADING "PFX-" BY "NEW-"` | Replace prefix on all COBOL identifiers | `ESP-NAME` becomes `LK-ESP-NAME` |
| **TRAILING** | `REPLACING TRAILING "-IN" BY "-OUT"` | Replace suffix on all COBOL identifiers | `DATA-IN` becomes `DATA-OUT` |

Multiple REPLACING clauses can appear in a single COPY statement. They are applied in order to each COBOL identifier in the copybook content.

### Multi-Line COPY

COPY statements can span multiple lines (standard COBOL continuation rules apply):

```cobol
COPY CPSESP REPLACING
- LEADING "ESP-" BY "LK-ESP-"
- LEADING "KPSESPL" BY "LK-KPSESPL".
```

Continuation lines (indicator `-` in column 7) are merged before COPY statement scanning.

## Expansion Flow

```mermaid
sequenceDiagram
participant Pipeline
participant Expander as COPY Expander
participant Resolver
participant Reader

Pipeline->>Pipeline: Identify all COBOL files
Pipeline->>Pipeline: Classify copybooks vs programs
Pipeline->>Reader: Read all copybook content upfront
Reader-->>Pipeline: Copybook content map (name -> content)

loop For each source file in chunk
Pipeline->>Expander: expandCopies(content, filePath, resolveFile, readFile)
Expander->>Expander: Merge continuation lines
Expander->>Expander: Detect COPY statements via regex

loop For each COPY statement (reverse order)
Expander->>Resolver: resolveFile(copyTarget)
Resolver-->>Expander: Copybook key or null

alt Resolved successfully
Expander->>Reader: readFile(resolvedKey)
Reader-->>Expander: Copybook content

Expander->>Expander: Apply REPLACING transformations
Expander->>Expander: Recurse for nested COPYs (depth + 1)
Expander->>Expander: Splice expanded content into output
else Not resolved
Expander->>Expander: Keep original COPY line
end
end

Expander-->>Pipeline: Expanded content + resolution metadata
Pipeline->>Pipeline: Replace file content with expanded content
end
```

The return type `CopyExpansionResult` contains `expandedContent` and `copyResolutions`. The `expansionDepth` field has been removed from the return type (it was unused by callers).

COPY statement line numbers in `CopyResolution` are 1-based (consistent with the preprocessor's line numbering). The splice operation that replaces COPY lines with expanded content adjusts for 0-based array indexing internally.

## Cycle Detection

Circular COPY references (e.g., copybook A includes copybook B which includes copybook A) are detected and handled:

1. Each expansion chain maintains a `visited` set of resolved copybook paths
2. If a copybook path is already in the visited set, the expansion is skipped
3. A `warnedCircular` set (internal to `expandCopies()`, not a parameter) deduplicates warning messages within a single file expansion

Known circular copybooks in PROJECT-NAME: `ANAZI`, `ANDIP`, `QDIPE` (self-referential includes).

## Max Depth

Nested COPY expansion is limited to **10 levels** (`DEFAULT_MAX_DEPTH`). If a COPY chain exceeds this depth, a warning is logged and the remaining COPY statements are left unexpanded.

## Max Total Expansions

A breadth amplification guard caps the total number of COPY expansions across all branches within a single file to **500** (`MAX_TOTAL_EXPANSIONS`). This prevents exponential blowup from diamond-shaped COPY graphs where N copybooks each include N other copybooks. Once the limit is reached, further COPY statements in that file are left unexpanded and a single warning is logged.

## REPLACING Application Detail

The REPLACING engine works by scanning all COBOL identifiers (matching `\b[A-Z][A-Z0-9-]*\b`) in the copybook content and applying each replacement rule:

```
Original copybook content:
05 ESP-NAME PIC X(30).
05 ESP-CODE PIC X(10).
05 KPSESPL-FLAG PIC X(01).

After REPLACING LEADING "ESP-" BY "LK-ESP-" LEADING "KPSESPL" BY "LK-KPSESPL":
05 LK-ESP-NAME PIC X(30).
05 LK-ESP-CODE PIC X(10).
05 LK-KPSESPL-FLAG PIC X(01).
```

For LEADING replacements, the engine checks if each identifier starts with the `from` prefix (case-insensitive) and replaces only the prefix portion, preserving the rest of the identifier.

For TRAILING replacements, the same logic applies to suffixes.

For EXACT replacements, only identifiers that match the `from` value exactly (case-insensitive) are replaced.

## Copybook Resolution

The resolver tries multiple strategies to match a COPY target name to a copybook file:

1. **Exact match**: `COPY CPSESP` resolves to copybook named `CPSESP`
2. **Strip extension**: `COPY WORKGRID.CPY` strips `.CPY` and resolves to `WORKGRID`
3. **Add extension**: `COPY CPSESP` tries `CPSESP.CPY` and `CPSESP.COPY`

If no match is found, the COPY statement is left in place (unexpanded) and a resolution record with `resolvedPath: null` is created.

## Pipeline Integration

The expansion runs **per chunk**, after file content is read but before dispatch to worker threads:

1. All copybook files are read upfront (they are typically small, collectively under 100MB)
2. Per chunk, the copybook map is merged with chunk content (in case a chunk contains copybooks)
3. Only programs (not copybooks themselves) undergo expansion
4. The expanded content replaces the original content in-place before worker dispatch

## Inline Comment Handling

The copy expander's `stripInlineComment()` helper is quote-aware: pipe characters (`|`) inside single- or double-quoted strings are preserved. This matches the same quote-aware logic used by the preprocessor.

## Source Files

- `gitnexus/src/core/ingestion/cobol-copy-expander.ts` -- `expandCopies()`, `parseReplacingClause()`, `applyReplacing()`
- `gitnexus/src/core/ingestion/pipeline.ts` -- `expandCobolCopies()`, copybook map construction, chunk integration
Loading
Loading