Skip to content

fix: COBOL parsing-layer coverage gaps — F17-F23 (#1925)#1959

Merged
magyargergo merged 12 commits into
abhigyanpatwari:mainfrom
prajapatisparsh:fix/cobol-parsing-coverage
Jun 1, 2026
Merged

fix: COBOL parsing-layer coverage gaps — F17-F23 (#1925)#1959
magyargergo merged 12 commits into
abhigyanpatwari:mainfrom
prajapatisparsh:fix/cobol-parsing-coverage

Conversation

@prajapatisparsh

Copy link
Copy Markdown
Contributor

Closes #1925 (parent epic #1919).

Summary

Fixes 7 COBOL parsing coverage gaps identified in issue #1925:

F17 [HIGH] — Digit-leading COBOL names

Expanded all ~68 [A-Z]→[A-Z0-9] first-character anchors across cobol-preprocessor.ts regex patterns.

F18 [MEDIUM] — MOVE source subscript capture

F19 [MEDIUM] — SQL multi-table FROM

F20 [MEDIUM] — Arithmetic verb ACCESSES edges

F21 [LOW] — Free-format PROGRAM-ID column

F22 [LOW] — File-size guard

F23 [LOW] — PERFORM TIMES detection

Test coverage

  • 61 existing cobol.test.ts tests: all pass (1 value updated)
  • 24 cobol-scope.test.ts tests: all pass
  • 14 new coverage tests in cobol-parsing-coverage.test.ts
  • Benchmark: 1000 files in 314ms, linear scaling
  • tsc: zero new errors

… parent abhigyanpatwari#1919)

F17 [HIGH]: Expand all ~68 [A-Z]→[A-Z0-9] first-char anchors in regex
  patterns across cobol-preprocessor.ts — digit-leading COBOL names
  (1000-MAIN, 9000-EXIT) now work for paragraphs, sections, PERFORM,
  GO TO, CALL, COPY, SORT, SEARCH, MOVE, and all data-item patterns.

F18 [MEDIUM]: MOVE source subscript/reference-mod capture — RE_MOVE
  now captures optional (index) or (start:length) after source name.
  Subscript-stripping helper added to cobol-processor.ts for data-item
  lookup. Targets preserve subscript info in edge IDs.

F19 [MEDIUM]: SQL multi-table FROM — comma-separated table lists
  now captured correctly. SQL aliases stripped from table names.
  Fixture with 3 EXEC SQL blocks verifies 5 table references.

F20 [MEDIUM]: Arithmetic verb ACCESSES edges — extraction for
  COMPUTE/ADD/SUBTRACT/MULTIPLY/DIVIDE with cobol-arithmetic-read
  and cobol-arithmetic-write edge emission. Fixture with all 5 verbs.

F21 [LOW]: captures.ts free-format column detection — PROGRAM-ID
  name column now dynamically determined for free-format COBOL
  (>>SOURCE FORMAT FREE) instead of hardcoded column 7.

F22 [LOW]: File-size guard — GITNEXUS_MAX_COBOL_FILE_SIZE_BYTES
  env var (default 5MB). Files over limit skipped with logger.warn.
  Edge-case tests verify both below-threshold (pass) and
  above-threshold (skip with warning) behavior.

F23 [LOW]: PERFORM TIMES detection — TIMES clause now detected
  anywhere after the target (including count+TIMES patterns), not
  just immediate next word. PERFORM VARYING still correctly skipped.

Test coverage: 99 tests (61 existing + 24 scope + 14 new coverage).
@vercel

vercel Bot commented Jun 1, 2026

Copy link
Copy Markdown

@prajapatisparsh is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

✨ PR Autofix

Found fixable formatting / unused-import issues across 12 changed lines. Comment /autofix on this PR to apply them, or run npm run lint:fix && npm run format locally.

{"schema":"gitnexus.pr-autofix/v2","state":"fixes-available","pr_number":1959,"changed_lines":12,"head_sha":"e41f7fd0eb12a2f3a79015c12436e553c5bfa13e","run_id":"26741061987","apply_command":"/autofix"}

Comment thread gitnexus/src/core/ingestion/cobol/cobol-preprocessor.ts Fixed
Comment thread gitnexus/test/integration/resolvers/cobol-parsing-coverage.test.ts Fixed
prajapatisparsh and others added 2 commits June 1, 2026 12:52
- Remove unused stripSubscript function from cobol-preprocessor.ts
- Remove unused edgeSet import from cobol-parsing-coverage.test.ts
- Bump benchmark per-scale budget 300s→600s, total timeout 600s→900s
  (CI runner may be slower than local)
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
10771 10761 0 10 662s

✅ All 10761 tests passed

10 test(s) skipped — expand for details
  • COBOL pipeline benchmark > scales with file count
  • C# pipeline benchmark > scales with file count — namespaces spread across the solution
  • C# pipeline benchmark > scales with file count — all types in one (global) namespace bucket
  • C# pipeline benchmark > scales with file count — all types in one (named) namespace bucket
  • Go pipeline benchmark > scales with file count (workers enabled)
  • Go pipeline benchmark — worker pool (issue Worker idle timeout kills long Go scope extraction and surfaces as Napi::Error during analyze #1848) > does not quarantine the large generated Go file on sub-batch idle timeout
  • PHP pipeline benchmark > scales with file count (workers enabled)
  • Ruby pipeline benchmark > scales with file count (workers enabled)
  • Rust pipeline benchmark > scales with file count (workers enabled)
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 80.35% 37498/46667 79.84% 📈 +0.5 🟢 ████████████████░░░░
Branches 68.9% 23870/34642 68.5% 📈 +0.4 🟢 █████████████░░░░░░░
Functions 85.45% 3885/4546 84.94% 📈 +0.5 🟢 █████████████████░░░
Lines 83.92% 33741/40205 83.36% 📈 +0.6 🟢 ████████████████░░░░

📋 View full run · Generated by CI

- Fix F23 PERFORM TIMES regression: 'PERFORM identifier TIMES' where the
  target IS the counter variable was missed by the identifier+TIMES regex.
  Added ^\s*TIMES\b check to catch this case (restores pre-existing
  unit test: 'does NOT store PERFORM WS-COUNT TIMES as a perform target').
- Update COBOL scope-capture fingerprint in baselines.json (F17/F21/F23
  legitimately change capture output).
- Bump benchmark timeout budget (300s→600s per-scale, 600s→900s total).

@magyargergo magyargergo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #1959 tri-review — COBOL parsing coverage (F17–F23)

Methods: 3 engines — Codex (the one independent engine; live, effort=xhigh / gpt-5.5) + Claude reviewer lanes (risk, correctness, adversarial, performance, testing). Two Claude lanes (test-CI, security-boundary) truncated mid-run; their domains (CI gating, ReDoS, supply-chain) were covered by the coordinator + other lanes. Cross-engine agreement (Codex + ≥1 Claude lane) is weighted strong; Claude-only agreement is "consistent," not independent confirmation. Every inline finding below was reproduced by the coordinator.

Solid work / validated. The [A-Z][A-Z0-9] identifier widening across ~40 regexes is mechanically uniform and division-gated; correctness + adversarial confirmed it does not mint phantom symbols in normal programs (numeric literals miss the dataItemMap; hyphenated names like WS-TO-DATE aren't broken by indexOf(' TO ') since boundaries are whitespace-delimited). A suspected P1 ReDoS in the new multi-table FROM regex was refuted — the coordinator's timing test is strictly linear (1.26 ms at 160 KB, no backtracking), matching adversarial's empirical refutation and Codex finding no trigger. CI is green except Vercel (auth-only). PR is BEHIND main (rebase needed; non-blocking).

Inline findings (each reproduced by the coordinator)

P2 — Multi-table SQL FROM drops tables when AS alias is used. (Codex + reproduced) The new comma-aware FROM regex consumes AS as its single optional alias word and then can't continue past the comma. Reproduced: FROM CUSTOMER AS C, ACCOUNT AS A["CUSTOMER"] (ACCOUNT lost), while FROM CUSTOMER C, ACCOUNT A → both. The new F19 test only covers the alias-without-AS form, so it passes while the very common AS syntax silently regresses. Fix: handle optional AS, and terminate the table list at SQL clause keywords (WHERE/JOIN/GROUP/ON…).

P2 — ADD a b GIVING c (no TO) produces no edges. (Codex + correctness + reproduced) The ADD branch requires TO; the standard format-3 ADD WS-A WS-B GIVING WS-C matches nothing, target stays empty, and the op is dropped. Reproduced. Same branch also under-captures non-GIVING forms (see body).

P2 — tree-sitter-swift added to ROOT package.json dependencies in a COBOL PR. (Codex + risk + reproduced via git) main's root package.json has no dependencies block at all; this PR injects a Swift grammar (native node-gyp build at install) at the monorepo root plus 71 lockfile lines. Unrelated cross-domain churn — the Swift grammar is already managed in gitnexus/package.json. Likely a merge artifact; remove from this PR.

P2 — PERFORM <para> [THRU <para>] <count> TIMES suppresses the CALLS edge to the paragraph. (Codex + risk + correctness + reproduced) This appears intentional — the perform-times.cbl fixture (L12/L14) and the F23 test (calls.length === 8, "no TIMES false positives") deliberately treat it as a non-call. But per COBOL semantics the out-of-line PERFORM 2000-PROCESS 3 TIMES does transfer control to 2000-PROCESS (executes it 3×), so the CALLS edge is a real control-flow relationship that impact analysis depends on. Note this also reverses main's behavior (main emitted the edge; the original /^TIMES/ guard already correctly suppressed only the genuine inline form PERFORM WS-COUNT TIMES). Recommend reconsidering: emit the edge for <para> <count> TIMES / <para> THRU <para> <count> TIMES, and ideally resolve at graph-mapping time (emit only when the target resolves to a paragraph). If the suppression is genuinely intended, document the rationale.

Lower-priority (body)

  • File-size guard gaps (Codex + adversarial + risk): copybooks/JCL are never size-checked (only programs), so an oversized COPY member still OOMs; GITNEXUS_MAX_COBOL_FILE_SIZE_BYTES=-1 is truthy ⇒ MAX=-1 ⇒ every COBOL file skipped (reproduced; 0/NaN/empty correctly fall back — only negative is unsafe); skipped programs leave an empty Module node and strand cross-program CALL edges as permanently <unresolved> with no skippedFiles counter surfaced; content.length is UTF-16 code units, not bytes, so multibyte sources can slip the guard. Validate the env var as a positive int; guard copybooks; surface a skip count.
  • Arith data-flow incompleteness (Codex + correctness): non-GIVING receiving operands aren't credited as reads (ADD a TO bb is read+write but only written); a multi-target receiving list (ADD a TO b c) captures only the first; the populated givingTarget field is never consumed.
  • Arith phantom edges inside string literals (adversarial; regex match reproduced): the per-line arith regex sees DISPLAY literal content (DISPLAY "ADD ITEM TO CART" ⇒ verb=ADD), so a data item named like a word in a message gets a spurious ACCESSES edge. Bounded by name collision; mask string literals before the arith/MOVE regexes.
  • captures.ts fixed-format PROGRAM-ID range (correctness, P3): the name column is hardcoded to 7 for fixed-format, but preprocessing blanks (not shifts) cols 1–6, so the name actually starts ~col 19 — the capture range start is off (the name text is unaffected). Use findProgramIdNameColumn for both formats.
  • Perf hygiene (performance): the arith verb regex and COMPUTE keyword-skip regex are compiled per-line/per-token instead of hoisted to module constants (every other hot-loop regex is hoisted); sources.includes() dedup is O(n²) — use a Set. Micro, but inconsistent with the file's own discipline.
  • Benchmark timeout doubled (performance + risk + testing): inner 300k→600k, outer 600k→900k, COBOL baseline re-fingerprinted, no comment. Confirm the new arith extraction is the cause and document it, or it reads as a perf regression absorbed into the budget.
  • Test rigor (testing + risk): F18 MOVE-subscript and F20 arithmetic use toBeGreaterThan(0)/>= rather than exact counts (a doubled or halved emission would still pass); the GIVING path is untested; F17 SORT/INPUT-PROCEDURE is deferred via a comment claiming "confirmed by code audit" rather than tested; free-format coverage only checks Module creation; no negative test that the widening doesn't mint phantom numeric symbols.

Automated multi-tool digest (Codex xhigh/gpt-5.5 + Claude reviewer swarm). Inline findings were reproduced by the coordinator; verify before acting.

Comment thread gitnexus/src/core/ingestion/cobol/cobol-preprocessor.ts Outdated
}
case 'ADD': {
// ADD a TO b [GIVING c] — target is after TO or GIVING
const addGiving = rest.match(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 — ADD a b GIVING c (no TO) produces no edges. [reproduced]

This branch requires TO, so the standard COBOL format-3 ADD WS-A WS-B GIVING WS-C matches nothing → target stays '' → the op is dropped (no arith-read/write edges). Reproduced.

Add a GIVING-first fallback for ADD when no TO is present (operands before GIVING are reads, the GIVING identifier is the write). Related: non-GIVING ADD a TO b never credits b as a read (it's read+write), and ADD a TO b c captures only the first receiving target. (Codex + correctness lane + coordinator-reproduced.)

Comment thread package.json Outdated
Comment thread gitnexus/src/core/ingestion/cobol/cobol-preprocessor.ts Outdated
@prajapatisparsh prajapatisparsh force-pushed the fix/cobol-parsing-coverage branch from eecef56 to 56e1b43 Compare June 1, 2026 10:28
@magyargergo

Copy link
Copy Markdown
Collaborator

Could you please resolve the conflict?

@prajapatisparsh

Copy link
Copy Markdown
Contributor Author

Could you please resolve the conflict?

hey just reached home , on it

@magyargergo magyargergo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #1959 delta tri-review — prior P2 fixes re-checked

Methods: 3 engines — Codex (live) + Claude lanes (risk, correctness, adversarial, testing). This is a follow-up to the prior tri-review (2026-06-01); it checks whether the four P2 inline findings were addressed and what remains.

Engine weighting: Codex + Claude agreement = strong; Claude-only = consistent across personas, not independent confirmation.

Prior P2 inline findings — status after 56e1b438 / c84bc3b / 31e195e8

Finding Status Verification
SQL FROM … AS alias drops tables Fixed Regex (?:AS\s+)? at cobol-preprocessor.ts:847; [reproduced] FROM CUSTOMER AS C, ACCOUNT AS A → both tables
ADD WS-A WS-B GIVING WS-C (no TO) Fixed GIVING-only fallback at :2088–2102; [reproduced]
Root package.json tree-sitter-swift Fixed Reverted in 56e1b438; root package.json has no dependencies block
PERFORM 2000-PROCESS 3 TIMES CALLS edge Fixed Guard narrowed to ^\s*TIMES� only (:1971); out-of-line 3 TIMES emits CALLS; test expects 10 total CALLS; [reproduced]

Solid / validated: The [A-Z][A-Z0-9] widening is division-gated; numeric literals do not mint graph nodes without dataItemMap entries. CodeQL unused stripSubscript / edgeSet warnings from the first review appear resolved (functions/imports removed).

CI / merge state (HEAD 0d7281d)

  • Merge: MERGEABLE, no file conflicts (mergeStateStatus: BLOCKED = required checks, not conflict).
  • Passing: typecheck, lint, format, ubuntu/macos/windows tests, scope-parity discover, packaged install smoke, tree-sitter ABI, CodeQL, gitleaks.
  • Failing: tests / benchmarks (GITNEXUS_BENCH) — investigate whether COBOL baseline/timeout bump or unrelated main merge (#1956 heritage) caused regression.
  • Pending: tests / windows-latest, scope-parity / scope-resolution parity.
  • Vercel: auth-only (non-blocking).

Verdict: production-ready with minor follow-ups — prior P2 code fixes land correctly; remaining P2 items are env parsing, SQL pagination syntax, and quote-blind arithmetic matching. Test coverage for the fixed paths is still thin.

Inline findings (coordinator-verified)

See threaded comments on: negative file-size env (cobol-processor.ts), SQL FETCH FIRST terminator gap (cobol-preprocessor.ts), arithmetic-in-literal false positives (cobol-preprocessor.ts).

Lower-priority (body)

  • Stale fixture comments (perform-times.cbl:11–14): still say “not a paragraph call” for PERFORM 2000-PROCESS 3 TIMES / … WS-COUNT TIMES, but post-fix these do emit CALLS edges (test documents 10 total). Misleading for future maintainers.
  • Arithmetic read gaps: SUBTRACT/MULTIPLY/DIVIDE non-GIVING forms do not mirror ADD’s accumulator-as-source fix (:2106–2164); ADD A TO B GIVING C omits read edge for B.
  • Test gaps: no fixture for explicit FROM … AS …; no ADD … GIVING … integration test; F18/F20 still use >= despite header claiming “exact-count assertions”; F17 SORT/INPUT-PROCEDURE deferred via comment only.
  • File-size guard (prior): copybooks/JCL not size-checked; content.length is UTF-16 code units; skipped files leave empty Module nodes.
  • INNER JOIN SQL: adversarial trigger refuted — [reproduced] FROM CUSTOMER INNER JOIN ACCOUNT yields both tables via FROM+JOIN patterns.

Automated multi-tool delta digest (Codex + Claude reviewer swarm). Prior P2 fixes verified; residual items need human judgment before merge.


// ── 3. Process each COBOL program ──────────────────────────────────
const MAX_COBOL_FILE_SIZE =
parseInt(process.env.GITNEXUS_MAX_COBOL_FILE_SIZE_BYTES ?? '', 10) || 5 * 1024 * 1024;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 — negative GITNEXUS_MAX_COBOL_FILE_SIZE_BYTES skips all COBOL files. [reproduced]

Trigger: GITNEXUS_MAX_COBOL_FILE_SIZE_BYTES=-1parseInt('-1',10)||5MB = -1 (truthy) → every file has content.length > -1 → all programs skipped, zero Module nodes.

Evidence: cobol-processor.ts:181–182. F22 tests only '100' and '10MB'.

Fix: Require positive int: Number.isFinite(raw) && raw > 0 ? raw : default.

(Codex + adversarial + risk agreed; prior tri-review body item, still open.)

// FROM table1 [AS alias], table2 [AS alias] … — handle comma-separated
// lists with optional AS keyword, terminated by SQL clause keywords
// (WHERE, JOIN, GROUP, ON, ORDER, HAVING, UNION, SET, INTO, VALUES).
/\bFROM\s+([A-Z0-9][A-Z0-9_]+(?:\s+(?:AS\s+)?[A-Z0-9][A-Z0-9_]*)?(?:\s*,\s*[A-Z0-9][A-Z0-9_]+(?:\s+(?:AS\s+)?[A-Z0-9][A-Z0-9_]*)?)*)(?:\s+(?:WHERE|JOIN|GROUP|ON|ORDER|HAVING|UNION|SET|INTO|VALUES)\b|$)/gi,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 — SQL FROM without clause terminator (e.g. FETCH FIRST) yields zero tables. [reproduced]

Trigger: SELECT * FROM CUSTOMER FETCH FIRST 100 ROWS ONLYextractTables() returns [] because terminators are WHERE|JOIN|GROUP|ON|ORDER|HAVING|UNION|SET|INTO|VALUES only — FETCH is absent and string does not end at $.

Fix: Add FETCH|FOR|LIMIT|OFFSET|WITH to terminator set, or fall back to simple \bFROM\s+(\w+) when strict match fails.

(Note: FROM CUSTOMER INNER JOIN ACCOUNT refuted — both tables captured.)


// Arithmetic statements — COMPUTE, ADD, SUBTRACT, MULTIPLY, DIVIDE
// All extract target (written) and source operands (read) for ACCESSES edges
const arithMatch = line.match(/\b(COMPUTE|ADD|SUBTRACT|MULTIPLY|DIVIDE)\s+(.+)/i);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 — arithmetic verb regex matches inside string literals. [code-read]

Trigger: DISPLAY "ADD WS-A TO WS-B". — line-level \b(ADD|SUBTRACT|…)\b at :2029 is quote-blind; if WS-A/WS-B exist in WORKING-STORAGE, spurious cobol-arithmetic-* ACCESSES edges emit.

Fix: Mask quoted regions before arithmetic/MOVE extraction (same approach as free-format *> comment stripping).

(Prior tri-review item; not addressed in 56e1b438. Codex + correctness + adversarial agreed.)

for (const fileCount of scales) {
const paragraphsPerProgram = 3;
const result = await runBenchmark(fileCount, paragraphsPerProgram, 300_000);
const result = await runBenchmark(fileCount, paragraphsPerProgram, 600_000);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to increase the timeout? Is this regex solution is that bad?

@prajapatisparsh

prajapatisparsh commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

The timeout bump was unnecessary. COBOL benchmarks take 314ms at 1000 files against a 300s per-scale budget.

@magyargergo magyargergo merged commit 4f7697c into abhigyanpatwari:main Jun 1, 2026
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

COBOL: parsing-layer coverage gaps (7 findings)

3 participants