feat(taint): intra-procedural taint analysis (#2083)#2164
Conversation
…Facts (abhigyanpatwari#2083 U1) Worker-side site harvest in TsHarvester: call/new/member-read records with dotted callee paths, receiver slots, per-argument occurrence tagging with nested-site links, per-declarator resultDefs, spread/template/require-literal markers. hasTaintSafeSites validation seam. The pdg parse-cache chunk-key namespace is versioned (pdg:1 -> pdg:2) instead of a global SCHEMA_BUMP so flag-off users keep warm caches; bench fingerprints re-baselined for the three call-bearing scenarios (straight-line/dense-bindings byte-unchanged).
abhigyanpatwari#2083 U2) Typed spec (kind taxonomy; sanitizers carry neutralizes-kinds), the canonical Express/Node model, and matchFunctionSites: ESM alias/namespace + require- literal callee resolution, bare-name fallback restricted to true globals, sanitizers module-or-global only (never user-shadowable by name), spread/ template arg-position rules, deterministic taintModelVersion.
…patwari#2083 U3) Two-rule model (statement-local + du-fact worklist) with per-taint neutralized-kind exclusion sets: sanitizers exclude only the sink kinds they neutralize (escape(req.body) suppresses res.send but still fires db.query; exec(path.basename(t)) fires), intersection-over-paths so a bypass occurrence keeps the taint live, kill locality on resultDefs, propagate-through args+receiver with viaCall hops, one path per finding, deterministic caps, coverage-gap statuses. Test-first: 38 scenarios on real harvested CFGs.
abhigyanpatwari#2083 U5) resolvePdgConfig gains maxTaintFindingsPerFunction (200), maxTaintHops (32), and the taintModelVersion digest; RepoMeta.pdg + RunScopeResolutionInput surfaces added. The key-union comparator trips full writeback on M2->M3 upgrade and on model-version change without --force (mode-flip tested). No CLI flags or rc keys (programmatic parity with the other caps).
…bhigyanpatwari#2083 U4) run.ts pdg window: match-first fast path (solver only when a function has both a matched source and sink) -> computeReachingDefs with the shared RD fact derivation -> computeTaintFlows -> per-finding TAINTED (versioned hop-encoded reason via the shared path codec, statement-level occurrence identity) + per-kill SANITIZES, dedup-before-budget, truncate-and-warn. All emit counters surfaced (aggregate warn for gaps/drops, debug for volume); PROF gains taint=. Flag-off golden untouched.
…#2083 U6) Anchorless calls enumerate the sparse TAINTED table (bounded, deterministic, limit-clamped); anchored calls (file or symbol via resolveSymbolCandidates) return full decoded hop detail. sinkKind rides a version-1 codec header (1;<kind>|hops — no other persisted channel exists; U4/U6 ship together). RepoMeta.pdg probe yields a no-taint-layer note instead of an error. TAINTED/SANITIZES pinned OUT of VALID_RELATION_TYPES (KTD9a negative- membership tests); generators + canonical skill docs + mirrors updated.
…bhigyanpatwari#2083 U7) pdg-repo taint-cases fixtures complete the six plan shapes; committed findings/kills snapshot via a shared pure-path harness that also feeds the AE2 exact-equality assertion (stored TAINTED == pure-path findings, the no-explosion gate). New taint-dense bench scenario with four --check gates: per-function findings pinned AT the cap, absolute reason-byte + site-bytes disk ceilings (the load-bearing R10 gate), zero-match pass < 0.5x match- dense, N-linearity. Pre-existing scenario baselines untouched.
|
@magyargergo is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 11499 tests passed 16 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
magyargergo
left a comment
There was a problem hiding this comment.
Tri-method review — PR #2164 (M3 intra-procedural taint, #2083)
Methods & engines: 5 productive Claude lanes (correctness, adversarial, security-boundary, performance, maintainability) + Codex (live, independent engine). Two GitNexus lanes (risk-architect, test-ci-verifier) ended mid-investigation without findings; their domains were covered by the adversarial lane, the fully-green CI matrix, and coordinator verification (bench --check, flag-off golden, blast-sweep). Coverage caveat: the risk and test/CI persona passes are therefore thinner than the other lanes.
CI: all checks green (ubuntu coverage, windows/macos platform-sensitive, tree-sitter ABI ×3, packaged-install smoke ×2, benchmarks, CodeQL, quality format/lint/typecheck). The Vercel FAILURE status is the known deploy-authorization false-positive, not a code signal.
What's solid (independently validated)
- M2 harvest parity: a 25-construct differential battery (member chains, destructuring, may-defs, try/finally, optional chaining, nested functions, …) run through base vs PR harvesters produced zero defs/uses/mayDefs divergence — the new site layer doesn't perturb the M2 dataflow substrate. [reproduced, correctness lane]
- Hop codec: byte-exact through
escapeCSVField ∘ sanitizeUTF8round-trips for hostile-but-valid names,#<idx>fallbacks, exact byte-cap straddles; truncation degrades to path-incomplete, never corrupt CSV. [reproduced + Codex code-read, cross-engine] - Flag-off cache parity:
computeChunkHashreturns before the pdg token for non-pdg chunks; warmpdg:1shards miss and are pruned in lockstep — no stale sites-less side channels revived. [adversarial + Codex, cross-engine] - Injection surface:
explain'sLIMITinterpolation is guarded by an airtightNumber.isInteger ∧ [1,200]triple-check (NaN/Infinity/floats/strings/1e300 all rejected); every user value flows through$params;decodeTaintPathis bounded, throw-free, no backtracking regexes. [security + adversarial + Codex] - Exclusion-set worklist: re-derivation with a strictly smaller exclusion set correctly re-enqueues (sanitized-first/raw-second orderings fire); loop/mutual-recursion fixpoints terminate;
exec(escape(x), x)fires on the raw arg. [reproduced + Codex] - Emit id collisions: refuted by construction — source
propertyis bounded to the model's clean set,entryNameis a closed set,bindingKeyinjective per function. [adversarial]
Inline findings (2)
- P2 — multi-source merge under one (binding, defPoint) (
propagate.ts:561) — taint state is keyed(bindingIdx, defPoint)without source identity; a second source occurrence arriving with an equal exclusion set is dropped, soconst x = cond ? req.body : req.query; exec(x)(alsolet x = req.body + req.query) emits one TAINTED edge instead of two. The sink is still flagged (not a missed vulnerability), butexplainshows only one contributing source — remediating only that source leaves the other reaching the same sink with no edge to flag it, and the KTD6 identity contract says source occurrence is part of finding identity. Strong: Codex (HIGH) + correctness lane independently, with executed repros. Fix: include a source discriminator in the taint-state key (or per-source substate with intersection applied within a source), plus a regression test. [reproduced · cross-engine] - P3 —
fileishheuristic misroutes dotted symbol names (local-backend.ts:2783) — the classifier is/[\\/]/.test(target) || /\.[A-Za-z0-9]+$/.test(target)— a separator-less dotted name likeUserController.creatematches the extension leg and is classified as a file, so the symbol-resolution branch (not-found / ambiguous candidates) is never reached and the tool returns a silent empty result attributed to a nonexistent file anchor. Fix: require a path separator or a known source extension for the file branch, and/or fall back to symbol resolution when a separator-less file anchor yields zero rows. [reproduced]
Lower-priority (body only)
- Sequence-expression operands over-taint sink args —
exec((log(x), 'safe'))with taintedxemits a spurious finding even though only the last comma operand's value flows. Mechanism: the harvest visitor has no explicitsequence_expressioncase — the defaultwalkValuedescent recurses into every operand, so each binding occurrence fans into the enclosing sink argument position viarecordOccurrence(typescript-harvest.ts:1017). Reproduced by the correctness lane via an executed pipeline repro (1 finding, hop chain throughx); the mechanism attribution here is code-read. Safe-direction FP, rare trigger. Fix: treat only the final operand of a sequence expression as value-carrying for occurrence fan-out (earlier operands record uses only). explainon an M2-era--pdgindex (pdg stamped, zero TAINTED rows) returns the generic empty-result note instead of "no taint layer — run analyze" —pdgStamped = meta.pdg !== undefinedis true for M1/M2 stamps. Gate the note onmeta.pdg.taintModelVersioninstead. Self-corrects after the first M3 re-analyze (key-union writeback), so impact is transitional. [code-read]- Performance polish (all bounded by existing caps; bench
taint-denseabsorbs at current scales): worklist usesArray.shift(O(N) dequeue — use a head index);chainHopsruns before the finding-identity dedup check (wasted ancestry walks on repeat hits);addMemberReadlinearly rescans the statement's sites; the require-join scan could early-exit on functions with norequireArgsites. A single-function call-dense bench scenario would gate these paths. - Maintainability: the parse/collect/cfgOf unit-test harness is copied into three new test files (extract to
test/helpers/);pointKeyseparator diverges betweenemit.ts('.') andpropagate.ts(':') — harmless while local, worth unifying;sanitizerNeutralizesis exported but only tests consume it;run.tscould take the taint default constants viaemit.tsinstead of reaching intopropagate.ts. - Security suggestions: add
NaN/Infinity/'50'(string) to the limit-rejection test battery; fixtures withexec(req.body)patterns confirmed inert and excluded from the npm package (filesallowlist). - Test gaps worth closing: a multi-source-one-def regression (inline #1);
explain's ambiguous-symbol and symbol-not-found branches; a pin for the M2-era-index note.
Refuted (validation is a feature)
- Windows backslash paths breaking
explain's suffix match (Codex P3): refuted — all scanned paths are normalized to forward slashes atfilesystem-walker.ts:68-71before entering the pipeline, so backslash filePaths cannot reach BasicBlock rows. The right-aligned block-id parse was also verified against colon-bearing paths. - TAINTED id aliasing via hostile
req["a:b"]properties;registerBuiltinTaintModelsraces; deep-nesting crashes escaping the per-file catch (contained — same isolation as pre-M3); chunk-key leakage into flag-off keys.
Automated multi-tool digest (2 engines: Claude multi-persona + Codex). Verify findings before acting on them.
…bhigyanpatwari#2083 review) Extract pointKey(ProgramPoint) to cfg/reaching-defs.ts (colon-separated, matching the codebase block:stmt id convention) and import it in both propagate.ts and emit.ts, replacing the two divergent locals (':' vs '.'). Edge-id material now uses the colon form; ids are in-memory only and no test asserts the pointKey segment shape.
…twari#2083 review) Two distinct sources flowing into one variable at one def point no longer collapse to a single TAINTED edge: the taint-state key gains a root source-occurrence discriminator ({point, siteIndex} — the same fields recordFinding's identity uses, excluding kind). Def->use fact lookup keys on the source-independent (binding, def-point) portion. Same-source multi-path flows still share one state so their exclusion sets intersect (the raw arm soundly wins); termination holds (finite keys, monotone shrink, no cross-source ping-pong). Restores the KTD6 identity contract.
…bhigyanpatwari#2083 review) The fileish classifier matched any dotted name (UserController.create) as a file via its extension-like suffix, so symbol resolution never ran and the tool returned a silent empty file-anchored result. Tighten the classifier to require a path separator or a real source extension (derived from the resolver's EXTENSIONS list, multi-language), so dotted/bare names route to resolveSymbolCandidates (found / ambiguous / not-found).
…gyanpatwari#2083 review) An M1/M2-era --pdg index has meta.pdg defined (BasicBlock/REACHING_DEF recorded) but no taintModelVersion and zero TAINTED rows. The probe keyed on generic meta.pdg presence, so explain returned the generic empty note instead of the actionable 'no taint layer — run analyze' hint. Gate on meta.pdg?.taintModelVersion (the field M3 stamps) so an M2-era index gets the layer hint; a taint-stamped index with no findings still gets the generic note.
…bhigyanpatwari#2083 review) A comma expression in value position (exec((log(x), 'safe'))) default- descended, fanning every operand's occurrences into the enclosing sink argument — over-tainting exec's arg 0 with x. Add an explicit walkValue case that records earlier operands' uses with occurrence fan-out suppressed (new FactAccumulator.suppressOccurrences) and routes only the last operand through the value path. Sites-layer only; defs/uses/mayDefs byte-identical (cfg + reaching-defs snapshots unchanged).
…gyanpatwari#2083 review) Replace queue.shift() (O(N) dequeue) with a strict-FIFO head cursor plus order-preserving prefix reclamation; FIFO is load-bearing because chainHops reads the live taints map whose parent/source/viaCall are rewritten order-sensitively on monotone shrink, so hop determinism is dequeue-order contingent. Extract findingKey() and dedup-check before chainHops in the justify branch — already-recorded identities discard their hop chain (first write wins), so the ancestry walk was pure waste. The else kill branch is untouched. Findings + hops byte-identical (snapshot unchanged).
…twari#2083 review) addMemberRead rescanned the whole per-statement sites array per call to dedup by (object, property, parent) — O(n^2) on member-read-dense statements. Track a composite-key Set alongside sites for O(1) dedup. (The require-literal join is already O(sites) with a no-op body on non-require sites, so no early-exit is needed there.) Behavior identical: harvest + model-match + taint snapshots unchanged.
…abhigyanpatwari#2083 review) Remove the sanitizerNeutralizes export (its only consumers were two test assertions — inlined to entry.neutralizes membership). Re-export the DEFAULT_PDG_MAX_TAINT_* caps from emit.ts and point run.ts at emit.ts, so the pipeline's taint dependency surface is the single orchestration module rather than reaching into propagate.ts.
…twari#2083 review) The parse/collectFunctions/cfgOf/cfgsOf/importsFor harness was copied byte-for-byte across four suites (harvest, model-match, propagate, taint-emit). Promote it to test/helpers/ts-cfg-harness.ts and import it. site-safety/reaching-defs carry a structurally different inlined builder and are left as-is. Pure extraction, no assertion changes.
… review) Add NaN, Infinity, -Infinity, and a numeric string to the out-of-bounds limit cases — a regression fence over the interpolated LIMIT, confirming the Number.isInteger guard rejects every non-integer/non-finite/string input before it reaches the query.
M3 — Intra-procedural taint analysis (#2083)
First reliable-taint deliverable of Epic A (#2087), building directly on the merged M2 REACHING_DEF layer (#2160). Forward taint reachability over the in-memory def→use facts: TS/JS sources (
req.body/query/params/headers/cookies) to sinks (child_process.exec/execSync/spawn,eval/new Function,fs.*path args, SQLquery/execute,res.send/write), killed by sanitizers. All under the opt-in--pdgflag; flag-off stays byte-identical (golden gate untouched).Acceptance criteria
taint-snapshot.test.ts(AE1) +pipeline-pdg.test.ts+taint-explain.test.ts.pipeline-pdg.test.ts+pipeline-graph-golden.test.ts+ the benchtaint-densebyte ceilings.explainreturns the ordered path + per-hop variable. Pinned bytaint-explain.test.ts+taint-snapshot.test.ts(AE3).How it works (7 commits, one per implementation unit)
StatementFacts(dotted callee paths, receiver slots, nested-site links, per-declaratorresultDefs, spread/template/require-literal markers). The pdg parse-cache chunk-key namespace is versioned (pdg:1→pdg:2) rather than a globalSCHEMA_BUMP, so flag-off users keep warm caches.require-literal resolution; bare-name fallback restricted to true globals; sanitizers module/global-only, never user-shadowable; sanitizers carry neutralized-kinds).escape(req.body)suppressesres.sendbut still firesdb.query;exec(path.basename(t))fires), intersection-over-paths so a bypass occurrence keeps taint live, kill locality, propagate-through args+receiver withviaCallhops, deterministic caps, coverage-gap statuses.TAINTED/SANITIZESedges with versioned hop-encoded reason via a shared path codec, statement-level occurrence identity, dedup-before-budget, all coverage counters surfaced,PROFgainstaint=.--force).explaintool: anchorless enumeration of the sparse table + anchored (file/symbol) hop detail;RepoMeta.pdgprobe yields a no-taint-layer note;TAINTED/SANITIZESpinned out ofVALID_RELATION_TYPES.taint-densescenario (findings pinned at the per-function cap; absolute reason-byte + site-harvest disk ceilings; zero-match cost < 0.5× match-dense).Architecture — taint subsystem
flowchart TB subgraph worker["Parse worker (AST lives here)"] AST["tree-sitter AST"] --> HARV["TsHarvester walkValue<br/>defs / uses / mayDefs"] AST --> SITES["site harvest<br/>call / new / member-read records"] HARV --> SF["StatementFacts defs uses mayDefs sites"] SITES --> SF end SF -. "cfgSideChannel pdg chunk-key v2" .-> INPHASE subgraph INPHASE["In-phase scope-resolution run.ts when pdg on"] direction TB GATE["hasTaintSafeSites gate"] --> MATCH["matchFunctionSites<br/>import-aware ESM alias / require-literal"] MATCH --> FAST{"hasSource AND hasSink"} FAST -- no --> SKIP["skip, no solver"] FAST -- yes --> RD["computeReachingDefs<br/>shared fact derivation"] RD --> PROP["computeTaintFlows<br/>two-rule worklist, kind-set exclusions"] PROP --> EMIT["emitFileTaint<br/>sparse TAINTED + SANITIZES"] end EMIT --> DB[("LadybugDB<br/>BasicBlock to BasicBlock edges")] DB --> EXPLAIN["MCP explain tool<br/>anchorless + file/symbol anchored"] POINT["pointKey in cfg/reaching-defs.ts"] -.-> PROP POINT -.-> EMITSequence — analyze-time taint flow (in-phase)
sequenceDiagram participant W as Parse worker participant R as run.ts pdg window participant M as matchFunctionSites participant S as computeReachingDefs participant P as computeTaintFlows participant E as emitFileTaint participant G as KnowledgeGraph to DB W->>W: harvest defs uses mayDefs and sites Note over W: comma-expr, only final operand fans out W-->>R: ParsedFile cfgSideChannel per function loop per function, modeled language R->>R: hasTaintSafeSites gate R->>M: match sites vs built-in model M-->>R: FunctionSiteMatches hasSource hasSink alt hasSource AND hasSink R->>S: computeReachingDefs with maxFacts S-->>R: FunctionDefUse facts R->>P: computeTaintFlows cfg defUse matches Note over P: source-keyed states, FIFO worklist P-->>R: findings kills status R->>E: emit per-finding TAINTED per-kill SANITIZES E->>G: addRelationship statement-level id else no match R->>R: skip, no solver, the fast path end end Note over G: flag-off means zero taint edges, golden byte-identicalSequence — source-discriminated taint state (review fix)
sequenceDiagram participant Seed as feedDefs seeding participant D as deriveTaint participant T as taints map participant WL as worklist participant RF as recordFinding Note over Seed: x assigned from req.body OR req.query, then exec of x Seed->>D: derive x from source body D->>T: state key = x defPt body siteA, new state Seed->>D: derive x from source query D->>T: state key = x defPt query siteB, distinct state Note right of T: BEFORE, key was x defPt only, second source dropped WL->>RF: finding body to exec WL->>RF: finding query to exec Note over RF: two findings, both provenances preserved per KTD6Tri-review fixes (applied on top — 10 separate commits)
A
/pr-tri-review(Codex + 5 Claude lanes, critic-gated) surfaced findings, all fixed as one commit each (planned with ce-plan + a deepening pass):Class.methodresolves as a symbol instead of a silent file miss.taintModelVersion); FIFO head-cursor worklist + dedup-before-chainHops; O(1) member-read dedup; sharedpointKey; dead-export removal + constant retargeting; shared test harness; limit-rejection hardening.Named contract exclusions (documented in the tool description + module docs)
Intra-procedural only. Closure/callback flows, property/field-sensitive flows, guard-style sanitizers, implicit/control-dependence flows, and cross-function flows are not modeled (M4 #2084 / Epic B #2088). CommonJS aliasing is partially modeled. Exception-path over-approximation may add benign noise.
Validation
tsc --noEmitclean; 441 targeted tests green; bench--checkPASS (6 scenarios, no pre-existing fingerprint drift); root prettier clean; eslint 0 errors. Plan:docs/plans/2026-06-11-002-feat-taint-m3-plan.md.🤖 Generated with Claude Code