Skip to content

feat(install): toolchain-free tree-sitter via vendored prebuilds#2113

Merged
magyargergo merged 25 commits into
abhigyanpatwari:mainfrom
magyargergo:feat/tree-sitter-prebuilds
Jun 9, 2026
Merged

feat(install): toolchain-free tree-sitter via vendored prebuilds#2113
magyargergo merged 25 commits into
abhigyanpatwari:mainfrom
magyargergo:feat/tree-sitter-prebuilds

Conversation

@magyargergo

Copy link
Copy Markdown
Collaborator

Warning

DRAFT — DO NOT MERGE until gitnexus/vendor/tree-sitter-kotlin/prebuilds/ is populated by the build-tree-sitter-prebuilds workflow. Until then Kotlin is vendored with an empty prebuilds/ and no source-build fallback, so a fresh install has no Kotlin (even on toolchain hosts). dart/proto remain fully functional throughout.

Goal

Eliminate the C/C++-toolchain requirement at install for every tree-sitter grammar GitNexus uses ("no operational risk for any tree-sitter"), by generating and vendoring native prebuilds — mirroring the existing vendored tree-sitter-swift.

Why this scope (audit-driven)

Of 15 grammars, only 3 are at risk: dart + proto (vendored but compiled from source at postinstall) and kotlin (third-party optionalDependency, source-only). The other 10 already ship 6 upstream prebuilds and stay dependency-review-tracked — left as npm dependencies. swift already vendors upstream prebuilds. Because dart/proto are already off the dependency graph, only kotlin newly leaves it (the sole new CVE-tracking blind spot; mitigated by a recommended drift-check job).

What this PR does

  • .github/workflows/build-tree-sitter-prebuilds.yml — registry-parameterized workflow building {dart,proto,kotlin} × {linux,darwin,win32}-{x64,arm64} natively (no cross-compile), validating each .node loads and parses on its arch, then opening a PR that vendors them. A guard job gates the heavy matrix to run only on dispatch or a real grammar-version change — ordinary code PRs cost zero matrix minutes.
  • dart/proto — prefer a committed prebuild; fall back to today's source build when none matches → no regression before prebuilds exist.
  • kotlin → vendored (Swift parity; supersedes the merged fix(install): graceful Kotlin optional-grammar install + accurate toolchain docs #2110 optionalDependency mechanism). The ~23 MB parser.c is not vendored (the workflow builds from the published npm package); only node-types.json + bindings + prebuilds are. Removed from optionalDependencies; lockfile regenerated; probe / parser-loader note / README+devcontainer docs / the fix(install): graceful Kotlin optional-grammar install + accurate toolchain docs #2110 tests all updated.

Verification (local)

tsc --noEmit clean · 44 unit tests pass (incl. updated cli-commands + build-tree-sitter-kotlin-probe) · prettier + eslint clean · all build scripts node --check + exit-0 smoke · workflow YAML + embedded scripts validated. The actual cross-platform prebuild load is only provable by dispatching the workflow (that's the point of the first run).

To complete (open decisions)

  1. Confirm arm64 hosted runners (ubuntu-24.04-arm, windows-11-arm) are available to this repo's plan — else those legs queue forever and no prebuild PR lands.
  2. Pin 3 placeholder action SHAs (setup-python, attest-build-provenance) + zizmor/Scorecard allowlist.
  3. RELEASE_APP_* token must have Contents+PRs write for the aggregate job (publish.yml pattern).
  4. Dispatch the workflow → it builds the binaries and opens the vendoring PR → populate prebuilds/ here → un-draft.
  5. Optional: a weekly drift-check job (recovers the kotlin CVE signal) and closing the tree-sitter-c 4/6-prebuild gap when the tree-sitter 0.21→0.23 runtime bump lands.

🤖 Generated with Claude Code

… prebuilds

Eliminate the C/C++-toolchain requirement at install for the at-risk grammars
(dart, proto, kotlin) by generating + vendoring native prebuilds, mirroring the
existing vendored tree-sitter-swift. The 10 grammars that already ship 6 upstream
prebuilds stay npm dependencies (toolchain-free AND dependency-review-tracked).

- .github/workflows/build-tree-sitter-prebuilds.yml: a registry-parameterized
  workflow that builds {dart,proto,kotlin} x {linux,darwin,win32}-{x64,arm64}
  prebuilds natively, validates each loads + parses on its arch, and opens a PR
  vendoring them. A `guard` job gates the heavy matrix to run ONLY on dispatch
  or a real grammar-version change — ordinary code PRs cost zero matrix minutes.
- dart/proto: prefer a committed prebuild; fall back to today's source build
  when none matches (no behavior change until prebuilds are vendored).
- kotlin: vendor it (Swift parity) instead of compiling the third-party
  optionalDependency from source at the user's install — supersedes abhigyanpatwari#2110's
  optionalDependency mechanism. The ~23 MB parser.c is NOT vendored (the
  workflow builds from the published package); only node-types + bindings +
  prebuilds are. Removed from optionalDependencies; lock regenerated; probe,
  parser-loader note, README/.devcontainer docs, and the abhigyanpatwari#2110 tests updated.

DO NOT MERGE until vendor/tree-sitter-kotlin/prebuilds/ is populated by the
build-tree-sitter-prebuilds workflow: until then Kotlin is unavailable (vendored
with no source-build fallback). dart/proto remain fully functional throughout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

@magyargergo is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
10841 10825 0 16 534s

✅ All 10825 tests passed

16 test(s) skipped — expand for details
  • COBOL pipeline benchmark > scales with file count
  • C++ ADL emit benchmark > emit phase scales sub-quadratically with co-scaled files and sites
  • C++ pipeline benchmark > scales with file count
  • C# pipeline benchmark > scales with file count — namespaces spread across the solution
  • C# pipeline benchmark > scales with file count — all types in one (global) namespace bucket
  • C# pipeline benchmark > scales with file count — all types in one (named) namespace bucket
  • Go pipeline benchmark > scales with file count (workers enabled)
  • Go pipeline benchmark — worker pool (issue Worker idle timeout kills long Go scope extraction and surfaces as Napi::Error during analyze #1848) > does not quarantine the large generated Go file on sub-batch idle timeout
  • Go structural interface detection benchmark > scales linearly with interface × struct count
  • Go structural interface detection split-phase benchmark > separates index-build and detection time
  • PHP pipeline benchmark > scales with file count (workers enabled)
  • Ruby pipeline benchmark > scales with file count (workers enabled)
  • Rust pipeline benchmark > scales with file count (workers enabled)
  • Vue pipeline benchmark > scales with component count
  • run.cjs direct-exec entrypoint (fix(cli): steer docs, skills, and hooks through a CLI-neutral project-local runner (#1939) #1945) > resolves a .cmd shim via the Windows shell branch, passing args and exit code
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 75.07% 35450/47221 N/A% 🟢 ███████████████░░░░░
Branches 62.81% 21904/34869 N/A% 🟢 ████████████░░░░░░░░
Functions 80.81% 3829/4738 N/A% 🟢 ████████████████░░░░
Lines 78.87% 32053/40639 N/A% 🟢 ███████████████░░░░░

📋 View full run · Generated by CI

Regression guard so a toolchain-less install can never silently lose a tree-sitter
language on a supported platform-arch:

- Vendored grammars (vendor/tree-sitter-*): every one MUST ship a loadable N-API
  prebuild for all 6 tuples {linux,darwin,win32}-{x64,arm64}. Asserts the
  napi_register_module_v1 entry symbol in each .node (cross-platform, no need to
  run the binary). Currently RED for dart/proto/kotlin until the
  build-tree-sitter-prebuilds workflow populates their prebuilds/ — this is the
  must-fill-before-merge gate (swift already passes 6/6).
- npm-dependency grammars: asserts upstream ships 6/6 N-API too, catching a
  future platform drop. tree-sitter-c is allow-listed at 4/6 (missing
  linux-arm64/win32-arm64) pending abhigyanpatwari#2116; the guard also fails if that gap is
  silently closed (prompting allow-list removal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@magyargergo

Copy link
Copy Markdown
Collaborator Author

Added: full prebuild/ABI coverage verification + regression guard

Verified every tree-sitter grammar for native-binding coverage across {linux,darwin,win32}-{x64,arm64} on the ABI we support (Node ≥22 / N-API; tree-sitter language ABI 0.21.1). Method: inspected the installed tarball contents + checked each .node for the N-API entry symbol napi_register_module_v1 on all six platforms (including darwin/win32, by binary inspection — no need to run them).

Result: 10 npm grammars + vendored swift = full 6/6 N-API. The only gaps are dart/proto/kotlin (0/6 — what this PR fixes) and tree-sitter-c 4/6 (missing linux-arm64/win32-arm64) → filed as #2116.

New regression guard (gitnexus/test/unit/prebuild-coverage.test.ts):

  • Every vendored grammar must ship a loadable N-API prebuild for all 6 tuples. This is intentionally RED for dart/proto/kotlin right now — it is the must-fill-before-merge gate. Once the build-tree-sitter-prebuilds workflow populates prebuilds/, it goes green. (swift already passes.)
  • npm grammars are checked too (catches a future upstream platform drop); tree-sitter-c is allow-listed at 4/6 pending tree-sitter-c ships only 4/6 native prebuilds (missing linux-arm64, win32-arm64) — toolchain needed on ARM #2116, and the guard fails if that gap is silently closed (prompting allow-list removal).

So the tests / ubuntu / coverage check will show the 3 vendored failures until the binaries land — that's the gate working, not a broken test.

magyargergo and others added 7 commits June 9, 2026 10:27
…builds (abhigyanpatwari#2116)

tree-sitter-c is the one grammar dependency upstream ships incomplete prebuilds
for (4/6 — no linux-arm64/win32-arm64), AND it is a REQUIRED grammar: its own
`install` (node-gyp-build) compiles from source when no prebuild matches and
exits non-zero, so on a toolchain-less ARM host `npm install gitnexus` HARD-FAILS
at the c step — during npm's dependency phase, before any GitNexus postinstall
runs (so a postinstall "supplement" can't help).

Fix: vendor c prebuild-only at the pinned 0.21.4 (Kotlin pattern), with all six
prebuilds GitNexus-cross-built, and drop it from `dependencies`:
- vendor/tree-sitter-c/ (bindings + node-types + manifest + prebuilds); build
  probe scripts/build-tree-sitter-c.cjs; added to the build workflow registry
  (kind 'npm' — built from c@0.21.4 source).
- materialize-vendor-grammars.cjs: c is REQUIRED, so it is always materialized,
  even under GITNEXUS_SKIP_OPTIONAL_GRAMMARS (it needs no toolchain).
- Removed from package.json dependencies + lockfile (nothing else needs npm c —
  tree-sitter-cpp's dep on c is dev-only and not installed). Preserves the abhigyanpatwari#1242
  ABI pin: vendoring 0.21.4 keeps the good ABI while closing the ARM gap.
- parser-loader note + the prebuild-coverage guard + a cli-commands assertion
  updated; c moves from the npm-gap allow-list into the vendored 6/6 cohort.

Verified: tsc clean, 31 unit tests pass, c loads/parses; the guard is RED for
c/dart/proto/kotlin until the workflow populates prebuilds (the must-fill gate).
Closes the operational risk in abhigyanpatwari#2116.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… pre-prebuilds

The vendored prebuild-only grammars (c, kotlin) had empty prebuilds/ until the
build-tree-sitter-prebuilds workflow runs, so they could not load in CI — and
C is hard-required by cross-platform tests (tree-sitter-languages/parsing on
ubuntu+macos+windows), which I cannot pre-build for macos/windows locally. The
robust fix is a source-build fallback that works on every CI runner (all have a
toolchain), mirroring dart/proto:

- Vendor the grammar source (binding.gyp + src/) for c and kotlin; their build
  scripts now PREFER a committed prebuild (toolchain-free) and fall back to
  `node-gyp rebuild` from the vendored source when no prebuild matches. Verified
  both compile against the hoisted node-addon-api@^8 and the runtime loads.
- prebuild-coverage guard is now bootstrap-tolerant: a grammar that vendors its
  source (binding.gyp) may have an incomplete prebuild set (the workflow fills
  it); a prebuild-only grammar (swift) still must ship all six. Any present
  prebuild must still be N-API. Guard goes green; it re-tightens per-grammar as
  the workflow populates prebuilds.
- actionlint: silence a false-positive SC2016 (JS template literals inside the
  single-quoted `node -e` validate block).

Note: kotlin's generated parser.c is large (~23 MB on disk; compresses heavily
in git). Once the workflow populates all six kotlin prebuilds, the source serves
only as the fallback and could be slimmed if desired.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`npm prune --omit=dev` in the gitnexus CLI image drops anything not in
package.json's dependency tree — including the VENDORED tree-sitter grammars
(materialized by postinstall, not declared deps) and their built bindings. The
`serve` image analyzes/parses repos at runtime, so re-run the grammar postinstall
after the prune (in the toolchain-equipped builder) to restore them. Load-bearing
for tree-sitter-c, a core REQUIRED grammar now vendored (abhigyanpatwari#2116): as a former
dependency it survived prune; vendored, it would not. Also restores
swift/dart/proto/kotlin, which were silently pruned from the image before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d pipeline

Swift was the last grammar handled differently — it shipped only upstream
prebuilds, while c/dart/proto/kotlin vendor their grammar source and use a
prefer-prebuild -> source-build-fallback activation script. Vendor swift's
source so all five are handled identically (one uniform build path).

- vendor/tree-sitter-swift: add binding.gyp (win-hardened), bindings/node/
  binding.cc, src/parser.c (ABI-14 default, ~18 MB), src/scanner.c, and
  src/tree_sitter/ headers. The 6/6 prebuilds are retained. The legacy
  parser_abi13.c alternate is intentionally not vendored.
- build-tree-sitter-swift.cjs: rewrite the prebuild probe into the dart-style
  prefer-prebuild then source-build fallback (keeps the GITNEXUS_SKIP gate and
  the never-exit-non-zero postinstall invariant).
- build-tree-sitter-prebuilds.yml: register swift (kind 'vendored'); add its
  package.json to the version-gated pull_request paths and a validate snippet.
- prebuild-coverage guard auto-moves swift into the source-fallback cohort
  (binding.gyp now present); refresh the stale "swift is prebuild-only" comments.
- tests: add build-tree-sitter-swift-probe.test.ts; fix the pre-existing
  build-tree-sitter-kotlin-probe.test.ts breakage (it still asserted the old
  probe strings after kotlin's dart-style conversion); assert swift's vendored
  source in cli-commands.test.ts.
- docs: README / .devcontainer / kotlin vendor README — swift's prebuilds are
  now GitNexus-cross-built from vendored source like the rest, not upstream-only.

Verified: swift source-builds against node-addon-api@8 -> N-API binary -> loads
against the pinned tree-sitter@0.21.1 (ABI 14) -> parses cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ge guard

Vendoring grammar source (parser.c) alongside the prebuilds means the npm
tarball now carries ~50 MB of generated source it almost never compiles (every
supported platform-arch has a prebuild). Prepare to drop it from the published
package once all prebuilds exist — safely.

- .npmignore: add a GATED, commented-out "lean publish" block that excludes the
  source-build inputs (parser.c/scanner.c/tree_sitter/binding.gyp/binding.cc) but
  keeps prebuilds/ + the runtime files. Uncommenting ships prebuilds-only.
- scripts/assert-publish-grammar-coverage.cjs: a prepack guard that refuses to
  pack/publish if the source exclusion is active while any vendored grammar still
  lacks 6/6 prebuilds (which would ship a grammar with no loadable binding). Wired
  into `prepack` (runs on npm pack + publish, incl. the publish.yml dry-run) and
  exposed as `npm run assert-publish-coverage`.
- test: pure-core decision cases + a real-repo publish-safety check that fails CI
  if .npmignore is activated prematurely.

Net: the prebuilds already publish today (files: ["vendor"]); this makes the
future switch to a prebuilds-only tarball a one-line uncomment that can't ship a
dead grammar. The guard currently reports "source + prebuilds" (only swift has
6/6 prebuilds so far) and passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The per-grammar activation scripts (c/dart/proto/swift/kotlin) were ~95%
identical — same prefer-prebuild → source-build → never-fail flow, differing only
in name, target_name, required-vs-optional, and the display label in warnings.

- scripts/build-tree-sitter-grammars.cjs: one registry-driven script. Bare call
  builds all (postinstall); `... <name>` builds only the named grammars (so the
  probe test can isolate one). c is `required: true` (ignores the opt-out gate);
  the rest honor GITNEXUS_SKIP_OPTIONAL_GRAMMARS. Per-grammar try/catch + a final
  process.exit(0) preserve the postinstall never-exit-non-zero invariant.
- package.json: postinstall is now `materialize && build-tree-sitter-grammars.cjs`
  (was five chained `build-tree-sitter-<name>.cjs` calls).
- tests: replace the two near-identical *-probe.test.ts files with one
  parameterized build-tree-sitter-grammars-probe.test.ts that also covers the
  required-vs-optional opt-out split and an unknown-grammar arg.
- update cli-commands.test.ts postinstall assertions + the vendor c/kotlin/swift
  README + swift provenance to reference the consolidated script.

Behavior is preserved (warnings normalized to one consistent format). Removes 5
scripts + 1 test file; adds 1 script + 1 test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@magyargergo magyargergo left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tri-review (3 methods, Codex live) — toolchain-free vendored tree-sitter

Methods & engines. GitNexus swarm + Compound-Engineering personas (both Claude) + Codex (the one independent engine, live) + two requested personas (DevOps/release-pipeline, npm native-build). 4 of 6 Claude lanes returned full structured findings (correctness, adversarial, DevOps, npm-native); 2 (risk, test/CI) ended mid-investigation, their domains covered by the others. Claude-lane agreement = "consistent across personas," not independent; the strong signal is Codex+Claude.

What's solid (the reviews validated this). The consolidated build script's required-vs-optional gate + never-exit-non-zero invariant; the Docker prune → re-postinstall → COPY chain (builder has the toolchain, so the .node + node-gyp-build reach runtime); the publish coverage-guard correctly blocking a lean tarball while prebuilds are incomplete; the cost-gating (an ordinary PR triggers zero native matrix — confirmed); least-privilege workflow perms + persist-credentials: false; N-API / ABI-14 consistency. Genuine care here.

🔴 Headline — P1 (resolve before un-drafting/merge)

Unguarded static import C from 'tree-sitter-c' crashes analyze at module-load when C has no binding. tree-sitter-c is now vendored prebuild-only and removed from dependencies, but it has 0/6 committed prebuilds (only .gitkeep). On a toolchain-less host (source-build fails) or under npm install --ignore-scripts (materialize never runs), C ends up with no binding. Three sites import it with a hard static ESM import that runs before parser-loader's optional/degradation machinery — parse-worker.ts:7, core/ingestion/languages/c/query.ts:2 (eager on the main thread via languages/index.ts → c-cpp.ts → c/index.ts), core/group/extractors/include-extractor.ts:5 — so the import throws ERR_MODULE_NOT_FOUND at module-load and crashes the whole run, instead of the clean "C disabled, other languages fine" the parser-loader entry intends. This is the exact bug class #2091/#2093 fixed for Swift/Dart/Kotlin via the lazy getLanguageGrammar pattern; C was left static because it used to be an always-present npm dep.

  • Mechanism: adversarial (Claude) lane + coordinator code-read + the #2091/#2093 precedent. Codex independently flagged the --ignore-scripts failure path (its F2), corroborating the install-failure trigger — but it framed the result as "parser loader fails," not the static-import module-load crash, so the crash mechanism itself is single-Claude-lane + coordinator. [code-read]
  • Fix: commit C's six prebuilds before un-drafting, or convert the three C imports to the lazy/guarded pattern Swift/Dart/Kotlin already use.

🔴 P0 — supply chain (pre-merge)

attest-build-provenance is pinned to bd77c077… commented # v2.4.0 # PLACEHOLDER-PIN, but real v2.4.0 = e8998f94… (coordinator-verified, GitHub API) — an untagged mid-stream commit, so the SLSA-attestation step runs unvetted action code and the comment misrepresents what runs. The PR's own # PLACEHOLDER-PIN — verify before merge marker flags it as unfinished; the specific finding is that the SHA is wrong, not merely unverified. The sibling setup-python@a26af69b… # v5.6.0 PLACEHOLDER-PIN is, by contrast, a correct pin (coordinator-verified) — only its comment is stale. (Inline on the workflow line.)

Other findings (body)

  • P1 (DevOps)aggregate hard-fails with no graceful skip if RELEASE_APP_ID/RELEASE_APP_PRIVATE_KEY are unset; the "open a PR with prebuilds" feature can't run until the App is provisioned, after a full 6-runner build spends its budget. Gate on secret presence + fall back to open_pr=false artifacts. [code-read]
  • P2 (Codex F3 + correctness + npm-native — STRONG) — the publish guard detects source-exclusion only via the vendor/**/src/parser.c toggle; a partial .npmignore edit (e.g. excluding binding.gyp but leaving parser.c commented) passes the guard while shipping an unbuildable grammar. Validate the full source-build set, or assert against npm pack --dry-run --json.
  • P2 (Codex F5 + adversarial + correctness — STRONG) — test false-confidence: gitnexus/test/unit/prebuild-coverage.test.ts's strict 6/6 assertion is dormant whenever binding.gyp exists (all grammars have it), so a missing prebuild passes CI silently; and parser-loader-abi.test only drives the lazy loader, so it's blind to the P1 crash. CI is green precisely because CI hosts have a toolchain. Add a per-grammar "fully-prebuilt" gate + a smoke that imports the real static-import surface.
  • P2 (npm-native + adversarial) — "toolchain-free" is aspirational today: only Swift has 6/6 prebuilds; c/dart/proto/kotlin are source-build-only until build-tree-sitter-prebuilds.yml runs (itself blocked on the P0 SHA + new runners + RELEASE_APP secrets). This transitional state is what makes the P1 live.
  • P2/P3 (DevOps)gitnexus/package-lock.json in the workflow paths: fires the (cheap) guard on most dependency PRs; the new arm runners (esp. windows-11-arm) are unproven in this repo (failure mode is safefail-fast: false + the 6/6 aggregate assertion refuse a partial set); the 30-min build timeout is tight for the 23.6 MB kotlin parser.c.
  • P3 (Codex F4 — verify)aggregate's if: inputs.open_pr != false is null on pull_request events; GHA null-coercion makes the direction non-obvious (over-fire per Codex's read, or under-fire the documented version-change-PR → prebuild-PR flow). Make it explicitly event-gated.
  • P3 (npm-native) — promote node-gyp-build/node-addon-api to gitnexus's own dependencies (currently safe under --omit=optional only via the required tree-sitter's transitive edge — see Refuted); add the /std:c11 /utf-8 win block to c's binding.gyp for parity (inert today — no non-ASCII bytes in any parser.c/scanner.c); AGENTS.md:177 still calls kotlin/swift "optional" (stale).
  • P3 (correctness) — pre-existing materialize double-rename-failure edge could leave a grammar unmaterialized (very low probability).

✅ Validated / refuted (the reviews doing their job)

  • REFUTED — Codex F1 (P1: node-gyp-build optionalDependency → --omit=optional breaks C). The required tree-sitter@0.21.1 declares node-addon-api + node-gyp-build as regular dependencies (lockfile verified), as do 12 other regular tree-sitter-* deps — so node-gyp-build survives --omit=optional and the npm-11 arborist prune, and every vendored grammar's require("node-gyp-build") resolves. The independent engine's plausible P1 is a false positive; two Claude lanes + the coordinator's lockfile check refute it. (The P3 hardening note above is the residual.)
  • Docker image, the publish guard's "can't ship a fully-dead grammar," and GITNEXUS_SKIP_OPTIONAL_GRAMMARS=1 were all probed and found safe.

CI: ABI gates ×3, packaged-install smoke (ubuntu+windows), lint/format/typecheck/actionlint/zizmor/CodeQL/gitleaks/Trivy all green; build-matrix + "Vendor prebuilds" correctly skip (no version change); a few pending; Vercel = deploy-auth (ignore).

Coverage: read the substantive diff (scripts, parser-loader, workflow, Dockerfile, .npmignore, package.json, vendor binding.gyp/package.json); the generated parser.c/node-types.json/binaries were not line-read.

Automated multi-tool digest (3 methods, Codex live). Verify before acting.

Comment thread .github/workflows/build-tree-sitter-prebuilds.yml Outdated
'usually indicates a corrupted install, an unsupported Node version, ' +
'or a native ABI mismatch with the bundled tree-sitter runtime. ' +
'Try `npm rebuild tree-sitter-c` or reinstalling, then re-run analyze. ' +
'C parsing disabled: vendored `tree-sitter-c` (under ' +

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 (resolve before un-drafting) — this C degradation is bypassed by unguarded static imports. This optional / severity: error entry is meant to turn a missing tree-sitter-c binding into a clean "C disabled, other languages fine." But tree-sitter-c is now vendored prebuild-only with 0/6 committed prebuilds, and three sites import it with a hard static ESM import that runs before this loader and never reaches it: parse-worker.ts:7, core/ingestion/languages/c/query.ts:2 (eager on the main thread via languages/index.ts → c-cpp.ts → c/index.ts), and core/group/extractors/include-extractor.ts:5.

On a toolchain-less host (source-build fails) or under npm install --ignore-scripts (materialize never runs), C has no binding → those imports throw ERR_MODULE_NOT_FOUND at module-load → the whole analyze crashes, never reaching this degradation. This is the exact bug class #2091/#2093 fixed for Swift/Dart/Kotlin via the lazy getLanguageGrammar pattern; C was left static because it used to be an always-present npm dep.

Fix: commit C's six prebuilds before un-drafting, OR convert the three C imports to the lazy/guarded pattern. (Anchored here on the bypassed degradation handler — the actual crash sites are the three import lines above. Adversarial lane traced the mechanism; Codex independently flagged the --ignore-scripts path; verified by code-read + the #2091/#2093 precedent.) [code-read]

tree-sitter-c is now vendored prebuild-only (abhigyanpatwari#2116) with 0/6 committed
prebuilds, so on a toolchain-less or `--ignore-scripts` install C has no native
binding. Three modules loaded it via a hard top-level `import C from
'tree-sitter-c'`, which throws ERR_MODULE_NOT_FOUND at module-load — crashing
`analyze` before parser-loader's optional/severity:error degradation can run.
This is the abhigyanpatwari#2091/abhigyanpatwari#2093 bug class (previously fixed for swift/dart/kotlin); C was
left static because it used to be an always-present npm dependency.

- languages/c/query.ts: load via the lazy guarded getLanguageGrammar(C), mirroring
  swift/query.ts; the main-thread isLanguageAvailable filter ensures the getters
  are reached only when C is present.
- workers/parse-worker.ts: guarded `_require('tree-sitter-c')` + conditional
  languageMap spread, like swift/dart/kotlin.
- group/extractors/include-extractor.ts: guarded `_require`; getLanguageForFile
  returns null for .c/.h when absent, so C include-extraction degrades to a no-op
  (C++ unaffected).
- extend the registry-import-closure regression test (abhigyanpatwari#2091/abhigyanpatwari#2093) to assert C
  also loads lazily at registry static-import time.
The workflow pinned actions/attest-build-provenance@bd77c077… commented
`# v2.4.0`, but v2.4.0 is e8998f94… (verified via the GitHub API); bd77c077…
is an untagged mid-stream commit, so the SLSA-attestation step ran unvetted
action code and the comment misrepresented what runs. Repin to the real
v2.4.0 commit and drop the `# PLACEHOLDER-PIN` markers on both this line and
the setup-python pin (a26af69b… is already the correct v5.6.0 — only its
comment was stale). Update the header NOTE accordingly.
…absent

The aggregate job mints a GitHub App token as its first step; with
RELEASE_APP_ID/RELEASE_APP_PRIVATE_KEY unset it hard-failed AFTER a full
(up-to-6-runner) native build. Since the `secrets` context isn't available in
a job-level `if:`, the guard job now computes a `release_app` boolean output
(a step can read secrets) and emits an actionable `::notice::`; aggregate
gates on it and skips cleanly, while the build job's artifacts still upload
(run with open_pr=false for artifacts-only).
…en build timeout

`gitnexus/package-lock.json` changes on nearly every dependency PR, so it
fired the prebuild workflow's guard job on unrelated churn (the matrix stayed
correctly skipped — `gitnexus/package.json` already covers the transition-window
pin, so removing the lock only drops guard noise). Also bump the native build
job timeout 30 -> 45 min for headroom compiling the 23 MB kotlin / 18 MB swift
parser.c, especially under arm emulation.
`inputs.open_pr` is null on pull_request events, and the prior
`inputs.open_pr != false` leg relied on GHA's direction-ambiguous null
coercion (Codex F4) to decide whether to open the prebuild PR. Gate
explicitly on the event: a non-fork pull_request that bumped a grammar
version opens the prebuild PR (the documented flow), and `open_pr` is only
consulted on workflow_dispatch — so a manual run with open_pr=false stays
artifacts-only and no event's behavior rests on coercion.
…e guard

The publish guard inferred "is source shipped?" from a single .npmignore toggle
line, which a partial/out-of-order edit could defeat (exclude binding.gyp but
leave parser.c → unbuildable yet "source-shipping"). It now inspects the
EFFECTIVE tarball via `npm pack --dry-run --ignore-scripts --json` (the
--ignore-scripts avoids re-entering this guard through prepack): a grammar
"ships source" only when EVERY on-disk source-build input (binding.gyp +
binding.cc + parser.c + scanner.c when present + a tree_sitter header) is
actually in the packed file list.

This also surfaced that the gated lean-publish .npmignore block was inert:
package.json's `files: ["vendor"]` allow-list overrides .npmignore for the
vendored subtree, so those exclusion lines never dropped anything. Replace the
dead toggle with documentation of the real mechanism (narrow the `files` field)
and note the guard enforces safety on the effective pack regardless of how the
slim is done.
…erage

The strict 6/6 prebuild assertion was dormant whenever a grammar vendors source
(binding.gyp) — which is every grammar — so a dropped prebuild passed CI
silently. Add a FULLY_PREBUILT allowlist of grammars GitNexus has committed 6/6
for (today: swift); those must keep all six even with a source fallback, so
losing one now fails CI. Grammars graduate into the set as the
build-tree-sitter-prebuilds workflow lands their binaries. (The static-import
degradation smoke is covered by the registry-import-closure regression test
extended in the C lazy-load commit.)
…ncies

Every vendored grammar's index.js does `require("node-gyp-build")` at runtime
to load even a prebuilt .node, so node-gyp-build is runtime-load-critical (and
node-addon-api is needed for the source-build fallback). They were
optionalDependencies, surviving `--omit=optional` only via the required
tree-sitter's transitive edge — correct today but fragile. Promote both to
regular dependencies so the contract is explicit (optionalDependencies is now
empty and removed). Lock the contract with a cli-commands assertion.
…ng.gyp

c's binding.gyp used an unconditional `cflags_c: ["-std=c11"]`, while
kotlin/swift gate MSVC flags behind an `OS=='win'` condition (/std:c11 /utf-8).
Inert today (no non-ASCII bytes in c's parser.c, and node-gyp ignores cflags_c
on MSVC anyway), but align the three so a future source-build fallback on
Windows behaves consistently.
AGENTS.md still said postinstall "patches tree-sitter-swift, builds
tree-sitter-proto" and that only kotlin/swift are "optional". Update to the
vendored-uniform model: postinstall materializes the vendored grammars and
prefers a committed prebuild (source-build only when none matches); c is
required while dart/proto/swift/kotlin are optional + skippable via
GITNEXUS_SKIP_OPTIONAL_GRAMMARS=1, with non-fatal warnings only on a
toolchain-less host with no matching prebuild.
…lize rollback

If renameSync(partial, dest) failed AND the rollback renameSync(backup, dest)
also failed, the grammar was left unmaterialized (node_modules/<name> missing)
with only a generic "could not materialize" warning — the recoverable backup at
<dest>.materialize-bak was unmentioned. Emit a CRITICAL warning naming the
backup path and the recovery command on that double-failure, and document that
the fail-soft catch removes only the scratch `partial`, never the `backup`
(which may be the sole recoverable copy). Never-throw / exit-0 contract intact.
@magyargergo magyargergo linked an issue Jun 9, 2026 that may be closed by this pull request
The prepack guard shelled out to `npm pack --dry-run --ignore-scripts --json`,
but the `--ignore-scripts` flag is not reliably honored by npm pack's
prepare/prepack lifecycle on the CI npm — so build.js ran, polluted the --json
stdout with `[build] …`, and the guard's JSON.parse threw. That broke every
`npm pack` (packaged-install-smoke on ubuntu+windows) and failed the guard's own
real-repo unit test (the only coverage-job failure). Force script-skipping via
the reliable `npm_config_ignore_scripts` env config (also removes the prepack
re-entry/recursion risk) and parse defensively from the JSON-array start.
…ot `npm pack`

The npm-pack-based guard timed out in CI: `npm pack`'s prepare/prepack lifecycle
is not skipped by `--ignore-scripts` (flag or env config) on the CI npm, so the
inner pack ran the full build (~20s+) — fine for the slow smoke job, but it blew
past vitest's 30s test timeout in the coverage job (and risked re-entering this
prepack guard).

Replace it with a deterministic, fast (~0.1s) check that needs no subprocess:
since `files: ["vendor"]` OVERRIDES `.npmignore` for the vendored subtree (so
`.npmignore` can never drop vendored source — verified), the ONLY lever that can
exclude source is narrowing the package.json `files` field. The guard now reads
`files` directly: a grammar "ships source" iff `files` includes the vendor
subtree AND the grammar carries a buildable source set on disk. A lean publish
that narrows `files` while a grammar lacks 6/6 prebuilds still fails the gate.
@magyargergo magyargergo marked this pull request as ready for review June 9, 2026 15:39
@magyargergo magyargergo changed the title feat(install): toolchain-free tree-sitter via vendored prebuilds [DRAFT] feat(install): toolchain-free tree-sitter via vendored prebuilds Jun 9, 2026
Adds a weekly (+ dispatchable) workflow that checks each vendored grammar against
its source-of-origin (npm for swift/kotlin, the GitHub default branch for
dart/proto; c is excluded — held at 0.21.4 for ABI safety) and opens a PR
re-vendoring any update that is ABI-COMPATIBLE with the pinned tree-sitter@0.21.1
(LANGUAGE_VERSION 13-14).

ABI awareness is the point: most upstreams have moved to ABI 15 (newer
tree-sitter), so a blind "bump to latest" would open PRs that can't build. The
monitor fetches the candidate source, reads its parser.c LANGUAGE_VERSION, and
only re-vendors 13/14 — incompatible updates are reported (notice + job summary),
never applied. (Confirmed live: dart/proto upstreams are ABI 15 today and are
correctly held; swift/kotlin are current.)

The re-vendor refreshes only the source-build inputs + runtime entrypoints,
preserving the GitNexus-hardened binding.gyp / README / prebuilds; the version
bump then triggers build-tree-sitter-prebuilds.yml, whose ABI-validation is the
final safety net so a subtly-wrong re-vendor can't silently ship. PR creation is
gated on the RELEASE_APP secret (skips with a notice if absent), mirroring the
build aggregate. Unit test locks the ABI gate; the script is import-safe.
Comment thread .github/scripts/update-vendored-grammars.mjs Fixed
c was excluded from the update monitor, so an upstream c update went unnoticed.
Include it, but as report-only via a `hold`: c is ABI-pinned at 0.21.4
(abhigyanpatwari#1242/abhigyanpatwari#858) and must not auto-bump without a tree-sitter runtime upgrade, so an
available c update is detected + surfaced (notice + job summary) but never
auto-PR'd — even if it were ABI-13/14. `--apply c` refuses defensively. (Live:
upstream c is 0.24.1 / ABI 15 today, so c is doubly held — reported, not applied.)
CodeQL flagged the GitHub-tarball fetch — it used `bash -c "gh api …/tarball/$ref
> src.tgz && tar xzf src.tgz"`, interpolating the API-derived ref into a shell
command (the shell-command-injection family: "this shell command depends on an
uncontrolled file name"). Replace it with a shell-free path: capture `gh api`'s
binary tarball as a Buffer via execFileSync, write it to a fixed file, and
extract with execFileSync('tar', …). No shell, no injection surface. Verified the
dart/proto fetch + ABI read still work.
magyargergo added a commit to magyargergo/GitNexus that referenced this pull request Jun 9, 2026
abhigyanpatwari#2113 added the full vendored-grammar + prebuild + monitor feature with no
changelog entry; document it under Unreleased ahead of the next release.
@magyargergo magyargergo force-pushed the feat/tree-sitter-prebuilds branch from 3ed551d to 089dad6 Compare June 9, 2026 16:57
@magyargergo magyargergo merged commit cef63dd into abhigyanpatwari:main Jun 9, 2026
72 of 73 checks passed
@magyargergo magyargergo deleted the feat/tree-sitter-prebuilds branch June 9, 2026 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Request: Package prebuilt tree-sitter binaries in v1.6.7+

2 participants