Skip to content

fix(cli): ship tokenizers linux-arm64-gnu binding in CLI tarball#3960

Merged
saddlepaddle merged 1 commit into
mainfrom
peppermint-duchess
May 1, 2026
Merged

fix(cli): ship tokenizers linux-arm64-gnu binding in CLI tarball#3960
saddlepaddle merged 1 commit into
mainfrom
peppermint-duchess

Conversation

@saddlepaddle
Copy link
Copy Markdown
Collaborator

@saddlepaddle saddlepaddle commented May 1, 2026

Summary

  • Ships @anush008/tokenizers-linux-arm64-gnu in the linux-arm64 CLI tarball so superset start no longer crashes with Cannot find module '@anush008/tokenizers-linux-arm64-gnu'. fix(cli): ship onnxruntime-node + tokenizers natives in linux-x64 tarball #3921 added the binding for linux-x64 but missed arm64.
  • Pins the 0.6.0 binding directly on @superset/host-service as an optionalDependency (gated to os: linux, cpu: arm64), wires it through build-dist.ts and the host-service bundler externals.

Why this shape

Three layers had to line up — the issue's diff alone wouldn't pass CI:

  1. Build pipelineTARGET_NATIVE_PACKAGES["linux-arm64"] and the host-service Bun.build externals didn't list the binding.
  2. @anush008/tokenizers@0.0.0 (pinned transitively via fastembed → @anush008/tokenizers@^0.0.0) lists no linux-arm64-gnu in optionalDependencies, so bun install couldn't resolve it from the parent.
  3. npm registry@anush008/tokenizers-linux-arm64-gnu@0.0.0 was never published; only 0.5.0 and 0.6.0 exist.

Adding the 0.6.0 binding as a direct optional dep on host-service sidesteps (2) and (3): bun installs it on linux-arm64 CI runners and skips it everywhere else via os/cpu filtering. The 0.6.0 native binary is ABI-compatible with the 0.0.0 JS loader's require('@anush008/tokenizers-linux-arm64-gnu') path for the surface fastembed actually uses.

Test plan

Verified end-to-end in a linux/arm64 docker container (Apple Silicon → native arm64):

  • ABI compat: 0.6.0 binding loaded behind 0.0.0 JS loader resolves Tokenizer.fromStringsetTruncationsetPaddingnew AddedToken(...)addAddedTokens → async encodegetIds/getTokens (the fastembed surface).
  • bun install --frozen --ignore-scripts resolves and installs @anush008/tokenizers-linux-arm64-gnu@0.6.0.
  • bun run build:dist --target=linux-arm64 succeeds; tarball contains both lib/node_modules/@anush008/tokenizers/index.js and lib/node_modules/@anush008/tokenizers-linux-arm64-gnu/tokenizers.linux-arm64-gnu.node.
  • NODE_PATH=$DIST/lib/node_modules $DIST/lib/node -e 'require("@anush008/tokenizers")' loads cleanly.
  • bin/superset --version0.2.1. bin/superset-host boots past the previous MODULE_NOT_FOUND and only fails on runtime env-var validation (ORGANIZATION_ID, HOST_DB_PATH) — unrelated.
  • bun run lint clean; bun run typecheck clean for @superset/host-service + @superset/cli.

Notes

  • darwin and linux-x64 unaffected: optionalDependencies skips on os/cpu mismatch and the bundler external list is platform-agnostic.
  • 0.6.0 is the most recent published binding; the maintainer-side fix proposed in the issue (republish 0.0.0 with arm64 in opt-deps) is no longer needed.

Closes #3951.


Summary by cubic

Ships @anush008/tokenizers-linux-arm64-gnu in the linux-arm64 CLI tarball to fix the MODULE_NOT_FOUND crash on superset start. Pins the 0.6.0 binding in @superset/host-service and updates the build to bundle it on arm64 Linux.

  • Bug Fixes
    • Add @anush008/tokenizers-linux-arm64-gnu@0.6.0 as an optionalDependency in @superset/host-service (os: linux, cpu: arm64).
    • Include the binding in TARGET_NATIVE_PACKAGES["linux-arm64"] and host-service bundler externals.
    • Tarball now ships the native .node; darwin and linux-x64 remain unaffected.

Written for commit 6dfde38. Summary will update on new commits.

Summary by CodeRabbit

  • Chores
    • Extended support for Linux ARM64 systems by including required tokenizer packages and updating build configurations.

…rball

The linux-arm64 CLI tarball crashed on `superset start` with
`Cannot find module '@anush008/tokenizers-linux-arm64-gnu'`. #3921 added
the binding for linux-x64 but missed arm64. Three layers had to line up:

1. The build script needed the binding in TARGET_NATIVE_PACKAGES and the
   host-service bundler needed it as an external.
2. `@anush008/tokenizers@0.0.0` (pinned via fastembed → ^0.0.0) lists no
   linux-arm64 variant in optionalDependencies, so even with (1) bun
   couldn't resolve it.
3. `@anush008/tokenizers-linux-arm64-gnu@0.0.0` doesn't exist on npm
   (only 0.5.0 and 0.6.0 are published).

Pin the 0.6.0 binding directly as an optionalDependency on host-service
so bun installs it on linux-arm64 and skips it elsewhere via os/cpu
filtering. The 0.6.0 native binary is ABI-compatible with the 0.0.0 JS
loader for the surface fastembed uses (Tokenizer.fromString,
setTruncation, setPadding, AddedToken, async encode, getIds, getTokens).

Verified end-to-end in linux/arm64 docker: `bun install --frozen` picks
up the binding, `build:dist --target=linux-arm64` ships it in
lib/node_modules, the bundled Node loads `require("@anush008/tokenizers")`
cleanly, and `superset-host` boots past the previous MODULE_NOT_FOUND.

Closes #3951.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cf526973-16e7-4c2b-bb05-22d4a4f6cd12

📥 Commits

Reviewing files that changed from the base of the PR and between e65c9f5 and 6dfde38.

⛔ Files ignored due to path filters (1)
  • bun.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • packages/cli/scripts/build-dist.ts
  • packages/host-service/build.ts
  • packages/host-service/package.json

📝 Walkthrough

Walkthrough

Adds support for the Linux ARM64 native tokenizer package across the CLI distribution build pipeline and host-service dependencies, ensuring the @anush008/tokenizers-linux-arm64-gnu binary is included in the ARM64 distribution and not bundled into the host-service output.

Changes

Cohort / File(s) Summary
Linux ARM64 Native Tokenizer Package
packages/cli/scripts/build-dist.ts, packages/host-service/build.ts, packages/host-service/package.json
Adds @anush008/tokenizers-linux-arm64-gnu to the linux-arm64 native packages list, bundler externals, and optional dependencies to ensure the tokenizer binary is available on ARM64 hosts.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 A tokenizer hops onto ARM's strong feet,
From aarch64 hosts, the journey's now complete,
No more MODULE_NOT_FOUND in the night,
Linux ARM64 now boots just right,

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: shipping the Linux ARM64 tokenizers binding in the CLI tarball.
Description check ✅ Passed The description is comprehensive, covering summary, rationale, testing, and notes; all required template sections are present and filled out.
Linked Issues check ✅ Passed The PR fulfills all coding objectives from #3951: it updates the build pipeline to include the arm64 binding in both build-dist.ts and build.ts, pins the 0.6.0 binding as optionalDependency, and includes the native in the tarball.
Out of Scope Changes check ✅ Passed All three code changes are directly scoped to shipping the linux-arm64 tokenizers binding; no unrelated modifications are present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch peppermint-duchess

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 6/8 reviews remaining, refill in 12 minutes and 48 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 1, 2026

Greptile Summary

This PR fixes a MODULE_NOT_FOUND crash on linux-arm64 by shipping the @anush008/tokenizers-linux-arm64-gnu native binding in the CLI tarball, mirroring the fix previously applied for linux-x64. Three coordinated changes align the build pipeline: the binding is added to the host-service optionalDependencies (gated by os/cpu), the Bun bundler external list, and the TARGET_NATIVE_PACKAGES copy map for the linux-arm64 target.

Confidence Score: 5/5

Safe to merge — the three-layer fix is correct and well-tested; only minor documentation and lockfile noise remain.

All changes are additive and symmetric with the existing linux-x64 / darwin patterns. The only findings are P2: an unrelated SDK version bump in the lockfile and a missing inline comment explaining the 0.6.0 ↔ 0.0.0 cross-version pairing. Neither blocks correctness.

The unrelated @superset/sdk version bump in bun.lock is worth a second look to confirm it is intentional.

Important Files Changed

Filename Overview
packages/cli/scripts/build-dist.ts Adds @anush008/tokenizers-linux-arm64-gnu to TARGET_NATIVE_PACKAGES["linux-arm64"] so the binding is copied into the linux-arm64 tarball; mirrors the existing x64 and darwin entries exactly.
packages/host-service/build.ts Adds @anush008/tokenizers-linux-arm64-gnu to the Bun bundler external list so it is not inlined and remains resolvable at runtime; symmetric with the other platform-specific tokenizer bindings.
packages/host-service/package.json Adds @anush008/tokenizers-linux-arm64-gnu@0.6.0 as an optionalDependency; note the version is 0.6.0 rather than 0.0.0 used by the other platform bindings due to a publishing gap — cross-version pairing is explained in the PR but undocumented in the file.
bun.lock Lock file updated to reflect the new optional dep and its os: linux, cpu: arm64 platform constraints; also contains an unrelated @superset/sdk version bump from 0.0.1-alpha.0 to 0.0.1-alpha.7.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["bun run build:dist --target=linux-arm64"] --> B["buildHostService()\nbuild.ts externals list"]
    B --> C["@anush008/tokenizers-linux-arm64-gnu\nmarked external — not inlined"]
    A --> D["copyNativePackages()\nTARGET_NATIVE_PACKAGES[linux-arm64]"]
    D --> E["@libsql/linux-arm64-gnu"]
    D --> F["@parcel/watcher-linux-arm64-glibc"]
    D --> G["@anush008/tokenizers-linux-arm64-gnu\n(0.6.0 — NEW)"]
    G --> H["lib/node_modules/@anush008/\ntokenizers-linux-arm64-gnu/\ntokenizers.linux-arm64-gnu.node"]
    A --> I["superset-linux-arm64.tar.gz"]
    H --> I
    J["@anush008/tokenizers@0.0.0\n(JS loader)"] -->|"require('@anush008/tokenizers-linux-arm64-gnu')"| H
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
bun.lock:902
**Unrelated SDK version bump in lockfile**

The `@superset/sdk` workspace version changed from `0.0.1-alpha.0` to `0.0.1-alpha.7` in this lockfile update. This change is unrelated to the tokenizers arm64 fix and appears to be lockfile drift from running `bun install` after an out-of-band SDK `package.json` bump. If this wasn't intentional, consider reverting this hunk to keep the diff focused — otherwise, it could obscure version history for the SDK package.

### Issue 2 of 2
packages/host-service/package.json:87-89
**Cross-version native binding pairing**

The arm64 binding is pinned at `0.6.0`, while the other platform bindings (`linux-x64-gnu`, `darwin-universal`, `win32-x64-msvc`) all resolve to `0.0.0`. The `@anush008/tokenizers@0.0.0` JS loader calls `require('@anush008/tokenizers-linux-arm64-gnu')` without version-gating, so the `0.6.0` native binary must expose a compatible ABI — which the author verified for the current fastembed surface. If `@anush008/tokenizers` is ever updated and the JS-to-native boundary changes, arm64 could break silently while other platforms continue to work. Adding a comment here documenting why `0.6.0` is used instead of `0.0.0` would help future maintainers.

Reviews (1): Last reviewed commit: "fix(cli): ship @anush008/tokenizers-linu..." | Re-trigger Greptile

Comment thread bun.lock
"packages/sdk": {
"name": "@superset/sdk",
"version": "0.0.1-alpha.0",
"version": "0.0.1-alpha.7",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unrelated SDK version bump in lockfile

The @superset/sdk workspace version changed from 0.0.1-alpha.0 to 0.0.1-alpha.7 in this lockfile update. This change is unrelated to the tokenizers arm64 fix and appears to be lockfile drift from running bun install after an out-of-band SDK package.json bump. If this wasn't intentional, consider reverting this hunk to keep the diff focused — otherwise, it could obscure version history for the SDK package.

Prompt To Fix With AI
This is a comment left during a code review.
Path: bun.lock
Line: 902

Comment:
**Unrelated SDK version bump in lockfile**

The `@superset/sdk` workspace version changed from `0.0.1-alpha.0` to `0.0.1-alpha.7` in this lockfile update. This change is unrelated to the tokenizers arm64 fix and appears to be lockfile drift from running `bun install` after an out-of-band SDK `package.json` bump. If this wasn't intentional, consider reverting this hunk to keep the diff focused — otherwise, it could obscure version history for the SDK package.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +87 to 89
"optionalDependencies": {
"@anush008/tokenizers-linux-arm64-gnu": "0.6.0"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Cross-version native binding pairing

The arm64 binding is pinned at 0.6.0, while the other platform bindings (linux-x64-gnu, darwin-universal, win32-x64-msvc) all resolve to 0.0.0. The @anush008/tokenizers@0.0.0 JS loader calls require('@anush008/tokenizers-linux-arm64-gnu') without version-gating, so the 0.6.0 native binary must expose a compatible ABI — which the author verified for the current fastembed surface. If @anush008/tokenizers is ever updated and the JS-to-native boundary changes, arm64 could break silently while other platforms continue to work. Adding a comment here documenting why 0.6.0 is used instead of 0.0.0 would help future maintainers.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/package.json
Line: 87-89

Comment:
**Cross-version native binding pairing**

The arm64 binding is pinned at `0.6.0`, while the other platform bindings (`linux-x64-gnu`, `darwin-universal`, `win32-x64-msvc`) all resolve to `0.0.0`. The `@anush008/tokenizers@0.0.0` JS loader calls `require('@anush008/tokenizers-linux-arm64-gnu')` without version-gating, so the `0.6.0` native binary must expose a compatible ABI — which the author verified for the current fastembed surface. If `@anush008/tokenizers` is ever updated and the JS-to-native boundary changes, arm64 could break silently while other platforms continue to work. Adding a comment here documenting why `0.6.0` is used instead of `0.0.0` would help future maintainers.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

  • ✅ Neon database branch

Thank you for your contribution! 🎉

@saddlepaddle saddlepaddle merged commit 8c36fb8 into main May 1, 2026
14 of 15 checks passed
@saddlepaddle saddlepaddle mentioned this pull request May 1, 2026
3 tasks
saddlepaddle added a commit that referenced this pull request May 1, 2026
Patch release for the linux-arm64 startup crash (#3960 since v0.2.1).
Push cli-v0.2.2 after this lands to fire the release pipeline.
@Kitenite Kitenite deleted the peppermint-duchess branch May 6, 2026 04:51
MocA-Love added a commit to MocA-Love/superset that referenced this pull request May 29, 2026
Non-applicable to current fork structure: superset-sh#3960 and superset-sh#4068 require linux-arm64/full CLI dist targets that this fork does not ship; superset-sh#4678 targets a relay deploy script intentionally absent from the fork; superset-sh#4694 requires DuckDB native packaging but the fork has no DuckDB runtime dependency; superset-sh#4822 targets removed v1 import modal paths; superset-sh#4826 assumes upstream release-cli.yml while this fork uses build-cli.yml with draft release semantics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: linux-arm64 tarball missing @anush008/tokenizers-linux-arm64-gnu native binding

1 participant