fix(cli): ship tokenizers linux-arm64-gnu binding in CLI tarball#3960
Conversation
…rball The linux-arm64 CLI tarball crashed on `superset start` with `Cannot find module '@anush008/tokenizers-linux-arm64-gnu'`. #3921 added the binding for linux-x64 but missed arm64. Three layers had to line up: 1. The build script needed the binding in TARGET_NATIVE_PACKAGES and the host-service bundler needed it as an external. 2. `@anush008/tokenizers@0.0.0` (pinned via fastembed → ^0.0.0) lists no linux-arm64 variant in optionalDependencies, so even with (1) bun couldn't resolve it. 3. `@anush008/tokenizers-linux-arm64-gnu@0.0.0` doesn't exist on npm (only 0.5.0 and 0.6.0 are published). Pin the 0.6.0 binding directly as an optionalDependency on host-service so bun installs it on linux-arm64 and skips it elsewhere via os/cpu filtering. The 0.6.0 native binary is ABI-compatible with the 0.0.0 JS loader for the surface fastembed uses (Tokenizer.fromString, setTruncation, setPadding, AddedToken, async encode, getIds, getTokens). Verified end-to-end in linux/arm64 docker: `bun install --frozen` picks up the binding, `build:dist --target=linux-arm64` ships it in lib/node_modules, the bundled Node loads `require("@anush008/tokenizers")` cleanly, and `superset-host` boots past the previous MODULE_NOT_FOUND. Closes #3951.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds support for the Linux ARM64 native tokenizer package across the CLI distribution build pipeline and host-service dependencies, ensuring the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 6/8 reviews remaining, refill in 12 minutes and 48 seconds.Comment |
Greptile SummaryThis PR fixes a Confidence Score: 5/5Safe to merge — the three-layer fix is correct and well-tested; only minor documentation and lockfile noise remain. All changes are additive and symmetric with the existing linux-x64 / darwin patterns. The only findings are P2: an unrelated SDK version bump in the lockfile and a missing inline comment explaining the 0.6.0 ↔ 0.0.0 cross-version pairing. Neither blocks correctness. The unrelated
|
| Filename | Overview |
|---|---|
| packages/cli/scripts/build-dist.ts | Adds @anush008/tokenizers-linux-arm64-gnu to TARGET_NATIVE_PACKAGES["linux-arm64"] so the binding is copied into the linux-arm64 tarball; mirrors the existing x64 and darwin entries exactly. |
| packages/host-service/build.ts | Adds @anush008/tokenizers-linux-arm64-gnu to the Bun bundler external list so it is not inlined and remains resolvable at runtime; symmetric with the other platform-specific tokenizer bindings. |
| packages/host-service/package.json | Adds @anush008/tokenizers-linux-arm64-gnu@0.6.0 as an optionalDependency; note the version is 0.6.0 rather than 0.0.0 used by the other platform bindings due to a publishing gap — cross-version pairing is explained in the PR but undocumented in the file. |
| bun.lock | Lock file updated to reflect the new optional dep and its os: linux, cpu: arm64 platform constraints; also contains an unrelated @superset/sdk version bump from 0.0.1-alpha.0 to 0.0.1-alpha.7. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["bun run build:dist --target=linux-arm64"] --> B["buildHostService()\nbuild.ts externals list"]
B --> C["@anush008/tokenizers-linux-arm64-gnu\nmarked external — not inlined"]
A --> D["copyNativePackages()\nTARGET_NATIVE_PACKAGES[linux-arm64]"]
D --> E["@libsql/linux-arm64-gnu"]
D --> F["@parcel/watcher-linux-arm64-glibc"]
D --> G["@anush008/tokenizers-linux-arm64-gnu\n(0.6.0 — NEW)"]
G --> H["lib/node_modules/@anush008/\ntokenizers-linux-arm64-gnu/\ntokenizers.linux-arm64-gnu.node"]
A --> I["superset-linux-arm64.tar.gz"]
H --> I
J["@anush008/tokenizers@0.0.0\n(JS loader)"] -->|"require('@anush008/tokenizers-linux-arm64-gnu')"| H
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
bun.lock:902
**Unrelated SDK version bump in lockfile**
The `@superset/sdk` workspace version changed from `0.0.1-alpha.0` to `0.0.1-alpha.7` in this lockfile update. This change is unrelated to the tokenizers arm64 fix and appears to be lockfile drift from running `bun install` after an out-of-band SDK `package.json` bump. If this wasn't intentional, consider reverting this hunk to keep the diff focused — otherwise, it could obscure version history for the SDK package.
### Issue 2 of 2
packages/host-service/package.json:87-89
**Cross-version native binding pairing**
The arm64 binding is pinned at `0.6.0`, while the other platform bindings (`linux-x64-gnu`, `darwin-universal`, `win32-x64-msvc`) all resolve to `0.0.0`. The `@anush008/tokenizers@0.0.0` JS loader calls `require('@anush008/tokenizers-linux-arm64-gnu')` without version-gating, so the `0.6.0` native binary must expose a compatible ABI — which the author verified for the current fastembed surface. If `@anush008/tokenizers` is ever updated and the JS-to-native boundary changes, arm64 could break silently while other platforms continue to work. Adding a comment here documenting why `0.6.0` is used instead of `0.0.0` would help future maintainers.
Reviews (1): Last reviewed commit: "fix(cli): ship @anush008/tokenizers-linu..." | Re-trigger Greptile
| "packages/sdk": { | ||
| "name": "@superset/sdk", | ||
| "version": "0.0.1-alpha.0", | ||
| "version": "0.0.1-alpha.7", |
There was a problem hiding this comment.
Unrelated SDK version bump in lockfile
The @superset/sdk workspace version changed from 0.0.1-alpha.0 to 0.0.1-alpha.7 in this lockfile update. This change is unrelated to the tokenizers arm64 fix and appears to be lockfile drift from running bun install after an out-of-band SDK package.json bump. If this wasn't intentional, consider reverting this hunk to keep the diff focused — otherwise, it could obscure version history for the SDK package.
Prompt To Fix With AI
This is a comment left during a code review.
Path: bun.lock
Line: 902
Comment:
**Unrelated SDK version bump in lockfile**
The `@superset/sdk` workspace version changed from `0.0.1-alpha.0` to `0.0.1-alpha.7` in this lockfile update. This change is unrelated to the tokenizers arm64 fix and appears to be lockfile drift from running `bun install` after an out-of-band SDK `package.json` bump. If this wasn't intentional, consider reverting this hunk to keep the diff focused — otherwise, it could obscure version history for the SDK package.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| "optionalDependencies": { | ||
| "@anush008/tokenizers-linux-arm64-gnu": "0.6.0" | ||
| } |
There was a problem hiding this comment.
Cross-version native binding pairing
The arm64 binding is pinned at 0.6.0, while the other platform bindings (linux-x64-gnu, darwin-universal, win32-x64-msvc) all resolve to 0.0.0. The @anush008/tokenizers@0.0.0 JS loader calls require('@anush008/tokenizers-linux-arm64-gnu') without version-gating, so the 0.6.0 native binary must expose a compatible ABI — which the author verified for the current fastembed surface. If @anush008/tokenizers is ever updated and the JS-to-native boundary changes, arm64 could break silently while other platforms continue to work. Adding a comment here documenting why 0.6.0 is used instead of 0.0.0 would help future maintainers.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/package.json
Line: 87-89
Comment:
**Cross-version native binding pairing**
The arm64 binding is pinned at `0.6.0`, while the other platform bindings (`linux-x64-gnu`, `darwin-universal`, `win32-x64-msvc`) all resolve to `0.0.0`. The `@anush008/tokenizers@0.0.0` JS loader calls `require('@anush008/tokenizers-linux-arm64-gnu')` without version-gating, so the `0.6.0` native binary must expose a compatible ABI — which the author verified for the current fastembed surface. If `@anush008/tokenizers` is ever updated and the JS-to-native boundary changes, arm64 could break silently while other platforms continue to work. Adding a comment here documenting why `0.6.0` is used instead of `0.0.0` would help future maintainers.
How can I resolve this? If you propose a fix, please make it concise.
🧹 Preview Cleanup CompleteThe following preview resources have been cleaned up:
Thank you for your contribution! 🎉 |
Patch release for the linux-arm64 startup crash (#3960 since v0.2.1). Push cli-v0.2.2 after this lands to fire the release pipeline.
Non-applicable to current fork structure: superset-sh#3960 and superset-sh#4068 require linux-arm64/full CLI dist targets that this fork does not ship; superset-sh#4678 targets a relay deploy script intentionally absent from the fork; superset-sh#4694 requires DuckDB native packaging but the fork has no DuckDB runtime dependency; superset-sh#4822 targets removed v1 import modal paths; superset-sh#4826 assumes upstream release-cli.yml while this fork uses build-cli.yml with draft release semantics.
Summary
@anush008/tokenizers-linux-arm64-gnuin the linux-arm64 CLI tarball sosuperset startno longer crashes withCannot find module '@anush008/tokenizers-linux-arm64-gnu'. fix(cli): ship onnxruntime-node + tokenizers natives in linux-x64 tarball #3921 added the binding for linux-x64 but missed arm64.@superset/host-serviceas anoptionalDependency(gated toos: linux, cpu: arm64), wires it throughbuild-dist.tsand the host-service bundler externals.Why this shape
Three layers had to line up — the issue's diff alone wouldn't pass CI:
TARGET_NATIVE_PACKAGES["linux-arm64"]and the host-serviceBun.buildexternals didn't list the binding.@anush008/tokenizers@0.0.0(pinned transitively viafastembed → @anush008/tokenizers@^0.0.0) lists nolinux-arm64-gnuinoptionalDependencies, sobun installcouldn't resolve it from the parent.@anush008/tokenizers-linux-arm64-gnu@0.0.0was never published; only0.5.0and0.6.0exist.Adding the
0.6.0binding as a direct optional dep on host-service sidesteps (2) and (3): bun installs it on linux-arm64 CI runners and skips it everywhere else viaos/cpufiltering. The0.6.0native binary is ABI-compatible with the0.0.0JS loader'srequire('@anush008/tokenizers-linux-arm64-gnu')path for the surface fastembed actually uses.Test plan
Verified end-to-end in a
linux/arm64docker container (Apple Silicon → native arm64):0.6.0binding loaded behind0.0.0JS loader resolvesTokenizer.fromString→setTruncation→setPadding→new AddedToken(...)→addAddedTokens→ asyncencode→getIds/getTokens(the fastembed surface).bun install --frozen --ignore-scriptsresolves and installs@anush008/tokenizers-linux-arm64-gnu@0.6.0.bun run build:dist --target=linux-arm64succeeds; tarball contains bothlib/node_modules/@anush008/tokenizers/index.jsandlib/node_modules/@anush008/tokenizers-linux-arm64-gnu/tokenizers.linux-arm64-gnu.node.NODE_PATH=$DIST/lib/node_modules $DIST/lib/node -e 'require("@anush008/tokenizers")'loads cleanly.bin/superset --version→0.2.1.bin/superset-hostboots past the previousMODULE_NOT_FOUNDand only fails on runtime env-var validation (ORGANIZATION_ID,HOST_DB_PATH) — unrelated.bun run lintclean;bun run typecheckclean for@superset/host-service+@superset/cli.Notes
optionalDependenciesskips on os/cpu mismatch and the bundlerexternallist is platform-agnostic.0.6.0is the most recent published binding; the maintainer-side fix proposed in the issue (republish0.0.0with arm64 in opt-deps) is no longer needed.Closes #3951.
Summary by cubic
Ships
@anush008/tokenizers-linux-arm64-gnuin the linux-arm64 CLI tarball to fix theMODULE_NOT_FOUNDcrash onsuperset start. Pins the0.6.0binding in@superset/host-serviceand updates the build to bundle it on arm64 Linux.@anush008/tokenizers-linux-arm64-gnu@0.6.0as anoptionalDependencyin@superset/host-service(os: linux, cpu: arm64).TARGET_NATIVE_PACKAGES["linux-arm64"]and host-service bundler externals..node; darwin and linux-x64 remain unaffected.Written for commit 6dfde38. Summary will update on new commits.
Summary by CodeRabbit