ICU: per-item zstd decompression hook for compressed libicudata#31200
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR adds ICU zstd decompression support for Linux, a comprehensive ECMA-402 Intl test suite with locale fixtures, and updates WebKit build configuration. The decompression hook memoizes buffers keyed by input pointer and conditionally uses weak zstd dictionary symbols. The Intl test suite covers DisplayNames, NumberFormat, DateTimeFormat, Collator, Segmenter, and related APIs with snapshot testing and an exhaustive locale sweep verifying outputs across 600+ locales. WebKit configuration is updated to use preview tags with distinct cache key derivation. ChangesICU Decompression Hook
Intl Behavior Test Suite
WebKit Build Configuration Update
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/js/web/intl/icu-locales.txt`:
- Line 434: Remove the stray non-locale entry "res-index" from the icu-locales
fixture file so it no longer gets filtered out by Intl.getCanonicalLocales;
alternatively, replace "res-index" with a valid BCP47 locale tag if it was
intended to be a test case, and re-generate the fixture so
test/js/web/intl/intl.test.ts will exercise actual Intl behavior instead of
discarding this entry.
- Around line 291-292: The fixture includes deprecated BCP‑47 tags (e.g.
"in"/"in-ID", "iw"/"iw-IL", "mo", "no*", "sh*", "tl*") but the test harness
(Intl.getCanonicalLocales in test/js/web/intl/intl.test.ts) doesn't assert
alias→canonical mappings; either (A) add explicit assertions in intl.test.ts
that for each deprecated tag Intl.getCanonicalLocales(tag) returns the expected
modern tag(s) (build a small map of deprecated→canonical and assert equality) or
(B) remove/trim deprecated entries from test/js/web/intl/icu-locales.txt so the
fixture only contains canonical/modern tags—pick one approach and apply
consistently across the listed deprecated symbols.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f1e23297-222c-4b5b-9798-99d7df2bd552
⛔ Files ignored due to path filters (1)
test/js/web/intl/__snapshots__/intl.test.ts.snapis excluded by!**/*.snap
📒 Files selected for processing (3)
src/jsc/bindings/bun_icu_decompress.cpptest/js/web/intl/icu-locales.txttest/js/web/intl/intl.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/build/deps/webkit.ts`:
- Around line 6-11: Add a build-time guard that rejects WEBKIT_VERSION values
beginning with "autobuild-preview-" unless an explicit override flag is present:
detect the exported constant WEBKIT_VERSION at startup (e.g., in the configure
or CI entrypoint) and if it matches /^autobuild-preview-/ exit non‑zero with a
clear error unless an environment variable or CLI flag like ALLOW_PREVIEW_WEBKIT
/ --allow-preview-webkit is set; ensure the error message references
WEBKIT_VERSION and instructs how to set the override for temporary testing so
CI/local builds fail fast on accidental preview pins.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 85628d3e-9043-433a-ae81-6d57b4024b61
📒 Files selected for processing (1)
scripts/build/deps/webkit.ts
… preview WEBKIT_VERSION, drop res-index from locale fixture
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/jsc/bindings/bun_icu_decompress.cpp (1)
87-93:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle
ZSTD_createDCtxallocation failure.
ZSTD_createDCtx()returnsnullptron memory exhaustion. Withm_dctxnull, subsequent calls toZSTD_decompressDCtxorZSTD_decompress_usingDDictwill dereference null, causing UB/crash.Add a null check in
decompress()to fail gracefully.Proposed fix
const void* decompress(const void* p, int32_t* length) { + if (!m_dctx) [[unlikely]] + return p; + Locker locker { m_lock };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/jsc/bindings/bun_icu_decompress.cpp` around lines 87 - 93, The constructor may leave m_dctx null if ZSTD_createDCtx() fails; update ICUDecompressor::decompress() to check if m_dctx is nullptr before calling ZSTD_decompressDCtx or ZSTD_decompress_usingDDict and fail gracefully (e.g., return an error/false or set an error code) instead of dereferencing it; reference the ICUDecompressor class, the m_dctx member, the decompress() method, and the ZSTD_createDCtx / ZSTD_decompressDCtx / ZSTD_decompress_usingDDict calls so you add an early null check and an appropriate error path (and ensure any allocated m_ddict is still cleaned up as before).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/build/deps/webkit.ts`:
- Around line 83-91: The preview-guard uses the unnormalized value but the URL
uses a normalized tag, so inputs like "preview-pr-..." bypass the check; compute
the normalized tag first (e.g., const tag = version.startsWith("autobuild-") ?
version : `autobuild-${version}`) and then check
tag.startsWith("autobuild-preview-") against cfg.allowPreviewWebkit, throwing
the BuildError if needed, and use that same tag to build the download URL
(references: version, tag, cfg.allowPreviewWebkit, BuildError).
---
Outside diff comments:
In `@src/jsc/bindings/bun_icu_decompress.cpp`:
- Around line 87-93: The constructor may leave m_dctx null if ZSTD_createDCtx()
fails; update ICUDecompressor::decompress() to check if m_dctx is nullptr before
calling ZSTD_decompressDCtx or ZSTD_decompress_usingDDict and fail gracefully
(e.g., return an error/false or set an error code) instead of dereferencing it;
reference the ICUDecompressor class, the m_dctx member, the decompress() method,
and the ZSTD_createDCtx / ZSTD_decompressDCtx / ZSTD_decompress_usingDDict calls
so you add an early null check and an appropriate error path (and ensure any
allocated m_ddict is still cleaned up as before).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b5e251ac-2724-4221-9912-664ed762dd05
📒 Files selected for processing (6)
scripts/build.tsscripts/build/config.tsscripts/build/deps/webkit.tssrc/jsc/bindings/bun_icu_decompress.cpptest/js/web/intl/icu-locales.txttest/js/web/intl/intl.test.ts
💤 Files with no reviewable changes (1)
- test/js/web/intl/icu-locales.txt
| if (version.startsWith("autobuild-preview-") && !cfg.allowPreviewWebkit) { | ||
| throw new BuildError( | ||
| `WEBKIT_VERSION is pinned to a PR preview tag (${version}). ` + | ||
| `Preview prebuilts are deleted when the PR closes. ` + | ||
| `Pass --allow-preview-webkit=on to build anyway, or revert to a stable hash before merging.` | ||
| ); | ||
| } | ||
| const tag = version.startsWith("autobuild-") ? version : `autobuild-${version}`; | ||
| return `https://github.com/oven-sh/WebKit/releases/download/${tag}/${name}.tar.gz`; |
There was a problem hiding this comment.
Normalize preview detection before enforcing the guard.
At Line 83, the guard checks raw cfg.webkitVersion, but at Line 90 the value is normalized into tag. This allows bypassing the guard with inputs like --webkit-version=preview-pr-237-..., which still become preview tags after normalization.
Suggested fix
function prebuiltUrl(cfg: Config): string {
const os = cfg.windows ? "windows" : cfg.darwin ? "macos" : cfg.freebsd ? "freebsd" : "linux";
const arch = cfg.arm64 ? "arm64" : "amd64";
const name = `bun-webkit-${os}-${arch}${prebuiltSuffix(cfg)}`;
const version = cfg.webkitVersion;
- if (version.startsWith("autobuild-preview-") && !cfg.allowPreviewWebkit) {
+ const tag = version.startsWith("autobuild-") ? version : `autobuild-${version}`;
+ if (tag.startsWith("autobuild-preview-") && !cfg.allowPreviewWebkit) {
throw new BuildError(
`WEBKIT_VERSION is pinned to a PR preview tag (${version}). ` +
`Preview prebuilts are deleted when the PR closes. ` +
`Pass --allow-preview-webkit=on to build anyway, or revert to a stable hash before merging.`
);
}
- const tag = version.startsWith("autobuild-") ? version : `autobuild-${version}`;
return `https://github.com/oven-sh/WebKit/releases/download/${tag}/${name}.tar.gz`;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/build/deps/webkit.ts` around lines 83 - 91, The preview-guard uses
the unnormalized value but the URL uses a normalized tag, so inputs like
"preview-pr-..." bypass the check; compute the normalized tag first (e.g., const
tag = version.startsWith("autobuild-") ? version : `autobuild-${version}`) and
then check tag.startsWith("autobuild-preview-") against cfg.allowPreviewWebkit,
throwing the BuildError if needed, and use that same tag to build the download
URL (references: version, tag, cfg.allowPreviewWebkit, BuildError).
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
scripts/build/deps/webkit.ts:99-101— The fix only partially achieves its stated goal: after strippingautobuild-,"preview-pr-237-72c07745".slice(0, 16)yields"preview-pr-237-7"— the fixed 15-charpreview-pr-237-prefix leaves only one hex char of the sha in the cache key, so two preview pins of the same PR whose shas share a first hex digit (1/16 chance) still collapse to the same dir. Since this is just a cache dirname, you could drop the.slice(0, 16)entirely forautobuild--prefixed pins (or slice more chars / use the fullcfg.webkitVersion).Extended reasoning...
What the bug is
Commit 277eca8 addressed the earlier review comment by stripping
/^autobuild-/before slicing inprebuiltDestDir()(scripts/build/deps/webkit.ts:99-101):// Strip the autobuild- prefix (matches prebuiltUrl) so two different // preview/tag pins don't collapse to the same "autobuild-previe" cache key. const version16 = cfg.webkitVersion.replace(/^autobuild-/, "").slice(0, 16);
The comment says the goal is "so two different preview/tag pins don't collapse to the same cache key." But for the actual preview-tag format used in this PR, the fix leaves only a single hex character of entropy.
Step-by-step proof
Take the value currently pinned at line 11:
"autobuild-preview-pr-237-72c07745"..replace(/^autobuild-/, "")→"preview-pr-237-72c07745"(23 chars)..slice(0, 16)→"preview-pr-237-7".- The fixed prefix
"preview-pr-237-"is exactly 15 characters, so only the first hex digit of the commit sha (7) contributes toversion16.
Now consider two preview builds of the same WebKit PR — say
autobuild-preview-pr-237-72c07745and a hypotheticalautobuild-preview-pr-237-7a3f0012. Both reduce toversion16 = "preview-pr-237-7", andprebuiltDestDir()returns the same path for both. With 16 possible first hex digits, any two same-PR preview pins have a 1/16 chance of colliding.The two concrete pins this PR has actually cycled through (
cbfdad14→c,72c07745→7) happen to differ in the first digit, so the immediate collision that motivated the original comment is resolved — but only by luck, not by construction.Why existing code only partially mitigates it
- The
Source.identitystamp (webkit.ts:199) does carry the fullcfg.webkitVersion + suffix, so within a single worktree a stale extraction would be detected and re-fetched. But per the original review comment's analysis, the multi-worktree-shared-$BUN_INSTALL/build-cachescenario can still see ninja'srestatskip the fetch when the collidingdestDir's outputs already exist with newer mtimes. - The new
--allow-preview-webkit=onguard (webkit.ts:83-89) makes preview pins explicitly opt-in, so this only affects developers who deliberately enable it. - The normal
autobuild-<40hexsha>post-merge form is unaffected: strippingautobuild-leaves a 40-hex sha, and.slice(0, 16)gives the full intended 16 hex chars. - Different PR numbers (
preview-pr-237-vspreview-pr-238-) don't collide — only same-PR re-pins do.
Impact
Low. This is a residual 1/16 same-PR collision in an opt-in, dev-only flow gated behind
--allow-preview-webkit=on, and theWEBKIT_VERSIONvalue carrying it is marked DO-NOT-MERGE. But the line was just changed specifically to fix this class of collision, and its own comment overstates what it achieves, so it's worth tightening while you're here.How to fix
Any of these one-liners fully satisfies the comment's stated intent:
- Drop the slice for tag-form pins:
const v = cfg.webkitVersion.replace(/^autobuild-/, ""); const version16 = v.length === 40 ? v.slice(0, 16) : v;— bare 40-hex shas keep the short form, anything else (preview tags) uses the full string. - Slice more characters (e.g.
.slice(0, 32)coverspreview-pr-NNN-+ 8 hex). - Just use the full
cfg.webkitVersionin the dirname — it's a cache path, length doesn't matter.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/build/deps/webkit.ts`:
- Around line 86-89: Update the comment above the version slicing to accurately
reflect the implemented logic: it checks cfg.webkitVersion for any "autobuild-"
prefix (not only "autobuild-preview-") and uses the full remainder of that
prefix, otherwise it takes the first 16 characters of the version string;
reference the variables v and version16 and the cfg.webkitVersion check so
reviewers know the behavior matches the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: afd20132-2676-4338-b941-3ec8cb0da0a0
📒 Files selected for processing (1)
scripts/build/deps/webkit.ts
| // For 40-hex shas, 16 chars is plenty. For autobuild-preview-* tags, the | ||
| // meaningful sha is at the end, so use the whole thing. | ||
| const v = cfg.webkitVersion; | ||
| const version16 = v.startsWith("autobuild-") ? v.slice("autobuild-".length) : v.slice(0, 16); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Consider clarifying the comment to match code behavior.
The comment mentions "autobuild-preview-* tags" specifically, but the code checks for any "autobuild-" prefix (line 89). While preview tags are the primary use case, the logic applies to any autobuild-prefixed version string.
📝 Suggested comment refinement
- // For 40-hex shas, 16 chars is plenty. For autobuild-preview-* tags, the
- // meaningful sha is at the end, so use the whole thing.
+ // For 40-hex shas, 16 chars is plenty. For autobuild-prefixed tags
+ // (e.g., preview tags), use the full suffix after the prefix.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // For 40-hex shas, 16 chars is plenty. For autobuild-preview-* tags, the | |
| // meaningful sha is at the end, so use the whole thing. | |
| const v = cfg.webkitVersion; | |
| const version16 = v.startsWith("autobuild-") ? v.slice("autobuild-".length) : v.slice(0, 16); | |
| // For 40-hex shas, 16 chars is plenty. For autobuild-prefixed tags | |
| // (e.g., preview tags), use the full suffix after the prefix. | |
| const v = cfg.webkitVersion; | |
| const version16 = v.startsWith("autobuild-") ? v.slice("autobuild-".length) : v.slice(0, 16); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/build/deps/webkit.ts` around lines 86 - 89, Update the comment above
the version slicing to accurately reflect the implemented logic: it checks
cfg.webkitVersion for any "autobuild-" prefix (not only "autobuild-preview-")
and uses the full remainder of that prefix, otherwise it takes the first 16
characters of the version string; reference the variables v and version16 and
the cfg.webkitVersion check so reviewers know the behavior matches the code.
There was a problem hiding this comment.
My earlier concerns (debug tryAlignedMalloc assert, preview WEBKIT_VERSION pin, prebuiltDestDir cache-key collision) are all addressed and no new issues turned up, but this adds new runtime C++ on the ICU data-load path tied to a coordinated WebKit bump — worth a human pass before merging.
Extended reasoning...
Overview
This PR is the Bun-side half of oven-sh/WebKit#237: it adds src/jsc/bindings/bun_icu_decompress.cpp (a LazyNeverDestroyed singleton with a WTF::Lock-guarded HashMap cache that zstd-decompresses per-item ICU display-name data into 16-aligned mimalloc), bumps WEBKIT_VERSION to the now-merged 782504c968e2… sha, adds a ~300-line Intl test suite with 20 snapshots and a 599-locale sweep fixture, and tweaks prebuiltDestDir() in scripts/build/deps/webkit.ts to handle autobuild--prefixed version strings.
Since my last inline comments the author has: rounded the alloc size with WTF::roundUpToMultipleOf<16> (fixing the debug assert), replaced the preview autobuild-preview-pr-237-* pin with the real merged sha, fixed the prebuiltDestDir cache-key truncation, gated the whole hook body under #if OS(LINUX), and added/reverted the allowPreviewWebkit guard per discussion. All prior review threads are substantively resolved.
Security risks
The hook only runs on pointers handed to it by ICU's patched udata.cpp from .rodata, gated by a ZSTD_MAGICNUMBER check, with zstd doing bounded decompression into a buffer sized from the frame header. Inputs are not user-controlled (they're baked into libicudata.a at WebKit build time), so there's no injection or untrusted-data surface. The weak-symbol pattern and OS(LINUX) gate keep non-ELF platforms out of scope. I don't see a security concern here.
Level of scrutiny
High. This is new C++ on a path that every Intl.DisplayNames / NumberFormat({style:'unit'}) / DateTimeFormat({timeZoneName}) call will traverse once the WebKit bump activates it, with process-lifetime caching, locking, and a deliberate RssFile→RssAnon memory tradeoff. It's also tightly coupled to an external repo's data layout (the repacked libicudata.a and shared zstd dict from oven-sh/WebKit#237). The test coverage is extensive and the design has already been iterated with Jarred, but the combination of runtime hot-path code + cross-repo ABI coordination + a WEBKIT_VERSION bump is well outside the "mechanical change" bar for auto-approval.
Other factors
- Jarred reviewed an earlier revision and his structural feedback was incorporated; he hasn't signed off on the final shape.
- robobun reported CI failures on
9b70a2ae; the subsequent2900814c(the realWEBKIT_VERSIONbump) hasn't been reflected in the CI status comment yet. - The one remaining unresolved CodeRabbit thread (normalize-before-guard in
prebuiltUrl) is moot since the guard it referenced was reverted. - The
prebuiltDestDirchange slightly alters cache-dir naming for the normal 40-hex-sha case too (no behavioral difference, but worth a glance).
Runtime side of oven-sh/WebKit#237 (now merged as
782504c968e2). That change repacks ICU'slibicudata.awith most items per-item zstd-compressed and patchesudata.cppto callbun_icu_maybe_decompressbetween TOC lookup and header validation. This PR provides that function, the test coverage, and bumpsWEBKIT_VERSION.What's in this PR
src/jsc/bindings/bun_icu_decompress.cpp—Bun::ICUDecompressorsingleton (LazyNeverDestroyed+std::call_once,WTF::Lock/HashMap). The hook checksZSTD_MAGICNUMBERand returns immediately for raw items; zstd frames are decompressed once with the sharedDDictinto 16-aligned mimalloc and cached by.rodataaddress. Linux-only (#if OS(LINUX)).test/js/web/intl/— 31 tests / 5,902 assertions. Snapshot tests for 12 locales × everyIntl.*API (captured against unmodified ICU); structural sweep over all locales × 5 trees; plusicu-locales.txt(the full locale list).scripts/build/deps/webkit.ts—WEBKIT_VERSION→782504c968e2; fixprebuiltDestDircache key forautobuild-*tags.Runtime cost
Nothing is removed — same ICU code, same data, just stored compressed and decoded on first read, then cached for the process lifetime.
new Intl.NumberFormat("ru", {style:"unit"})Intl.*("en"),Date.toString(), URL parsing,String.normalize, regex\p{}are unchanged (kept uncompressed). AllIntloutputs byte-identical — the snapshots are the unmodified-ICU reference.No popular i18n package iterates more than the app's own locale at runtime (luxon, @formatjs/intl, i18next, dayjs, date-fns all checked); the every-locale row is the absolute upper bound, not a realistic workload.
30 fresh processes per binary, median; baseline =
autobuild-0d85951a, this PR =autobuild-782504c968e2.Memory
Total RSS is flat. Up to ~11 MB shifts from evictable
.rodatato pinned heap as items are decompressed — reachable only viaDisplayNames/unit/timeZoneName/currencyDisplay:"name"across many locales.Scope
Linux glibc + musl only. Windows (
build-icu.ps1, ICU 73.2), FreeBSD, Android Dockerfiles, and macOS (systemlibicucore) are untouched. The hook compiles everywhere but is dead code where the prebuilt'sudata.cppisn't patched.