Optimize Buffer.toString('hex'/'base64') for large buffers#31421
Conversation
Restore a SIMD hex encoder (Highway kernel, runtime-dispatched) for encode_bytes_to_hex, and encode large base64/base64url outputs into a mimalloc-backed buffer wrapped in an external WTF string instead of cycling large blocks through WTF's string allocator.
|
Warning Review limit reached
More reviews will be available in 55 minutes and 34 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughThis PR adds Highway-based SIMD acceleration for hex encoding with a 64-byte threshold, refactors base64 encoding into a length-aware helper, registers symbols across ARM64 and x64 platforms, and introduces comprehensive tests validating encoding correctness at SIMD and scalar boundaries. ChangesHex SIMD and Encoding Refactoring
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
Updated 1:32 AM PT - May 26th, 2026
✅ @robobun, your commit 5f0e161821d79b253994f0e9af607a36edd2ffba passed in 🧪 To try this PR locally: bunx bun-pr 31421That installs a local version of the PR into your bun-31421 --bun |
|
Found 1 issue this PR may fix:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/highway/lib.rs`:
- Around line 265-275: The function encode_hex_lower currently uses
debug_assert! which is removed in release builds and can allow out-of-bounds FFI
writes; replace the debug-only check with a runtime capacity check: compute
needed = src.len().checked_mul(2) (handle overflow by treating as too-large),
then assert!(dst.len() >= needed) or return a Result indicating insufficient
capacity so the public API cannot call highway_encode_hex_lower with a too-small
dst; update encode_hex_lower to perform this checked validation before the
unsafe call to highway_encode_hex_lower and include a clear error message
referencing src/dst sizes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 11eb8c5c-77bf-47db-8ff2-b490d6fe9928
📒 Files selected for processing (8)
scripts/verify-baseline-static/allowlist-aarch64.txtscripts/verify-baseline-static/allowlist-x64-windows.txtscripts/verify-baseline-static/allowlist-x64.txtsrc/bun_core/string/immutable.rssrc/highway/lib.rssrc/jsc/bindings/highway_strings.cppsrc/runtime/webcore/encoding.rstest/js/node/buffer.test.js
There was a problem hiding this comment.
I didn't find any correctness issues, but this adds a new Highway SIMD kernel plus an allocation-strategy change on a hot path (and the ISA allowlist entries are partly speculative per the description), so it's worth a human pass.
Extended reasoning...
Overview
This PR restores large-buffer Buffer.toString('hex'/'base64') throughput that regressed in the Zig→Rust port. It adds a new EncodeHexLowerImpl Highway kernel in highway_strings.cpp (nibble split → TableLookupBytes → StoreInterleaved2), wires it through a Rust FFI wrapper in src/highway/lib.rs and a 64-byte threshold in encode_bytes_to_hex, and refactors the base64/base64url arms of to_bun_string_comptime into a helper that switches to a mimalloc-backed external WTF string above 32 KB of output. Supporting changes: per-target symbol entries in three verify-baseline-static allowlists, and a substantial new test block in buffer.test.js (byte-exact reference comparisons across vector-width boundaries, unaligned views, SHA-256 digests of a 110 KB buffer, and a timing-ratio regression guard).
Security risks
None apparent. Inputs are in-process Buffer bytes; outputs are freshly allocated strings. The one memory-safety surface — the FFI write into dst — is now guarded by a runtime assert!(dst.len() / 2 >= src.len()) (added after CodeRabbit's comment), and the only caller slices dst to exactly 2 * to_read. The C++ kernel's HWY_RESTRICT aliasing assumption is satisfied by the &/&mut split on the Rust side. The fill_spare usage for the large-base64 path correctly reserves to_len on an empty Vec before slicing spare[..to_len].
Level of scrutiny
This warrants human review rather than bot approval. It is a new hand-written SIMD kernel that will run on every ≥64-byte hex encode across SSE/AVX2/AVX-512/SVE targets, relying on Highway semantics (TableLookupBytes index-range behaviour, StoreInterleaved2 write bounds, ShiftRight<4> on u8) that look correct to me but are exactly the kind of thing a maintainer familiar with this file's other kernels should eyeball. The base64 change swaps the allocation/ownership model on a hot path (uninitialized WTF string → external mimalloc string with finalizer), which is a design choice worth a human nod. The PR description also explicitly flags that the AVX10_2 allowlist feature set is extrapolated and may need adjusting once the baseline CI lanes report.
Other factors
Both prior review comments (my subarray end-index nit and CodeRabbit's debug_assert!→assert! request) have been addressed in follow-up commits and the threads are resolved. Test coverage is thorough and includes round-trip decode checks. The new timing-based regression guard (hex ≤ 6× latin1) is skipped on debug/ASAN and uses a median of 13 GC-isolated samples with wide margin, which seems reasonable but is the kind of thing that occasionally flakes on shared CI — worth a maintainer's judgment on whether that belongs in the suite vs. the bench scripts.
…VX3_DL hex allowlist ceiling The in-suite timing baseline gets skewed by the surrounding tests' heap and GC activity, so run the hex-vs-latin1 measurement in a fresh process. The Windows baseline ISA scan reports AVX512DQ/AVX512VL encodings in the AVX3_DL hex kernel, so include them in that entry's feature ceiling.
There was a problem hiding this comment.
No bugs found, but this adds a new Highway SIMD kernel with unsafe FFI plumbing, changes the large-buffer base64 allocation strategy, and touches ISA allowlists across three platforms — worth a human pass on the kernel and the 32 KB / 64 B thresholds.
Extended reasoning...
Overview
This PR restores pre-Rust-port performance for Buffer.toString('hex'|'base64'|'base64url') on large buffers. It adds a new Highway kernel EncodeHexLowerImpl in highway_strings.cpp (nibble split → TableLookupBytes → StoreInterleaved2), wires it through a new extern "C" shim and a safe Rust wrapper in src/highway/lib.rs, and dispatches to it from encode_bytes_to_hex for inputs ≥ 64 bytes. Separately, to_bun_string_comptime's base64/base64url arms are refactored into encode_base64_to_bun_string, which switches from create_uninitialized_latin1 to a mimalloc-backed Vec + external WTF string when the encoded output is ≥ 32 KB. Three verify-baseline-static allowlists gain per-target EncodeHexLowerImpl entries, and buffer.test.js gains a ~190-line correctness suite plus a subprocess-based throughput regression guard.
Security risks
None apparent. The new FFI write path (highway_encode_hex_lower) is guarded by a runtime assert!(dst.len() / 2 >= src.len()) in the safe wrapper (added in 55b86a4 after CodeRabbit feedback), and the only caller slices dst to exactly 2 * src.len(). The HWY_RESTRICT on input/output is satisfied because the Rust wrapper takes &/&mut (non-overlapping by construction). The kernel's index vectors (ShiftRight<4> and And(..., 0x0F)) are bounded to [0,15], so TableLookupBytes behaves identically across PSHUFB / NEON TBL / AVX-512 VPERMB without high-bit zeroing surprises. No user-controlled data reaches allowlists or build config.
Level of scrutiny
High. Buffer.toString is one of the hottest user-facing paths in the runtime, and this change introduces (a) a brand-new per-ISA-target SIMD kernel that will execute on every ≥64-byte hex encode, (b) a heuristic allocation-strategy fork at 32 KB that trades WTF-string allocation for an external-string finalizer, and (c) hand-written ISA feature-ceiling entries in the baseline allowlists — one of which (N_AVX3_DL on Windows) already needed a follow-up commit after CI flagged a mismatch, and the PR description notes the AVX10_2 entry was extrapolated rather than measured. None of this is mechanical.
Other factors
Test coverage is strong: byte-exact comparison against pure-JS reference encoders across 0–130 and vector-width ±1 boundaries, unaligned subarray views, a 110 KB buffer cross-checked via SHA-256 against Node and the JS reference, and decode round-trips. The throughput guard runs in a subprocess with GC disabled and uses a 6× ratio with wide margin, but timing-based tests are always a flake risk worth a second opinion. Both prior review threads (my subarray nit and CodeRabbit's debug_assert! → assert!) are resolved. CodeRabbit's walkthrough suggests Jarred-Sumner / alii as reviewers, which seems right for Highway + allocator changes.
…t ceilings The Windows baseline ISA scan reports the same AVX512DQ/AVX512VL encodings in the AVX3_SPR and AVX3_ZEN4 variants of EncodeHexLowerImpl as in AVX3_DL.
There was a problem hiding this comment.
No issues found after the fixes, but this adds a new Highway SIMD kernel, an unsafe FFI boundary, and changes the large-buffer base64 allocation strategy — worth a human pass on the kernel correctness and the 32 KB / 64 B thresholds.
Extended reasoning...
Overview
This PR restores large-buffer Buffer.toString('hex'/'base64') performance to (and past) the pre-Rust-port baseline. It adds a new EncodeHexLowerImpl Highway kernel in highway_strings.cpp with the standard HWY_EXPORT/HWY_DYNAMIC_DISPATCH plumbing, a Rust FFI wrapper in bun_highway, a 64-byte dispatch threshold in encode_bytes_to_hex, and a refactored encode_base64_to_bun_string that switches to a mimalloc-backed external string above 32 KB output. It also registers the new per-target symbols in all three verify-baseline-static allowlists and adds ~190 lines of byte-exact reference tests plus a throughput regression guard.
Security risks
None identified. The new unsafe surface is the highway_encode_hex_lower FFI call, which is now guarded by a runtime assert!(dst.len() / 2 >= src.len()) (upgraded from debug_assert! per CodeRabbit feedback), and the only caller slices dst to exactly 2 * src.len(). The base64 path's fill_spare usage commits only the bytes the encoder reports as written. Inputs are arbitrary user buffer bytes, but the encoders are pure data transforms with fixed-ratio output sizes.
Level of scrutiny
Medium-high. This is a hot-path performance change in Buffer.prototype.toString that introduces hand-written SIMD across six x86 targets and four ARM SVE targets, plus an allocation-strategy change whose correctness depends on create_external_globally_allocated_latin1 ownership semantics. The kernel itself (ShiftRight<4> / And 0x0F / TableLookupBytes / StoreInterleaved2) is a standard nibble-interlace and the scalar tail is straightforward, but SIMD kernels and ISA allowlist ceilings are exactly the kind of thing where a second pair of eyes pays off — the PR already needed two follow-up commits to get the Windows SPR/ZEN4 ceilings right, and the description notes the AVX10_2 entry is extrapolated rather than measured.
Other factors
Test coverage is strong: byte-for-byte comparison against pure-JS reference encoders across 0–130 and vector-width ±1 boundaries, unaligned subarray views (my earlier nit about the subarray end index was fixed in 77f5e52), SHA-256 cross-checks against a 110 KB buffer, and round-trip decodes. The throughput guard runs in a subprocess and is skipped on debug/ASAN. Both prior review comments (mine and CodeRabbit's) are resolved. Still, the combination of new SIMD code, unsafe FFI, magic thresholds (64 B, 32 KB), and hand-maintained per-ISA allowlist entries puts this outside what I'd auto-approve without human sign-off.
Jarred-Sumner
left a comment
There was a problem hiding this comment.
What about the buffer.write code path?
|
I did measure it to make sure there isn't a lurking port regression on that side. Same methodology as the PR benchmarks (110 000-byte payload, median of GC-isolated samples, three interleaved rounds on a noisy shared box — ranges below are min–max of the three rounds):
So: write-hex and write/from-base64 are at parity with (or slightly ahead of) the Zig build — hex decode was a scalar loop in Zig too, and base64 decode is the same simdutf call in both. Two natural follow-ups I'd keep out of this PR to keep it reviewable, but am happy to do next:
|
|
CI note: the only red check on the current head is |
* oven/main (3 new commits): build: support cross-compiling Windows (x64 and arm64) from Linux (oven-sh#31300) Bun.serve: restore per-request GC memory accounting to fix elevated RSS under HTTP load (oven-sh#31422) Optimize Buffer.toString('hex'/'base64') for large buffers (oven-sh#31421) Auto-merged: scripts/build.ts, scripts/build/bun.ts, scripts/build/deps/webkit.ts, scripts/build/flags.ts, scripts/build/rust.ts, scripts/build/source.ts, scripts/build/tools.ts Resolved conflict in scripts/build/config.ts: kept OHOS closing block + upstream Windows cross-compilation block
* oven/main (3 new commits): build: support cross-compiling Windows (x64 and arm64) from Linux (oven-sh#31300) Bun.serve: restore per-request GC memory accounting to fix elevated RSS under HTTP load (oven-sh#31422) Optimize Buffer.toString('hex'/'base64') for large buffers (oven-sh#31421) Auto-merged: scripts/build.ts, scripts/build/bun.ts, scripts/build/deps/webkit.ts, scripts/build/flags.ts, scripts/build/rust.ts, scripts/build/source.ts, scripts/build/tools.ts Resolved conflict in scripts/build/config.ts: kept OHOS closing block + upstream Windows cross-compilation block
### What does this PR do? `Buffer.from(str, "hex")` (and `buf.write(str, "hex")`) currently decode two characters per loop iteration through a lookup table. This PR adds Highway SIMD kernels for hex decoding and routes the shared decoder through them, so whole vectors of input characters are validated and converted at a time. - **`DecodeHex8Impl` / `DecodeHex16Impl`** in `highway_strings.cpp` (same `HWY_EXPORT` + `HWY_DYNAMIC_DISPATCH` pattern as the existing kernels): classify digits with two range checks, compute nibble values arithmetically, pair them with `ConcatEven`/`ConcatOdd`, and stop at the first vector block containing a non-hex character — the scalar tail then pinpoints the exact pair, preserving Node's "stop at the first invalid pair" semantics byte-for-byte. A capped 128-bit cleanup loop handles the 16–63-pair remainder on wide-vector targets so digest-sized inputs (md5/sha256 strings) still vectorize on AVX-512 machines. - **`bun_highway::decode_hex` / `decode_hex_u16`** expose the kernels to Rust. - **`_decode_hex_to_bytes`** (the shared helper behind `Buffer.from`, `buf.write`, CSRF token parsing, etc.) takes the Highway path for inputs of at least 16 byte pairs, for both Latin-1 and UTF-16 source strings; shorter inputs keep the existing scalar loop. The UTF-16 kernel classifies on the full code unit, so units above 0xFF whose low byte looks like a hex digit are still rejected, matching the scalar decoder. - `verify-baseline-static` allowlists gain entries for the new dispatch targets, with feature ceilings taken from the tool's own report on the compiled kernels. Complementary to #31421, which covers the encode direction (`toString('hex')`); the two touch the same files but different functions. Related to #8782 (adopting Highway for runtime-dispatched SIMD) — this covers the hex decode path only, so it is not marked as closing that issue. ### Benchmarks `bench/snippets/buffer-from-hex.mjs` (added in this PR), median of 3 interleaved rounds on linux x64 (Ice Lake, AVX-512), comparing a release build of `main` (4324120) with this PR at the same commit: | input | main | this PR | speedup | |---|---|---|---| | 32 chars → 16 B | 237 ns | 235 ns | ~1.0x | | 64 chars → 32 B | 255 ns | 237 ns | 1.08x | | 128 chars → 64 B | 273 ns | 243 ns | 1.12x | | 1 KiB chars → 512 B | 796 ns | 375 ns | 2.1x | | 128 KiB chars → 64 KB | 65.4 µs | 9.8 µs | 6.7x | | 1 MiB chars → 512 KB | 513 µs | 88 µs | 5.8x | | 16 MiB chars → 8 MB | 8.50 ms | 2.22 ms | 3.8x | Small inputs are dominated by the fixed `Buffer.from` overhead, so they sit at parity; the decode itself is the win for everything from a few hundred characters up. ### How did you verify your code works? - `bun bd test test/js/node/buffer.test.js` — 472 pass with the debug (ASAN) build; also passes with `USE_SYSTEM_BUN=1` since this is a behavior-preserving optimization. `test/js/bun/util/csrf.test.ts`, `buffer-utf16`, `buffer-concat`, and `buffer-from-encoding-leak` (with the ASAN quarantine accounted for) also pass. - New regression coverage in `test/js/node/buffer.test.js` sweeps the SIMD block boundaries: every length around the vector widths (15–1024 pairs, with and without a trailing lone digit), an invalid character at every position of an 80-pair input for ten different invalid bytes, UTF-16 code units above 0xFF whose low byte is a valid hex digit placed at block boundaries, and `buf.write` destination-limit behavior — all cross-checked against a plain JS reference decoder, on both the Latin-1 and two-byte string paths. These tests pass before and after this change by design (there is no functional delta to assert); the before/after evidence for the optimization itself is the benchmark table above. - A standalone harness compiled the kernels for every Highway target and checked them against a scalar reference for all input lengths 0–600, invalid characters at every position, and the UTF-16 high-byte cases, including "never writes past the reported count". - `cargo clippy -p bun_core -p bun_highway` is clean. - The baseline-ISA allowlist entries are the feature sets `verify-baseline-static` reports for the new kernels when compiled with the same LLVM toolchain; if the CI baseline builds report different sets I'll update the entries to match. Co-authored-by: Jarred Sumner <jarred@jarredsumner.com>
What does this PR do?
Buffer.prototype.toStringregressed vs the last Zig-based release for large buffers on the binary-to-text encodings (bench/snippets/buffer-to-string.mjs:hexof a 110 KB buffer ~3.7× slower,base64/base64url~17% slower on a quiet Sapphire Rapids host). Small-buffer cases were already faster than Zig and stay on their current paths.Cause
hex: the Rust port ofencode_bytes_to_hexreplaced the Zig@Vector(16, u8)nibble-interlace fast path with a scalar, bounds-checked table loop (bytes_to_hex_lower) — the port even left aPERF(port)note about it. Every byte of a 110 KB buffer goes through that loop.base64/base64url: the encode itself issimdutf::binary_to_base64in both the Zig and Rust builds, so the kernel was never the problem. The Zig implementation encoded into adefault_allocator(mimalloc) buffer and handed it to an external WTF string; the Rust port switched toBunString::create_uninitialized_latin1, which cycles a ~146 KB block through WTF's string allocator on every call. mimalloc reuses blocks of that size much more cheaply than WTF's allocator does.Fix
EncodeHexLowerImplHighway kernel inhighway_strings.cpp(sameHWY_EXPORT+HWY_DYNAMIC_DISPATCHpattern as the existing kernels): split each byte into nibbles, map both nibble vectors through the 16-byte hex-digit table withTableLookupBytes, and write the interleaved digits withStoreInterleaved2. Exposed asbun_highway::encode_hex_lower;bun_core::strings::encode_bytes_to_hexdispatches to it for inputs ≥ 64 bytes and keeps the scalar LUT loop below that (small buffers were already ahead of Zig). Per-target symbols added to theverify-baseline-staticallowlists.Buffer.toString("base64"/"base64url")(to_bun_string_comptime) now encodes outputs ≥ 32 KB into a mimalloc-backed buffer wrapped in an external WTF string — the strategy the Zig implementation used — via a sharedencode_base64_to_bun_stringhelper. Smaller outputs keep the uninitialized-WTF-string path, which is what makes the 16-byte cases faster than Zig today.Output bytes are unchanged for every encoding and size (see verification).
Benchmarks
Shared/noisy Ice Lake Xeon container, medians of interleaved rounds; ratios are the signal, not absolute numbers.
Zigis bun 1.3.14,beforeis canary at 49c97de,afteris this PR (both release builds from this tree).bench/snippets/buffer-to-string.mjs:Buffer(110000).toString('hex')Buffer(16).toString('hex')Buffer(110000).toString('base64')Buffer(110000).toString('base64url')Buffer(16).toString('base64')Buffer(110000).toString('ascii')The 110 KB base64 delta (~1 µs on the reporting host) is below this container's noise floor, so it reads as a wash above; the allocation-strategy effect is visible once the output is big enough to clear the noise, in
bench/snippets/buffer-base64.mjs(toStringdirection, two rounds):Hex is ~3.8× faster than before this change and ~30–40% faster than the Zig build at 110 KB; large base64 recovers the gap to Zig (and beats it at 8 MB).
How did you verify your code works?
bun bd test test/js/node/buffer.test.js— 468 pass, including the newBuffer.prototype.toString binary-to-text encodingsblock: byte-for-byte comparison against pure-JS reference encoders across every length 0–130 plus 192/256/512/1024 ±1 (vector-width boundaries and the scalar tail), unalignedsubarrayviews, everylength % 3padding shape, and a 110 000-byte buffer checked against SHA-256 digests that were cross-verified against bun 1.3.14 (pre-rewrite Zig) and the JS reference.test-buffer-tostring.js,test-buffer-alloc.js,test-stringbytes-external.js(~500 KB hex/base64 encode+decode round-trips), andtest/js/node/string_decoder/string-decoder.test.js— all pass with the debug (ASAN) build.fs.readFile(path, "hex" | "base64" | "base64url")spot-checked againstBuffer.toStringoutput (the readFile hex path sharesencode_bytes_to_hex).cargo clippy -p bun_core -p bun_highway -p bun_runtimeclean.scripts/verify-baseline-staticagainst the locally builtbun-profileand copying the feature sets it reported per target. The local toolchain doesn't compile theAVX10_2target, so that entry mirrors the SPR/DL ceilings (+AVX512VL); if the baseline/aarch64 CI lanes report different sets I'll update the entries to match, same as theCopyAsciiPrefixImpladdition did.toString('hex')of a 110 KB buffer stays within 6× the cost of atoString('latin1')copy of the same buffer (median of GC-isolated samples). The scalar per-byte loop sits at ≥10× that baseline while the SIMD kernel stays at ~2–3×, so this test fails on the pre-SIMD implementation and passes with this PR; it is skipped on debug/ASAN builds, where the unoptimized, instrumented native kernels make timing ratios meaningless. The full before/after picture is the benchmark tables above, reproducible viabench/snippets/buffer-to-string.mjsandbench/snippets/buffer-base64.mjs.