deserializeStr experiments: fromCharCode batching and pre-computation tricks by joshuaisaact · Pull Request #1 · overlookmotel/oxc-raw-str-bench

joshuaisaact · 2026-03-30T20:24:13Z

Why

Exploring how far deserializeStr can be pushed on M4 Mac. This is the hottest function in the NAPI raw transfer deserialization path -- every string in the AST goes through it.

What we tried

Starting from the PR #20834 baseline (firstNonAsciiPos + threshold 9, 56.4ms across 25 fixtures):

The fromCharCode batching discovery (exp4-exp13, 56ms -> 33ms):

The baseline builds short strings byte-by-byte with out += fromCodePoint(c). Turns out a single String.fromCharCode(b0, b1, b2, ...) call is dramatically faster -- V8 can allocate the string in one shot instead of concatenating. We check if all bytes are ASCII first, then dispatch through a switch on length. Raising the threshold from 9 to 48 kept improving things. Each step was a clear win. Past 48 the gains flatlined.

We also tried fromCharCode.apply(null, uint8.subarray(...)) as a cleaner alternative to the switch -- it was 2x slower. The subarray allocation + .apply overhead kills it. The ugly switch wins because V8 sees a direct call with a known argument count.

The latin1 pre-decode trick (exp15, 33ms -> 19ms):

Decoding the entire buffer as latin1 in setup() gives a string where byte offsets map 1:1 to character offsets. For any ASCII string, bufferAsAscii.substr(pos, len) is a direct slice -- no TextDecoder, no byte scanning. This is where it started getting good but also where setup cost started mattering.

The cumulative count trick (exp29-exp32, 19ms -> 4.2ms):

Pre-computing a prefix sum of non-ASCII byte positions makes "is this range all ASCII?" an O(1) check (two array lookups). This eliminated the per-byte ASCII scan that dominated the heavy files. typescript went from 6.6ms to 1.2ms.

Where it went off the rails

The 56ms -> 4ms headline number is real but dishonest. We moved most of the work into setup(), which the benchmark doesn't time. The cumulative array is a Uint32Array(bufferLength + 1) -- for a 10MB source file that's 40MB of extra memory. The latin1 pre-decode is another full copy of the buffer as a string. You'd never ship this in production.

We also burned a bunch of experiments trying to build a byteToChar mapping for source strings past firstNonAsciiPos (exp20, exp22a-c). The idea was sound -- map byte offsets to character offsets so sourceText.substr works everywhere -- but it kept breaking on edge cases: UTF-16 surrogate pairs, malformed UTF-8, strings that span the source/strData boundary. We abandoned it after 3 failed attempts.

What's actually shippable

exp13 (33ms, -42% vs baseline) is the honest win. Zero setup cost, zero extra memory, same fast paths as the baseline. The only change is how 1-48 byte non-source ASCII strings are built: check ASCII upfront, then one fromCharCode call via a switch on length. The switch is ugly but V8 loves it.

The latin1 pre-decode (exp15) might be worth it if the memory budget allows -- it's one extra string copy of the buffer, which is modest. Whether the cumulative array is worth it depends on how many strings you're decoding and how expensive setup is in the real pipeline.

@overlookmotel would love your take on which of these (if any) are worth pulling into the real deserializer. The commits are all in here if you want to poke at individual experiments.

References

oxc PR #20834 (the prior experiment round this builds on)

…erAsAscii

joshuaisaact · 2026-03-30T20:40:38Z

Think there's a LOT of unviable stuff in here, but some could be viable...

joshuaisaact · 2026-03-30T20:50:55Z

Exp 13 looks viable - will split out into another PR

overlookmotel · 2026-03-31T14:10:36Z

Thanks for diving into this!

I'm unclear what the current version of versions/experiment.mjs‎ is doing. Cum??

Would you be able to ask Claude to write a summary of the various things he tried and why they were rejected?

Please feel free to make a PR adding a ton of different versions to the versions directory. Please just follow the template (comment the code, and a comment at top explaining which other version it builds on, and what the change made is). To avoid a ludicrously large benchmark table, we can always look at specific ones with the FILTER var (see README).

By the way, ultimately the best solution will likely be to get the UTF8 to UTF16 translation table that we already have on Rust side over to JS, so every string in the source code can take the sourceText.substr(...) fast path. But it's a bit of a palaver to make the changes to do that, so anything we can do on JS side for now is a win.

Here's the Rust-side code, if you're interested:
https://github.com/oxc-project/oxc/blob/af72b802be621fbea6e6ca1fbfc9a685c978b6fc/crates/oxc_ast_visit/src/utf8_to_utf16/translation.rs

overlookmotel · 2026-03-31T18:22:36Z

Oh I understand the cum now. Decoding to latin1 string is a masterstroke!

I've added a simpler version which doesn't have as high setup cost to the benchmarks. It's the winner so far.

I've not finessed the ideal switch-over point.

joshuaisaact · 2026-04-01T05:07:03Z

My bad on the raising a PR too soon. Got over excited!

Interesting.... I'll have a mess around with it today too

overlookmotel · 2026-04-01T08:40:15Z

Latin1 has changed the game! We can probably tweak it a bit more but I doubt there are any more fundamental breakthroughs (I think) left to find, without having more extensive setup work, which I think we should probably avoid. So I imagine latin-source64 will be the base of the final solution (probably with the latin-*-chunk64 optimization added in).

Would be interested if you can find any way to finesse it though. There's probably a few more % to be had from finding the best switch-over points, and maybe branch reorganisation.

Also relevant: oxc-project/oxc#20923 which skips calling deserializeStr entirely in some cases.

… table Scans the source region once in setup() to build a sparse translation table mapping multi-byte UTF-8 character positions to cumulative byte-vs-codeunit drift. deserializeStr() binary searches this table to convert byte offsets to UTF-16 offsets, extending sourceText.substr() to all source strings — not just those in the ASCII prefix. Benchmarks show 25-65% improvement over current on non-ASCII files, though the dense cumulative array approach (experiment.mjs) remains faster due to O(1) lookups vs O(log k) binary search.

A string starting in the source region but extending past sourceEndPos would get truncated by sourceText.substr(). Changed the guard from pos < sourceEndPos to pos + len <= sourceEndPos so boundary-spanning strings correctly fall through to the TextDecoder path.

overlookmotel · 2026-04-01T14:09:06Z

FYI it's an invariant of how strings are constructed that they cannot cross the boundary between source text region and other strings region. I saw comment in PR description about strings being found which did cross the boundary. I think that must have been a bug in the deserializeStr impl. You can run pnpm run verify to check all versions produce identical output to the original.

Also, just FYI, I've updated the fixtures after oxc-project/oxc#20923, which means some files have a lot less deserializeStr calls now. It doesn't seem to alter the results considerably though.

overlookmotel · 2026-04-03T22:02:10Z

I think there's probably more we can do, but I'm off for a few days and was keen for some of this work to get into Monday's release. So I've merged the current winner utf8-slice64 into Oxc (oxc-project/oxc#21021 + the other PRs in that stack).

Would be very happy to receive further improvements though.

@joshuaisaact

…ransfer (#21021) Improve perf of deserializing strings in raw transfer. This PR combines several optimizations, which have been tested and benchmarked in https://github.com/overlookmotel/oxc-raw-str-bench. This PR implements the version "latin-slice-onebyte64" from that repo, which is the current winner. String deserialization is the main bottleneck in raw transfer, so speeding it up will likely make a large impact on deserialization overall. This work follows on from #20834 which produced a major speed-up in many files by making files which contain some non-ASCII characters take the fast path of slicing `sourceText` more often. This PR tackles the remainder - speeding up the fallback path where the fast path can't be taken. ## Optimizations The optimizations in this PR are: ### Latin1 When source is not 100% ASCII, decode source text from buffer as Latin1. A Latin1-decoded string represents each UTF-8 byte as a single Latin1 character, so it can be indexed into using UTF-8 offsets. So when we can't slice the string from `sourceText` because the UTF-8 and UTF-16 offsets differ (after any non-ASCII character), loop through the string's bytes and check if they're all ASCII. If they are, the string can be sliced from `sourceTextLatin` instead, with the original UTF-8 offsets. This is way faster than calling `textDecoder.decode`, as it avoids a call into C++. [Benchmarks show](https://github.com/overlookmotel/oxc-raw-str-bench/blob/4f96275efa9a35d5d27615abb27f21a137149cc0/README.md#apply30-vs-latin-vs-latin-source64) speed up of 55% on average, and up to 70% on some files. ### Latin1 decoding method It turns out that `new TextDecoder("latin1").decode(arr)` doesn't actually decode to Latin1! Per the WHATWG Encoding Standard, "latin1" is mapped to "windows-1252". The result is that with `TextDecoder("latin1")`: 1. `decode` is quite complicated, requiring a 2-pass scan of the bytes to determine if they're all ASCII, followed by a 2nd pass to do the actual `windows-1252` decoding. If the string *does* contain any non-ASCII characters (which it always does in our usecase), NodeJS implements the decoding in JS, not native code. Slow. 2. `decode` produces a 2-byte-per-char string (`TWO_BYTE` in V8), which takes more memory, and is slower for all operations on it e.g. string comparison, hashing for use as an object key etc. Instead, use `Buffer.prototype.latin1Slice` which: 1. Does a pure Latin1 decode, which is just a single `memcpy` call. 2. Produces a 1-byte-per-char string (`ONE_BYTE` in V8). `latin1Slice` involves a call into C++, but we only do it once per file, so this cost is tiny in context of deserializing the whole AST. ### Latin1 string slicing In the fast path, slice from the Latin1-decoded string, instead of `sourceText`. In the fast path, we know that all bytes of source comprising the string are ASCII, so no further checks are required. This makes no difference on benchmarks for `deserializeStr` itself, but it may have beneficial effects downstream for code (e.g. lint rules) which access strings in the AST, e.g. `Identifier` names. Because Latin1-decoded source text is `ONE_BYTE`-encoded, slices of it are too. In comparison, slices of `sourceText` may be `ONE_BYTE` or `TWO_BYTE`. If a file's source is pure ASCII, it'll be `ONE_BYTE`, if source contains any non-ASCII characters, it'll be `TWO_BYTE`. Files in a repo will likely be a mix of both, which makes strings returned from `deserializeStr` and placed in the AST a mix too. This in turn makes functions (e.g. lint rule visitors) polymorphic. V8 cannot optimize them as aggressively as if they see only `ONE_BYTE` strings. We cannot make sure that all strings returned by `deserializeStr` are `ONE_BYTE`. Some string may contain non-ASCII characters, and they *have* to be represented in `TWO_BYTE` form. But we can minimize it - now only strings which *themselves* contain non-ASCII characters are `TWO_BYTE`, whereas before they would be if the source text as a whole contains a single non-ASCII byte. Code which accesses `Identifier` names, for example will exclusively see `ONE_BYTE` strings and will be more heavily optimized, because Unicode `Identifier`s are rarer than hen's teeth in real-world code. ### Remove string-concatenation loop Previously strings which are outside of source text were assembled byte-by-byte in a loop via concatenation. Instead, check that all the bytes are ASCII first, copy them into an array and pass that array to `String.fromCharCode` with `fromCharCode.apply(null, array)`. To avoid allocating a fresh array every time, hold a stock of arrays for all string lengths that this path can require, and reuse them. This is a variation on the approach that #20883 took, but without the massive switch. This produces much tighter assembly, and avoids regressing the fast path due making `deserializeStr` a very large function. Despite the complexity, and multiple operations, [this is up to 3x faster](https://github.com/overlookmotel/oxc-raw-str-bench/blob/4f96275efa9a35d5d27615abb27f21a137149cc0/README.md#apply30-vs-switch30) than the switch approach, and gives an average 30% speed-up. ### Increase native call threshold The above optimizations make the slow path much faster. This shifts the tipping point at which it's faster to make a native call to `TextDecoder.decode` from 9 bytes to 64 bytes. Most strings now avoid the native call and stay in JS code which is heavily optimized by Turbofan. The tipping point of 64 is something of a guesstimate. Benchmarking shows its in the right ballpark, but we could finesse it, and probably squeeze out another couple of %. ## Credit The Latin1 string technique was cooked up by @joshuaisaact in overlookmotel/oxc-raw-str-bench#1. All credit to him for this masterstroke which cracks the whole problem!

joshuaisaact added 18 commits March 30, 2026 19:47

baseline: PR #20834 + firstNonAsciiPos (threshold 9)

5ab8da0

exp1: fromCharCode + inline textDecoder.decode

e1360e2

exp4: batch fromCharCode with switch on len

cfedd0b

exp6: extend batch fromCharCode to threshold 12

695f269

exp7: extend batch fromCharCode to threshold 16

e4b31fb

exp8: extend batch fromCharCode to threshold 24

842dce4

exp10: extend batch fromCharCode to threshold 32

3ef3cd0

exp13: extend batch fromCharCode to threshold 48

e08bc8d

exp15: pre-decode buffer as latin1, substr for ASCII strings

04fdac8

exp19: strDataIsAscii flag + streamlined branching

5628183

exp24: minimize branching with firstNonAsciiBufPos

ebfbf99

exp24b: fix source boundary + separate source/non-source paths

a07cdd4

exp26: add lastNonAsciiSrcEnd for tail-ASCII fast path

69d3b2d

exp29: cumulative non-ASCII count for O(1) ASCII range check

86720dc

exp30: simplified - only cumulative count, no first/lastNonAscii

e274000

exp31: maximum simplification - no sourceText, pure cumulative + buff…

bc3c7bf

…erAsAscii

exp31b: restore sourceText path for ASCII source boundary strings

b94307c

exp32: micro-opts - remove len===0 check, inline pos read

c668176

joshuaisaact mentioned this pull request Mar 30, 2026

perf(napi/parser): use batched fromCharCode for short ASCII string deserialization oxc-project/oxc#20883

Closed

overlookmotel force-pushed the main branch 3 times, most recently from 1c3a0e3 to 0040360 Compare March 31, 2026 16:47

joshuaisaact added 2 commits April 1, 2026 10:20

overlookmotel force-pushed the main branch 2 times, most recently from 45ecc3e to 4f96275 Compare April 3, 2026 19:42

overlookmotel mentioned this pull request Apr 3, 2026

perf(napi/parser, linter/plugins): speed up decoding strings in raw transfer oxc-project/oxc#21021

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deserializeStr experiments: fromCharCode batching and pre-computation tricks#1

deserializeStr experiments: fromCharCode batching and pre-computation tricks#1
joshuaisaact wants to merge 20 commits into
overlookmotel:mainfrom
joshuaisaact:autoresearch/mar30

joshuaisaact commented Mar 30, 2026

Uh oh!

joshuaisaact commented Mar 30, 2026

Uh oh!

joshuaisaact commented Mar 30, 2026

Uh oh!

overlookmotel commented Mar 31, 2026 •

edited

Loading

Uh oh!

overlookmotel commented Mar 31, 2026 •

edited

Loading

Uh oh!

joshuaisaact commented Apr 1, 2026

Uh oh!

overlookmotel commented Apr 1, 2026 •

edited

Loading

Uh oh!

overlookmotel commented Apr 1, 2026 •

edited

Loading

Uh oh!

overlookmotel commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joshuaisaact commented Mar 30, 2026

Why

What we tried

Where it went off the rails

What's actually shippable

References

Uh oh!

joshuaisaact commented Mar 30, 2026

Uh oh!

joshuaisaact commented Mar 30, 2026

Uh oh!

overlookmotel commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

overlookmotel commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshuaisaact commented Apr 1, 2026

Uh oh!

overlookmotel commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

overlookmotel commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

overlookmotel commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

overlookmotel commented Mar 31, 2026 •

edited

Loading

overlookmotel commented Mar 31, 2026 •

edited

Loading

overlookmotel commented Apr 1, 2026 •

edited

Loading

overlookmotel commented Apr 1, 2026 •

edited

Loading

overlookmotel commented Apr 3, 2026 •

edited

Loading