Skip to content

deserializeStr experiments: fromCharCode batching and pre-computation tricks#1

Draft
joshuaisaact wants to merge 20 commits into
overlookmotel:mainfrom
joshuaisaact:autoresearch/mar30
Draft

deserializeStr experiments: fromCharCode batching and pre-computation tricks#1
joshuaisaact wants to merge 20 commits into
overlookmotel:mainfrom
joshuaisaact:autoresearch/mar30

Conversation

@joshuaisaact
Copy link
Copy Markdown

Why

Exploring how far deserializeStr can be pushed on M4 Mac. This is the hottest function in the NAPI raw transfer deserialization path -- every string in the AST goes through it.

What we tried

Starting from the PR #20834 baseline (firstNonAsciiPos + threshold 9, 56.4ms across 25 fixtures):

The fromCharCode batching discovery (exp4-exp13, 56ms -> 33ms):

The baseline builds short strings byte-by-byte with out += fromCodePoint(c). Turns out a single String.fromCharCode(b0, b1, b2, ...) call is dramatically faster -- V8 can allocate the string in one shot instead of concatenating. We check if all bytes are ASCII first, then dispatch through a switch on length. Raising the threshold from 9 to 48 kept improving things. Each step was a clear win. Past 48 the gains flatlined.

We also tried fromCharCode.apply(null, uint8.subarray(...)) as a cleaner alternative to the switch -- it was 2x slower. The subarray allocation + .apply overhead kills it. The ugly switch wins because V8 sees a direct call with a known argument count.

The latin1 pre-decode trick (exp15, 33ms -> 19ms):

Decoding the entire buffer as latin1 in setup() gives a string where byte offsets map 1:1 to character offsets. For any ASCII string, bufferAsAscii.substr(pos, len) is a direct slice -- no TextDecoder, no byte scanning. This is where it started getting good but also where setup cost started mattering.

The cumulative count trick (exp29-exp32, 19ms -> 4.2ms):

Pre-computing a prefix sum of non-ASCII byte positions makes "is this range all ASCII?" an O(1) check (two array lookups). This eliminated the per-byte ASCII scan that dominated the heavy files. typescript went from 6.6ms to 1.2ms.

Where it went off the rails

The 56ms -> 4ms headline number is real but dishonest. We moved most of the work into setup(), which the benchmark doesn't time. The cumulative array is a Uint32Array(bufferLength + 1) -- for a 10MB source file that's 40MB of extra memory. The latin1 pre-decode is another full copy of the buffer as a string. You'd never ship this in production.

We also burned a bunch of experiments trying to build a byteToChar mapping for source strings past firstNonAsciiPos (exp20, exp22a-c). The idea was sound -- map byte offsets to character offsets so sourceText.substr works everywhere -- but it kept breaking on edge cases: UTF-16 surrogate pairs, malformed UTF-8, strings that span the source/strData boundary. We abandoned it after 3 failed attempts.

What's actually shippable

exp13 (33ms, -42% vs baseline) is the honest win. Zero setup cost, zero extra memory, same fast paths as the baseline. The only change is how 1-48 byte non-source ASCII strings are built: check ASCII upfront, then one fromCharCode call via a switch on length. The switch is ugly but V8 loves it.

The latin1 pre-decode (exp15) might be worth it if the memory budget allows -- it's one extra string copy of the buffer, which is modest. Whether the cumulative array is worth it depends on how many strings you're decoding and how expensive setup is in the real pipeline.

@overlookmotel would love your take on which of these (if any) are worth pulling into the real deserializer. The commits are all in here if you want to poke at individual experiments.

References

  • oxc PR #20834 (the prior experiment round this builds on)

@joshuaisaact
Copy link
Copy Markdown
Author

Think there's a LOT of unviable stuff in here, but some could be viable...

@joshuaisaact
Copy link
Copy Markdown
Author

Exp 13 looks viable - will split out into another PR

@overlookmotel
Copy link
Copy Markdown
Owner

overlookmotel commented Mar 31, 2026

Thanks for diving into this!

I'm unclear what the current version of versions/experiment.mjs‎ is doing. Cum??

Would you be able to ask Claude to write a summary of the various things he tried and why they were rejected?

Please feel free to make a PR adding a ton of different versions to the versions directory. Please just follow the template (comment the code, and a comment at top explaining which other version it builds on, and what the change made is). To avoid a ludicrously large benchmark table, we can always look at specific ones with the FILTER var (see README).

By the way, ultimately the best solution will likely be to get the UTF8 to UTF16 translation table that we already have on Rust side over to JS, so every string in the source code can take the sourceText.substr(...) fast path. But it's a bit of a palaver to make the changes to do that, so anything we can do on JS side for now is a win.

Here's the Rust-side code, if you're interested:
https://github.com/oxc-project/oxc/blob/af72b802be621fbea6e6ca1fbfc9a685c978b6fc/crates/oxc_ast_visit/src/utf8_to_utf16/translation.rs

@overlookmotel overlookmotel force-pushed the main branch 3 times, most recently from 1c3a0e3 to 0040360 Compare March 31, 2026 16:47
@overlookmotel
Copy link
Copy Markdown
Owner

overlookmotel commented Mar 31, 2026

Oh I understand the cum now. Decoding to latin1 string is a masterstroke!

I've added a simpler version which doesn't have as high setup cost to the benchmarks. It's the winner so far.

I've not finessed the ideal switch-over point.

@joshuaisaact
Copy link
Copy Markdown
Author

My bad on the raising a PR too soon. Got over excited!

Interesting.... I'll have a mess around with it today too

@overlookmotel
Copy link
Copy Markdown
Owner

overlookmotel commented Apr 1, 2026

Latin1 has changed the game! We can probably tweak it a bit more but I doubt there are any more fundamental breakthroughs (I think) left to find, without having more extensive setup work, which I think we should probably avoid. So I imagine latin-source64 will be the base of the final solution (probably with the latin-*-chunk64 optimization added in).

Would be interested if you can find any way to finesse it though. There's probably a few more % to be had from finding the best switch-over points, and maybe branch reorganisation.

Also relevant: oxc-project/oxc#20923 which skips calling deserializeStr entirely in some cases.

… table

Scans the source region once in setup() to build a sparse translation
table mapping multi-byte UTF-8 character positions to cumulative
byte-vs-codeunit drift. deserializeStr() binary searches this table to
convert byte offsets to UTF-16 offsets, extending sourceText.substr()
to all source strings — not just those in the ASCII prefix.

Benchmarks show 25-65% improvement over current on non-ASCII files,
though the dense cumulative array approach (experiment.mjs) remains
faster due to O(1) lookups vs O(log k) binary search.
A string starting in the source region but extending past sourceEndPos
would get truncated by sourceText.substr(). Changed the guard from
pos < sourceEndPos to pos + len <= sourceEndPos so boundary-spanning
strings correctly fall through to the TextDecoder path.
@overlookmotel
Copy link
Copy Markdown
Owner

overlookmotel commented Apr 1, 2026

FYI it's an invariant of how strings are constructed that they cannot cross the boundary between source text region and other strings region. I saw comment in PR description about strings being found which did cross the boundary. I think that must have been a bug in the deserializeStr impl. You can run pnpm run verify to check all versions produce identical output to the original.

Also, just FYI, I've updated the fixtures after oxc-project/oxc#20923, which means some files have a lot less deserializeStr calls now. It doesn't seem to alter the results considerably though.

@overlookmotel
Copy link
Copy Markdown
Owner

overlookmotel commented Apr 3, 2026

I think there's probably more we can do, but I'm off for a few days and was keen for some of this work to get into Monday's release. So I've merged the current winner utf8-slice64 into Oxc (oxc-project/oxc#21021 + the other PRs in that stack).

Would be very happy to receive further improvements though.

graphite-app Bot pushed a commit to oxc-project/oxc that referenced this pull request Apr 3, 2026
…ransfer (#21021)

Improve perf of deserializing strings in raw transfer. This PR combines several optimizations, which have been tested and benchmarked in https://github.com/overlookmotel/oxc-raw-str-bench. This PR implements the version "latin-slice-onebyte64" from that repo, which is the current winner.

String deserialization is the main bottleneck in raw transfer, so speeding it up will likely make a large impact on deserialization overall.

This work follows on from #20834 which produced a major speed-up in many files by making files which contain some non-ASCII characters take the fast path of slicing `sourceText` more often.

This PR tackles the remainder - speeding up the fallback path where the fast path can't be taken.

## Optimizations

The optimizations in this PR are:

### Latin1

When source is not 100% ASCII, decode source text from buffer as Latin1.

A Latin1-decoded string represents each UTF-8 byte as a single Latin1 character, so it can be indexed into using UTF-8 offsets.

So when we can't slice the string from `sourceText` because the UTF-8 and UTF-16 offsets differ (after any non-ASCII character), loop through the string's bytes and check if they're all ASCII. If they are, the string can be sliced from `sourceTextLatin` instead, with the original UTF-8 offsets.

This is way faster than calling `textDecoder.decode`, as it avoids a call into C++. [Benchmarks show](https://github.com/overlookmotel/oxc-raw-str-bench/blob/4f96275efa9a35d5d27615abb27f21a137149cc0/README.md#apply30-vs-latin-vs-latin-source64) speed up of 55% on average, and up to 70% on some files.

### Latin1 decoding method

It turns out that `new TextDecoder("latin1").decode(arr)` doesn't actually decode to Latin1!

Per the WHATWG Encoding Standard, "latin1" is mapped to "windows-1252".

The result is that with `TextDecoder("latin1")`:

1. `decode` is quite complicated, requiring a 2-pass scan of the bytes to determine if they're all ASCII, followed by a 2nd pass to do the actual `windows-1252` decoding. If the string *does* contain any non-ASCII characters (which it always does in our usecase), NodeJS implements the decoding in JS, not native code. Slow.
2. `decode` produces a 2-byte-per-char string (`TWO_BYTE` in V8), which takes more memory, and is slower for all operations on it e.g. string comparison, hashing for use as an object key etc.

Instead, use `Buffer.prototype.latin1Slice` which:

1. Does a pure Latin1 decode, which is just a single `memcpy` call.
2. Produces a 1-byte-per-char string (`ONE_BYTE` in V8).

`latin1Slice` involves a call into C++, but we only do it once per file, so this cost is tiny in context of deserializing the whole AST.

### Latin1 string slicing

In the fast path, slice from the Latin1-decoded string, instead of `sourceText`. In the fast path, we know that all bytes of source comprising the string are ASCII, so no further checks are required.

This makes no difference on benchmarks for `deserializeStr` itself, but it may have beneficial effects downstream for code (e.g. lint rules) which access strings in the AST, e.g. `Identifier` names.

Because Latin1-decoded source text is `ONE_BYTE`-encoded, slices of it are too. In comparison, slices of `sourceText` may be `ONE_BYTE` or `TWO_BYTE`. If a file's source is pure ASCII, it'll be `ONE_BYTE`, if source contains any non-ASCII characters, it'll be `TWO_BYTE`. Files in a repo will likely be a mix of both, which makes strings returned from `deserializeStr` and placed in the AST a mix too. This in turn makes functions (e.g. lint rule visitors) polymorphic. V8 cannot optimize them as aggressively as if they see only `ONE_BYTE` strings.

We cannot make sure that all strings returned by `deserializeStr` are `ONE_BYTE`. Some string may contain non-ASCII characters, and they *have* to be represented in `TWO_BYTE` form. But we can minimize it - now only strings which *themselves* contain non-ASCII characters are `TWO_BYTE`, whereas before they would be if the source text as a whole contains a single non-ASCII byte.

Code which accesses `Identifier` names, for example will exclusively see `ONE_BYTE` strings and will be more heavily optimized, because Unicode `Identifier`s are rarer than hen's teeth in real-world code.

### Remove string-concatenation loop

Previously strings which are outside of source text were assembled byte-by-byte in a loop via concatenation.

Instead, check that all the bytes are ASCII first, copy them into an array and pass that array to `String.fromCharCode` with `fromCharCode.apply(null, array)`.

To avoid allocating a fresh array every time, hold a stock of arrays for all string lengths that this path can require, and reuse them.

This is a variation on the approach that #20883 took, but without the massive switch. This produces much tighter assembly, and avoids regressing the fast path due making `deserializeStr` a very large function.

Despite the complexity, and multiple operations, [this is up to 3x faster](https://github.com/overlookmotel/oxc-raw-str-bench/blob/4f96275efa9a35d5d27615abb27f21a137149cc0/README.md#apply30-vs-switch30) than the switch approach, and gives an average 30% speed-up.

### Increase native call threshold

The above optimizations make the slow path much faster. This shifts the tipping point at which it's faster to make a native call to `TextDecoder.decode` from 9 bytes to 64 bytes. Most strings now avoid the native call and stay in JS code which is heavily optimized by Turbofan.

The tipping point of 64 is something of a guesstimate. Benchmarking shows its in the right ballpark, but we could finesse it, and probably squeeze out another couple of %.

## Credit

The Latin1 string technique was cooked up by @joshuaisaact in overlookmotel/oxc-raw-str-bench#1. All credit to him for this masterstroke which cracks the whole problem!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants