deserializeStr experiments: fromCharCode batching and pre-computation tricks#1
deserializeStr experiments: fromCharCode batching and pre-computation tricks#1joshuaisaact wants to merge 20 commits into
Conversation
|
Think there's a LOT of unviable stuff in here, but some could be viable... |
|
Exp 13 looks viable - will split out into another PR |
|
Thanks for diving into this! I'm unclear what the current version of versions/experiment.mjs is doing. Cum?? Would you be able to ask Claude to write a summary of the various things he tried and why they were rejected? Please feel free to make a PR adding a ton of different versions to the By the way, ultimately the best solution will likely be to get the UTF8 to UTF16 translation table that we already have on Rust side over to JS, so every string in the source code can take the Here's the Rust-side code, if you're interested: |
1c3a0e3 to
0040360
Compare
|
Oh I understand the cum now. Decoding to latin1 string is a masterstroke! I've added a simpler version which doesn't have as high setup cost to the benchmarks. It's the winner so far. I've not finessed the ideal switch-over point. |
|
My bad on the raising a PR too soon. Got over excited! Interesting.... I'll have a mess around with it today too |
|
Latin1 has changed the game! We can probably tweak it a bit more but I doubt there are any more fundamental breakthroughs (I think) left to find, without having more extensive setup work, which I think we should probably avoid. So I imagine Would be interested if you can find any way to finesse it though. There's probably a few more % to be had from finding the best switch-over points, and maybe branch reorganisation. Also relevant: oxc-project/oxc#20923 which skips calling |
… table Scans the source region once in setup() to build a sparse translation table mapping multi-byte UTF-8 character positions to cumulative byte-vs-codeunit drift. deserializeStr() binary searches this table to convert byte offsets to UTF-16 offsets, extending sourceText.substr() to all source strings — not just those in the ASCII prefix. Benchmarks show 25-65% improvement over current on non-ASCII files, though the dense cumulative array approach (experiment.mjs) remains faster due to O(1) lookups vs O(log k) binary search.
A string starting in the source region but extending past sourceEndPos would get truncated by sourceText.substr(). Changed the guard from pos < sourceEndPos to pos + len <= sourceEndPos so boundary-spanning strings correctly fall through to the TextDecoder path.
|
FYI it's an invariant of how strings are constructed that they cannot cross the boundary between source text region and other strings region. I saw comment in PR description about strings being found which did cross the boundary. I think that must have been a bug in the Also, just FYI, I've updated the fixtures after oxc-project/oxc#20923, which means some files have a lot less |
45ecc3e to
4f96275
Compare
|
I think there's probably more we can do, but I'm off for a few days and was keen for some of this work to get into Monday's release. So I've merged the current winner Would be very happy to receive further improvements though. |
…ransfer (#21021) Improve perf of deserializing strings in raw transfer. This PR combines several optimizations, which have been tested and benchmarked in https://github.com/overlookmotel/oxc-raw-str-bench. This PR implements the version "latin-slice-onebyte64" from that repo, which is the current winner. String deserialization is the main bottleneck in raw transfer, so speeding it up will likely make a large impact on deserialization overall. This work follows on from #20834 which produced a major speed-up in many files by making files which contain some non-ASCII characters take the fast path of slicing `sourceText` more often. This PR tackles the remainder - speeding up the fallback path where the fast path can't be taken. ## Optimizations The optimizations in this PR are: ### Latin1 When source is not 100% ASCII, decode source text from buffer as Latin1. A Latin1-decoded string represents each UTF-8 byte as a single Latin1 character, so it can be indexed into using UTF-8 offsets. So when we can't slice the string from `sourceText` because the UTF-8 and UTF-16 offsets differ (after any non-ASCII character), loop through the string's bytes and check if they're all ASCII. If they are, the string can be sliced from `sourceTextLatin` instead, with the original UTF-8 offsets. This is way faster than calling `textDecoder.decode`, as it avoids a call into C++. [Benchmarks show](https://github.com/overlookmotel/oxc-raw-str-bench/blob/4f96275efa9a35d5d27615abb27f21a137149cc0/README.md#apply30-vs-latin-vs-latin-source64) speed up of 55% on average, and up to 70% on some files. ### Latin1 decoding method It turns out that `new TextDecoder("latin1").decode(arr)` doesn't actually decode to Latin1! Per the WHATWG Encoding Standard, "latin1" is mapped to "windows-1252". The result is that with `TextDecoder("latin1")`: 1. `decode` is quite complicated, requiring a 2-pass scan of the bytes to determine if they're all ASCII, followed by a 2nd pass to do the actual `windows-1252` decoding. If the string *does* contain any non-ASCII characters (which it always does in our usecase), NodeJS implements the decoding in JS, not native code. Slow. 2. `decode` produces a 2-byte-per-char string (`TWO_BYTE` in V8), which takes more memory, and is slower for all operations on it e.g. string comparison, hashing for use as an object key etc. Instead, use `Buffer.prototype.latin1Slice` which: 1. Does a pure Latin1 decode, which is just a single `memcpy` call. 2. Produces a 1-byte-per-char string (`ONE_BYTE` in V8). `latin1Slice` involves a call into C++, but we only do it once per file, so this cost is tiny in context of deserializing the whole AST. ### Latin1 string slicing In the fast path, slice from the Latin1-decoded string, instead of `sourceText`. In the fast path, we know that all bytes of source comprising the string are ASCII, so no further checks are required. This makes no difference on benchmarks for `deserializeStr` itself, but it may have beneficial effects downstream for code (e.g. lint rules) which access strings in the AST, e.g. `Identifier` names. Because Latin1-decoded source text is `ONE_BYTE`-encoded, slices of it are too. In comparison, slices of `sourceText` may be `ONE_BYTE` or `TWO_BYTE`. If a file's source is pure ASCII, it'll be `ONE_BYTE`, if source contains any non-ASCII characters, it'll be `TWO_BYTE`. Files in a repo will likely be a mix of both, which makes strings returned from `deserializeStr` and placed in the AST a mix too. This in turn makes functions (e.g. lint rule visitors) polymorphic. V8 cannot optimize them as aggressively as if they see only `ONE_BYTE` strings. We cannot make sure that all strings returned by `deserializeStr` are `ONE_BYTE`. Some string may contain non-ASCII characters, and they *have* to be represented in `TWO_BYTE` form. But we can minimize it - now only strings which *themselves* contain non-ASCII characters are `TWO_BYTE`, whereas before they would be if the source text as a whole contains a single non-ASCII byte. Code which accesses `Identifier` names, for example will exclusively see `ONE_BYTE` strings and will be more heavily optimized, because Unicode `Identifier`s are rarer than hen's teeth in real-world code. ### Remove string-concatenation loop Previously strings which are outside of source text were assembled byte-by-byte in a loop via concatenation. Instead, check that all the bytes are ASCII first, copy them into an array and pass that array to `String.fromCharCode` with `fromCharCode.apply(null, array)`. To avoid allocating a fresh array every time, hold a stock of arrays for all string lengths that this path can require, and reuse them. This is a variation on the approach that #20883 took, but without the massive switch. This produces much tighter assembly, and avoids regressing the fast path due making `deserializeStr` a very large function. Despite the complexity, and multiple operations, [this is up to 3x faster](https://github.com/overlookmotel/oxc-raw-str-bench/blob/4f96275efa9a35d5d27615abb27f21a137149cc0/README.md#apply30-vs-switch30) than the switch approach, and gives an average 30% speed-up. ### Increase native call threshold The above optimizations make the slow path much faster. This shifts the tipping point at which it's faster to make a native call to `TextDecoder.decode` from 9 bytes to 64 bytes. Most strings now avoid the native call and stay in JS code which is heavily optimized by Turbofan. The tipping point of 64 is something of a guesstimate. Benchmarking shows its in the right ballpark, but we could finesse it, and probably squeeze out another couple of %. ## Credit The Latin1 string technique was cooked up by @joshuaisaact in overlookmotel/oxc-raw-str-bench#1. All credit to him for this masterstroke which cracks the whole problem!
Why
Exploring how far
deserializeStrcan be pushed on M4 Mac. This is the hottest function in the NAPI raw transfer deserialization path -- every string in the AST goes through it.What we tried
Starting from the PR #20834 baseline (
firstNonAsciiPos+ threshold 9, 56.4ms across 25 fixtures):The fromCharCode batching discovery (exp4-exp13, 56ms -> 33ms):
The baseline builds short strings byte-by-byte with
out += fromCodePoint(c). Turns out a singleString.fromCharCode(b0, b1, b2, ...)call is dramatically faster -- V8 can allocate the string in one shot instead of concatenating. We check if all bytes are ASCII first, then dispatch through a switch on length. Raising the threshold from 9 to 48 kept improving things. Each step was a clear win. Past 48 the gains flatlined.We also tried
fromCharCode.apply(null, uint8.subarray(...))as a cleaner alternative to the switch -- it was 2x slower. The subarray allocation +.applyoverhead kills it. The ugly switch wins because V8 sees a direct call with a known argument count.The latin1 pre-decode trick (exp15, 33ms -> 19ms):
Decoding the entire buffer as latin1 in
setup()gives a string where byte offsets map 1:1 to character offsets. For any ASCII string,bufferAsAscii.substr(pos, len)is a direct slice -- no TextDecoder, no byte scanning. This is where it started getting good but also where setup cost started mattering.The cumulative count trick (exp29-exp32, 19ms -> 4.2ms):
Pre-computing a prefix sum of non-ASCII byte positions makes "is this range all ASCII?" an O(1) check (two array lookups). This eliminated the per-byte ASCII scan that dominated the heavy files. typescript went from 6.6ms to 1.2ms.
Where it went off the rails
The 56ms -> 4ms headline number is real but dishonest. We moved most of the work into
setup(), which the benchmark doesn't time. The cumulative array is aUint32Array(bufferLength + 1)-- for a 10MB source file that's 40MB of extra memory. The latin1 pre-decode is another full copy of the buffer as a string. You'd never ship this in production.We also burned a bunch of experiments trying to build a
byteToCharmapping for source strings pastfirstNonAsciiPos(exp20, exp22a-c). The idea was sound -- map byte offsets to character offsets sosourceText.substrworks everywhere -- but it kept breaking on edge cases: UTF-16 surrogate pairs, malformed UTF-8, strings that span the source/strData boundary. We abandoned it after 3 failed attempts.What's actually shippable
exp13 (33ms, -42% vs baseline) is the honest win. Zero setup cost, zero extra memory, same fast paths as the baseline. The only change is how 1-48 byte non-source ASCII strings are built: check ASCII upfront, then one
fromCharCodecall via a switch on length. The switch is ugly but V8 loves it.The latin1 pre-decode (exp15) might be worth it if the memory budget allows -- it's one extra string copy of the buffer, which is modest. Whether the cumulative array is worth it depends on how many strings you're decoding and how expensive setup is in the real pipeline.
@overlookmotel would love your take on which of these (if any) are worth pulling into the real deserializer. The commits are all in here if you want to poke at individual experiments.
References