perf(napi/parser): use batched fromCharCode for short ASCII string deserialization#20883
perf(napi/parser): use batched fromCharCode for short ASCII string deserialization#20883joshuaisaact wants to merge 1 commit intooxc-project:mainfrom
Conversation
…serialization Replace the byte-by-byte fromCodePoint concat loop with an upfront ASCII check followed by a single fromCharCode call dispatched via switch(len) for strings 1-48 bytes. Threshold raised from 9 to 48.
|
This is really, really ugly. But it is faster. Open to being told to close it due to the diff size for these gains. |
|
Ugly is fine! Speed is all that matters. But could we please iterate until we find a final winner in https://github.com/overlookmotel/oxc-raw-str-bench before making PR here? That repo allows measuring the difference between different versions more accurately. I've made a bunch of changes to that repo today. Notably, I discovered that the benchmark had a skew - it was advantaging earlier versions. So let's try this version again there now that the skew is fixed. I've also added some versions of my own before I saw this one. I suspect your monstrous switch is going to beat my current best version, but there are some smaller tweaks I made which should probably be merged in too. I'm closing this for now, just because I think we have a few more % to wring out of this. |
…ransfer (#21021) Improve perf of deserializing strings in raw transfer. This PR combines several optimizations, which have been tested and benchmarked in https://github.com/overlookmotel/oxc-raw-str-bench. This PR implements the version "latin-slice-onebyte64" from that repo, which is the current winner. String deserialization is the main bottleneck in raw transfer, so speeding it up will likely make a large impact on deserialization overall. This work follows on from #20834 which produced a major speed-up in many files by making files which contain some non-ASCII characters take the fast path of slicing `sourceText` more often. This PR tackles the remainder - speeding up the fallback path where the fast path can't be taken. ## Optimizations The optimizations in this PR are: ### Latin1 When source is not 100% ASCII, decode source text from buffer as Latin1. A Latin1-decoded string represents each UTF-8 byte as a single Latin1 character, so it can be indexed into using UTF-8 offsets. So when we can't slice the string from `sourceText` because the UTF-8 and UTF-16 offsets differ (after any non-ASCII character), loop through the string's bytes and check if they're all ASCII. If they are, the string can be sliced from `sourceTextLatin` instead, with the original UTF-8 offsets. This is way faster than calling `textDecoder.decode`, as it avoids a call into C++. [Benchmarks show](https://github.com/overlookmotel/oxc-raw-str-bench/blob/4f96275efa9a35d5d27615abb27f21a137149cc0/README.md#apply30-vs-latin-vs-latin-source64) speed up of 55% on average, and up to 70% on some files. ### Latin1 decoding method It turns out that `new TextDecoder("latin1").decode(arr)` doesn't actually decode to Latin1! Per the WHATWG Encoding Standard, "latin1" is mapped to "windows-1252". The result is that with `TextDecoder("latin1")`: 1. `decode` is quite complicated, requiring a 2-pass scan of the bytes to determine if they're all ASCII, followed by a 2nd pass to do the actual `windows-1252` decoding. If the string *does* contain any non-ASCII characters (which it always does in our usecase), NodeJS implements the decoding in JS, not native code. Slow. 2. `decode` produces a 2-byte-per-char string (`TWO_BYTE` in V8), which takes more memory, and is slower for all operations on it e.g. string comparison, hashing for use as an object key etc. Instead, use `Buffer.prototype.latin1Slice` which: 1. Does a pure Latin1 decode, which is just a single `memcpy` call. 2. Produces a 1-byte-per-char string (`ONE_BYTE` in V8). `latin1Slice` involves a call into C++, but we only do it once per file, so this cost is tiny in context of deserializing the whole AST. ### Latin1 string slicing In the fast path, slice from the Latin1-decoded string, instead of `sourceText`. In the fast path, we know that all bytes of source comprising the string are ASCII, so no further checks are required. This makes no difference on benchmarks for `deserializeStr` itself, but it may have beneficial effects downstream for code (e.g. lint rules) which access strings in the AST, e.g. `Identifier` names. Because Latin1-decoded source text is `ONE_BYTE`-encoded, slices of it are too. In comparison, slices of `sourceText` may be `ONE_BYTE` or `TWO_BYTE`. If a file's source is pure ASCII, it'll be `ONE_BYTE`, if source contains any non-ASCII characters, it'll be `TWO_BYTE`. Files in a repo will likely be a mix of both, which makes strings returned from `deserializeStr` and placed in the AST a mix too. This in turn makes functions (e.g. lint rule visitors) polymorphic. V8 cannot optimize them as aggressively as if they see only `ONE_BYTE` strings. We cannot make sure that all strings returned by `deserializeStr` are `ONE_BYTE`. Some string may contain non-ASCII characters, and they *have* to be represented in `TWO_BYTE` form. But we can minimize it - now only strings which *themselves* contain non-ASCII characters are `TWO_BYTE`, whereas before they would be if the source text as a whole contains a single non-ASCII byte. Code which accesses `Identifier` names, for example will exclusively see `ONE_BYTE` strings and will be more heavily optimized, because Unicode `Identifier`s are rarer than hen's teeth in real-world code. ### Remove string-concatenation loop Previously strings which are outside of source text were assembled byte-by-byte in a loop via concatenation. Instead, check that all the bytes are ASCII first, copy them into an array and pass that array to `String.fromCharCode` with `fromCharCode.apply(null, array)`. To avoid allocating a fresh array every time, hold a stock of arrays for all string lengths that this path can require, and reuse them. This is a variation on the approach that #20883 took, but without the massive switch. This produces much tighter assembly, and avoids regressing the fast path due making `deserializeStr` a very large function. Despite the complexity, and multiple operations, [this is up to 3x faster](https://github.com/overlookmotel/oxc-raw-str-bench/blob/4f96275efa9a35d5d27615abb27f21a137149cc0/README.md#apply30-vs-switch30) than the switch approach, and gives an average 30% speed-up. ### Increase native call threshold The above optimizations make the slow path much faster. This shifts the tipping point at which it's faster to make a native call to `TextDecoder.decode` from 9 bytes to 64 bytes. Most strings now avoid the native call and stay in JS code which is heavily optimized by Turbofan. The tipping point of 64 is something of a guesstimate. Benchmarking shows its in the right ballpark, but we could finesse it, and probably squeeze out another couple of %. ## Credit The Latin1 string technique was cooked up by @joshuaisaact in overlookmotel/oxc-raw-str-bench#1. All credit to him for this masterstroke which cracks the whole problem!
AI Disclosure: Developed with Claude Code (Opus), and an auto-claude research loop here: overlookmotel/oxc-raw-str-bench#1
Why
@overlookmotel pointed me at his benchmark repo after #20834 merged and suggested I set my robot on it. After the firstNonAsciiPos and sourceText.substr fast paths from that PR, the remaining hot path is the byte-by-byte out += fromCodePoint(c) loop for short strings that miss those fast paths -- non-source strings, or source strings past the first non-ASCII byte.
What
One change to the generator (
tasks/ast_tools/src/generators/raw_transfer.rs), the 9 generated JS files are the mechanical output ofcargo run -p oxc_ast_tools.The old approach builds short strings one byte at a time with +=. The new approach checks if all bytes are ASCII upfront, then calls fromCharCode once with all byte values as direct arguments, dispatched through a switch(len) for lengths 1-48. Strings over 48 bytes or containing non-ASCII fall through to TextDecoder as before.
V8 can allocate a string in one shot from a single fromCharCode call with a known argument count, which is significantly faster than repeated concatenation. We also tried fromCharCode.apply(null, uint8.subarray(...)) as a cleaner alternative but it was ~2x slower -- the subarray allocation and .apply overhead eat the gain. The switch is verbose but it's what V8 wants.
Everything from #20834 (firstNonAsciiPos, sourceText.substr fast path, threshold logic) is untouched.
Deserialization-only benchmarks on M4 Mac (5 rounds x 30 iters, dropping round 1 for JIT warmup):
checker.tsdoesn't improve because most strings are in the ASCII prefix (already handled bysourceText.substr). The wins are on files where non-ASCII appears early (cal.com.tsxat 0.1%,antd.jsat 1.3%), sofirstNonAsciiPosdoesn't help and short strings fall through to the concat loop -- which is now the switch.Zero setup cost, zero extra memory, no new allocations. The only downside is generated code size (~48 switch cases per deserializer file).
Bench repo vs real parser
Worth noting: @overlookmotel's bench repo uses small tightly-packed buffers per fixture, while the real parser uses a 2GB fixed-size allocator. Some of the more aggressive optimisations we tried (latin1 pre-decode, cumulative non-ASCII prefix sums) showed huge gains in the bench repo but don't translate directly because they'd be operating on the full 2GB buffer. This PR only includes the approach that works identically in both environments. Full experiment log in overlookmotel/oxc-raw-str-bench#1.
References