perf(parser): Explore packed tokens#10933
Conversation
How to use the Graphite Merge QueueAdd either label to this PR to merge it via the merge queue:
You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. |
CodSpeed Instrumentation Performance ReportMerging #10933 will degrade performances by 3.85%Comparing Summary
Benchmarks breakdown
|
|
Hi Tom! Long time no see! Very promising performance improvements:
I'll get some of the API changes into main first to make this PR easier to review. Previous attempt was #8153 and it made similar API changes I forgot to do that first ... |
Curious how you chatted with o3 ;-) |
The goal of this PR is to reduce the diff for #10933
The goal of this PR is to reduce the diff for #10933
|
After this modification, can it parse this library properly? https://www.npmjs.com/package/@shikijs/engine-oniguruma?activeTab=code Based on experimentation and calculation, the shikijs oniguruma library size is approximately 600KB, and 20 bits correspond to about 1MB of data, so it's safe. |
Pack the parser tokens into a single u128
|
@magic-akari thanks for that example. I was aware of this new constraint - but struggled to identify where someone would actually have such a large single individual token. From some iteration locally, it seems that moving to u128 performs similarly. I also couldn't get stable results between using |
Just conversationally exploring, rough overall structure
Relevant snippet of convo
Yes — Hermes, Facebook/Meta’s JS engine, is famous for its focus on startup time, memory compactness, and ahead-of-time execution, especially on low-power devices. There are several powerful tricks in its lexer/parser/compiler that you could absolutely adapt to your high-performance parser. Let’s break them down: 1. Compact Token & AST RepresentationHermes:
AST nodes are tight structs, often with union-like variants. Apply to your language:
Benefit: Faster parsing, smaller memory footprint, better locality 2. Pre-parsed Bytecode: Skipping Parse at Runtime...
4. Bit-Packed Tokens: What, Why, and HowWhy:
Idea:Represent a Token as a single u32 or u64: Layout:
Accessors:impl PackedToken {
fn kind(self) -> TokenKind {
TokenKind::from_u8((self.0 & 0xFF) as u8)
}
fn start(self) -> u32 {
((self.0 >> 8) & 0xFFFFFF) as u32
}
fn end(self) -> u32 {
((self.0 >> 32) & 0xFFFFFF) as u32
}
}Optionally make TokenKind: u8 with #[repr(u8)] to simplify. |
|
From CodSpeed it look like this has now shifted much of the work over to the lexer (more so with u128 than u64). This might need a bit more work on the lexing side to improve |
overlookmotel
left a comment
There was a problem hiding this comment.
Great!
On Mac Mini M4, I'm getting this result from benchmarks:
lexer/checker.ts time: [3.5806 ms 3.5878 ms 3.5953 ms]
change: [+2.8823% +3.1912% +3.4929%] (p = 0.00 < 0.05)
Performance has regressed.
lexer/cal.com.tsx time: [1.4846 ms 1.4899 ms 1.4961 ms]
change: [+3.5426% +4.5244% +5.7089%] (p = 0.00 < 0.05)
Performance has regressed.
lexer/RadixUI.jsx time: [1.5299 µs 1.5328 µs 1.5358 µs]
change: [−4.0315% −3.7330% −3.4303%] (p = 0.00 < 0.05)
Performance has improved.
lexer/pdf.mjs time: [918.41 µs 920.58 µs 923.02 µs]
change: [+1.2432% +1.7523% +2.2282%] (p = 0.00 < 0.05)
Performance has regressed.
lexer/antd.js time: [6.8587 ms 6.8660 ms 6.8756 ms]
change: [+3.8498% +3.9941% +4.1547%] (p = 0.00 < 0.05)
Performance has regressed.
parser/checker.ts time: [7.1211 ms 7.1489 ms 7.1802 ms]
change: [−3.8394% −3.3014% −2.7636%] (p = 0.00 < 0.05)
Performance has improved.
parser/cal.com.tsx time: [3.5647 ms 3.5728 ms 3.5822 ms]
change: [−6.5004% −6.2345% −5.9451%] (p = 0.00 < 0.05)
Performance has improved.
parser/RadixUI.jsx time: [4.3214 µs 4.3289 µs 4.3367 µs]
change: [−12.434% −12.188% −11.924%] (p = 0.00 < 0.05)
Performance has improved.
parser/pdf.mjs time: [2.0171 ms 2.0206 ms 2.0243 ms]
change: [−6.4245% −5.8924% −5.4479%] (p = 0.00 < 0.05)
Performance has improved.
parser/antd.js time: [13.776 ms 13.836 ms 13.905 ms]
change: [−4.3647% −3.9299% −3.5010%] (p = 0.00 < 0.05)
Performance has improved.
+12% perf improvement on the RadixUI benchmark - which I think is the one that's most representative of real-world code that gets thrown at Oxc, as all the other benchmark files are huge.
I'm really surprised that this change nets such a gain. I've previously tried making Token #[repr(C, align(16))] with padding u8 fields, so all 16 bytes are initialized. i.e. so the memory layout of Token is same as a u128 and with all the fields in same location as here. I'd have expected the compiler to boil that down to exactly the same assembly as this. But it seems not! 🤷
Merge activity
You must have a Graphite account in order to use the merge queue. Create an account and try again using this link |
Follow-on after #10933. Cosmetic change only. Reformat comments to our standard style, with backticks around type/field names.
|
Thank you @tomgasson! Can't wait to see what you find next. |
Follow-on after #10933. It's preferable to include explicit type params in calls to `mem::transmute` rather than relying on type inference. This ensures any later changes won't inadvertently cause UB. Also expand the safety comment.
This is the kind of thing you could raise as an issue on the Rust repository. |
Follow-on after #10933. Make `u8` value of `Kind::Eof` (`Kind::default()`) be 0. This makes `Token::default()` a little cheaper, as all bits of default `Token` are 0.


Pack the parser tokens into a single u128.
I was musing with o3 about compiler performance, and it told me that Hermes used packed tokens to improve parser performance so I thought I'd try and see if that would also work here. In local benchmarks it appears to have a decent improvement.
closes #9926