Skip to content

perf(parser): Explore packed tokens#10933

Merged
overlookmotel merged 1 commit intooxc-project:mainfrom
tomgasson:packed_token
May 13, 2025
Merged

perf(parser): Explore packed tokens#10933
overlookmotel merged 1 commit intooxc-project:mainfrom
tomgasson:packed_token

Conversation

@tomgasson
Copy link
Contributor

@tomgasson tomgasson commented May 10, 2025

Pack the parser tokens into a single u128.

I was musing with o3 about compiler performance, and it told me that Hermes used packed tokens to improve parser performance so I thought I'd try and see if that would also work here. In local benchmarks it appears to have a decent improvement.

closes #9926

@graphite-app
Copy link
Contributor

graphite-app bot commented May 10, 2025

How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • 0-merge - adds this PR to the back of the merge queue
  • hotfix - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

@github-actions github-actions bot added A-parser Area - Parser C-performance Category - Solution not expected to change functional behavior, only performance labels May 10, 2025
@camc314 camc314 requested a review from overlookmotel May 10, 2025 14:15
@codspeed-hq
Copy link

codspeed-hq bot commented May 10, 2025

CodSpeed Instrumentation Performance Report

Merging #10933 will degrade performances by 3.85%

Comparing tomgasson:packed_token (18cc339) with main (a86cbb3)

Summary

⚡ 2 improvements
❌ 4 regressions
✅ 30 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
lexer[antd.js] 24.1 ms 25 ms -3.72%
lexer[cal.com.tsx] 5.8 ms 6.1 ms -3.63%
lexer[checker.ts] 14.4 ms 14.9 ms -3.2%
lexer[pdf.mjs] 3.9 ms 4 ms -3.85%
parser[antd.js] 106 ms 102.8 ms +3.07%
parser[cal.com.tsx] 28.7 ms 27.8 ms +3.38%

@Boshen
Copy link
Member

Boshen commented May 10, 2025

Hi Tom! Long time no see!

Very promising performance improvements:

image

I'll get some of the API changes into main first to make this PR easier to review.

Previous attempt was #8153 and it made similar API changes I forgot to do that first ...

@Boshen Boshen self-assigned this May 10, 2025
@Boshen
Copy link
Member

Boshen commented May 10, 2025

I was musing with o3 about compiler performance

Curious how you chatted with o3 ;-)

Boshen pushed a commit that referenced this pull request May 11, 2025
The goal of this PR is to reduce the diff for #10933
graphite-app bot pushed a commit that referenced this pull request May 11, 2025
The goal of this PR is to reduce the diff for #10933
@Boshen Boshen assigned overlookmotel and unassigned Boshen May 11, 2025
@magic-akari
Copy link
Contributor

magic-akari commented May 11, 2025

After this modification, can it parse this library properly?
Just to highlight, there's a Base64-encoded WASM file embedded directly in the source code as a very long string.

https://www.npmjs.com/package/@shikijs/engine-oniguruma?activeTab=code
https://cdn.jsdelivr.net/npm/@shikijs/engine-oniguruma@3.4.0/dist/wasm-inlined.mjs


Based on experimentation and calculation, the shikijs oniguruma library size is approximately 600KB, and 20 bits correspond to about 1MB of data, so it's safe.

Pack the parser tokens into a single u128
@tomgasson
Copy link
Contributor Author

@magic-akari thanks for that example. I was aware of this new constraint - but struggled to identify where someone would actually have such a large single individual token.

From some iteration locally, it seems that moving to u128 performs similarly.

I also couldn't get stable results between using start & len vs start & end so I've reverted that to end to reduce what's changing.

Screenshot 2025-05-12 at 10 30 08 AM
baseline: main

@tomgasson
Copy link
Contributor Author

Curious how you chatted with o3 ;-)

Just conversationally exploring, rough overall structure

  • I want to build a language from scratch (in rust), ask me what features it should have
  • [answer questions like JS]
  • Read about the following [oxc, hermes, v8, javascriptcore, typescript, flow, hhvm, JVM, c#, go, mold, wild, 1brc, ripgrep, ninja, n2, buck2, bazel, simdjson, simdutf, ada, sep] (filling context with high perf insights)
  • Let's build my language. Starting at the [lexer|parser|printer]
  • [simple example]
  • What techniques is X using which I could use?

Relevant snippet of convo


Consider Hermes (Facebook/meta). Are there any tricks we can take from there to apply to our lexer/parser?

Yes — Hermes, Facebook/Meta’s JS engine, is famous for its focus on startup time, memory compactness, and ahead-of-time execution, especially on low-power devices. There are several powerful tricks in its lexer/parser/compiler that you could absolutely adapt to your high-performance parser.

Let’s break them down:

1. Compact Token & AST Representation

Hermes:

  • Uses bit-packed token headers (u32 or u64):
    • kind: 8 bits
    • start_offset: 24 bits
    • flags/metadata: remaining bits

AST nodes are tight structs, often with union-like variants.

Apply to your language:

  • You can bit-pack token metadata:
    • Use u32 with {kind: u8, start: u24}
  • Replace heap-allocated structs with:
    • flat arenas
    • small, Copy AST nodes
    • Possibly #[repr(C)] + enum-discriminant layouts for traversals

Benefit: Faster parsing, smaller memory footprint, better locality

2. Pre-parsed Bytecode: Skipping Parse at Runtime

...

Tell me more about bit-packing token

4. Bit-Packed Tokens: What, Why, and How

Why:

  • Your lexer is blazing fast — but each token is 12–24 bytes
  • That adds up fast in large files (>10K tokens)
  • Packing reduces memory, improves cache locality

Idea:

Represent a Token as a single u32 or u64:

Layout:

Field Bits Example
Kind 8 0–255 enum
Start offset 24 Byte offset
End offset 24 Byte offset
Flags 8 custom

Accessors:

impl PackedToken {
    fn kind(self) -> TokenKind {
        TokenKind::from_u8((self.0 & 0xFF) as u8)
    }
    fn start(self) -> u32 {
        ((self.0 >> 8) & 0xFFFFFF) as u32
    }
    fn end(self) -> u32 {
        ((self.0 >> 32) & 0xFFFFFF) as u32
    }
}

Optionally make TokenKind: u8 with #[repr(u8)] to simplify.

@Boshen Boshen mentioned this pull request May 12, 2025
@tomgasson
Copy link
Contributor Author

From CodSpeed it look like this has now shifted much of the work over to the lexer (more so with u128 than u64). This might need a bit more work on the lexing side to improve

Copy link
Member

@overlookmotel overlookmotel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

On Mac Mini M4, I'm getting this result from benchmarks:

lexer/checker.ts        time:   [3.5806 ms 3.5878 ms 3.5953 ms]
                        change: [+2.8823% +3.1912% +3.4929%] (p = 0.00 < 0.05)
                        Performance has regressed.
lexer/cal.com.tsx       time:   [1.4846 ms 1.4899 ms 1.4961 ms]
                        change: [+3.5426% +4.5244% +5.7089%] (p = 0.00 < 0.05)
                        Performance has regressed.
lexer/RadixUI.jsx       time:   [1.5299 µs 1.5328 µs 1.5358 µs]
                        change: [−4.0315% −3.7330% −3.4303%] (p = 0.00 < 0.05)
                        Performance has improved.
lexer/pdf.mjs           time:   [918.41 µs 920.58 µs 923.02 µs]
                        change: [+1.2432% +1.7523% +2.2282%] (p = 0.00 < 0.05)
                        Performance has regressed.
lexer/antd.js           time:   [6.8587 ms 6.8660 ms 6.8756 ms]
                        change: [+3.8498% +3.9941% +4.1547%] (p = 0.00 < 0.05)
                        Performance has regressed.

parser/checker.ts       time:   [7.1211 ms 7.1489 ms 7.1802 ms]
                        change: [−3.8394% −3.3014% −2.7636%] (p = 0.00 < 0.05)
                        Performance has improved.
parser/cal.com.tsx      time:   [3.5647 ms 3.5728 ms 3.5822 ms]
                        change: [−6.5004% −6.2345% −5.9451%] (p = 0.00 < 0.05)
                        Performance has improved.
parser/RadixUI.jsx      time:   [4.3214 µs 4.3289 µs 4.3367 µs]
                        change: [−12.434% −12.188% −11.924%] (p = 0.00 < 0.05)
                        Performance has improved.
parser/pdf.mjs          time:   [2.0171 ms 2.0206 ms 2.0243 ms]
                        change: [−6.4245% −5.8924% −5.4479%] (p = 0.00 < 0.05)
                        Performance has improved.
parser/antd.js          time:   [13.776 ms 13.836 ms 13.905 ms]
                        change: [−4.3647% −3.9299% −3.5010%] (p = 0.00 < 0.05)
                        Performance has improved.

+12% perf improvement on the RadixUI benchmark - which I think is the one that's most representative of real-world code that gets thrown at Oxc, as all the other benchmark files are huge.

I'm really surprised that this change nets such a gain. I've previously tried making Token #[repr(C, align(16))] with padding u8 fields, so all 16 bytes are initialized. i.e. so the memory layout of Token is same as a u128 and with all the fields in same location as here. I'd have expected the compiler to boil that down to exactly the same assembly as this. But it seems not! 🤷

@overlookmotel overlookmotel added the 0-merge Merge with Graphite Merge Queue label May 13, 2025
@graphite-app
Copy link
Contributor

graphite-app bot commented May 13, 2025

Merge activity

  • May 13, 3:16 PM UTC: @tomgasson we removed the merge queue label because we could not find a Graphite account associated with your GitHub profile.

You must have a Graphite account in order to use the merge queue. Create an account and try again using this link

@graphite-app graphite-app bot removed the 0-merge Merge with Graphite Merge Queue label May 13, 2025
@overlookmotel overlookmotel merged commit cfd1ed3 into oxc-project:main May 13, 2025
28 checks passed
graphite-app bot pushed a commit that referenced this pull request May 13, 2025
Follow-on after #10933. That PR made some cosmetic changes to the tests for `Token` which weren't necessary. Revert them.
graphite-app bot pushed a commit that referenced this pull request May 13, 2025
Follow-on after #10933. Cosmetic change only. Reformat comments to our standard style, with backticks around type/field names.
@Boshen
Copy link
Member

Boshen commented May 14, 2025

Thank you @tomgasson! Can't wait to see what you find next.

graphite-app bot pushed a commit that referenced this pull request May 14, 2025
Follow-on after #10933. It's preferable to include explicit type params in calls to `mem::transmute` rather than relying on type inference. This ensures any later changes won't inadvertently cause UB.

Also expand the safety comment.
@Conaclos
Copy link

I'm really surprised that this change nets such a gain. I've previously tried making Token #[repr(C, align(16))] with padding u8 fields, so all 16 bytes are initialized. i.e. so the memory layout of Token is same as a u128 and with all the fields in same location as here. I'd have expected the compiler to boil that down to exactly the same assembly as this. But it seems not!

This is the kind of thing you could raise as an issue on the Rust repository.

graphite-app bot pushed a commit that referenced this pull request May 14, 2025
Follow-on after #10933. Make `u8` value of `Kind::Eof` (`Kind::default()`) be 0. This makes `Token::default()` a little cheaper, as all bits of default `Token` are 0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-parser Area - Parser C-performance Category - Solution not expected to change functional behavior, only performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants