Skip to content

feat(parser): add token collection option#17024

Closed
lilnasy wants to merge 3 commits into12-18-test_benchmarks_add_parser_tokens_benchmarkfrom
12-17-feat_oxc_parser_store_tokens_in_lexer_
Closed

feat(parser): add token collection option#17024
lilnasy wants to merge 3 commits into12-18-test_benchmarks_add_parser_tokens_benchmarkfrom
12-17-feat_oxc_parser_store_tokens_in_lexer_

Conversation

@lilnasy
Copy link
Contributor

@lilnasy lilnasy commented Dec 17, 2025

Part of #16207. This is just to measure the performance impact of unconditionally always storing tokens in an oxc_allocator::Vec.

@github-actions github-actions bot added A-parser Area - Parser C-enhancement Category - New feature or request labels Dec 17, 2025
Copy link
Contributor Author

lilnasy commented Dec 17, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • 0-merge - adds this PR to the back of the merge queue
  • hotfix - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@codspeed-hq
Copy link

codspeed-hq bot commented Dec 17, 2025

CodSpeed Performance Report

Merging #17024 will degrade performances by 29.01%

Comparing 12-17-feat_oxc_parser_store_tokens_in_lexer_ (d9d0d73) with 12-18-test_benchmarks_add_parser_tokens_benchmark (08c5680)

Summary

❌ 12 regressions
✅ 34 untouched
⏩ 3 skipped1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Mode Benchmark BASE HEAD Change
Simulation parser_tokens[react.development.js] 1.2 ms 1.8 ms -29.01%
Simulation parser[RadixUIAdoptionSection.jsx] 80.6 µs 88.1 µs -8.48%
Simulation parser[binder.ts] 3.2 ms 3.4 ms -6.32%
Simulation parser_tokens[RadixUIAdoptionSection.jsx] 80.6 µs 97.1 µs -17.03%
Simulation parser_tokens[cal.com.tsx] 25.8 ms 32.8 ms -21.16%
Simulation parser[cal.com.tsx] 25.8 ms 27.7 ms -6.85%
Simulation parser[react.development.js] 1.2 ms 1.3 ms -6.23%
Simulation parser_tokens[binder.ts] 3.2 ms 4.3 ms -26.25%
Simulation lexer[cal.com.tsx] 5.5 ms 7.5 ms -27.47%
Simulation lexer[RadixUIAdoptionSection.jsx] 21 µs 26.3 µs -20.05%
Simulation lexer[binder.ts] 884.6 µs 1,149.6 µs -23.05%
Simulation lexer[react.development.js] 357.9 µs 465.9 µs -23.18%

Footnotes

  1. 3 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@lilnasy lilnasy force-pushed the 12-17-feat_oxc_parser_store_tokens_in_lexer_ branch from e35dc7e to f58229f Compare December 17, 2025 23:46
@lilnasy lilnasy marked this pull request as ready for review December 18, 2025 00:26
Copilot AI review requested due to automatic review settings December 18, 2025 00:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements unconditional token storage in the Lexer as part of measuring the performance impact for issue #16207. The changes add a new tokens field to store all tokens encountered during parsing in an arena-allocated vector, and expose this collection through the ParserReturn struct.

  • Adds token storage infrastructure to the Lexer struct
  • Exports Token and Kind types publicly from the parser crate
  • Optimizes error collection by using append instead of extend

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tasks/track_memory_allocations/allocs_parser.snap Updates memory allocation benchmarks showing minimal increase in arena allocations and reallocations from storing tokens
crates/oxc_parser/src/lib.rs Adds public exports for Token and Kind, adds tokens field to ParserReturn, and optimizes error handling with append
crates/oxc_parser/src/lexer/mod.rs Adds tokens field to Lexer, updates checkpoint/rewind to track token vector length, and implements tokens() method to extract collected tokens

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Boshen Boshen self-requested a review December 18, 2025 03:10
@Boshen Boshen self-assigned this Dec 18, 2025
@lilnasy lilnasy marked this pull request as draft December 18, 2025 05:31
@overlookmotel
Copy link
Member

overlookmotel commented Dec 18, 2025

I discussed with Boshen today. He wants to abandon the ParserConfig trait approach for now. Plan is to configure whether to store tokens or not with a runtime option. Boshen feels it's OK to take a small slowdown in compiler pipeline in return for less complicated code. If we find the perf hit is too much, we can look at bringing in ParserConfig trait later on.

Just to be transparent: Personally I am not entirely on board with this decision - Boshen and I have different priorities. I prioritize perf over all else (pretty much), whereas Boshen puts more weight on avoiding complex generics (readable code) and keeping compile times down. But we follow our fearless captain! So the decision is made.

I'd like to apologise to you. I thought that the approach was agreed already, but it seems not. So I've sent you down a pointless path of getting into the ParserConfig stuff (which was not trivial) and then reversed direction. Doing that is not cool - we should have reached agreement on the design first. I'm sorry.

So... here's what I suggest:

  • Add tokens: bool field to ParseOptions to enable/disable collecting tokens (default false).
  • Add tokens: ArenaVec<'a, Token> to ParserReturn (which will be empty if collecting tokens is disabled).
  • Stack this PR on top of test(benchmarks): add parser_tokens benchmark #17047.
  • Alter the parser_tokens benchmark to pass tokens: true in options.

I've added the parser_tokens benchmark in a separate PR, so we'll see the costs clearly in CodSpeed on this PR - it'll show cost to both compiler pipeline, and to linter.


I think we can improve perf when collecting tokens. Simplest thing would be to pre-allocate capacity in tokens Vec for source_text.len() tokens. There can never be more tokens than there are bytes in source text, so this will guarantee that the Vec doesn't have to grow during the parsing process.

Growing a Vec is really expensive (especially if it's large) as all the contents of the Vec (pre-growth) have to be copied to the new allocation (post-growth). You can see this cost in the CodSpeed flame graphs for lexer benchmarks (RawVec::finish_grow::grow).

Pre-allocating a lot of space is a speed vs memory usage trade-off. I imagine it'll be worth it.

But... since you have Graphite now, please make that optimization in a separate PR on top of this one, so we can see the effect it has on benchmarks in isolation.

@lilnasy lilnasy force-pushed the 12-17-feat_oxc_parser_store_tokens_in_lexer_ branch 2 times, most recently from e7c19d8 to 4ecf1ad Compare December 18, 2025 23:26
@github-actions github-actions bot added A-linter Area - Linter A-cli Area - CLI A-formatter Area - Formatter labels Dec 18, 2025
@lilnasy lilnasy force-pushed the 12-17-feat_oxc_parser_store_tokens_in_lexer_ branch 4 times, most recently from 1ed0bc7 to 125e885 Compare December 19, 2025 00:32
@lilnasy lilnasy changed the title feat(oxc_parser): store tokens in Lexer feat(parser): add token collection option Dec 19, 2025
@lilnasy lilnasy changed the base branch from main to 12-18-test_benchmarks_add_parser_tokens_benchmark December 19, 2025 00:47
@github-actions github-actions bot added A-ast-tools Area - AST tools A-editor Area - Editor and Language Server labels Dec 19, 2025
@github-actions github-actions bot added the A-linter-plugins Area - Linter JS plugins label Dec 19, 2025
@lilnasy lilnasy force-pushed the 12-18-test_benchmarks_add_parser_tokens_benchmark branch from 3296252 to e7d7cab Compare December 19, 2025 00:48
@lilnasy lilnasy changed the base branch from 12-18-test_benchmarks_add_parser_tokens_benchmark to graphite-base/17024 December 19, 2025 00:51
@lilnasy lilnasy force-pushed the 12-17-feat_oxc_parser_store_tokens_in_lexer_ branch from 125e885 to 3ee6e9e Compare December 19, 2025 00:52
@lilnasy lilnasy force-pushed the graphite-base/17024 branch from 3296252 to 6d053b4 Compare December 19, 2025 00:52
@lilnasy lilnasy changed the base branch from graphite-base/17024 to main December 19, 2025 00:52
@lilnasy lilnasy changed the base branch from main to graphite-base/17024 December 19, 2025 00:58
@lilnasy lilnasy force-pushed the 12-17-feat_oxc_parser_store_tokens_in_lexer_ branch from 3ee6e9e to 7635cc4 Compare December 19, 2025 00:58
@lilnasy lilnasy changed the base branch from graphite-base/17024 to 12-18-test_benchmarks_add_parser_tokens_benchmark December 19, 2025 00:58
@lilnasy lilnasy force-pushed the 12-17-feat_oxc_parser_store_tokens_in_lexer_ branch from 7635cc4 to f59f6c8 Compare December 19, 2025 01:14
@lilnasy
Copy link
Contributor Author

lilnasy commented Dec 19, 2025

Simplest thing would be to pre-allocate capacity in tokens Vec for source_text.len() tokens.

#17095 shows it improves performance by ~23% after this change degrades it by 28%, but that is misleading because of how CodSpeed does its maths. When I had the preallocation change in this PR, the degradation dropped down only to ~27%.

Parser::new(&allocator, source_text, source_type)
.with_options(ParseOptions {
parse_regular_expression: true,
collect_tokens: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be false here, and true on the other benchmark - bench_parser_with_tokens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But ergh! So the cost to parser with runtime option in parse-transform-minify-print pipeline is 6%-8%. That's a lot. We might have to bring back ParserConfig! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a commit to switch round which benchmark gets collect_tokens: true. I just want the bench results to be clear so I can discuss with Boshen.

I've not restacked rest of the stack - didn't want to touch any of the rest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not worry about performance for now. If you're able to get it working and conformance tests passing, then we'll loop back and fix the perf. If necessary, we may have to go back to ParserConfig trait.

Comment on lines +274 to +275
let backtrace = std::backtrace::Backtrace::capture();
panic!("Can't retrieve tokens because they were not collected\n{backtrace}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never seen Backtrace used before. I think panic! automatically produces a backtrace, so it's not required. Any reason why you added this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't seeing the names of the methods in the call stack until I added this. It's possible I missed something.

We don't have to keep this, I needed it just for debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@overlookmotel overlookmotel Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Usually running with RUST_BACKTRACE=1 gives you stack traces.

Turning on debug temporarily can also help sometimes:

oxc/Cargo.toml

Lines 247 to 255 in 5a7fcd1

[profile.dev]
# Disabling debug info speeds up local and CI builds,
# and we don't rely on it for debugging that much.
debug = false
[profile.test]
# Disabling debug info speeds up local and CI builds,
# and we don't rely on it for debugging that much.
debug = false

Let me know if neither of those works.

Comment on lines 470 to 486
errors.extend(self.lexer.errors);
errors.extend(self.errors);
errors.append(&mut self.lexer.errors);
errors.append(&mut self.errors);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is likely good, but it's incidental to this PR. Want to make a separate PR for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It maybe incidental for self.errors, but extend() moves self.lexer while consuming self.lexer.errors. That prevented the self.lexer.tokens() call below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe it's not good. extend is I think slightly cheaper.

But this code could be optimized in other way.

  1. module_record_errors.len() is not included in reservation.
  2. Could extend one of the existing Vecs instead of creating a new one.

Something like:

if !self.source_type.is_typescript() {
  module_record_errors.truncate();
}

errors = self.lexer.errors;
errors.reserve(self.errors.len() + module_record_errors.len());
errors.extend(self.errors);
errors.extend(module_record_errors);

Copy link
Member

@overlookmotel overlookmotel Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It maybe incidental for self.errors, but extend() moves self.lexer while consuming self.lexer.errors. That prevented the self.lexer.tokens() call below.

Ah ha! Sorry I was wrong. It's not incidental.

Copy link
Member

@overlookmotel overlookmotel Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But could you just move the self.lexer.tokens() call to earlier, before the error-handling code, and then leave it as using extends()?

@lilnasy lilnasy force-pushed the 12-17-feat_oxc_parser_store_tokens_in_lexer_ branch from bdd6736 to d9d0d73 Compare December 19, 2025 14:59
@lilnasy lilnasy force-pushed the 12-18-test_benchmarks_add_parser_tokens_benchmark branch from f525989 to 08c5680 Compare December 19, 2025 14:59
@Boshen Boshen removed their assignment Feb 6, 2026
@camc314 camc314 closed this Feb 19, 2026
@overlookmotel overlookmotel deleted the 12-17-feat_oxc_parser_store_tokens_in_lexer_ branch February 27, 2026 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ast-tools Area - AST tools A-cli Area - CLI A-editor Area - Editor and Language Server A-formatter Area - Formatter A-linter Area - Linter A-linter-plugins Area - Linter JS plugins A-parser Area - Parser C-enhancement Category - New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants