feat(parser): add token collection option#17024
feat(parser): add token collection option#17024lilnasy wants to merge 3 commits into12-18-test_benchmarks_add_parser_tokens_benchmarkfrom
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
How to use the Graphite Merge QueueAdd either label to this PR to merge it via the merge queue:
You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
CodSpeed Performance ReportMerging #17024 will degrade performances by 29.01%Comparing Summary
Benchmarks breakdown
Footnotes
|
e35dc7e to
f58229f
Compare
There was a problem hiding this comment.
Pull request overview
This PR implements unconditional token storage in the Lexer as part of measuring the performance impact for issue #16207. The changes add a new tokens field to store all tokens encountered during parsing in an arena-allocated vector, and expose this collection through the ParserReturn struct.
- Adds token storage infrastructure to the
Lexerstruct - Exports
TokenandKindtypes publicly from the parser crate - Optimizes error collection by using
appendinstead ofextend
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
tasks/track_memory_allocations/allocs_parser.snap |
Updates memory allocation benchmarks showing minimal increase in arena allocations and reallocations from storing tokens |
crates/oxc_parser/src/lib.rs |
Adds public exports for Token and Kind, adds tokens field to ParserReturn, and optimizes error handling with append |
crates/oxc_parser/src/lexer/mod.rs |
Adds tokens field to Lexer, updates checkpoint/rewind to track token vector length, and implements tokens() method to extract collected tokens |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I discussed with Boshen today. He wants to abandon the Just to be transparent: Personally I am not entirely on board with this decision - Boshen and I have different priorities. I prioritize perf over all else (pretty much), whereas Boshen puts more weight on avoiding complex generics (readable code) and keeping compile times down. But we follow our fearless captain! So the decision is made. I'd like to apologise to you. I thought that the approach was agreed already, but it seems not. So I've sent you down a pointless path of getting into the So... here's what I suggest:
I've added the I think we can improve perf when collecting tokens. Simplest thing would be to pre-allocate capacity in tokens Growing a Pre-allocating a lot of space is a speed vs memory usage trade-off. I imagine it'll be worth it. But... since you have Graphite now, please make that optimization in a separate PR on top of this one, so we can see the effect it has on benchmarks in isolation. |
e7c19d8 to
4ecf1ad
Compare
1ed0bc7 to
125e885
Compare
Lexer3296252 to
e7d7cab
Compare
125e885 to
3ee6e9e
Compare
3296252 to
6d053b4
Compare
3ee6e9e to
7635cc4
Compare
7635cc4 to
f59f6c8
Compare
#17095 shows it improves performance by ~23% after this change degrades it by 28%, but that is misleading because of how CodSpeed does its maths. When I had the preallocation change in this PR, the degradation dropped down only to ~27%. |
tasks/benchmark/benches/parser.rs
Outdated
| Parser::new(&allocator, source_text, source_type) | ||
| .with_options(ParseOptions { | ||
| parse_regular_expression: true, | ||
| collect_tokens: true, |
There was a problem hiding this comment.
This should be false here, and true on the other benchmark - bench_parser_with_tokens.
There was a problem hiding this comment.
But ergh! So the cost to parser with runtime option in parse-transform-minify-print pipeline is 6%-8%. That's a lot. We might have to bring back ParserConfig! :)
There was a problem hiding this comment.
I've pushed a commit to switch round which benchmark gets collect_tokens: true. I just want the bench results to be clear so I can discuss with Boshen.
I've not restacked rest of the stack - didn't want to touch any of the rest.
There was a problem hiding this comment.
Let's not worry about performance for now. If you're able to get it working and conformance tests passing, then we'll loop back and fix the perf. If necessary, we may have to go back to ParserConfig trait.
| let backtrace = std::backtrace::Backtrace::capture(); | ||
| panic!("Can't retrieve tokens because they were not collected\n{backtrace}"); |
There was a problem hiding this comment.
I've never seen Backtrace used before. I think panic! automatically produces a backtrace, so it's not required. Any reason why you added this?
There was a problem hiding this comment.
I wasn't seeing the names of the methods in the call stack until I added this. It's possible I missed something.
We don't have to keep this, I needed it just for debugging.
There was a problem hiding this comment.
There was a problem hiding this comment.
Ah I see. Usually running with RUST_BACKTRACE=1 gives you stack traces.
Turning on debug temporarily can also help sometimes:
Lines 247 to 255 in 5a7fcd1
Let me know if neither of those works.
crates/oxc_parser/src/lib.rs
Outdated
| errors.extend(self.lexer.errors); | ||
| errors.extend(self.errors); | ||
| errors.append(&mut self.lexer.errors); | ||
| errors.append(&mut self.errors); |
There was a problem hiding this comment.
This change is likely good, but it's incidental to this PR. Want to make a separate PR for this?
There was a problem hiding this comment.
It maybe incidental for self.errors, but extend() moves self.lexer while consuming self.lexer.errors. That prevented the self.lexer.tokens() call below.
There was a problem hiding this comment.
Actually, maybe it's not good. extend is I think slightly cheaper.
But this code could be optimized in other way.
module_record_errors.len()is not included in reservation.- Could extend one of the existing
Vecs instead of creating a new one.
Something like:
if !self.source_type.is_typescript() {
module_record_errors.truncate();
}
errors = self.lexer.errors;
errors.reserve(self.errors.len() + module_record_errors.len());
errors.extend(self.errors);
errors.extend(module_record_errors);There was a problem hiding this comment.
It maybe incidental for
self.errors, butextend()movesself.lexerwhile consumingself.lexer.errors. That prevented theself.lexer.tokens()call below.
Ah ha! Sorry I was wrong. It's not incidental.
There was a problem hiding this comment.
But could you just move the self.lexer.tokens() call to earlier, before the error-handling code, and then leave it as using extends()?
bdd6736 to
d9d0d73
Compare
f525989 to
08c5680
Compare

Part of #16207. This is just to measure the performance impact of unconditionally always storing tokens in an
oxc_allocator::Vec.