refactor(es/parser): Split parser into also-lex/parse-only by bvanjoi · Pull Request #10399 · swc-project/swc

bvanjoi · 2025-04-21T12:42:33Z

This constitutes a substantial code revision. We should first evaluate whether it yields measurable performance enhancements.

changeset-bot · 2025-04-21T12:42:38Z

🦋 Changeset detected

Latest commit: 3db38aa

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

codspeed-hq · 2025-04-21T12:46:53Z

CodSpeed Performance Report

Merging #10399 will create unknown performance changes

_{Comparing bvanjoi:pref (3db38aa) with main (26289ab)}

Summary

🆕 152 new benchmarks

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
🆕	`deserialization of serde`	N/A	2.5 µs	N/A
🆕	`serialization of serde`	N/A	2.9 µs	N/A
🆕	`css/visitor/compare/clone`	N/A	5 ms	N/A
🆕	`css/visitor/compare/fold_span`	N/A	6 ms	N/A
🆕	`css/visitor/compare/fold_span_panic`	N/A	6 ms	N/A
🆕	`css/visitor/compare/visit_mut_span`	N/A	5.4 ms	N/A
🆕	`css/visitor/compare/visit_mut_span_panic`	N/A	5.4 ms	N/A
🆕	`css/lexer/bootstrap_5_1_3`	N/A	11.2 ms	N/A
🆕	`css/lexer/foundation_6_7_4`	N/A	9.1 ms	N/A
🆕	`css/lexer/tailwind_3_1_1`	N/A	1.8 ms	N/A
🆕	`css/parser/bootstrap_5_1_3`	N/A	48.5 ms	N/A
🆕	`css/parser/foundation_6_7_4`	N/A	38.4 ms	N/A
🆕	`css/parser/tailwind_3_1_1`	N/A	7.5 ms	N/A
🆕	`es/codegen/colors`	N/A	70.5 µs	N/A
🆕	`es/codegen/large`	N/A	647.9 µs	N/A
🆕	`es/codegen/with-parser/colors`	N/A	183.4 µs	N/A
🆕	`es/codegen/with-parser/large`	N/A	1.4 ms	N/A
🆕	`es/fast-lexer/angular`	N/A	9.2 ms	N/A
🆕	`es/fast-lexer/backbone`	N/A	1.2 ms	N/A
🆕	`es/fast-lexer/jquery`	N/A	6.1 ms	N/A
...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

bvanjoi · 2025-04-22T07:18:02Z

The performance gains demonstrate a compelling case for implementation.

Let me explain my idea:

First, split Token into TokenKind (the type of token, which is still referred to as Token in this PR) and TokenValue. This means we can't retrieve the correct combination of TokenKind and TokenValue after parsing, because the TokenValue has already been consumed during parsing(edit: Unless the TokenValue is stored elsewhere, but this is unnecessary for parsing).
And then, these changes mean there are now two different types of Token between swc_ecma_lexer and swc_ecma_parser: one contains the TokenValue, while the other does not.
The points above mean that using a single Parser between swc_ecma_lexer and swc_ecma_parser is more complicated. We would need to create a trait to handle different types of Token and manage their behaviors, and the ROI for this seems low.
So, I made the following changes: I copied the swc_ecma_parser/parser into swc_ecma_lexer, meaning that swc_ecma_lexer now includes the Parser itself. I also created a swc_ecma_parser/lexer folder, which contains a lexer that serves only the swc_ecma_parser/Parser and is not exposed externally. With these changes, if you want to generate the correct tokens, you should use swc_ecma_lexer/{Lexer, Parser}, while the swc_ecma_parser/Parser is now only for quickly generating the AST.
Based on these changes, I’ve removed some tests that ensured the lexer worked correctly in swc_ecma_parser and switched to using swc_ecma_lexer in swc_fast_ts_strip.
Another change is simply replacing Token with TokenKind and TokenValue.

@kdy1 Do you agree with these changes? If so, I will fix CI and make it ready for review.

CPunisher · 2025-04-22T08:22:00Z

I have seen the two lexers and two parsers in swc_ecma_lexer and swc_ecma_parser. Do we need maintain all of them or one pair is WIP while another is released?

bvanjoi · 2025-04-22T10:07:23Z

Do we need maintain all of them

Yes, I think so. Perhaps we can find a better way to integrate it.

kdy1 · 2025-04-22T10:10:12Z

Why do we need two separate parser/lexer/token types?
What's the problem you see in the token definition in #10397 ?

bvanjoi · 2025-04-22T10:25:40Z

Why do we need two separate token types?

The value field defined in Token(source) is not required for parsing. Moving it to the outer struct would reduce memory usage.

Additionally, the had_line_break field in Token during parsing also unnecessary and can be removed in the future.

Why do we need two separate lexer ?

Since the lexical analysis implementations differ, swc_ecma_lexer/lexer guarantees token correctness while swc_ecma_parser/lexer does not maintain the same level of validation.

Why do we need two separate parser?

Using a single parser to handle different lexical logic introduces unnecessary development complexity. While I've separated them for now, I believe there should be a better way to consolidate this logic.

CPunisher · 2025-04-22T11:52:51Z

The value field defined in Token(source) is not required for parsing.

I'm not sure why TokenValue is not required for parsing. For example, doesn't an identifier token value required for constructing identifier ast node?

CPunisher · 2025-04-22T11:55:36Z

This means we can't retrieve the correct combination of TokenKind and TokenValue after parsing, because the TokenValue has already been consumed during parsing(edit: Unless the TokenValue is stored elsewhere, but this is unnecessary for parsing)

I guess this is the root cause of the splitting. But I'm not too clear about it. Could you please provide more detailed examples?

bvanjoi · 2025-04-22T12:05:54Z

I'm not sure why TokenValue is not required for parsing.

TokenValue is not required in the Token. For example:

a + b
^        --> `token: ident`, `lexer.state.token_value: a`
    ^    --> `token: ident`, `lexer.state.token_value: b`
    
// rather than
a + b
^        --> `token: token_kind: ident, token_value: a`
    ^    --> `token: token_kind: ident, token_value: b` // actually, we dont need to store `token_value` in each `token`

CPunisher · 2025-04-22T13:27:53Z

Since both of swc_ecma_parser and swc_ecma_lexer contain Lexer and Parser, under what circumstances should we use swc_ecma_parser? In other words, what's the difference between them now?

kdy1 · 2025-04-22T13:46:13Z

The value field defined in Token(source) is not required for parsing. Moving it to the outer struct would reduce memory usage.

Additionally, the had_line_break field in Token during parsing also unnecessary and can be removed in the future.

I understood, but I think the gain is too small if we have to maintain two tokens/lexers/parsers. My guess is that #10397 is the sweet point, but not sure.

bvanjoi · 2025-04-22T14:31:11Z

In other words, what's the difference between them now

The swc_ecma_parser/Parser implementation prioritizes AST generation performance at the cost of potential token validation gaps, whereas swc_ecma_lexer/Parser enforces strict token-level correctness guarantees during AST construction.

I think the gain is too small if we have to maintain two tokens/lexers/parsers

I suggest considering landing this change for three reasons:

Using a separate lexer and parser between swc_ecma_parser and swc_ecma_lexer is the foundation for future optimizations (such as removing TokenContexts, which requires using a custom lexer; and remove the had_line_break field in Token, which need different Token type). I think with this change we can eventually achieve the same performance as oxc/parser.
Strict token validation is not commonly needed, and sacrificing performance improvements for it may not be a good trade-off.
We can reuse code by creating a shared crate (e.g., swc_ecma_lexer_common) to store common functions and traits.

CPunisher · 2025-04-22T14:46:43Z

The swc_ecma_parser/Parser implementation prioritizes AST generation performance at the cost of potential token validation gaps, whereas swc_ecma_lexer/Parser enforces strict token-level correctness guarantees during AST construction.

I'm not too clear about Strict token validation. Generating correct AST is always the foundation for parser. If Strict token validation is not neccessary for generating correct AST, why we still keep swc_ecma_lexer/Parser in the project?

Using a separate lexer and parser between swc_ecma_parser and swc_ecma_lexer is the foundation for future optimizations (such as removing TokenContexts, which requires using a custom lexer; and remove the had_line_break field in Token, which need different Token type).

Do you mean copy the new Lexer and Parser, optimize on top of the new ones, and replace the original ones once they are completed? If not, why can't do so?

bvanjoi · 2025-04-22T14:59:29Z

I'm not too clear about Strict token validation. Generating correct AST is always the foundation for parser.

Consider the tokenization of < in <div/> when parsing a JSX file. Under strict parsing rules, this would be classified as a Token::JsxTagStart. However, in practice, treating it as a simple Token::LessThan is sufficient for generating a correct AST. The stricter approach requiring Token::JsxTagStart forces us to maintain unnecessary contextual information purely for AST construction purposes.

why we still keep swc_ecma_lexer/Parser in the project

The primary purpose of swc_ecma_lexer/Parser is to ensure the lexer can function reliably, particularly for cases involving try_parse methods that may require multiple lexing attempts. Unlike a single-pass lexer, this implementation needs to handle scenarios where tokens cannot be fully determined in one pass.

(Alternatively, we might consider eventually removing swc_ecma_lexer/Parser entirely and simply exposing a lexical_analysis function that handles tokenization independently.)

Do you mean copy the new Lexer and Parser, optimize on top of the new ones, and replace the original ones once they are completed

I'm having difficulty understanding the intended meaning here. Could you please clarify?

CPunisher · 2025-04-22T15:17:51Z

Consider the tokenization of < in <div/> when parsing a JSX file. Under strict parsing rules, this would be classified as a Token::JsxTagStart. However, in practice, treating it as a simple Token::LessThan is sufficient for generating a correct AST. The stricter approach requiring Token::JsxTagStart forces us to maintain unnecessary contextual information purely for AST construction purposes.

Learnt, and I agree this is a very good idea.

The primary purpose of swc_ecma_lexer/Parser is to ensure the lexer can function reliably, particularly for cases involving try_parse methods that may require multiple lexing attempts. Unlike a single-pass lexer, this implementation needs to handle scenarios where tokens cannot be fully determined in one pass.

(Alternatively, we might consider eventually removing swc_ecma_lexer/Parser entirely and simply exposing a lexical_analysis function that handles tokenization independently.)

I see. Does this mean swc_ecma_parser/Parser MAY produce wrong results(without some preconditions satisfied)?

I'm having difficulty understanding the intended meaning here. Could you please clarify?

Like swc_ecma_fast_parser, early it's a trial to rewrite the parser and then replace the swc_ecma_parser, but it seems to be aborted. My question is whether you are also writing a completely new parser (but maybe not yet finished)?

bvanjoi · 2025-04-22T15:37:54Z

Does this mean swc_ecma_parser/Parser MAY produce wrong results(without some preconditions satisfied)?

Certainly not - the current token and parser state combination is sufficient for producing a correct AST. This approach aligns with many established parser implementations, including TypeScript's.

Like swc_ecma_fast_parser, early it's a trial to rewrite the parser and then replace the swc_ecma_parser, but it seems to be aborted. My question is whether you are also writing a completely new parser

I agree with this perspective in principle, but I don't believe creating a separate crate would actually reduce the workload. My primary goal is to optimize swc_ecma_parser for faster AST generation while maintaining correctness.

To summarize these changes:

Main branch: swc_ecma_parser generates both AST and tokens.
After this PR:
- swc_ecma_lexer takes over token generation.
- swc_ecma_parser focuses solely on AST production, and it no longer exposes internal tokens.

kdy1 · 2025-04-22T15:40:41Z

Ok, I got the idea and I like it. Would it be possible to use a single struct with const generic or generic to make maintenance easy? I'm worried about the maintenance burden.

CPunisher · 2025-04-22T15:46:12Z

Ok, I got the idea and I like it. Would it be possible to use a single struct with const generic or generic to make maintenance easy? I'm worried about the maintenance burden.

Agree. I expect finally we should eliminate all duplication and get a clear lexer and a clear parser (just like now, and faster), but we should still keep the code in the process as maintainable as possible, which also bring less burden for bug fixing.

kdy1 · 2025-04-22T15:52:13Z

If it's inevitable at the moment, I think we can merge this PR as is and refactor these with other PRs, but I think we should at least have concrete plan for future deduplication.

bvanjoi · 2025-04-22T15:56:53Z

I'm worried about the maintenance burden.

I am also worried about the maintenance cost: At present, it seems troublesome to abstract swc_ecma_lexer/Parser and swc_ecma_parser/Parser into an independent trait. My suggestion is:

Create a library swc_ecma_lexer_common, and then put the common parts of swc_ecma_lexer and swc_ecma_parser into it, and swc_ecma_parser will no longer depend on swc_ecma_lexer.
Gradually sort out the reusable logic of the two and implement it in swc_ecma_lexer_common.

What do you think?

kdy1 · 2025-04-22T16:02:00Z

It sounds acceptable for now, but I want to hear opinion from @CPunisher, too.

### What? Update swc_core to v26 ### Why? swc-project/swc#10399 made the parser 7% faster

bvanjoi requested review from a team as code owners April 21, 2025 12:42

bvanjoi force-pushed the pref branch 5 times, most recently from 2c80e1a to c2b1b12 Compare April 22, 2025 03:39

bvanjoi force-pushed the pref branch from 8867c2d to f4bb503 Compare April 22, 2025 12:57

bvanjoi and others added 11 commits May 17, 2025 16:12

chore(es/parser): common parse pattern, part3

014f754

chore(es/parser): common parse_ty, part7

96f69ab

chore(es/parser): common parse_ty, part8

dca14f0

chore\(es/parser\): common parse stmt, part2

4388351

chore(es/parser): common parse expr

9fd1182

chore(es/parser): common parse stmt

08f540e

chore(es/parser): common parse class and fn

32c48fa

Create twenty-rocks-deny.md

97c19b3

refactor(es/lexer): delete Korean comment

a712751

refactor(ecma/lexer): self by value

a0b476e

refactor(es/lexer): add lifetime for Input tarit

0c6e879

bvanjoi dismissed kdy1’s stale review via 0c6e879 May 17, 2025 08:43

bvanjoi force-pushed the pref branch from d1196a6 to 0c6e879 Compare May 17, 2025 08:43

bvanjoi requested review from a team as code owners May 17, 2025 08:43

Create kind-weeks-study.md

3db38aa

kdy1 approved these changes May 17, 2025

View reviewed changes

kdy1 merged commit 26289ab into swc-project:main May 17, 2025
18 checks passed

kdy1 modified the milestones: Planned, v1.11.28 May 20, 2025

This was referenced May 20, 2025

build: Update swc_core to v26 wooorm/mdxjs-rs#69

Merged

build: Update swc_core to v26.0.0 vercel/next.js#79411

Merged

kdy1 added a commit to vercel/next.js that referenced this pull request May 20, 2025

build: Update swc_core to v26.0.0 (#79411)

deb49c4

### What? Update swc_core to v26 ### Why? swc-project/swc#10399 made the parser 7% faster

kdy1 mentioned this pull request May 30, 2025

fix: change visibility of lexer::token exports from crate to public #10537

Closed

swc-project locked as resolved and limited conversation to collaborators Jun 20, 2025

kdy1 modified the milestones: v1.11.28, Planned Jun 21, 2025

Uh oh!

Comments

Conversation

bvanjoi commented Apr 21, 2025

Uh oh!

changeset-bot bot commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

codspeed-hq bot commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #10399 will create unknown performance changes

Summary

Benchmarks breakdown

Uh oh!

bvanjoi commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CPunisher commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bvanjoi commented Apr 22, 2025

Uh oh!

kdy1 commented Apr 22, 2025

Uh oh!

bvanjoi commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CPunisher commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CPunisher commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bvanjoi commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CPunisher commented Apr 22, 2025

Uh oh!

kdy1 commented Apr 22, 2025

Uh oh!

bvanjoi commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CPunisher commented Apr 22, 2025

Uh oh!

bvanjoi commented Apr 22, 2025

Uh oh!

CPunisher commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bvanjoi commented Apr 22, 2025

Uh oh!

kdy1 commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CPunisher commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kdy1 commented Apr 22, 2025

Uh oh!

bvanjoi commented Apr 22, 2025

Uh oh!

kdy1 commented Apr 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

changeset-bot bot commented Apr 21, 2025 •

edited

Loading

codspeed-hq bot commented Apr 21, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

kdy1 commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading