Skip to content

Comments

refactor(es/parser): Split parser into also-lex/parse-only#10399

Merged
kdy1 merged 84 commits intoswc-project:mainfrom
bvanjoi:pref
May 17, 2025
Merged

refactor(es/parser): Split parser into also-lex/parse-only#10399
kdy1 merged 84 commits intoswc-project:mainfrom
bvanjoi:pref

Conversation

@bvanjoi
Copy link
Contributor

@bvanjoi bvanjoi commented Apr 21, 2025

This constitutes a substantial code revision. We should first evaluate whether it yields measurable performance enhancements.

@bvanjoi bvanjoi requested review from a team as code owners April 21, 2025 12:42
@changeset-bot
Copy link

changeset-bot bot commented Apr 21, 2025

🦋 Changeset detected

Latest commit: 3db38aa

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@codspeed-hq
Copy link

codspeed-hq bot commented Apr 21, 2025

CodSpeed Performance Report

Merging #10399 will create unknown performance changes

Comparing bvanjoi:pref (3db38aa) with main (26289ab)

Summary

🆕 152 new benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
🆕 deserialization of serde N/A 2.5 µs N/A
🆕 serialization of serde N/A 2.9 µs N/A
🆕 css/visitor/compare/clone N/A 5 ms N/A
🆕 css/visitor/compare/fold_span N/A 6 ms N/A
🆕 css/visitor/compare/fold_span_panic N/A 6 ms N/A
🆕 css/visitor/compare/visit_mut_span N/A 5.4 ms N/A
🆕 css/visitor/compare/visit_mut_span_panic N/A 5.4 ms N/A
🆕 css/lexer/bootstrap_5_1_3 N/A 11.2 ms N/A
🆕 css/lexer/foundation_6_7_4 N/A 9.1 ms N/A
🆕 css/lexer/tailwind_3_1_1 N/A 1.8 ms N/A
🆕 css/parser/bootstrap_5_1_3 N/A 48.5 ms N/A
🆕 css/parser/foundation_6_7_4 N/A 38.4 ms N/A
🆕 css/parser/tailwind_3_1_1 N/A 7.5 ms N/A
🆕 es/codegen/colors N/A 70.5 µs N/A
🆕 es/codegen/large N/A 647.9 µs N/A
🆕 es/codegen/with-parser/colors N/A 183.4 µs N/A
🆕 es/codegen/with-parser/large N/A 1.4 ms N/A
🆕 es/fast-lexer/angular N/A 9.2 ms N/A
🆕 es/fast-lexer/backbone N/A 1.2 ms N/A
🆕 es/fast-lexer/jquery N/A 6.1 ms N/A
... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

@bvanjoi bvanjoi force-pushed the pref branch 5 times, most recently from 2c80e1a to c2b1b12 Compare April 22, 2025 03:39
@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

The performance gains demonstrate a compelling case for implementation.

Let me explain my idea:

  • First, split Token into TokenKind (the type of token, which is still referred to as Token in this PR) and TokenValue. This means we can't retrieve the correct combination of TokenKind and TokenValue after parsing, because the TokenValue has already been consumed during parsing(edit: Unless the TokenValue is stored elsewhere, but this is unnecessary for parsing).
  • And then, these changes mean there are now two different types of Token between swc_ecma_lexer and swc_ecma_parser: one contains the TokenValue, while the other does not.
  • The points above mean that using a single Parser between swc_ecma_lexer and swc_ecma_parser is more complicated. We would need to create a trait to handle different types of Token and manage their behaviors, and the ROI for this seems low.
  • So, I made the following changes: I copied the swc_ecma_parser/parser into swc_ecma_lexer, meaning that swc_ecma_lexer now includes the Parser itself. I also created a swc_ecma_parser/lexer folder, which contains a lexer that serves only the swc_ecma_parser/Parser and is not exposed externally. With these changes, if you want to generate the correct tokens, you should use swc_ecma_lexer/{Lexer, Parser}, while the swc_ecma_parser/Parser is now only for quickly generating the AST.
  • Based on these changes, I’ve removed some tests that ensured the lexer worked correctly in swc_ecma_parser and switched to using swc_ecma_lexer in swc_fast_ts_strip.
  • Another change is simply replacing Token with TokenKind and TokenValue.

@kdy1 Do you agree with these changes? If so, I will fix CI and make it ready for review.

@CPunisher
Copy link
Member

CPunisher commented Apr 22, 2025

I have seen the two lexers and two parsers in swc_ecma_lexer and swc_ecma_parser. Do we need maintain all of them or one pair is WIP while another is released?

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

Do we need maintain all of them

Yes, I think so. Perhaps we can find a better way to integrate it.

@kdy1
Copy link
Member

kdy1 commented Apr 22, 2025

Why do we need two separate parser/lexer/token types?
What's the problem you see in the token definition in #10397 ?

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

Why do we need two separate token types?

The value field defined in Token(source) is not required for parsing. Moving it to the outer struct would reduce memory usage.

Additionally, the had_line_break field in Token during parsing also unnecessary and can be removed in the future.

Why do we need two separate lexer ?

Since the lexical analysis implementations differ, swc_ecma_lexer/lexer guarantees token correctness while swc_ecma_parser/lexer does not maintain the same level of validation.

Why do we need two separate parser?

Using a single parser to handle different lexical logic introduces unnecessary development complexity. While I've separated them for now, I believe there should be a better way to consolidate this logic.

@CPunisher
Copy link
Member

CPunisher commented Apr 22, 2025

The value field defined in Token(source) is not required for parsing.

I'm not sure why TokenValue is not required for parsing. For example, doesn't an identifier token value required for constructing identifier ast node?

@CPunisher
Copy link
Member

CPunisher commented Apr 22, 2025

  • This means we can't retrieve the correct combination of TokenKind and TokenValue after parsing, because the TokenValue has already been consumed during parsing(edit: Unless the TokenValue is stored elsewhere, but this is unnecessary for parsing)

I guess this is the root cause of the splitting. But I'm not too clear about it. Could you please provide more detailed examples?

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

I'm not sure why TokenValue is not required for parsing.

TokenValue is not required in the Token. For example:

a + b
^        --> `token: ident`, `lexer.state.token_value: a`
    ^    --> `token: ident`, `lexer.state.token_value: b`
    
// rather than
a + b
^        --> `token: token_kind: ident, token_value: a`
    ^    --> `token: token_kind: ident, token_value: b` // actually, we dont need to store `token_value` in each `token`

@CPunisher
Copy link
Member

Since both of swc_ecma_parser and swc_ecma_lexer contain Lexer and Parser, under what circumstances should we use swc_ecma_parser? In other words, what's the difference between them now?

@kdy1
Copy link
Member

kdy1 commented Apr 22, 2025

The value field defined in Token(source) is not required for parsing. Moving it to the outer struct would reduce memory usage.

Additionally, the had_line_break field in Token during parsing also unnecessary and can be removed in the future.

I understood, but I think the gain is too small if we have to maintain two tokens/lexers/parsers. My guess is that #10397 is the sweet point, but not sure.

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

In other words, what's the difference between them now

The swc_ecma_parser/Parser implementation prioritizes AST generation performance at the cost of potential token validation gaps, whereas swc_ecma_lexer/Parser enforces strict token-level correctness guarantees during AST construction.

I think the gain is too small if we have to maintain two tokens/lexers/parsers

I suggest considering landing this change for three reasons:

  • ​Using a separate lexer and parser between swc_ecma_parser and swc_ecma_lexer is the foundation for future optimizations (such as removing TokenContexts, which requires using a custom lexer; and remove the had_line_break field in Token, which need different Token type). I think with this change we can eventually achieve the same performance as oxc/parser.
  • Strict token validation is not commonly needed, and sacrificing performance improvements for it may not be a good trade-off.
  • ​We can reuse code by creating a shared crate (e.g., swc_ecma_lexer_common) to store common functions and traits.

@CPunisher
Copy link
Member

The swc_ecma_parser/Parser implementation prioritizes AST generation performance at the cost of potential token validation gaps, whereas swc_ecma_lexer/Parser enforces strict token-level correctness guarantees during AST construction.

I'm not too clear about Strict token validation. Generating correct AST is always the foundation for parser. If Strict token validation is not neccessary for generating correct AST, why we still keep swc_ecma_lexer/Parser in the project?

  • ​Using a separate lexer and parser between swc_ecma_parser and swc_ecma_lexer is the foundation for future optimizations (such as removing TokenContexts, which requires using a custom lexer; and remove the had_line_break field in Token, which need different Token type).

Do you mean copy the new Lexer and Parser, optimize on top of the new ones, and replace the original ones once they are completed? If not, why can't do so?

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

I'm not too clear about Strict token validation. Generating correct AST is always the foundation for parser.

Consider the tokenization of < in <div/> when parsing a JSX file. Under strict parsing rules, this would be classified as a Token::JsxTagStart. However, in practice, treating it as a simple Token::LessThan is sufficient for generating a correct AST. The stricter approach requiring Token::JsxTagStart forces us to maintain unnecessary contextual information purely for AST construction purposes.

why we still keep swc_ecma_lexer/Parser in the project

The primary purpose of swc_ecma_lexer/Parser is to ensure the lexer can function reliably, particularly for cases involving try_parse methods that may require multiple lexing attempts. Unlike a single-pass lexer, this implementation needs to handle scenarios where tokens cannot be fully determined in one pass.

(Alternatively, we might consider eventually removing swc_ecma_lexer/Parser entirely and simply exposing a lexical_analysis function that handles tokenization independently.)

Do you mean copy the new Lexer and Parser, optimize on top of the new ones, and replace the original ones once they are completed

I'm having difficulty understanding the intended meaning here. Could you please clarify?

@CPunisher
Copy link
Member

CPunisher commented Apr 22, 2025

Consider the tokenization of < in <div/> when parsing a JSX file. Under strict parsing rules, this would be classified as a Token::JsxTagStart. However, in practice, treating it as a simple Token::LessThan is sufficient for generating a correct AST. The stricter approach requiring Token::JsxTagStart forces us to maintain unnecessary contextual information purely for AST construction purposes.

Learnt, and I agree this is a very good idea.

The primary purpose of swc_ecma_lexer/Parser is to ensure the lexer can function reliably, particularly for cases involving try_parse methods that may require multiple lexing attempts. Unlike a single-pass lexer, this implementation needs to handle scenarios where tokens cannot be fully determined in one pass.

(Alternatively, we might consider eventually removing swc_ecma_lexer/Parser entirely and simply exposing a lexical_analysis function that handles tokenization independently.)

I see. Does this mean swc_ecma_parser/Parser MAY produce wrong results(without some preconditions satisfied)?

I'm having difficulty understanding the intended meaning here. Could you please clarify?

Like swc_ecma_fast_parser, early it's a trial to rewrite the parser and then replace the swc_ecma_parser, but it seems to be aborted. My question is whether you are also writing a completely new parser (but maybe not yet finished)?

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

Does this mean swc_ecma_parser/Parser MAY produce wrong results(without some preconditions satisfied)?

Certainly not - the current token and parser state combination is sufficient for producing a correct AST. This approach aligns with many established parser implementations, including TypeScript's.

Like swc_ecma_fast_parser, early it's a trial to rewrite the parser and then replace the swc_ecma_parser, but it seems to be aborted. My question is whether you are also writing a completely new parser

I agree with this perspective in principle, but I don't believe creating a separate crate would actually reduce the workload. My primary goal is to optimize swc_ecma_parser for faster AST generation while maintaining correctness.

To summarize these changes:

  • Main branch: swc_ecma_parser generates both AST and tokens.
  • After this PR:
    • swc_ecma_lexer takes over token generation.
    • swc_ecma_parser focuses solely on AST production, and it no longer exposes internal tokens.

@kdy1
Copy link
Member

kdy1 commented Apr 22, 2025

Ok, I got the idea and I like it. Would it be possible to use a single struct with const generic or generic to make maintenance easy? I'm worried about the maintenance burden.

@CPunisher
Copy link
Member

CPunisher commented Apr 22, 2025

Ok, I got the idea and I like it. Would it be possible to use a single struct with const generic or generic to make maintenance easy? I'm worried about the maintenance burden.

Agree. I expect finally we should eliminate all duplication and get a clear lexer and a clear parser (just like now, and faster), but we should still keep the code in the process as maintainable as possible, which also bring less burden for bug fixing.

@kdy1
Copy link
Member

kdy1 commented Apr 22, 2025

If it's inevitable at the moment, I think we can merge this PR as is and refactor these with other PRs, but I think we should at least have concrete plan for future deduplication.

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

I'm worried about the maintenance burden.

I am also worried about the maintenance cost: At present, it seems troublesome to abstract swc_ecma_lexer/Parser and swc_ecma_parser/Parser into an independent trait. My suggestion is:

  • Create a library swc_ecma_lexer_common, and then put the common parts of swc_ecma_lexer and swc_ecma_parser into it, and swc_ecma_parser will no longer depend on swc_ecma_lexer.
  • Gradually sort out the reusable logic of the two and implement it in swc_ecma_lexer_common.

What do you think?

@kdy1
Copy link
Member

kdy1 commented Apr 22, 2025

It sounds acceptable for now, but I want to hear opinion from @CPunisher, too.

@kdy1 kdy1 merged commit 26289ab into swc-project:main May 17, 2025
18 checks passed
@kdy1 kdy1 modified the milestones: Planned, v1.11.28 May 20, 2025
kdy1 added a commit to vercel/next.js that referenced this pull request May 20, 2025
### What?

Update swc_core to v26

### Why?

swc-project/swc#10399 made the parser 7% faster
@swc-project swc-project locked as resolved and limited conversation to collaborators Jun 20, 2025
@kdy1 kdy1 modified the milestones: v1.11.28, Planned Jun 21, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants