feat(linter/plugins): serialize rust-parsed tokens#17025
feat(linter/plugins): serialize rust-parsed tokens#17025lilnasy wants to merge 1 commit into12-19-perf_oxc_parser_preallocate_tokens_vectorfrom
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
How to use the Graphite Merge QueueAdd either label to this PR to merge it via the merge queue:
You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
CodSpeed Performance ReportMerging #17025 will degrade performances by 25.15%Comparing Summary
Benchmarks breakdown
Footnotes
|
5416e19 to
3700cde
Compare
e35dc7e to
f58229f
Compare
| #[estree(skip)] | ||
| pub tokens: Vec<'a, Token<'a>>, |
There was a problem hiding this comment.
Can we avoid adding this field to Program? Tokens can be returned in ParserReturn instead.
| use oxc_allocator::CloneIn; | ||
| use oxc_ast_macros::{ast, ast_meta}; | ||
| use oxc_estree::ESTree; | ||
| use oxc_span::{Atom, ContentEq, GetSpan, GetSpanMut, Span}; | ||
|
|
||
| #[ast] | ||
| #[generate_derive(CloneIn, ContentEq, ESTree, GetSpan, GetSpanMut)] | ||
| #[estree(add_fields(value = TokenValue), no_type, no_ts_def, no_parent)] | ||
| #[derive(Debug)] | ||
| /// Represents a token in the source code. | ||
| pub struct Token<'a> { | ||
| /// Span. | ||
| #[span] | ||
| pub span: Span, | ||
| /// Type. | ||
| pub r#type: Atom<'a>, | ||
| /// Flags. | ||
| pub flags: Option<Atom<'a>>, | ||
| /// Pattern. | ||
| pub pattern: Option<Atom<'a>>, | ||
| } | ||
|
|
||
| /// Custom deserializer for `value` field of `Token`. | ||
| #[ast_meta] | ||
| #[generate_derive(CloneIn, ContentEq, ESTree)] | ||
| #[estree(ts_type = "string", raw_deser = "SOURCE_TEXT.slice(THIS.start, THIS.end)")] | ||
| pub struct TokenValue<'a, 'b>(pub &'b Token<'a>); |
There was a problem hiding this comment.
We shouldn't need this. We want JS side code to read the original Vec<Token> directly, without converting it.
Instead:
- Add
#[ast]attribute to the existingTokenstruct (and remove#[repr(transparent)]attr -#[ast]macro adds it automatically). - Add
#[ast] #[generate_derive(ESTree)] #[estree(rename = "TokenKind")]attributes to the existingKind(token kind) enum. - Implement
ESTreetrait onToken. - Add
crates/oxc_parser/src/lexer/token.rsandcrates/oxc_parser/src/lexer/kind.rsto the list of files that codegen processes at inast_tools. - Add another type
#[ast] struct Tokens<'a>(Vec<'a, Tokens>);in same file. We don't need to use that type, but adding it should makeast_toolsgenerate adeserializeVecTokenfunction. - Add a
TOKENflag toFLAG_NAMESinraw_transfer.rsgenerator. - Export
deserializeVecTokenfrom deserializer whenTOKENflag is enabled.
ESTree impl on Token will need to be manually defined (including the #[estree(raw_deser)] attr). Shout if you have any difficulty with this. Writing raw_deser implementations is pretty dreadful - my fault, it's terribly designed - and you'll need to write the byte offsets manually (since Token doesn't have real fields).
But you should be able to use deserializeTokenKind which codegen will generate, and also use the generated ESTree impl for Kind.
The conversion from Kind (token kind) to ESTree token type is the tricky part. Should be able to get codegen to do most of the work by adding #[estree(rename = "Keyword")] (etc) to all the variants of Kind.
No doubt, you didn't need me to explain every one of the steps above. I've just included them for clarity, to (hopefully) save you a little time. Please excuse me if I'm "teaching grandpa to suck eggs".
I've laid a bit of groundwork for this in #17050 and #17052.
There was a problem hiding this comment.
Oh actually maybe we can get codegen to generate everything for us by:
- Don't put
#[ast]attr onToken. - Instead, add this:
/// Dummy type to communicate the content of `Token` to `oxc_ast_tools`.
#[ast(foreign = Token)]
#[generate_derive(ESTree)]
#[expect(dead_code)]
struct TokenAlias {
pub span: Span,
pub kind: Kind,
#[estree(skip)]
pub _align: U128Align,
}
/// Zero-sized type which has alignment of `u128`
#[repr(transparent)]
struct U128Align([u128; 0]);This tells the codegen "treat Token as if it's defined like this".
You'd need to add a "special case" to codegen for U128Align (search ast_tools for PointerAlign).
EDIT: No this is a bad idea. Need to implement ESTree and raw_deser manually to handle the special case of extra fields on regexp tokens.
| let tokens = self.ast.vec_from_iter(self.lexer.tokens().iter().filter_map(|token| { | ||
| token.kind().to_tseslint_type().map(|ty| ast::Token { | ||
| span: token.span(), | ||
| r#type: self.ast.atom(ty), | ||
| flags: None, | ||
| pattern: None, | ||
| }) | ||
| })); |
There was a problem hiding this comment.
This is probably where the perf regression is coming from. Should be able to remove it once my other comments are actioned.
… structs (#17052) `#[ast]` macro adds `#[repr(C)]` to structs. For structs with a single field, use `#[repr(transparent)]` instead. This does not alter memory layout, but can change the ABI, which can make it more performant to pass such structs to functions in some cases, as it can change the registers used for passing. This will help with making lexer tokens serializable (#17025), since `Token` will have `#[ast]` added to it, and we need it to continue to be `#[repr(transparent)]`.
Previously `u128` was not supported in `assert_layouts` generator, because Rust changed the alignment of `u128` from 8 to 16 - and therefore the alignment depended on Rust version. That change happened some time ago now, and all Rust versions above our MSRV have `u128` with alignment of 16. So we can now support `u128` (and also `i128`, `NonZeroU128`, `NonZeroI128`). Also support `u128` in `raw_transfer` and `raw_transfer_lazy` generators. This will help with making lexer tokens serializable (#17025), since `Token` contains a `u128`.
f58229f to
e7c19d8
Compare
3700cde to
7ff857c
Compare
e7c19d8 to
d6951b8
Compare
7ff857c to
1aa1abc
Compare
1aa1abc to
d75a44f
Compare
d6951b8 to
8e5c3be
Compare
8e5c3be to
1ed0bc7
Compare
d75a44f to
bfea179
Compare
bfea179 to
f75eb49
Compare
1ed0bc7 to
125e885
Compare
125e885 to
3ee6e9e
Compare
e6b8780 to
e2b7436
Compare
3ee6e9e to
7635cc4
Compare
7635cc4 to
01dc729
Compare
e2b7436 to
8d40ca3
Compare

No description provided.