☂️ Attach trivia to tokens #1720

xunilrj · 2021-10-27T08:26:46Z

Description

part of: #1718

Pending Decisions

storing number of spaces/tabs (swift) or a string containing all whitespace (Roslyn)?
Token { kind: Ident, text: "foo", trailing_trivia: Trivia::spaces(1), span: Span(4) } (one option)
modelling trivia as cached references (similar to token/nodes) or as enum (Space, Tab, NewLine, SingleLineComment, - - MultiLineComment) and store them by value?
unifying GreenNode , GreenToken and GreenTrivia into a single GreenNode(Roslyn) or keep as is (Rust analyzser, Swift)?

Tasks

Modify lexer to generate two tokens for "\n ". "\n" will be appended to the trailing trivia of the last token, and " " will be prepended to the leading trivia of the next token. See Break newline and space "\n " into two tokens #1716
Fix lexer, because we actually want the "newline" token to always be in the leading, not in the trailing. Including the EOF token.
Changes to Rowan. See trivia attached to GreenTokens #1783 and typed api access to each piece of the trivia #1798
Changes to the parser. See LosslessTreeSinker attaching trivia to tokens #1801

Research

Lexer merges newline and trailing spaces

let text = "let a = 1; // nice variable \n /*hey*/ let b = 2; // another nice variable";
let (mut tokens, mut errors) = tokenize(text, 0);
let mut i = 0;
for token in &tokens {
    println!("{:?} {:?}", &text[i..(i + token.len)], token);
    i += token.len;
}

generates

"let" Token { kind: IDENT, len: 3 }
" " Token { kind: WHITESPACE, len: 1 }
"a" Token { kind: IDENT, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"=" Token { kind: EQ, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"1" Token { kind: NUMBER, len: 1 }
";" Token { kind: SEMICOLON, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"// nice variable " Token { kind: COMMENT, len: 17 }
"\n " Token { kind: WHITESPACE, len: 2 }
"/*hey*/" Token { kind: COMMENT, len: 7 }
" " Token { kind: WHITESPACE, len: 1 }
"let" Token { kind: IDENT, len: 3 }
" " Token { kind: WHITESPACE, len: 1 }
"b" Token { kind: IDENT, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"=" Token { kind: EQ, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"2" Token { kind: NUMBER, len: 1 }
";" Token { kind: SEMICOLON, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"// another nice variable" Token { kind: COMMENT, len: 24 }
"" Token { kind: EOF, len: 0 }

LosslessTreeSink

It seems possible to prepend and append trivia to tokens inside the LosslessTreeSink doing the following:

Removing eat_trivia and eat_n_trivia in LosslessTreeSink::start_node
LosslessTreeSink::do_token can eat and append trivia until finds a "new line"
After "new line", we can store the trivia somewhere and prepend them into the next token.

Doing this generates a trivia free and clean tree:

Node([email protected])
        Node([email protected])
        Token([email protected] "let")
                Node([email protected])
                        Node([email protected])
                                Node([email protected])
                                Token([email protected] "a")
                Token([email protected] "=")
                        Node([email protected])
                        Token([email protected] "1")
        Token([email protected] ";")
        Node([email protected])
        Token([email protected] "let")
                Node([email protected])
                        Node([email protected])
                                Node([email protected])
                                Token([email protected] "b")
                Token([email protected] "=")
                        Node([email protected])
                        Token([email protected] "2")
        Token([email protected] ";")

Before the tree was:

Node([email protected])
        Node([email protected])
        Token([email protected] "let")
        Token([email protected] " ")
                Node([email protected])
                        Node([email protected])
                                Node([email protected])
                                Token([email protected] "a")
                Token([email protected] " ")
                Token([email protected] "=")
                Token([email protected] " ")
                        Node([email protected])
                        Token([email protected] "1")
        Token([email protected] ";")
Token([email protected] " ")
Token([email protected] "// nice variable ")
Token([email protected] "\n ")
        Node([email protected])
        Token([email protected] "/*hey*/")
        Token([email protected] " ")
        Token([email protected] "let")
        Token([email protected] " ")
                Node([email protected])
                        Node([email protected])
                                Node([email protected])
                                Token([email protected] "b")
                Token([email protected] " ")
                Token([email protected] "=")
                Token([email protected] " ")
                        Node([email protected])
                        Token([email protected] "2")
        Token([email protected] ";")
Token([email protected] " ")
Token([email protected] "// another nice variable")

To eliminate the trivia `SyntaxKind`

The trivia processing in the LosslessTreeSink demands that the trivia are identifiable by its SyntaxKind. Therefore, to eliminate the SyntaxKind, we need to move this processing to the lexer.

In this case, the change would need to be done at Lexer::lex_token and Lexer::lex_template. Unfortunately, impl Iterator for Lexer<'_> is too late because it already works with Tokens.

The lexer is implemented with multiple functions returning LexerReturn, like

fn read_shebang(&mut self) -> LexerReturn {...}
fn read_regex(&mut self) -> LexerReturn {...}
fn resolve_plus(&mut self) -> LexerReturn {...}
fn lex_token(&mut self) -> LexerReturn {...}
fn lex_template(&mut self) -> LexerReturn {...}
...

LexerReturn is pub type LexerReturn = (Token, Option<Diagnostic>);. So we would need to consume the trailing trivia inside these functions. Boring, but doable.

Another option is to allow these functions to parse just its token and find leading/trailing trivia outside, in a more general function.

Here lex_token and lex_template seems to be the best option.

Another possible advantage of doing this in the lexer is that we can implement this as an option in the lexer:

pub struct Lexer<'src> {
	bytes: &'src [u8],
	cur: usize,
	state: LexerState,
	pub file_id: usize,
	returned_eof: bool,
        attach_trivia_to_tokens: bool // <- look here
}

I did a quick test draining all trivia from the lexer to test if the parser works, and it does work.

let text = "let a = 1; // nice variable \n /*hey*/ let b = 2; // another nice variable";
let (mut tokens, mut errors) = tokenize(text, 0);

let tokens: Vec<_> = tokens
    .drain(..)
    .filter(|x| (x.kind != SyntaxKind::WHITESPACE) && (x.kind != SyntaxKind::COMMENT))
    .collect();

The only complication I found is the Token::len. Now it contains the token length ("let" = 3). In this case we would need a Token::complete_length() (" let " = 5). We would need this inside the LosslessTreeSink to correctly point to the original &str.

Another implication of eliminating SyntaxKindwould be harder to have the trivia inside the token as enums. We can create an additional enum for this, of course, but maybe it is cheaper to have trivia as &str.

How to store the trivia inside the Green tree?

Swift

Swift has just pointers and lengths to trivia. The downside is that the trivia "loses" its parsing and needs to be parsed again. Swift documentation even reminds you that you should cache the parsing.

https://github.com/apple/swift/blob/7123d2614b5f222d03b3762cb110d27a9dd98e24/include/swift/Syntax/RawSyntax.h#L185

 struct TokenData {
    /// The pointers to the leading/trailing trivia and token texts. If their
    /// lengths are greater than 0, these always reside in the node's \c Arena.
    const char *LeadingTrivia;
    const char *TokenText;
    const char *TrailingTrivia;
    uint32_t LeadingTriviaLength;
    uint32_t TokenLength;
    uint32_t TrailingTriviaLength;
    /// The kind of token this "token" node represents.
    tok TokenKind;
  };

https://github.com/apple/swift/blob/7123d2614b5f222d03b3762cb110d27a9dd98e24/include/swift/Syntax/RawSyntax.h#L447

  /// Return pieces that make up the leading trivia of the token.
  /// Note that this triggers trivia parsing which may be expensive. If the
  /// trivia pieces are required multiple times, consider caching them.
  Trivia getLeadingTriviaPieces() const;

This solution looks interesting because we can keep the GreenToken with a fixed size and the vast majority of cases trivia will be trivially reparsed (you saw this pun coming 😛).

The problem is that we want out Green tree to be language independent. In this case to reparse the trivia we would need to know from which language the trivia was generated from. We would need to tag with a enum, or carry a function pointer to the trivia parser.

C#

Roslyn, surprisingly or not, uses a more C#-ish approach. There is no storage for the trivia inside the GreenNode.

https://github.com/dotnet/roslyn/blob/315c2e149ba7889b0937d872274c33fcbfe9af5f/src/Compilers/Core/Portable/Syntax/GreenNode.cs

internal abstract class GreenNode : IObjectWritable
{
        ...
        public virtual GreenNode? GetLeadingTriviaCore() { return null; }
        public virtual GreenNode? GetTrailingTriviaCore() { return null; }
        ...
}

Trivia has its own class:

https://github.com/dotnet/roslyn/blob/315c2e149ba7889b0937d872274c33fcbfe9af5f/src/Compilers/CSharp/Portable/Syntax/InternalSyntax/SyntaxTrivia.cs

internal class SyntaxTrivia : CSharpSyntaxNode
{
        public readonly string Text;
    ...
}

Just for reference, CSharpSyntaxNode is GreenNode.

internal abstract class CSharpSyntaxNode : GreenNode
{
    ...
}

So SyntaxTrivia is GreenNode. And we have

https://github.com/dotnet/roslyn/blob/315c2e149ba7889b0937d872274c33fcbfe9af5f/src/Compilers/CSharp/Portable/Syntax/InternalSyntax/SyntaxToken.SyntaxIdentifierWithTrivia.cs

 internal class SyntaxIdentifierWithTrivia : SyntaxIdentifierExtended
 {
     private readonly GreenNode _leading;
     private readonly GreenNode _trailing;
}

that is used as

https://github.com/dotnet/roslyn/blob/315c2e149ba7889b0937d872274c33fcbfe9af5f/src/Compilers/CSharp/Portable/Syntax/InternalSyntax/SyntaxToken.SyntaxIdentifierWithTrivia.cs

internal partial class SyntaxToken
{
    ...
    public override SyntaxToken TokenWithLeadingTrivia(GreenNode trivia)
    {
        return new SyntaxIdentifierWithTrivia(this.contextualKind, this.TextField, this.valueText, trivia, _trailing, this.GetDiagnostics(), this.GetAnnotations());
     }
    ...
}

My current preference is:

1 - GreenToken will have an enum with two cases: the most common one (a limited number of whitespaces); and a Vec with fallback to complex cases;
2 - SyntaxToken will expose trivia as a wrapper to the GreenToken. We will put helper methods here in the future.

Trivia Memory Comsuption

We decided to "flat" the storage to decrease memory comsuption

pub enum Trivia {
	Whitespace(std::num::NonZeroUsize),
  Comment(std::num::NonZeroUsize),
}

pub enum GreenTokenTrivia {
	None,
	One(Trivia),
	Many(Box<Vec<Trivia>>),
}

print-type-size type: `Trivia`: 16 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `Whitespace`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `Comment`: 8 bytes
print-type-size         field `.0`: 8 bytes

print-type-size type: `GreenTokenTrivia`: 24 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `One`: 16 bytes
print-type-size         field `.0`: 16 bytes
print-type-size     variant `Many`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `None`: 0 bytes

pub enum GreenTokenTrivia {
	None,
	Whitespace(std::num::NonZeroUsize),
  Comment(std::num::NonZeroUsize),
	Many(Box<Vec<Trivia>>),
}

print-type-size type: `GreenTokenTrivia`: 16 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `Whitespace`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `Comment`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `Many`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `None`: 0 bytes

The Box<Vec<...>> is polemic. There is even a lint rule to avoid it. Because of this and other issues we have have this discussion to address how we can improve trivia storage.

Discussions

#1809

PRs

#1716
#1738 (Old PR. Will be abandoned).
#1783
#1798
#1801

The text was updated successfully, but these errors were encountered:

ematipico · 2021-10-27T10:45:04Z

Fork rowan

This is done already, although we should decide on a different name that is different from rowan_rome?

xunilrj · 2021-10-27T14:26:12Z

Fork rowan

This is done already, although we should decide on a different name that is different from rowan_rome?

Our crates are lacking a pattern. rslint fork, for example, does not have "rome" as prefix. I would vote for no prefixes, to be a par with other crates. We can revisit this later, if needed.

ematipico · 2021-10-27T17:09:47Z

Fork rowan

This is done already, although we should decide on a different name that is different from rowan_rome?

Our crates are lacking a pattern. rslint fork, for example, does not have "rome" as prefix. I would vote for no prefixes, to be a par with other crates. We can revisit this later, if needed.

Yeah makes sense. Let's take down the prefix.

mrkldshv · 2021-10-27T22:01:03Z

Hi! Can I work on removing rome prefixes from crates?

xunilrj · 2021-10-28T16:14:41Z

@mrkldshv hum... We have another issue about the final name of these crates: #1677. Would be nice to bump the PR and check if it needs to be updated.

mrkldshv · 2021-10-28T19:13:12Z

@xunilrj I see, thanks for reply. I will try to update this PR.

MichaReiser · 2021-10-29T09:00:24Z

@mrkldshv we started a discord discussion to align on our crate naming. I suggest waiting on the outcome before working on the PR. I'm also not sure if it makes sense to build on top of my stale PR. You probably have an easier time starting from scratch.

mrkldshv · 2021-10-29T10:41:01Z

@MichaReiser thanks for the information. I'll wait for outcome and probably create new PR based on the decision.

xunilrj added the umbrella Issue to track a collection of other issues label Oct 27, 2021

xunilrj self-assigned this Oct 27, 2021

xunilrj mentioned this issue Oct 27, 2021

☂️ Rome Parser Architecture Changes #1718

Closed

3 tasks

This was referenced Nov 1, 2021

☂️ Weekly goals - week 43 #1717

Closed

Weekly goals - week 44 #1737

Closed

This was referenced Nov 2, 2021

Attaching trivia to GreenToken #1738

Closed

Break newline and space "\n " into two tokens #1716

Merged

ematipico mentioned this issue Nov 8, 2021

☂️ Weekly goals - week 45 #1754

Closed

xunilrj mentioned this issue Nov 12, 2021

trivia attached to GreenTokens #1783

Merged

5 tasks

ematipico mentioned this issue Nov 17, 2021

☂️ Weekly goals - week 46 #1796

Closed

This was referenced Nov 17, 2021

typed api access to each piece of the trivia #1798

Merged

LosslessTreeSinker attaching trivia to tokens #1801

Merged

xunilrj closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☂️ Attach trivia to tokens #1720

☂️ Attach trivia to tokens #1720

xunilrj commented Oct 27, 2021 •

edited

Loading

ematipico commented Oct 27, 2021

xunilrj commented Oct 27, 2021

ematipico commented Oct 27, 2021

mrkldshv commented Oct 27, 2021

xunilrj commented Oct 28, 2021

mrkldshv commented Oct 28, 2021

MichaReiser commented Oct 29, 2021

mrkldshv commented Oct 29, 2021

☂️ Attach trivia to tokens #1720

☂️ Attach trivia to tokens #1720

Comments

xunilrj commented Oct 27, 2021 • edited Loading

Description

Pending Decisions

Tasks

Research

Lexer merges newline and trailing spaces

LosslessTreeSink

To eliminate the trivia SyntaxKind

How to store the trivia inside the Green tree?

Swift

C#

Trivia Memory Comsuption

Discussions

PRs

ematipico commented Oct 27, 2021

xunilrj commented Oct 27, 2021

ematipico commented Oct 27, 2021

mrkldshv commented Oct 27, 2021

xunilrj commented Oct 28, 2021

mrkldshv commented Oct 28, 2021

MichaReiser commented Oct 29, 2021

mrkldshv commented Oct 29, 2021

xunilrj commented Oct 27, 2021 •

edited

Loading

To eliminate the trivia `SyntaxKind`