Skip to content
This repository has been archived by the owner on Aug 31, 2023. It is now read-only.

☂️ Attach trivia to tokens #1720

Closed
4 tasks done
xunilrj opened this issue Oct 27, 2021 · 8 comments
Closed
4 tasks done

☂️ Attach trivia to tokens #1720

xunilrj opened this issue Oct 27, 2021 · 8 comments
Assignees
Labels
umbrella Issue to track a collection of other issues

Comments

@xunilrj
Copy link
Contributor

xunilrj commented Oct 27, 2021

Description

part of: #1718

Pending Decisions

  • storing number of spaces/tabs (swift) or a string containing all whitespace (Roslyn)?
  • Token { kind: Ident, text: "foo", trailing_trivia: Trivia::spaces(1), span: Span(4) } (one option)
    modelling trivia as cached references (similar to token/nodes) or as enum (Space, Tab, NewLine, SingleLineComment, - - MultiLineComment) and store them by value?
  • unifying GreenNode , GreenToken and GreenTrivia into a single GreenNode(Roslyn) or keep as is (Rust analyzser, Swift)?

Tasks

Research

Lexer merges newline and trailing spaces

let text = "let a = 1; // nice variable \n /*hey*/ let b = 2; // another nice variable";
let (mut tokens, mut errors) = tokenize(text, 0);
let mut i = 0;
for token in &tokens {
    println!("{:?} {:?}", &text[i..(i + token.len)], token);
    i += token.len;
}

generates

"let" Token { kind: IDENT, len: 3 }
" " Token { kind: WHITESPACE, len: 1 }
"a" Token { kind: IDENT, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"=" Token { kind: EQ, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"1" Token { kind: NUMBER, len: 1 }
";" Token { kind: SEMICOLON, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"// nice variable " Token { kind: COMMENT, len: 17 }
"\n " Token { kind: WHITESPACE, len: 2 }
"/*hey*/" Token { kind: COMMENT, len: 7 }
" " Token { kind: WHITESPACE, len: 1 }
"let" Token { kind: IDENT, len: 3 }
" " Token { kind: WHITESPACE, len: 1 }
"b" Token { kind: IDENT, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"=" Token { kind: EQ, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"2" Token { kind: NUMBER, len: 1 }
";" Token { kind: SEMICOLON, len: 1 }
" " Token { kind: WHITESPACE, len: 1 }
"// another nice variable" Token { kind: COMMENT, len: 24 }
"" Token { kind: EOF, len: 0 }

LosslessTreeSink

It seems possible to prepend and append trivia to tokens inside the LosslessTreeSink doing the following:

  • Removing eat_trivia and eat_n_trivia in LosslessTreeSink::start_node
  • LosslessTreeSink::do_token can eat and append trivia until finds a "new line"
  • After "new line", we can store the trivia somewhere and prepend them into the next token.

Doing this generates a trivia free and clean tree:

Before the tree was:

To eliminate the trivia SyntaxKind

The trivia processing in the LosslessTreeSink demands that the trivia are identifiable by its SyntaxKind. Therefore, to eliminate the SyntaxKind, we need to move this processing to the lexer.

In this case, the change would need to be done at Lexer::lex_token and Lexer::lex_template. Unfortunately, impl Iterator for Lexer<'_> is too late because it already works with Tokens.

The lexer is implemented with multiple functions returning LexerReturn, like

fn read_shebang(&mut self) -> LexerReturn {...}
fn read_regex(&mut self) -> LexerReturn {...}
fn resolve_plus(&mut self) -> LexerReturn {...}
fn lex_token(&mut self) -> LexerReturn {...}
fn lex_template(&mut self) -> LexerReturn {...}
...

LexerReturn is pub type LexerReturn = (Token, Option<Diagnostic>);. So we would need to consume the trailing trivia inside these functions. Boring, but doable.

Another option is to allow these functions to parse just its token and find leading/trailing trivia outside, in a more general function.

Here lex_token and lex_template seems to be the best option.

Another possible advantage of doing this in the lexer is that we can implement this as an option in the lexer:

pub struct Lexer<'src> {
	bytes: &'src [u8],
	cur: usize,
	state: LexerState,
	pub file_id: usize,
	returned_eof: bool,
        attach_trivia_to_tokens: bool // <- look here
}

I did a quick test draining all trivia from the lexer to test if the parser works, and it does work.

let text = "let a = 1; // nice variable \n /*hey*/ let b = 2; // another nice variable";
let (mut tokens, mut errors) = tokenize(text, 0);

let tokens: Vec<_> = tokens
    .drain(..)
    .filter(|x| (x.kind != SyntaxKind::WHITESPACE) && (x.kind != SyntaxKind::COMMENT))
    .collect();

The only complication I found is the Token::len. Now it contains the token length ("let" = 3). In this case we would need a Token::complete_length() (" let " = 5). We would need this inside the LosslessTreeSink to correctly point to the original &str.

Another implication of eliminating SyntaxKindwould be harder to have the trivia inside the token as enums. We can create an additional enum for this, of course, but maybe it is cheaper to have trivia as &str.

How to store the trivia inside the Green tree?

Swift

Swift has just pointers and lengths to trivia. The downside is that the trivia "loses" its parsing and needs to be parsed again. Swift documentation even reminds you that you should cache the parsing.

https://github.com/apple/swift/blob/7123d2614b5f222d03b3762cb110d27a9dd98e24/include/swift/Syntax/RawSyntax.h#L185

 struct TokenData {
    /// The pointers to the leading/trailing trivia and token texts. If their
    /// lengths are greater than 0, these always reside in the node's \c Arena.
    const char *LeadingTrivia;
    const char *TokenText;
    const char *TrailingTrivia;
    uint32_t LeadingTriviaLength;
    uint32_t TokenLength;
    uint32_t TrailingTriviaLength;
    /// The kind of token this "token" node represents.
    tok TokenKind;
  };

https://github.com/apple/swift/blob/7123d2614b5f222d03b3762cb110d27a9dd98e24/include/swift/Syntax/RawSyntax.h#L447

  /// Return pieces that make up the leading trivia of the token.
  /// Note that this triggers trivia parsing which may be expensive. If the
  /// trivia pieces are required multiple times, consider caching them.
  Trivia getLeadingTriviaPieces() const;

This solution looks interesting because we can keep the GreenToken with a fixed size and the vast majority of cases trivia will be trivially reparsed (you saw this pun coming 😛).

The problem is that we want out Green tree to be language independent. In this case to reparse the trivia we would need to know from which language the trivia was generated from. We would need to tag with a enum, or carry a function pointer to the trivia parser.

C#

Roslyn, surprisingly or not, uses a more C#-ish approach. There is no storage for the trivia inside the GreenNode.

https://github.com/dotnet/roslyn/blob/315c2e149ba7889b0937d872274c33fcbfe9af5f/src/Compilers/Core/Portable/Syntax/GreenNode.cs

internal abstract class GreenNode : IObjectWritable
{
        ...
        public virtual GreenNode? GetLeadingTriviaCore() { return null; }
        public virtual GreenNode? GetTrailingTriviaCore() { return null; }
        ...
}

Trivia has its own class:

https://github.com/dotnet/roslyn/blob/315c2e149ba7889b0937d872274c33fcbfe9af5f/src/Compilers/CSharp/Portable/Syntax/InternalSyntax/SyntaxTrivia.cs

internal class SyntaxTrivia : CSharpSyntaxNode
{
        public readonly string Text;
    ...
}

Just for reference, CSharpSyntaxNode is GreenNode.

internal abstract class CSharpSyntaxNode : GreenNode
{
    ...
}

So SyntaxTrivia is GreenNode. And we have

https://github.com/dotnet/roslyn/blob/315c2e149ba7889b0937d872274c33fcbfe9af5f/src/Compilers/CSharp/Portable/Syntax/InternalSyntax/SyntaxToken.SyntaxIdentifierWithTrivia.cs

 internal class SyntaxIdentifierWithTrivia : SyntaxIdentifierExtended
 {
     private readonly GreenNode _leading;
     private readonly GreenNode _trailing;
}

that is used as

https://github.com/dotnet/roslyn/blob/315c2e149ba7889b0937d872274c33fcbfe9af5f/src/Compilers/CSharp/Portable/Syntax/InternalSyntax/SyntaxToken.SyntaxIdentifierWithTrivia.cs

internal partial class SyntaxToken
{
    ...
    public override SyntaxToken TokenWithLeadingTrivia(GreenNode trivia)
    {
        return new SyntaxIdentifierWithTrivia(this.contextualKind, this.TextField, this.valueText, trivia, _trailing, this.GetDiagnostics(), this.GetAnnotations());
     }
    ...
}

My current preference is:

1 - GreenToken will have an enum with two cases: the most common one (a limited number of whitespaces); and a Vec with fallback to complex cases;
2 - SyntaxToken will expose trivia as a wrapper to the GreenToken. We will put helper methods here in the future.

Trivia Memory Comsuption

We decided to "flat" the storage to decrease memory comsuption

pub enum Trivia {
	Whitespace(std::num::NonZeroUsize),
  Comment(std::num::NonZeroUsize),
}

pub enum GreenTokenTrivia {
	None,
	One(Trivia),
	Many(Box<Vec<Trivia>>),
}

print-type-size type: `Trivia`: 16 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `Whitespace`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `Comment`: 8 bytes
print-type-size         field `.0`: 8 bytes

print-type-size type: `GreenTokenTrivia`: 24 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `One`: 16 bytes
print-type-size         field `.0`: 16 bytes
print-type-size     variant `Many`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `None`: 0 bytes

pub enum GreenTokenTrivia {
	None,
	Whitespace(std::num::NonZeroUsize),
  Comment(std::num::NonZeroUsize),
	Many(Box<Vec<Trivia>>),
}

print-type-size type: `GreenTokenTrivia`: 16 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `Whitespace`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `Comment`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `Many`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `None`: 0 bytes

The Box<Vec<...>> is polemic. There is even a lint rule to avoid it. Because of this and other issues we have have this discussion to address how we can improve trivia storage.

Discussions

#1809

PRs

#1716
#1738 (Old PR. Will be abandoned).
#1783
#1798
#1801

@xunilrj xunilrj added the umbrella Issue to track a collection of other issues label Oct 27, 2021
@xunilrj xunilrj self-assigned this Oct 27, 2021
@ematipico
Copy link
Contributor

Fork rowan

This is done already, although we should decide on a different name that is different from rowan_rome?

@xunilrj
Copy link
Contributor Author

xunilrj commented Oct 27, 2021

Fork rowan

This is done already, although we should decide on a different name that is different from rowan_rome?

Our crates are lacking a pattern. rslint fork, for example, does not have "rome" as prefix. I would vote for no prefixes, to be a par with other crates. We can revisit this later, if needed.

@ematipico
Copy link
Contributor

Fork rowan

This is done already, although we should decide on a different name that is different from rowan_rome?

Our crates are lacking a pattern. rslint fork, for example, does not have "rome" as prefix. I would vote for no prefixes, to be a par with other crates. We can revisit this later, if needed.

Yeah makes sense. Let's take down the prefix.

@mrkldshv
Copy link
Contributor

Hi! Can I work on removing rome prefixes from crates?

@xunilrj
Copy link
Contributor Author

xunilrj commented Oct 28, 2021

@mrkldshv hum... We have another issue about the final name of these crates: #1677. Would be nice to bump the PR and check if it needs to be updated.

@mrkldshv
Copy link
Contributor

@xunilrj I see, thanks for reply. I will try to update this PR.

@MichaReiser
Copy link
Contributor

@mrkldshv we started a discord discussion to align on our crate naming. I suggest waiting on the outcome before working on the PR. I'm also not sure if it makes sense to build on top of my stale PR. You probably have an easier time starting from scratch.

@mrkldshv
Copy link
Contributor

@MichaReiser thanks for the information. I'll wait for outcome and probably create new PR based on the decision.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
umbrella Issue to track a collection of other issues
Projects
None yet
Development

No branches or pull requests

4 participants