Trivia Storage #1809

xunilrj · 2021-11-22T09:59:25Z

xunilrj
Nov 22, 2021

This issue #1720 contains all the details of the decisions took when we migrated the trivia to be attached to tokens.
Now we need to understand the impact in terms of performance and if we want, and how, to improve.

Today each token contains all its trivia. An statement like "\tlet a = 0;" is tokenized as: "[\tlet ][a ][= ][0][;]".
This means that "\tlet " together with its SyntaxKind are the key to the green cache.

The improvements we are aiming are:
1 - Can we use use less memory?
2 - Can we be faster?
3 - Can we use more cache?

But before we need to understand the following

Given the changes we did, GreenToken is 32 bytes longer than before. Has this increased the memory consumption?
To access the token text, for example, "let" now we have to calculate its slice inside the whole token string slice. Has this decreased performance? Other operations also became more complex.
The trivia storage has a BoxVec<...>>. Is this double indirection creating performance problems?
Would be better to cache trivia? The vast majority of trivia would be: "" (empty), " " (one space), "\t" (one tab), "\t\t" ("two tabs") etc...
Can we store trivia more cheaply inside the GreenNode?

MichaReiser · 2021-11-22T10:21:27Z

MichaReiser
Nov 22, 2021

Thanks, @xunilrj for starting this discussion and doing some real-world measurements is certainly the right start.

I want to shortly sum up the main concerns I had with the current approach and outline an alternative storage layout

Extensibility

One consideration I want to add is the flexibility of the design. For example, what would it mean if we start distinguishing between single-line and multi-line comments, or if we introduce a new line trivia. Introducing a new-line trivia would reduce the cases where we can use the optimized Whitespace variant and instead must use the Many variant.

tools/crates/rome_rowan/src/green/token.rs

Lines 17 to 24 in 9b74b52

    
           #[derive(Debug, Clone, PartialEq, Eq, Hash)] 
        
           #[allow(clippy::box_vec)] 
        
           pub enum GreenTokenTrivia { 
        
           	None, 
        
           	Whitespace(usize), 
        
           	Comments(usize), 
        
           	Many(Box<Vec<TriviaPiece>>), 
        
           }

Familiarity

The other thing I would consider is how familiar a design is for people working on rowan. Tokens and Nodes use the HeaderSlice pattern, but trivia now uses a different storage representation (which is ok, if it brings some clear wins over HeaderSlice).

Total memory consumption

Rowan sacrifices some performance in favour of total memory consumption by trying to re-use data structures like nodes and tokens. This is still true but we reduced the cacheable types because tokens are now less likely to be cached because they now also include the leading and trailing trivia and we don't cache the trivia on their own.

Alternative storage layout

An alternative approach would be to use a GreenTrivia type using a HeaderSlice.

#[repr(u8)]
#[derive(Debug, Clone, Copy)]
pub enum GreenTriviaPieceKind {
    Whitespace,
    NewLine,
    Comment,
}

#[derive(Debug, Clone)]
pub struct GreenTriviaPiece {
    width: usize,
    kind: GreenTriviaPieceKind,
}

#[derive(Debug, Clone)]
pub struct GreenTriviaHead {
    // full width of all trivia it contains so that we don't need to iterate over all trivia
    width: usize,
}

#[derive(Clone)]
pub struct GreenTrivia {
    ptr: ThinArc<GreenTriviaHead, GreenTriviaPiece>,
}

impl GreenTrivia {
    fn header(&self) -> &GreenTriviaHead {
        &self.ptr.header
    }
    
    pub fn width(&self) -> usize {
        self.header().width
    }
    
    pub fn slice(&self) -> &[GreenTriviaPiece] {
        &self.ptr.slice()
    }
}

impl std::fmt::Debug for GreenTrivia {
	fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
		f.debug_struct("GreenTrivia")
			.field("width", &self.width())
			.field("n_trivia", &self.slice().len())
			.finish()
	}
}

#[repr(u16)]
pub enum SyntaxKind {
    LIST,
    WHITESPACE,
    COMMENT
}

pub struct GreenToken {}

pub struct GreenTokenHead {
    kind: SyntaxKind,
    leading_trivia: Option<GreenTrivia>,
    trailing_trivia: Option<GreenTrivia>,
}


fn main() {
    println!("GreenTriviaPiece: {}", std::mem::size_of::<GreenTriviaPiece>());
    println!("GreenTrivia: {}", std::mem::size_of::<GreenTrivia>());
    println!("GreenTokenHead: {}", std::mem::size_of::<GreenTokenHead>());
    
}

Playground

0 replies

MichaReiser · 2021-11-29T14:41:34Z

MichaReiser
Nov 29, 2021

@xunilrj what are your plans around trivia?

5 replies

xunilrj Dec 21, 2021
Author

I think some of these improvements would come from sharing between files, right? I think we can postpone the analysis to when we start parsing multiple files. Then we see if what makes sense.

MichaReiser Dec 21, 2021

I believe some of these improvements would even be beneficial for a single file. Also, having a new line kind would be beneficial in the formatter to not have to search through the whole trivia text to count the lines between two statements.

It may further be nice to avoid the Vec allocation inside of the get_trivia operation in the tree sink which shows to be quiet expensive.

I started a short prototype and changing it was surprisingly easy... except that TriviaPiece currently doesn't hold on to the text but the text is needed to implement any caching (or any two comments of the same lengths would be considered equal).

xunilrj Dec 21, 2021
Author

+1 on the NewLine.
I will take a look on the get_trivia.

MichaReiser Dec 21, 2021

But agree, I don't think this is urgent right now and we may come back to this discussion in the future.

xunilrj Jan 13, 2022
Author

Trivia improvements were done here: #1901
I think we can wait to have multiple files being parsed to assess other improvements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trivia Storage #1809

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Trivia Storage #1809

xunilrj Nov 22, 2021

Replies: 2 comments · 5 replies

MichaReiser Nov 22, 2021

Extensibility

Familiarity

Total memory consumption

Alternative storage layout

MichaReiser Nov 29, 2021

xunilrj Dec 21, 2021 Author

MichaReiser Dec 21, 2021

xunilrj Dec 21, 2021 Author

MichaReiser Dec 21, 2021

xunilrj Jan 13, 2022 Author

xunilrj
Nov 22, 2021

Replies: 2 comments 5 replies

MichaReiser
Nov 22, 2021

MichaReiser
Nov 29, 2021

xunilrj Dec 21, 2021
Author

xunilrj Dec 21, 2021
Author

xunilrj Jan 13, 2022
Author