Tokenizer speedups #234

jaytaph · 2023-11-06T07:57:04Z

jaytaph
Nov 6, 2023
Maintainer

We've noticed that with large blobs of javascript, the tokenizer takes a long time to process this. This is mostly due to the fact that we are now tokenizer each character separately. In the earlier setups of the tokenizer, a character or comment token could consist of multiple characters and even a whole script. Now, each character of such a script is parsed and this causes a lot of overhead. The reason this has changed is because in some scenario's, the parser must work with separate newlines or spaces instead of whole texts.

My suggestion would be to make the tokenizer "greedy". In such a way, that when we encounter characters that can create edge-cases (like newlines, and sometimes spaces), we emit tokens for those specific characters, but when there isn't a change on hitting these edge-cases (for instance, when we deal with a complete script as parsed via RCDATA and ScriptData states, we can tokenize all the characters into a single text token and parse that only one time. This would save a lot of time in most cases.

CharlesChen0823 · 2023-11-06T11:32:47Z

CharlesChen0823
Nov 6, 2023

IMO, the slow reason is in fetch_next_token, split text token to char.

0 replies

emwalker · 2023-11-07T03:41:04Z

emwalker
Nov 7, 2023
Collaborator

@jaytaph judging from your comments in Slack, I gather you traced the problem back to two issues that were resolved? Or is the topic of the discussion still in need of investigation?

0 replies

jaytaph · 2023-11-07T07:38:22Z

jaytaph
Nov 7, 2023
Maintainer Author

There are two main points that would need to change (i've tested them and it returns a result of parsing https://adbook.com frontpage from 45-50 seconds down to 0.391 seconds).

first, the greediness of the tokenizer. We need to fetch as many characters as possible and turn that into a single token. The tokenizer actually does this already, but when emitting a token, each character is emitted separately, resulting in a lot of overhead and copying (95% of the time is spend in memmove when profiling).
second, the current system on detecting newlines can be changed. We have a tokenizer that allows jumping forwards and backwards in the stream. Because we have to take into account the linenumber and column, we need to store newlines. Jumping forward would mean we need to iterate the stream, count newlines and chars and adjust the line_offsets vec accordingly. This is changed: the tokenizer can only skip N amount of chars (still iterating them one by one), AND we can only unread one single char at the time (we do not allow to seek backward). This is all the functionality that we need, but we still need to keep the line_offsets vec. However, it doesn't really form a bottleneck anymore (though searching for a line / column position in this vec is still a O(n) operation.

I'm currently setting up a branch (once the current one is merged), with these changes, and see if and how we can fix the few remaining parse tests that will fail now.. I think we can detect these edge-cases in the tokenizer, and deal with them manually.

1 reply

CharlesChen0823 Nov 7, 2023

According looking at the parser and tokenizer code, Token::Text only have an null char case should deal special, other case mostly only change parser frameset_ok state, so IMO, we can change tokenzier code, when NULL code is found, then emit consumed string to text code, current NULL to Token::NULL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer speedups #234

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tokenizer speedups #234

jaytaph Nov 6, 2023 Maintainer

Replies: 3 comments · 1 reply

CharlesChen0823 Nov 6, 2023

emwalker Nov 7, 2023 Collaborator

jaytaph Nov 7, 2023 Maintainer Author

CharlesChen0823 Nov 7, 2023

jaytaph
Nov 6, 2023
Maintainer

Replies: 3 comments 1 reply

CharlesChen0823
Nov 6, 2023

emwalker
Nov 7, 2023
Collaborator

jaytaph
Nov 7, 2023
Maintainer Author