Replies: 3 comments 1 reply
-
IMO, the slow reason is in fetch_next_token, split text token to char. |
Beta Was this translation helpful? Give feedback.
-
@jaytaph judging from your comments in Slack, I gather you traced the problem back to two issues that were resolved? Or is the topic of the discussion still in need of investigation? |
Beta Was this translation helpful? Give feedback.
-
There are two main points that would need to change (i've tested them and it returns a result of parsing https://adbook.com frontpage from 45-50 seconds down to 0.391 seconds).
I'm currently setting up a branch (once the current one is merged), with these changes, and see if and how we can fix the few remaining parse tests that will fail now.. I think we can detect these edge-cases in the tokenizer, and deal with them manually. |
Beta Was this translation helpful? Give feedback.
-
@CharlesChen0823 @emwalker
We've noticed that with large blobs of javascript, the tokenizer takes a long time to process this. This is mostly due to the fact that we are now tokenizer each character separately. In the earlier setups of the tokenizer, a character or comment token could consist of multiple characters and even a whole script. Now, each character of such a script is parsed and this causes a lot of overhead. The reason this has changed is because in some scenario's, the parser must work with separate newlines or spaces instead of whole texts.
My suggestion would be to make the tokenizer "greedy". In such a way, that when we encounter characters that can create edge-cases (like newlines, and sometimes spaces), we emit tokens for those specific characters, but when there isn't a change on hitting these edge-cases (for instance, when we deal with a complete script as parsed via RCDATA and ScriptData states, we can tokenize all the characters into a single text token and parse that only one time. This would save a lot of time in most cases.
Beta Was this translation helpful? Give feedback.
All reactions