In existing browsers, how does the parser interact with the DOM? (research thread) #116

emwalker · 2023-10-08T00:44:14Z

emwalker
Oct 8, 2023
Collaborator

In the case of Servo, Gecko, Chromium, and other open source browsers, how do the tokenizer, parser and DOM interact? Is there an intermediate representation, in which an IR tree is first built up and then converted into DOM nodes? Is the DOM passed around and mutated in place without the benefit of an IR? At what point does script execution take over, and what does that handoff look like? What are the different collaborating classes and structs that are involved to make all of this happen?

Servo
Gecko
Chromium
WebKit

emwalker · 2023-10-08T01:52:25Z

emwalker
Oct 8, 2023
Collaborator Author

Servo

$ gh repo clone servo/servo

HTML5 parsing begins in ScriptThread.load
Parsing is carried out by ServoParser, which has fields for a Document and a Tokenizer, starting with the ServoParser::parse_html_document associated function
Work for the parser is first queued up
If the parser is not in a suspended state, parse_sync is then called
The parse_sync function delegates to do_parse_sync, which next invokes tokenization, feeding the work that has been queued up so far
There are three tokenizers that are available. In the HTML5 code path, html::Tokenizer is the one that will be used.
html::Tokenizer has an inner field to which it delegates, which holds an html5ever::tokenizer::Tokenizer
html5ever is another Servo project
html5ever::tokenizer::Tokenizer is passed an html5ever::tree_builder::TreeBuilder as an argument
As feed is called on Servo's tokenizer, Servo then calls feed on the html5ever tokenizer it holds
Through a roundabout call chain, the html5ever tokenizer will call tokenizer.step, calling process_token on the "sink" it was provided (possibly the tree builder, or possibly Servo's Sink struct)
The tree builder (an html5ever struct) will in turn call various methods on the Sink that servo provided
Sink holds a reference to the document and is responsible for mutating it as tokens arrive
Gradually, a document is built up. If at some point a script is encountered, tokenization exits early, presumably so the script can mutate the Document

Not sure if this is helpful for others. I learned a little about the various collaborators involved in Servo, which provides useful context.

3 replies

jaytaph Oct 10, 2023
Maintainer

I'm curios about the sink part. The sink contains the document that is being filled by the parser, which itself is added to the sink as well. This probably means that the tokenizer just emits tokens, but the parser could be a html5 parser, an xml parser, or even an old-school html4 parser or anything else.

It seems that we currently have a more pipelined system: input-stream -> tokenizer -> html5parser -> document output

I can understand that when we deal with XML, we would need to have a second pipeline, or somehow dynamically change the parser from html5 to xml, while still relying on the same output.

It might be a good idea to separate our output from the parser through a sink-type system as well. Basically we emit data (which might even be nodes?) directly to the output, which in turn decides how to deal with it (generate a simple node-tree as we do now, convert it to a dom tree etc?).

jaytaph Oct 10, 2023
Maintainer

Since we have tokenization and parsing in separate stages (the tokenizer actually fills up a token queue, which is read by the parser), we even might be able to have separate threads for both the tokenizer and parser. The tokenizer only fills the queue until it reach the end of stream, and the parser just waits until something is on the queue to parse, or if nothing is there, do "other stuff" (tm). I'm not sure how badly this is needed in the end though, as there is probably not much to do in this phase of processing, but we might be able to get "full-speed" if we manage to get the tokenizer onto a separate core as the parser in these cases. (again, not sure if this makes any sense, but relatively easy thing to do (and maybe even benchmark?)

emwalker Oct 11, 2023
Collaborator Author

I suspect we'll eventually want to split out some additional structs to help with the tokenization -> parser interaction. For example, in Gecko (and Chromium), there's a "tree builder" that aggregates tokens and creates operations or tasks that are then applied as mutations to the DOM.

I do not think this kind of refactoring is urgent, and it can be driven by the need in front of us at some point in the future. (Right now I'm just trying to map out the territory and get a sense of all of the moving parts in a real system.)

emwalker · 2023-10-08T17:44:42Z

emwalker
Oct 8, 2023
Collaborator Author

Gecko

The Gecko engine is written primarily in C++ and makes heavy use of the observer pattern and interface-like super classes. There also appears to be two code paths that are used, one of them deprecated, depending on whether older HTML or HTML5 is being parsed. Gecko is also used at the core of Firefox, which is a full-featured browser with all of the functionality one takes for granted. All of this together this makes the code more difficult to follow than Servo for someone new to it. What follows is very tentative.

$ gh repo clone mozilla/gecko-dev

Loading a page from a URL

Let's start in the vicinity of the docshell class, which is an important coordinating class in the Gecko engine. (The code involved here is gnarly.)

nsDocShell::LoadURI invokes nsDocShell::InternalLoad
nsDocShell::InternalLoad invokes nsDocShell::DoURILoad
nsDocShell::DoURILoad invokes nsDocShell::OpenInitializedChannel
nsDocShell::OpenInitializedChannel invokes nsURILoader::OpenURI
nsURILoader::OpenURI invokes (?) nsExtProtocolChannel::AsyncOpen
nsExtProtocolChannel::AsyncOpen invokes nsExtProtocolChannel::OpenURL
nsExtProtocolChannel::OpenURL invokes listener->OnStartRequest, passing in a channel
The channel in this case is might be a ParentProcessDocumentChannel, in which case channel->AsyncOpen(...), above, would call to ParentProcessDocumentChannel::AsyncOpen
ParentProcessDocumentChannel::AsyncOpen invokes DocumentLoadListener::OpenDocument, receiving a promise
DocumentLoadListener::OpenDocument invokes DocumentLoadListener::Open
DocumentLoadListener::Open invokes documentContext->StartDocumentLoad
It is unclear what documentContext is an instance of; it might be an instance of CanonicalBrowsingContext
Through magic of some kind, let's say that DocumentLoadListener::Open will have transitively invoked nsHTMLDocument::StartDocumentLoad
An nsHtml5Parser parser is initialized, since we're parsing HTML5
mParser->Parse(uri) is invoked
In the parsing function, work that has been queued up in buffers is processed in several stages. In a first stage, tokenization happens piecemeal, as portions of the buffer are processed
As tokenization happens, an nsHtml5TreeBuilder tree builder class is used to modify the document
nsHtml5TreeBuilder has a state machine that tracks what state we're in in the processing of tokens. It is a C++ class generated from a Java class elsewhere in the project.
Somewhere nsHtml5TreeBuilder must be registered to receive token events, because it might end up in a broken state after the code attempts to tokenize the next segment of the buffer
If we reach a point in the token stream where a script might need to intervene, the tree builder is flushed, which moves ops to an executor, at which point the executor might be marked as broken
The executor is an instance of nsHtml5TreeOpExecutor loaded from a field of nsHtml5Parser
If the executor is not in a broken state, executor->FlushDocumentWrite() is called
nsHtml5TreeOpExecutor::FlushDocumentWrite iterates over a queue of operations that have accumulated and calls Perform on them (implementation of Perform here)
An nsHtml5TreeOperation is an operation like CreateHTMLElement, which accepts an nsHtml5DocumentBuilder
HoldElement is called on the nsHtml5DocumentBuilder instance when the new (element) node is created. HoldElement appends to a list of elements in the document builder.
nsHtml5DocumentBuilder is an interface that our nsHtml5TreeOpExecutor from earlier implements, so our tree builder is probably just the nsHtml5TreeOpExecutor instance that was instantiated in the guise of an interface
It looks like the document itself might be modified by the CreateHTMLElement operation as well, although the connection is a little unclear

3 replies

emwalker Oct 8, 2023
Collaborator Author

One of the impressions I got from tracing these calls is that there is a several stage parsing pipeline, despite the need of updating the DOM in place using the contents of a stream. There's a tokenization phase, which drives a tree builder to collect operations on the DOM, and then the tree builder is flushed, which passes the operations to an executor which then applies them to the DOM (create an element, create a comment, etc.). Once the executor is done, control can be passed over to any JS scripts that might be next in the document flow.

Kiyoshika Oct 9, 2023
Collaborator

I wonder how much overhead all these steps introduce. Obviously I don't know the exact reasoning why some browsers are set up the way they are, but sometimes it seems like (and this is mainly my PTSD using the AWS C++ SDK) you need 500 classes to do something.

Right now we are taking a very minimalistic approach of: Read Stream -> Tokenize -> Parse + Modify DOM inplace using just a few structs. I am curious to know what struggles that's going to cause for us. I think abstractions are a goldilocks problem where very little abstraction can make code a little hard to follow but too much abstraction also makes the code hard to follow, so it's finding some type of middle ground to make code easy enough to maintain/follow but also not bottleneck us somehow.

We are also very very early so I'm sure we will have to introduce more objects to handle certain things, but I'm hoping to avoid the AWS syndrome

emwalker Oct 9, 2023
Collaborator Author

I think abstractions are a goldilocks problem where very little abstraction can make code a little hard to follow but too much abstraction also makes the code hard to follow, so it's finding some type of middle ground to make code easy enough to maintain/follow but also not bottleneck us somehow.

This rings true to me as well. I think the trick will be to go back to the Mozilla and Chromium code bases and do our research when new problems come up. Some of the code in the Gecko project will suffer from accidental complexity that they introduced but which isn't really needed. And some of the complexity will be necessary complexity inherent in the problem domain. We have the challenge and opportunity to approach things with fresh eyes, but we should also do our research as we go and avoid reinventing the wheel badly.

(I do not know that the Servo code base is something to emulate. I'm still trying to get a sense of what they got right and what they got wrong.)

emwalker · 2023-10-11T01:23:59Z

emwalker
Oct 11, 2023
Collaborator Author

Chromium

Chromium is a huge code base. It took me hours to download the project, and then forever to compile it (just for fun). Somehow it is easier to navigate than Gecko. To someone new to C++ like myself, Gecko gives the vague impression of being more abstract and hard to reason about in some ways. Much of the Chromium code base deals with the user interface. Chromium delegates its HTML handling to Blink, a rendering engine. It is in Blink that tokenization, parsing, DOM tree building and related things happen.

I downloaded and compiled Chromium by following these instructions, which were mostly straightforward. Along the way I noticed that there are thousands of interesting markdown docs on relevant topics that I might want to skim at some point (HTML here).

2 replies

emwalker Oct 11, 2023
Collaborator Author

After compiling Chromium, I was pleased to see it actually start up:

emwalker Oct 11, 2023
Collaborator Author

The call stack that leads to the parsing of an HTML document is deep, and we can skip a lot of it and start near the handoff to Blink. For the curious, the main browser loop seems to start here.

For the parsing of an HTML document, we can omit a lot of stuff and begin with Blink in one of the code paths that loads a document:

At a high level, in LocalFrame::ForceSynchronousDocumentInstall, a parser is instantiated and then bytes are appended. At the end of the parsing, parser->Finish() is called.
HTMLDocumentParser::AppendBytes calls DecodedDataDocumentParser::AppendBytes
DecodedDataDocumentParser::AppendBytes calls DecodedDataDocumentParser::UpdateDocument
DecodedDataDocumentParser::UpdateDocument calls DecodedDataDocumentParser::AppendDecodedData
DecodedDataDocumentParser::AppendDecodedData appears to call HTMLDocumentParser::Append
HTMLDocumentParser::Append calls HTMLDocumentParser::FinishAppend
HTMLDocumentParser::FinishAppend calls HTMLDocumentParser::PumpTokenizerIfPossible
HTMLDocumentParser::PumpTokenizerIfPossible calls HTMLDocumentParser::PumpTokenizer
HTMLDocumentParser::PumpTokenizer enters a tokenization loop
The tokenization loop calls HTMLDocumentParser::ConstructTreeFromToken
HTMLDocumentParser::ConstructTreeFromToken calls what appears to be HTMLTreeBuilder::ConstructTree
HTMLTreeBuilder::ConstructTree calls HTMLTreeBuilder::ProcessToken
HTMLTreeBuilder::ProcessToken calls (e.g.) HTMLTreeBuilder::ProcessStartTag and eventually HTMLTreeBuilder::ProcessEndTag
But before doing all of that, HTMLTreeBuilder::ProcessToken appears to call HTMLConstructionSite::Flush
HTMLConstructionSite::Flush calls HTMLConstructionSite::ExecuteQueuedTasks
HTMLConstructionSite::ExecuteQueuedTasks calls HTMLConstructionSite::ExecuteTask
HTMLConstructionSite::ExecuteTask calls (e.g.) ExecuteInsertTask
ExecuteInsertTask calls Insert
Insert calls ContainerNode::ParserAppendChild
ContainerNode::ParserAppendChild calls Document::adoptNode
Popping back up the stack, we eventually fall out of the tokenization loop and return to the caller, at which point we might attempt to end parsing

At a high level, I gather that there's a document that contains the DOM; there's a parser that works with the document and delegates to a tokenizer; the tokenizer is "pumped" when there is an opportunity, and it has a complex state machine that constructs tokens; there's a tree builder that processes token stream and turns them into tasks; the tasks are queued up, and then at the appropriate time they're executed against the document, doing things you would see in JavaScript such as inserting new elements.

emwalker · 2023-10-15T14:47:47Z

emwalker
Oct 15, 2023
Collaborator Author

WebKit

WebKit is the core of Safari and, in some form, Chrome. Its origins are in the KHTML and KJS libraries from KDE, a Linux desktop environment. It is the starting point for Chrome's Blink rendering engine, which is a fork of WebKit's WebCore component. I recall hearing that the Chrome team went with this approach because WebKit was very fast.

With some effort I downloaded the sources for WebKit and compiled them.

$ gh repo clone WebKit/WebKit

It was a little fiddly to compile the project. I ended up installing a bunch of development libraries, discovering some that would be hard to track down, and then running these commands:

$ cmake -DPORT=GTK -DUSE_JPEGXL=OFF -DUSE_OPENJPEG=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja -DUSE_WOFF2=OFF -DUSE_LCMS=OFF -DUSE_LIBBACKTRACE=OFF
$ ninja

It took several hours to compile the sources to a binary. To run the binary, I had to tell LD_LIBRARY_PATH where to find some shared libraries that were built:

$ LD_LIBRARY_PATH=`pwd`/lib /usr/local/libexec/webkit2gtk-4.1/MiniBrowser

2 replies

emwalker Oct 15, 2023
Collaborator Author

I was a little worried that it was going to be too involved to get a browser up and running, but then the command I tried worked, and one started up:

emwalker Oct 15, 2023
Collaborator Author

WebKit, via its WebCore component, follows a very similar path to the one seen in Chrome's Blink, which isn't surprising, since Blink is derived from WebCore:

WebContentReader::readHTML calls createFragmentFromMarkup
createFragmentFromMarkup calls DocumentFragment::parseHTML
DocumentFragment::parseHTML calls HTMLDocumentParser::parseDocumentFragment
HTMLDocumentParser::parseDocumentFragment calls HTMLDocumentParser::insert
HTMLDocumentParser::insert calls HTMLDocumentParser::pumpTokenizerIfPossible
HTMLDocumentParser::pumpTokenizerIfPossible calls HTMLDocumentParser::pumpTokenizer
HTMLDocumentParser::pumpTokenizer calls HTMLDocumentParser::pumpTokenizerLoop
HTMLDocumentParser::pumpTokenizerLoop calls HTMLTokenizer::nextToken, which delegates to HTMLTokenizer::processToken
HTMLTokenizer::processToken starts the tokenization state machine, and eventually a token is yielded
HTMLDocumentParser::pumpTokenizerLoop then calls HTMLDocumentParser::constructTreeFromHTMLToken
HTMLDocumentParser::constructTreeFromHTMLToken calls HTMLTreeBuilder::constructTree, delegating to HTMLTreeBuilder
HTMLTreeBuilder::constructTree calls HTMLTreeBuilder::processToken, entering the tree building state machine
For example, processToken might first go into HTMLTreeBuilder::processStartTag, and then with a later token it might go into HTMLTreeBuilder::processEndTag
Eventually the tree building state machine machine comes to a stopping point, havinig queued up some tasks, which it then executes by calling HTMLConstructionSite::executeQueuedTasks, delegating to HTMLConstructionSite
For each enqueued task, HTMLConstructionSite calls executeTask
At this point executeTask might do one of several things, depending on the task. An example is an insertion task, in which case it calls executeInsertTask
executeInsertTask then calls insert
insert then might call ContainerNode::parserAppendChild, which then does something you would expect to happen if you appended a node in JavaScript
Eventually we fall out of this task execution and HTMLDocumentParser::constructTreeFromHTMLToken and HTMLDocumentParser::pumpTokenizerLoop and HTMLDocumentParser::pumpTokenizer and HTMLDocumentParser::pumpTokenizerIfPossible and HTMLDocumentParser::insert and wrap things up.

Here we have an HTML fragment in a string that is then passed to a DocumentFragment class to parse as HTML. The DocumentFragment instantiates an HTMLDocumentParser, which then "pumps" an HTMLTokenizer class, which will tokenize as much as possible before coming to a stopping point. As new tokens are received from the token stream, an HTMLTreeBuilder then goes through its own state machine and builds and enqueues tasks to execute later on. These tasks include things you would see in JavaScript, such as appending a node to a location in the DOM. Once the tree builder comes to a stopping point, the tasks are executed by an HTMLConstructionSite class, which executes each task in the queue. Eventually we come to a stopping point in the parsing of the HTML fragment and clean up the helper classes, including the parser, leaving the DOM in a new state.

In existing browsers, how does the parser interact with the DOM? (research thread) #116

emwalker Oct 8, 2023 Collaborator

Replies: 4 comments · 10 replies

emwalker Oct 8, 2023 Collaborator Author

Servo

jaytaph Oct 10, 2023 Maintainer

jaytaph Oct 10, 2023 Maintainer

emwalker Oct 11, 2023 Collaborator Author

emwalker Oct 8, 2023 Collaborator Author

Gecko

Loading a page from a URL

emwalker Oct 8, 2023 Collaborator Author

Kiyoshika Oct 9, 2023 Collaborator

emwalker Oct 9, 2023 Collaborator Author

emwalker Oct 11, 2023 Collaborator Author

Chromium

emwalker Oct 11, 2023 Collaborator Author

emwalker Oct 11, 2023 Collaborator Author

emwalker Oct 15, 2023 Collaborator Author

WebKit

emwalker Oct 15, 2023 Collaborator Author

emwalker Oct 15, 2023 Collaborator Author

emwalker
Oct 8, 2023
Collaborator

Replies: 4 comments 10 replies

emwalker
Oct 8, 2023
Collaborator Author

jaytaph Oct 10, 2023
Maintainer

jaytaph Oct 10, 2023
Maintainer

emwalker Oct 11, 2023
Collaborator Author

emwalker
Oct 8, 2023
Collaborator Author

emwalker Oct 8, 2023
Collaborator Author

Kiyoshika Oct 9, 2023
Collaborator

emwalker Oct 9, 2023
Collaborator Author

emwalker
Oct 11, 2023
Collaborator Author

emwalker Oct 11, 2023
Collaborator Author

emwalker Oct 11, 2023
Collaborator Author

emwalker
Oct 15, 2023
Collaborator Author

emwalker Oct 15, 2023
Collaborator Author

emwalker Oct 15, 2023
Collaborator Author