Proposal on splitting up `Node.data` into structs #108

Kiyoshika · 2023-10-07T01:50:54Z

Kiyoshika
Oct 7, 2023
Collaborator

After some discussion with @emwalker, we briefly talked about the idea of redesigning current structure of Node.data.

Currently we are storing data directly in the NodeData enum like so:

/// Different type of node data
#[derive(Debug, PartialEq, Clone)]
pub enum NodeData {
    Document,
    Text {
        value: String,
    },
    Comment {
        value: String,
    },
    Element {
        name: String,
        attributes: HashMap<String, String>,
    },
}

This gets a little messy when we have start adding methods that only apply to a particular type of node. For example, when I introduced all the attribute methods, we have these nasty checks in every method:

if self.type_of() != NodeType::Element {
    return Err(ATTRIBUTE_NODETYPE_ERR_MSG.into());
}

This will only get worse as we add more methods specific to different node types (text nodes, element, any others in the future.)

I did some brainstorming tonight and have a proposal:

We create different structs for each node type and wrap that in the enum:

pub enum NodeData {
    Document(DocumentNodeData),
    Text(TextNodeData),
    Comment(CommentNodeData),
    Element(ElementNodeData)
}

and the construction will be changed to (for example on the Element type):

pub fn new_element(args) -> Self {
    // ... other stuff
    data: NodeData::Element(ElementNodeData::new(args)),
    // ... other stuff
}

The dedicated structs will have their specific methods (this has the advantage of not polluting the Node struct as well):

use std::collections::HashMap;

#[derive(Debug, PartialEq, Clone)]
pub struct ElementNodeData {
    pub name: String,
    pub attributes: HashMap<String, String>,
}

impl Default for ElementNodeData {
    fn default() -> Self {
        Self::new()
    }   
}

impl ElementNodeData {
    pub fn new() -> Self {
        ElementNodeData {
            name: "".to_string(),
            attributes: HashMap::new(),
        }   
    }   

    // note that this no longer returns a Result<> like it does currently since it's no longer needed.
    pub fn insert_attribute(&mut self, name: &str, value: &str)  {
        // implementation without the nasty type check shown earlier
    }   

    // other methods specific to Element
}

Then when it comes to actual usage (for example, fetching a node and adding an attribute; side note, I'm writing this by hand and not actually compiling so there are likely errors in below syntax)

if let Some(node) = document.get_node_by_id_mut(NodeId::from(4)).expect("node") {
    // if fetched node is not Element type, nothing happens.
    // optionally, we could log a warning in an else clause
    if let NodeData::Element(element) = node.data {
        element.insert_attribute("class", "hello world");
    }
}

In current state, it looks more like the following:

if let Some(node) = document.get_node_by_id_mut(NodeId::from(4)).expect("node") {
    let result = node.insert_attribute("class", "hello world");
    // result = Err() if type is not Element. Could probably use "if let Ok(_) = node.insert_..."
}

The current implementation of insert_attribute is:

/// Add or update a an attribute
pub fn insert_attribute(&mut self, name: &str, value: &str) -> Result<(), String> {
    if self.type_of() != NodeType::Element {
        return Err(ATTRIBUTE_NODETYPE_ERR_MSG.into());
    }   

    if let NodeData::Element { attributes, .. } = &mut self.data {
        attributes.insert(name.to_owned(), value.to_owned());
    }   

    Ok(())
}

But with the proposed approach could be simplified to:

// NOTE: this method would be inside ElementNodeData and no longer Node
pub fn insert_attribute(&mut self, name: &str, value: &str) { // <-- no longer returning Result<> because it's now unneeded
    self.attributes.insert(name.to_owned(), value.to_owned());
}

I think this would help remove bloat in the Node struct both now and in the future as well as significantly simplify the methods by removing the boilerplate type checks.

This would require a bit of rework so I wanted to have an open discussion before I started any serious work on it. If we are good with this idea, I will open an issue based off this discussion and assign it to myself.

emwalker · 2023-10-07T02:03:06Z

emwalker
Oct 7, 2023
Collaborator

This was one of two approaches that came to my mind as I was thinking about this problem. The other approach is called the typestate pattern. (This video is well worth watching when you have the time.)

I think the enum variant + wrapped specialized struct will be a significant improvement over the current approach. The typestate pattern is a little advanced, and I worry that if we let the code get too far ahead of people's learning, it may make the project really hard for people to work on. Eventually we might get there.

Anyway, I think your proposal is taking things in the right direction.

0 replies

emwalker · 2023-10-07T02:07:21Z

emwalker
Oct 7, 2023
Collaborator

I also wonder whether there's a way to avoid exposing the full attribute API on the enum variant, which requires an internal if let NodeData::Element(element) = node.data { match and a possible ignore if there is no match. Is there a way to require callers to work with the wrapped struct directly (make it part of the public API)? Effectively, you'd reduce the calling surface area of NodeData and only have the methods on ElementNodeData.

Callers might have to do if let themselves, but that might be ok.

10 replies

emwalker Oct 7, 2023
Collaborator

The concrete change would be that the API surface of Node would be reduced significantly.

Anyone working with Node will have to match on node.data_mut() or something.

let attr = HashMap::new();
let mut node = Node::new_element("name", attr.clone(), HTML_NAMESPACE);

match &mut node.data {
    NodeData::Element(element) => {
        element.insert_attribute("key", "value");
    }
    NodeData::Document(document) => {
        // etc.
    }
    // ...
}

Kiyoshika Oct 7, 2023
Collaborator Author

I see, I think that's how I mostly had it set up in the proposal. The usage example I gave was an example caller fetching a node from the DOM and using if let but should also be able to match on node.data directly as well

emwalker Oct 7, 2023
Collaborator

Ok, then we're probably thinking of the same thing.

Kiyoshika Oct 7, 2023
Collaborator Author

I was initially seeing if I could shortcut it like node.data.insert_attribute(...) where it would ignore if type is not Element but I don't think there's anyway in Rust to do that, so this is probably our best option for now

emwalker Oct 7, 2023
Collaborator

One issue you might run into is that the CSS spec might specify a common set of methods for all of the different node data types (Element, Document, etc.). For that, you might be able to get away with a shared trait that each of the inner structs implements.

If you find yourself needing to do a lot of copying and pasting to get the shared trait impls to work, you might be able to attempt a blanket implementation.

jaytaph · 2023-10-07T13:30:21Z

jaytaph
Oct 7, 2023
Maintainer

I'm fine with making the node data a bit easier to handle. However, keep in mind that we still don't know if this node structure will be the actual output of the parser.

When parsing a document, script functions can interrupt the parsing and mutate the dom-tree as it is as that point, before the parser continue with the parsing of the rest of the document. This would mean we need to generate a dom-tree DURING the parsing, and not convert a node-tree to a dom-tree after the parsing is completed.

It might be worthwile to see if we can either convert the current node structure into a DOM system, and maybe even see if we can have something like spidermonkey or v8 (prefer spidermonkey) incorporated to deal with script tags. (i'm currently not worried about parallelism, as javascript is blocking the parser anyway, but we have to think about this sooner rather than later as well.

3 replies

Kiyoshika Oct 7, 2023
Collaborator Author

I’m essentially trying to convert the current Document into a tree that can be mutated during parsing. Soon I will also have to store a reference to the document in the node itself to handle more complicated operations like insert_before and insert_after which I experimented with but planning for a later PR

I’ll have to do some more research on spidermonkey/v8 to see how they currently work with the tree during parsing but I think we might be able to pull it off where we’re headed

emwalker Oct 7, 2023
Collaborator

I’m essentially trying to convert the current Document into a tree that can be mutated during parsing.

This vaguely sounds like an intermediate representation (IR), i.e., a tree structure that you manipulate and optimize before generating a DOM. In that case, we might be premature in attempting to go straight to DOM nodes as we currently are.

emwalker Oct 7, 2023
Collaborator

Depending on what is needed to get the DOM in the right shape, you might have several intermediate tree structures, each of which is manipulated in several passes and changed around:

Text input → IR1 → IR2 → IR3 → DOM with indexing suitable for CSS querying

In this flow, IR1, IR2, IR3 are intermediate representations with their own structs and tree formats.

Kiyoshika · 2023-10-07T16:52:11Z

Kiyoshika
Oct 7, 2023
Collaborator Author

Yeah, it could be considered an IR at the moment. It could be abstracted later to make it easier on the UA to render (if necessary), but right now my thought was to make the current Document flexible enough for the parser to handle changes in "real time". So if we had something like

<html>
  <head></head>
  <body>
    <p id="myid">Some text</p>
    <script>
      document.getElementById("myid").style.background = 'red';
    </script>
    <p id="otherid">More text</p>
    <script>
      document.getElementById("myid").style.background = 'blue';
      
      const newP = document.createElement("p");
      const newPText = document.createTextNode("even more text");
      newP.appendChild(newPText);
      
      const otherId = document.getElementById("otherid");
      const parentNode = otherId.parentNode;
      parentNode.insertBefore(newP, otherId);
    </script>
    <p>Final text</p>
  </body>
</html>

Then basically the steps are:

create p node and add it to Document
- from here, the Document reads the id attribute and ties it to this NodeId to make it queryable
read script and parse/compile/etc and give instructions to parser (or however it works)
- fetch myid from Document and modify its style
create new p node and add it to Document (now a sibling of first p)
- Document reads the id and ties it to its NodeId
read script and parse/compile/etc and give instructions to parser
- fetch myid from Document and modify its style
- create new p and append text node inside it (but do not add into Document yet)
- fetch otherid from Document and get its parent node
- call parent_node.insert_before(...) to insert script-generated p before otherid node
  - this will update the relevant sibling nodes
create p node and add it to Document (which is under otherid now)

Obviously we're a long ways to javascript, but I at least wanted to start building the foundation for having a mutable tree for the parser to touch to have an accurate representation of the current source (essentially what you stated)

3 replies

Kiyoshika Oct 7, 2023
Collaborator Author

We may have a separate DomTree with a simpler API that outputs from the parser instead of Document if it's easier for the UA and Document would then be internal used by the parser

emwalker Oct 7, 2023
Collaborator

I wonder about JS script execution. Is it deferred until some point later, when most of the document has been received, or does it occur in the parsing flow, once the closing </script> tag is seen?

emwalker Oct 7, 2023
Collaborator

We may have a separate DomTree with a simpler API that outputs from the parser instead of Document if it's easier for the UA and Document would then be internal used by the parser

In my experience so far, refactoring is one of Rust's superpowers. So probably safe to proceed with what you have, with the possibility of safely refactoring later. (Assuming we don't have a bunch of unsafe { ... } code at that point.)

jaytaph · 2023-10-07T17:03:55Z

jaytaph
Oct 7, 2023
Maintainer

I wonder about JS script execution. Is it deferred until some point later, when most of the document has been received, or does it occur in the parsing flow, once the closing </script> tag is seen?

After a closing tag of a script, the parser is interrupted by the javascript code. At that point, javascript has access to the dom as it is currently (partially) generated. Once the javascript has finished, it may or may not have modified the dom, and the parser continues with the next token in the html stream. Note that the html stream never changes so it can safely continue.

3 replies

emwalker Oct 7, 2023
Collaborator

Your description makes me think that we might not want to attempt an intermediate representation, since the JS evaluation will presumably make use of the final DOM interface.

emwalker Oct 7, 2023
Collaborator

I guess you could have an iteration on the current approach:

Raw tree during HTML parsing
Raw tree passed into a wrapper with the full DOM interface during JS evaluation
Wrapper is discarded once we're out of the script, and we're back to the raw tree

Now I want to see what the other browser engines are doing.

emwalker Oct 8, 2023
Collaborator

Now I want to see what the other browser engines are doing.

Started a new discussion thread for this.

jaytaph · 2023-10-07T17:04:54Z

jaytaph
Oct 7, 2023
Maintainer

https://github.com/gosub-browser/gosub-engine/blob/main/src/html5_parser/parser.rs#L698

0 replies

Kiyoshika · 2023-10-07T17:06:30Z

Kiyoshika
Oct 7, 2023
Collaborator Author

Yeah, I still need to do some reading on JavaScript engines but my initial simple-minded thought is that when hitting the end script tag it's essentially converted into instructions that the parser can use to interact with the dom tree

8 replies

emwalker Oct 9, 2023
Collaborator

I kind of wish performance was a priority even now. I worry that we could find ourselves with a slow program later down the road and then find it hard to fix things because it will be a big effort. I'm thinking of Servo here. It was after playing around with Servo and seeing its bad performance that I started looking for another browser project.

jaytaph Oct 9, 2023
Maintainer

The thing is that we do not have enough experience with browsers and their architecture in general to even compete with slower browsers. By trying to compartmentalize the different elements, it might be possible to pick one part, optimize it, and put it back. I would not even be surprised if we find ourselves rewriting things from scratch after we get more insights later on, especially if more people would join after having "something" to show.

However, it would also be good to keep memory footprint, security and speed in mind, but I think getting a system up and running would be a difficult thing to do. I don't want to have it as an afterthought, but I'd rather have a slow system than no system at all. If we can speed up things along the way, i'm all for it, but it isn't my main priority at the moment.

blaumeise20 Oct 9, 2023

I just found this library: https://github.com/wasmerio/rusty_jsc
I think something similar to this but fitted to our needs should be pretty easily accomplishable. I can maybe look into it this week if you want.

emwalker Oct 9, 2023
Collaborator

@blaumeise20 I like the idea of exploring that project and whether it can do something here. In addition to looking at what would be involved in using it in this project, I'd be curious to know what your thoughts are on the pros and cons if we do go with it.

emwalker Oct 9, 2023
Collaborator

@jaytaph you make good points. I can see rewriting components from scratch once we get some traction and better understand the challenges. But if we have to redo the whole architecture at some point because we overlooked something major, I suspect it will have been from a failure to properly look at what has already been done. So there's some balance to be struck here, as I'm thinking you'll agree.

Proposal on splitting up Node.data into structs #108

Kiyoshika Oct 7, 2023 Collaborator

Replies: 7 comments · 27 replies

emwalker Oct 7, 2023 Collaborator

emwalker Oct 7, 2023 Collaborator

emwalker Oct 7, 2023 Collaborator

Kiyoshika Oct 7, 2023 Collaborator Author

emwalker Oct 7, 2023 Collaborator

Kiyoshika Oct 7, 2023 Collaborator Author

emwalker Oct 7, 2023 Collaborator

jaytaph Oct 7, 2023 Maintainer

Kiyoshika Oct 7, 2023 Collaborator Author

emwalker Oct 7, 2023 Collaborator

emwalker Oct 7, 2023 Collaborator

Kiyoshika Oct 7, 2023 Collaborator Author

Kiyoshika Oct 7, 2023 Collaborator Author

emwalker Oct 7, 2023 Collaborator

emwalker Oct 7, 2023 Collaborator

jaytaph Oct 7, 2023 Maintainer

emwalker Oct 7, 2023 Collaborator

emwalker Oct 7, 2023 Collaborator

emwalker Oct 8, 2023 Collaborator

jaytaph Oct 7, 2023 Maintainer

Kiyoshika Oct 7, 2023 Collaborator Author

emwalker Oct 9, 2023 Collaborator

jaytaph Oct 9, 2023 Maintainer

blaumeise20 Oct 9, 2023

emwalker Oct 9, 2023 Collaborator

emwalker Oct 9, 2023 Collaborator

Proposal on splitting up `Node.data` into structs #108

Kiyoshika
Oct 7, 2023
Collaborator

Replies: 7 comments 27 replies

emwalker
Oct 7, 2023
Collaborator

emwalker
Oct 7, 2023
Collaborator

emwalker Oct 7, 2023
Collaborator

Kiyoshika Oct 7, 2023
Collaborator Author

emwalker Oct 7, 2023
Collaborator

Kiyoshika Oct 7, 2023
Collaborator Author

emwalker Oct 7, 2023
Collaborator

jaytaph
Oct 7, 2023
Maintainer

Kiyoshika Oct 7, 2023
Collaborator Author

emwalker Oct 7, 2023
Collaborator

emwalker Oct 7, 2023
Collaborator

Kiyoshika
Oct 7, 2023
Collaborator Author

Kiyoshika Oct 7, 2023
Collaborator Author

emwalker Oct 7, 2023
Collaborator

emwalker Oct 7, 2023
Collaborator

jaytaph
Oct 7, 2023
Maintainer

emwalker Oct 7, 2023
Collaborator

emwalker Oct 7, 2023
Collaborator

emwalker Oct 8, 2023
Collaborator

jaytaph
Oct 7, 2023
Maintainer

Kiyoshika
Oct 7, 2023
Collaborator Author

emwalker Oct 9, 2023
Collaborator

jaytaph Oct 9, 2023
Maintainer

emwalker Oct 9, 2023
Collaborator

emwalker Oct 9, 2023
Collaborator