Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convenience functions for traversing DOM? #18

Open
dist1ll opened this issue Jan 25, 2022 · 8 comments
Open

Convenience functions for traversing DOM? #18

dist1ll opened this issue Jan 25, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@dist1ll
Copy link

dist1ll commented Jan 25, 2022

Assume we have the following structure

<div class="x">
  <div class="y">
    <div class="z"></div>
    <div class="z"></div>
    <div class="z"></div>
  </div>
</div>

<div class="z"></div>
<div class="z"></div>
<div class="z"></div>

Let's say I want to get all nested divs of class z that are in x, but not the outside ones. Right now that's quite cumbersome, because query_selector() only works on VDom. Then I have to implement a DFS, using Node::children continuously.

Suggestion: NodeHandle should also have a query_selector method, so that we can more easily traverse nested divs. I think it makes sense from an API perspective, because a node essentially spawns a subtree, so it should have the same functions that a VDom has.

Alternative: An alternative solution would be to extend the syntax of the query_selector method to search for nested divs. Something like query_selector("div.x > div.y > div.z").

@y21 y21 added the enhancement New feature or request label Jan 25, 2022
@dist1ll
Copy link
Author

dist1ll commented Jan 25, 2022

It seems like you already planned to select descendants

Descendant(Box<Selector<'a>>, Box<Selector<'a>>),

The match case is just not implemented in matches. This would probably be a good first step, right?

@y21
Copy link
Owner

y21 commented Jan 25, 2022

The match case is just not implemented in matches. This would probably be a good first step, right?

Yep, though I'm not sure about the complexity of adding this to the query selector. It's probably worth mentioning somewhere in the docs that not all of the query selector API is fully implemented yet (Selector::Descendant and Selector::Parent is missing).

Suggestion: NodeHandle should also have a query_selector method,

I agree, Node (or HTMLTag) should have a query_selector function like VDom does, and that seems fairly straight forward to implement. In fact, I've been working on a minor "rewrite" that mostly changes how the parser works (internally), and I've already got the query_selector function implemented for HTMLTags along with a few other things (like being able to parse a query selector independent of a parser, so it can be reused multiple times for different VDoms or Nodes). I will try to get this done soon.

@dist1ll
Copy link
Author

dist1ll commented Jan 25, 2022

Yep, though I'm not sure about the complexity of adding this to the query selector. It's probably worth mentioning somewhere in the docs that not all of the query selector API is fully implemented yet (Selector::Descendant and Selector::Parent is missing).

Oh right, the matches has to decide if a given node fulfils the requirement. That's of course more difficult then (since nodes/htmltags don't have a parent reference). Also the fact we're using an iterator makes this tricky.

I agree, Node (or HTMLTag) should have a query_selector function like VDom does, and that seems fairly straight forward to implement. [...] I've already got the query_selector function implemented for HTMLTags along with a few other things

That's great! If Node/HTMLTag has that functionality, that already gives us powerful composability.

@dist1ll
Copy link
Author

dist1ll commented Feb 1, 2022

Have you considered putting a &VDom reference into the NodeHandler struct? Because we need to give the parser to the select_query function every time, and this makes writing chainable APIs difficult. You can find some extensions that I wrote here:

https://github.com/dist1ll/hltv-rust/blob/main/src/tl_extensions.rs

I created a RichNode struct there

#[derive(Clone, Copy)]
pub struct RichNode<'a> {
    pub d: &'a VDom<'a>,
    pub n: Option<NodeHandle>,
}

I've used this extension to create an ergonomic find function:

// Example of using .find and .get_attr
fn parse_team(h: RichNode, team_id: &str) -> Option<Team> {
    Some(Team{
        // get node as tag, and parse attribute to correct type
        id:  h.get_attr(team_id).unwrap_or(None)?,
        // use .find to find children with the given class name
        name: h.find(team_id).find("matchTeamName").inner_text()?,
    })
}
  1. What do you think about putting a VDom reference into the NodeHandler?

  2. The .find in my example only matches class strings right now (and doesn't do a proper query select like your API). I think it would be cool if you could make the module tl::queryselector public, so I could reuse your logic for this abstraction.

@y21
Copy link
Owner

y21 commented Feb 2, 2022

What do you think about putting a VDom reference into the NodeHandler?

I've thought of this before and I'm not sure if this would work (if I understood this correctly). This sounds like we would end up with self referential structs if Node or NodeHandle had a reference to the VDom. (Also, I guess we'd need to store a &Parser if anything, since a VDom does not exist at parsing/node creation time)

struct Node<'input, 'parser> {
  raw: &'input [u8]
  children: Vec<NodeHandle<'input, 'parser>>
}
struct NodeHandle<'input, 'parser> {
  idx: usize,
  parser: &'parser VDom<'input>
}
struct Parser<'input> {
  nodes: Vec<Node<'input, '???>> // can't annotate the 'parser ('self) lifetime
}
struct VDom<'input> {
  parser: Parser<'input>
}

The reason why NodeHandle exists in the first place is to avoid a situation where Node stores its children as references to other Nodes (or a reference to anything from the parser/vdom for that matter), and to make it "detached" (no lifetime to the source string or parser, so it can be freely stored without running into the problem where you can't name the lifetime). Instead, the parser has a Vec of all nodes and each node stores children as indices (NodeHandle) into the vector, which is why methods on the HTMLTag struct need a &Parser. From what I've seen it seems like a common strategy to model these kinds of "graphs" by using vector indices, and it makes it fairly trivial to implement.

struct Node<'input> {
  raw: &'input [u8]
  children: Vec<NodeHandle>
}
struct NodeHandle {
  idx: usize,
}
struct Parser<'input> {
  nodes: Vec<Node<'input>>
}
struct VDom<'input> {
  parser: Parser<'input>
}

I agree that this makes chaining rather awkward and possibly unidiomatic, but I'm not sure there is a nice and easy solution.
I think a good long term solution would be to make the parser generic over a node sink (see #21), with some default sink implementations (e.g. one that remembers id and class attributes, and stores them in a map for get_element_by_id to run at constant time) and sink combinators. This would make it possible for users of the library to implement their own VDom (kind of) or extend it in any way you want (and also effectively creating their own Nodes). Users could make changes to the sink struct as the parser streams nodes and tell the parser to break at any point, which could be much more efficient in certain cases where you only need to extract one specific part of the HTML document and don't need anything else.

I think it would be cool if you could make the module tl::queryselector public

yes, that module will be public in the next release

@dist1ll
Copy link
Author

dist1ll commented Feb 2, 2022

Thanks for explaining your thoughts, I see now how that approach creates some problems. I really like the idea of a sink though

@y21
Copy link
Owner

y21 commented Feb 3, 2022

0.6.0 is on crates.io with some of your requested features:

let dom = tl::parse(r#"
<div class="x">
  <div class="y">
    <div class="z"></div>
    <div class="z"></div>
    <div class="z"></div>
  </div>
</div>

<div class="z"></div>
<div class="z"></div>
<div class="z"></div>
"#, Default::default()).unwrap();

let x_element = dom
  .get_elements_by_class_name("x")
  .next()
  .unwrap()
  .get(dom.parser())
  .unwrap()
  .as_tag()
  .unwrap();

let zs = x_element.query_selector(".z").unwrap();

assert_eq!(zs.count(), 3);

I have also refactored the HTMLTag children API a bit. Instead of having HTMLTag::children() return an iterator over the direct subnodes (not nested ones, which made it a little less useful than it could be), it now returns a struct (Children) containing a few helper methods:

  • Children::all: returns a slice containing all subnodes (including nested tags). Calling ::all() on .x in the HTML above would return [div.y, div.z, div.z, div.z] (ignoring the whitespaces)
  • Children::top: returns a slice containing only the "topmost", direct subnodes (which is what old HTMLTag::children did, before). Calling ::top() on .x in the HTML above would return [div.y] (ignoring the whitespaces), because that's the only immediate subnode
  • Children::{start, end, boundaries}: lower-level methods for getting the starting/ending boundary (indices) of the children. This is used by Children::all to build up the slice

@dist1ll
Copy link
Author

dist1ll commented Feb 3, 2022

wow good stuff! I think those are great improvements! I'll try out the new features, cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants