Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html_element() cannot select itself #382

Open
JosiahParry opened this issue Dec 22, 2023 · 4 comments
Open

html_element() cannot select itself #382

JosiahParry opened this issue Dec 22, 2023 · 4 comments

Comments

@JosiahParry
Copy link

After using html_children() the contents cannot be access using html_element() or html_elements().

I would not be surprised if this is user error, I'm just not sure where.

library(rvest)
html <- minimal_html(r"{
<div class="div-class">
  <h1 class="my-class">Hello</h1>
  <h2 class="subclass">World</h2>
</div>
}")

html_elements(html, ".div-class .my-class")
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>

div_children <- html_elements(html, ".div-class") |> 
  html_children() 
  
div_children
#> {xml_nodeset (2)}
#> [1] <h1 class="my-class">Hello</h1>
#> [2] <h2 class="subclass">World</h2>

html_elements(div_children, ".my-class")
#> {xml_nodeset (0)}

Created on 2023-12-22 with reprex v2.0.2

@rossellhayes
Copy link

rossellhayes commented Dec 22, 2023

When a CSS selector is passed to html_elements(), it is converted to Xpath with rvest:::make_selector(). make_selector() always prefixes the Xpath with .//, which means it can find nodes at all levels except for the top level. Because div_children is the result of html_children(), the nodes in question are top-level. In order to avoid that, you can avoid rvest's prefixing by handling the conversion to Xpath outside the function:

html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>

Created on 2023-12-22 with reprex v2.0.2

I believe this issue could be fixed in rvest by changing https://github.com/tidyverse/rvest/blob/main/R/selectors.R#L99C41-L99C41 to prefix descendant-or-self:: rather than .//. However, this test suggests that not being able to select top-level nodes is a purposeful design decision.

@JosiahParry
Copy link
Author

@rossellhayes well that beats my suggestion:

x <- xml_new_root("tmp")

for (child in div_children) {
  xml_add_child(x, child)
}

html_elements(x, ".my-class")

@JosiahParry
Copy link
Author

JosiahParry commented Dec 22, 2023

Thanks @rossellhayes. I'm wondering if there's something more going on here that I'm not able to grasp or is actually a bug whereas the previous was not per your findings in the test.

Using the selectr package to identify nodes does not permit removal from the document with xml2::xml_remove(). I wonder if this is another case (or the same) in which top level items are treated differently.

library(xml2)
library(rvest)
html <- minimal_html(r"{
<div class="div-class">
  <h1 class="my-class">Hello</h1>
  <h2 class="subclass">World</h2>
</div>
}")

html_elements(html, ".div-class .my-class")
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>

div_children <- html_elements(html, ".div-class") |> 
  html_children() 

# select using selectr
html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>

# remove the node using xml_remove
xml2::xml_remove(
  html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
)

# see if its still there
html_elements(div_children, xpath = selectr::css_to_xpath(".my-class"))
#> {xml_nodeset (1)}
#> [1] <h1 class="my-class">Hello</h1>

# repeat at the top level html
xml2::xml_remove(html_elements(html, ".div-class .my-class"))

# see if it is still there
html_elements(html, ".div-class .my-class")
#> {xml_nodeset (0)}

Created on 2023-12-22 with reprex v2.0.2

EDIT: ignore me. It seems free = TRUE must be set when its a subset of nodes

# remove the node using xml_remove
xml2::xml_remove(
  html_elements(div_children, xpath = selectr::css_to_xpath(".my-class")),
  free = TRUE
)

@hadley
Copy link
Member

hadley commented Jan 23, 2024

html_elements() select children elements, and are design not to select the elements themselves (otherwise this can make recursing over a document very tricky). This is probably worth a clarifying sentence in the docs.

@hadley hadley changed the title html_element() cannot select on children html_element() cannot select itself Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants