Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Add matches_breadcrumbs() to HTML Processor for better querying #5243

Closed

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Sep 19, 2023

Trac ticket: #59400

Inside a next_tag() loop it can be challenging to use breadcrumbs because they are only exposed inside the call to next_tag() via the $query arg.

In this patch a new method, matches_breadcrumbs() is exposed which allows for querying within the next_tag() loop for more complicated queries.

This method exposes a wildcard * operator to allow matching any HTML tag that the currently-matched tag is a child or descendant of.

@dmsnell dmsnell force-pushed the html-api/add-matches-breadcrumbs branch 2 times, most recently from ab6952c to 48b631d Compare September 19, 2023 21:58
@dmsnell dmsnell marked this pull request as ready for review September 19, 2023 23:01
Comment on lines +423 to +427
* At some point this function _may_ support a `**` syntax for matching any number
* of unspecified tags in the breadcrumb stack. This has been intentionally left
* out, however, to keep this function simple and to avoid introducing backtracking,
* which could open up surprising performance breakdowns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's say you did want to implement ** behavior but limited to 5 levels, where img is, say, inside of main. To do that, you currently would have to do:

for ( $asterisk_count = 0; $asterisk_count <= 5; $asterisk_count++ ) {
    if (
        $processor->matches_breadcrumbs(
            array(
                'main',
                ...array_fill( 0, $asterisk_count, '*' )
                'img' 
            ) 
        ) 
    ) {
        return true;
    }
}
return false;

Is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha yeah this would work and accurately conveys the backtracking problem, though I believe that there are some more performant backtracking mechanisms that don't have to re-process as much as this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's still possible that I'm way overthinking this, confusing it with a different kind of match, where it would matter which of the inner or outer elements are matching.

for a pattern like main, **, div, **, img though I want to make sure this doesn't break, or maybe it's better to compare that against main, *, div, **, img

for the first one, with two ** groups, I read it as "walk up the stack of open elements; if there is a DIV somewhere and also a MAIN somewhere then this matches."

for the second one though we have a tighter constraint above the simpler one. if we have this breadcrumb HTML > BODY > MAIN > SECTION > DIV > DIV > IMG then we have to make sure that the ** group only matches the last DIV, otherwise main, *, div will fail to match because we were too greedy.

This is where I think we must start backtracking: after that higher DIV fails the rest of the match. That's a very recursive problem too. If we start by being un-greedy though, we stop at the lower DIV and it fails the rest of the match, so we have to track that ** group and backtrack on the pattern vs. on the content. We have to remember each double wildcard group and go back to reattempt matches after adjusting our position in the stack.

The probably probably starts if * allowed for a zero match, but when I tossed it in there it didn't seem to have any connotation of that since this is like a filesystem wildcard where those expectations are that * in a path is something and not nothing.

Copy link
Member Author

@dmsnell dmsnell Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking all this through pushes me to leave this for userspace for now. we expose get_breadcrumbs() and so someone is implement this on their own until we find a good solution for it.

function preg_matches_breadcrumbs( string $pattern, array $breadcrumbs ) {
	return false !== preg_match( $pattern, implode( ' > ', $breadcrumbs ) );
}

if ( preg_matches_breadcrumbs( '~ > MAIN > .* IMG$~', $processor->get_breadcrumbs() ) ) {
	…
}

this format is surprisingly useful now that I hack it up

src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved
src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved
@dmsnell dmsnell force-pushed the html-api/add-matches-breadcrumbs branch 3 times, most recently from d2c276e to 90a2787 Compare September 21, 2023 23:20

if ( '*' !== $crumb && $this->get_tag() !== strtoupper( $crumb ) ) {
return false;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is bothering me because I don't currently understand why it's necessary. it seems like we should have the currently-matched tag on the stack of open elements, but we still need to check this. then, it seems like we'd be checking twice (in the first iteration of the loop below), but we aren't and all the tests are passing.

* May also contain the wildcard `*` which matches a single element, e.g. `array( 'SECTION', '*' )`.
* @return bool Whether the currently-matched tag is found at the given nested structure.
*/
public function matches_breadcrumbs( $breadcrumbs ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is out of scope for this PR, but could the $breadcrumbs format be expanded beyond just a list of tag names to instead be a list of queries (or the like)? That is, it would be very useful to check not just that the tag name matches but more so that the element has a given ID or a class name. Otherwise, the issue is that tag name alone may not be specific enough to be useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, let's say I have the HTML for the entire page and I need to add fetchpriority=high to the featured image. In Chrome DevTools, the featured image has the selector #page > div.single-featured-image-header > img. In tag name breadcrumbs, this is ['div', 'div', 'img'] which is very unspecific. It would be great to be able to pass in something like:

array(
	array("id" => "page"),
	array(
		"tag" => "div",
		"class" => "single-featured-image-header",
	),
	array("tag" => "img"),
}

Copy link
Member Author

@dmsnell dmsnell Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's definitely out of scope but it's something I've been thinking about a lot recently. it extends to all sorts of challenges in querying and I've been exploring the use of bookmarks to accomplish these needs. one thing we probably don't want to do here is jump around in the document while jumping around, meaning either we track everything about every token (memory bloat) or find a way to only trap what we care about when.

the HTML Processor is using $current_token to hold not only a tag name, but also a full reference to the location in the document for where something is. this is actually required to properly run the parsing, as certain operations imply closing tags of not only a given name, but a specific tag.

the idea would be something like this, where we still leave this up to userspace so that the performance of the processor remains more predictable, and additional processing/memory can be exposed explicitly in consuming code.

while ( $processor->next_tag() ) {
	if (
		'DIV' === $processor->get_tag() &&
		$processor->has_class( 'single-featured-image-header' )
	) {
		$processor->set_bookmark( 'wrapper' );
		continue;
	}

	if (
		'IMG' === $processor->get_tag() &&
		$processor->is_bookmark_open( 'wrapper' )
	) {
		$processor->set_attribute( 'fetchpriority', 'high' );
	}

and I believe that we could create a variety of relationship functions like this: is_child_of( $bookmark_name ), is_descendant_of( $bookmark_name ), is_sibling_of( $bookmark_name ).

internally there's a callback for when popping an element off of the stack of open elements, and I have been thinking of exposing that to userspace. I think it should be safe, but I have questions about how best to represent that. this is what led me to the is_ functions with bookmarks.

…erying

Inside a `next_tag()` loop it can be challenging to use breadcrumbs because
they are only exposed inside the call to `next_tag()` via the `$query` arg.

In this patch a new method, `matches_breadcrumbs()` is exposed which allows
for querying within the `next_tag()` loop for more complicated queries.

This method exposes a wildcard `*` operator to allow matching _any HTML tag_
that the currently-matched tag is a child or descendant of.
 - add wildcard to docblock description of breadcrumbs parameters
 - revert to scanning breadcrumb array with end/current/prev to handle sparse arrays

Props: @westonruter
@ockham ockham force-pushed the html-api/add-matches-breadcrumbs branch from 65759fc to cd79cdd Compare September 26, 2023 07:23
@ockham
Copy link
Contributor

ockham commented Sep 26, 2023

Rebased.

@ockham
Copy link
Contributor

ockham commented Sep 26, 2023

Committed to Core in https://core.trac.wordpress.org/changeset/56702.

@ockham ockham closed this Sep 26, 2023
@dmsnell dmsnell deleted the html-api/add-matches-breadcrumbs branch September 26, 2023 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants