Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Handle parsing changes in foreign content. #6006

Closed

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Feb 1, 2024

Trac ticket: Core-61576

Status

This is undergoing final review and preparation.

Earlier versions

  • First draft (ba2c889)

    • By implementing the rules at the top of Tree Construction/Dispatch, we don't need $skip_next_foreign_content_processing, because we pop nodes off of the stack of open elements until we're back at an html element, or an HTML integration point, or a MathML integration point. Returning to step() will choose insertion mode instead of foreign content and prevent an infinite loop.
  • Second draft (3e4bb4d)

    • In this draft the attribute and tag names are properly cased, but it throws off the logic mechanisms. Since attributes and tags are still parsed according to HTML rules (ASCII-case-insensitive), it makes sense to continue reporting them in all caps. Perhaps a new method such as get_token_display_name() would provide the remapping.
  • Third draft (@sirreal's work)

    • Jon found a way to fix the namespace issues by rearranging the assignment of a namespace to the tokens and the detection of integration points.

Notes

  • There's an interplay we have to figure out with casing of tag names. For all of our comparison functions we check the upper-case variants, but there's value in reporting the foreign content tag names as lower-cased or as mixed case.
  • There are some SVG tag names that are supposed to report in mixed case. For both SVG and MathML there are attributes which are supposed to report in mixed case. Unlike my first assumption, however, they still all parse with ASCII case insensitivity. That is, <circle clippath=1 clipPath=2 CLIPPATH=3 cliPPath=4> has a single attribute named clipPath and whose value is 1. This is convenient because we don't have to change the Tag Processor, but inconvenient when things are mostly based on lower-case attribute names.
  • We probably need to repeal the idea that tag names are upper-case and add another bit to communicate tag name vs. doctype declaration. Maybe? Can we test that math:mi or math:MI is in the math namespace? There can be no HTML tag named math:mi (it would be MATH:MI).
  • Could be handy to have something like base_class_get_tag() or private function comparison_tag_name() etc… to report an upper-case tag name, while preserving the case-variants required in foreign content to outside calls.
  • Tracking the namespace seems a bit fishy. When we're in the Tag Processor we can know that if we encounter an SVG or MATH tag when we're in the html namespace, that it should change the namespace to svg or math, respectively, and lower-case the tag names. However, the role of integration points and parsing things in the insertion mode is still vague.
  • Foreign content is not an insertion mode. This is important when handling HTML tags inside HTML integration points. This is very important for get_modifiable_text() which doesn't know if a text node inside foreign content is being processed as foreign content or in the insertion mode, where it determines if NULL bytes should be replaced or removed.
  • It's probably necessary to start over again with these insights and more in mind.

Description

We should reliable detect foreign content and we need to do it in the Tag Processor, specifically because of the rules for CDATA sections. The HTML Processor needs this as well to determine if things like self-closing flags for HTML elements should be respected.

- Tests: 1498, Assertions: 787, Skipped: 711.
+ Tests: 1498, Assertions: 930, Skipped: 568.

Unlocks:

  • proper detection of CDATA
  • parsing of SVG and MathML
  • indicating if one should expect a closing tag or not
Screenshot 2024-02-01 at 9 49 43 PM

Screenshot 2024-02-02 at 12 13 47 AM

Copy link

github-actions bot commented Feb 1, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell, westonruter.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Copy link

github-actions bot commented Feb 1, 2024

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

'A' === $html[ $at + 7 ] &&
'[' === $html[ $at + 8 ]
) {
$closer_at = strpos( $html, ']]>', $at + 1 );

This comment was marked as resolved.

As part of work to add more spec support to the HTML API, this patch adds
support for SVG and MathML elements, or more generally, "foreign content."

The rules in foreign content are a mix of XML and HTML parsing rules and
introduce additional complexity into the processor, but is important in
order to avoid getting lost when inside these elements.

Developed in WordPress#6006
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.
@dmsnell dmsnell force-pushed the html-api/detect-foreign-content branch from 3e4bb4d to 68a6b4a Compare July 30, 2024 23:38
Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

in_array( $token_name, array( 'IFRAME', 'NOEMBED', 'NOFRAMES', 'SCRIPT', 'STYLE', 'TEXTAREA', 'TITLE', 'XMP' ), true )
in_array( $token_name, array( 'IFRAME', 'NOEMBED', 'NOFRAMES', 'SCRIPT', 'STYLE', 'TEXTAREA', 'TITLE', 'XMP' ), true ) ||
// Self-closing elements in foreign content.
( isset( $node ) && 'html' !== $node->namespace && $node->has_self_closing_flag )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a performance benefit for doing this, I think generally better to check for equality with null since the variable is set. But even better here to check if it is the instance of the expected class, right?

Suggested change
( isset( $node ) && 'html' !== $node->namespace && $node->has_self_closing_flag )
( $node instanceof WP_HTML_Token && 'html' !== $node->namespace && $node->has_self_closing_flag )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I think I did this for a reason, but I can't remember. likely I had other code in transition and had non-null non-token instances passed around

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bookmarking this, the isset( $node ) seems like it would make this function return true under some circumstances where we would not expect that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was much discussion over this little function, and a lot of effort to "fix" it that only resulted in broken code.

I'd like to propose we continue with it as written and come back afterwards to clean it up.

right now, all of the "fixes" are only used in test code when expects_closer() is called without a $node passed in. the real problem is that there's confusion in what this function is communicating and how step() pops elements off of the stack of open elements.

we recently realized that we need to pop elements off not in step(), but in the parsing rules where they indicate as much. I'd rather we ship this odd bit now and create the more-sweeping fix in one go instead of stewing on code we think must be wrong but which we don't have a failing test case for.


I've added a type annotation to the function, ensuring that isset() is sufficient, and I've added a @todo annotation earmarking this for review. I think what's going on is a complicated interaction with step() where this looks wrong but isn't.

By the way, this function returns "expects closer" for all closing tags, which is "wrong" but also part of the bigger issue not created in this PR.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR to address this: #7162

@@ -2237,8 +2260,12 @@ private function step_in_body(): bool {
* These ought to be handled in the attribute methods.
*/

$this->bail( 'Cannot process MATH element, opening foreign content.' );
break;
$this->change_parsing_namespace( 'math' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the parsing namespace changed back to html when encountering </SVG> or </MATH>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble figuring out from the spec where the namespace is assigned for any other token besides SVG and MATH, as I'm having failures with a math ANNOTATION-XML tag

* > An end tag whose name is "script", if the current node is an SVG script element.
*/
if ( $this->is_tag_closer() && 'SCRIPT' === $this->state->current_token->node_name && 'svg' === $this->state->current_token->namespace ) {
$this->bail( 'Cannot parse SCRIPT tags inside SVG elements.' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this just be ignored since the scripting flag is not enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can't be ignored because there are rules that we would (might) need to follow and I didn't want to dive into that at this point. perhaps the rules aren't that complicated. it would require looking into the spec, determining what is required, and if we have enough state tracking already to handle it.

if ( 'svg' === $this->get_namespace() ) {
return (
'DESC' === $tag_name ||
'FOREIGNoBJECT' === $tag_name ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the lower-case o?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a TyPO 😄

Comment on lines +4857 to +4863
if ( is_string( $tag_name ) ) {
$tag_name = strtoupper( $tag_name );
} else {
$tag_name = 'html' === $tag_name->namespace
? strtoupper( $tag_name->node_name )
: "{$tag_name->namespace} {$tag_name->node_name}";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the normal case is now the object:

Suggested change
if ( is_string( $tag_name ) ) {
$tag_name = strtoupper( $tag_name );
} else {
$tag_name = 'html' === $tag_name->namespace
? strtoupper( $tag_name->node_name )
: "{$tag_name->namespace} {$tag_name->node_name}";
}
if ( $tag_name instanceof WP_HTML_Token ) {
$tag_name = 'html' === $tag_name->namespace
? strtoupper( $tag_name->node_name )
: "{$tag_name->namespace} {$tag_name->node_name}";
} else {
$tag_name = strtoupper( $tag_name );
}

Also, should $tag_name be renamed $node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah probably. thanks for the review - there are still some logical error in here I'm tracking, specifically, trying to figure out the right place to assign namespaces and avoid infinite loops when reprocessing nodes from the foreign content rules.

@dmsnell
Copy link
Member Author

dmsnell commented Jul 31, 2024

@sirreal @westonruter I've pushed some updates that incorporate trunk and a few other fixes. notably, I had the wrong code for adjusted_current_node().

break;
}

echo "\e[90mPopping a \e[35m{$current_node->namespace} \e[34m{$current_node->node_name}\e[90m from the stack.\e[m\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug code I'm presuming.

Suggested change
echo "\e[90mPopping a \e[35m{$current_node->namespace} \e[34m{$current_node->node_name}\e[90m from the stack.\e[m\n";

Comment on lines 4030 to 4031
echo "\e[90mACI is a \e[35m{$this->state->current_token->namespace}\e[90m \e[34m{$this->state->current_token->node_name}\e[m\n";
echo "\e[90mInserted a \e[35m{$this->state->current_token->namespace}\e[90m \e[34m{$this->state->current_token->node_name}\e[m\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug code I'm presuming.

Suggested change
echo "\e[90mACI is a \e[35m{$this->state->current_token->namespace}\e[90m \e[34m{$this->state->current_token->node_name}\e[m\n";
echo "\e[90mInserted a \e[35m{$this->state->current_token->namespace}\e[90m \e[34m{$this->state->current_token->node_name}\e[m\n";

@sirreal
Copy link
Member

sirreal commented Aug 1, 2024

This is very tricky. I'll share some cases that are problematic right now:

Update: I've debugged all of these cases and pushed some changes to fix them.

Fixed

<svg><a/>Text after "svg:a" self closing tag.

Errors with Cannot run adoption agency when "any other end tag" is required.

<svg></svg><![CDATA[ bogus comment ]]>

svg :svg ! CDATA  bogus comment

<svg><foreignObject><![CDATA[ bogus comment ]]>

svg foreignObject ! CDATA  bogus comment

<svg><title><div>

svg title div

<svg><title><rect><div>

svg title rect div

<svg><title><svg><div>

svg title svg div

<math><mi><div></div></mi><mi>

math mi div :div :mi mi

<math><mi><svg><foreignObject><div>

math mi svg foreignObject div

<svg><foreignObject>\x00filler\x00text (null bytes)

svg foreignObject x00filler x00text

<svg></br><foo>

svg :br foo

<math><mtext><b>

math mtext b

sirreal and others added 4 commits August 1, 2024 11:56
The push/pop handlers are a good opportunity to change parsing namespace
as elements are encountered and moved onto the stack. This helps to keep
the parsing namespace up to date without requiring state changes spread
all over parsing code.
@dmsnell dmsnell force-pushed the html-api/detect-foreign-content branch from f4373ff to a1e48e3 Compare August 8, 2024 05:28
dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Aug 8, 2024
As part of work to add more spec support to the HTML API, this patch adds
support for the relevant foreign elements in the HTML algorithms within
the stack of open elements. Although the HTML Processor cannot yet step
into these elements, the format of how they will be represented was
determined in the related work from which this patch is extracted.

This patch extracted from WordPress#6006.

Developed in https://github.com/wordpress/wordpress-develop/pull/
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.
dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Aug 8, 2024
As part of work to add more spec support to the HTML API, this patch adds
support for the relevant foreign elements in the HTML algorithms within
the stack of open elements. Although the HTML Processor cannot yet step
into these elements, the format of how they will be represented was
determined in the related work from which this patch is extracted.

This patch extracted from WordPress#6006.

Developed in WordPress#7157
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.
@dmsnell dmsnell marked this pull request as ready for review August 8, 2024 06:37
The specification says to run run the "any other end tag" steps under
certain other conditions. A goto and label are used for this. When
moving to the label, the "any other end tag" condition does not apply,
the goal is to run the steps regardless.

Move the label into the conditional block so that the condition is not
erroneously checked.

This caused a failure to return a value when stepping on
`<svg><script/>`.
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ready to me, I'm very excited to add this support!

The amount of html5lib-tests that run (and all pass) gives me confident in this work:

OK, but incomplete, skipped, or risky tests!
Tests: 1498, Assertions: 930, Skipped: 568.

pento pushed a commit that referenced this pull request Aug 8, 2024
As part of work to add more spec support to the HTML API, this patch adds
support for SVG and MathML elements, or more generally, "foreign content."

The rules in foreign content are a mix of XML and HTML parsing rules and
introduce additional complexity into the processor, but is important in
order to avoid getting lost when inside these elements.

Developed in #6006
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell, westonruter.
See #61576.


git-svn-id: https://develop.svn.wordpress.org/trunk@58867 602fd350-edb4-49c9-b593-d223f7449a82
@dmsnell
Copy link
Member Author

dmsnell commented Aug 8, 2024

Merged in [58867]
de084d7

@dmsnell dmsnell closed this Aug 8, 2024
@dmsnell dmsnell deleted the html-api/detect-foreign-content branch August 8, 2024 07:25
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Aug 8, 2024
As part of work to add more spec support to the HTML API, this patch adds
support for SVG and MathML elements, or more generally, "foreign content."

The rules in foreign content are a mix of XML and HTML parsing rules and
introduce additional complexity into the processor, but is important in
order to avoid getting lost when inside these elements.

Developed in WordPress/wordpress-develop#6006
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell, westonruter.
See #61576.

Built from https://develop.svn.wordpress.org/trunk@58867


git-svn-id: http://core.svn.wordpress.org/trunk@58263 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Aug 8, 2024
As part of work to add more spec support to the HTML API, this patch adds
support for SVG and MathML elements, or more generally, "foreign content."

The rules in foreign content are a mix of XML and HTML parsing rules and
introduce additional complexity into the processor, but is important in
order to avoid getting lost when inside these elements.

Developed in WordPress/wordpress-develop#6006
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell, westonruter.
See #61576.

Built from https://develop.svn.wordpress.org/trunk@58867


git-svn-id: https://core.svn.wordpress.org/trunk@58263 1a063a9b-81f0-0310-95a4-ce76da25c4cd
pento pushed a commit that referenced this pull request Aug 8, 2024
As part of work to add more spec support to the HTML API, this patch adds support for SVG and MathML elements, or more generally, "foreign content."

The rules in foreign content are a mix of XML and HTML parsing rules and introduce additional complexity into the processor, but is important in order to avoid getting lost when inside these elements.

This patch follows the first by deleting the empty files, which were mistakenly left in during the initial merge.

Developed in #6006
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58867].

Props: dmsnell, jonsurrell, westonruter.
See #61576.


git-svn-id: https://develop.svn.wordpress.org/trunk@58868 602fd350-edb4-49c9-b593-d223f7449a82
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Aug 8, 2024
As part of work to add more spec support to the HTML API, this patch adds support for SVG and MathML elements, or more generally, "foreign content."

The rules in foreign content are a mix of XML and HTML parsing rules and introduce additional complexity into the processor, but is important in order to avoid getting lost when inside these elements.

This patch follows the first by deleting the empty files, which were mistakenly left in during the initial merge.

Developed in WordPress/wordpress-develop#6006
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58867].

Props: dmsnell, jonsurrell, westonruter.
See #61576.

Built from https://develop.svn.wordpress.org/trunk@58868


git-svn-id: http://core.svn.wordpress.org/trunk@58264 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Aug 8, 2024
As part of work to add more spec support to the HTML API, this patch adds support for SVG and MathML elements, or more generally, "foreign content."

The rules in foreign content are a mix of XML and HTML parsing rules and introduce additional complexity into the processor, but is important in order to avoid getting lost when inside these elements.

This patch follows the first by deleting the empty files, which were mistakenly left in during the initial merge.

Developed in WordPress/wordpress-develop#6006
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58867].

Props: dmsnell, jonsurrell, westonruter.
See #61576.

Built from https://develop.svn.wordpress.org/trunk@58868


git-svn-id: https://core.svn.wordpress.org/trunk@58264 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants