-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Interactivity API store JSON encoding #6520
Conversation
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
src/wp-includes/interactivity-api/class-wp-interactivity-api.php
Outdated
Show resolved
Hide resolved
@westonruter brought up a point about Thanks for flagging that! This is interesting feedback, do you have more information on what an attack might look like? I decided to investigate more deeply, so I reviewed reviewing the standard and as far as I can tell to break out of the script you need a literal sequence like To close a script tag, we must find:
Based on the above, my interpretation is that we need to make sure that Here are the relevant JSON encoding flags:
If my analysis so far is correct, either omitting That is to say, it seems that neither flag ( My intuition is to use One thing that concerns me is a sequence of states in a script tag that begins
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Exactly this.
My intuition as well.
I've not looked too much into this either, but I've also had that in the back of my head when considering
I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the |
Good thought. I'm unable to get any modern browser to interpret a page as anything other than HTML5 so this may be a non-issue. There's a related ticket that suggests the same:
|
@sirreal Did you try sending Chrome, Safari, and Firefox all show parse errors when going to https://xhtml-test-page.glitch.me/?breakxml |
$expected = <<<"JSON" | ||
{"config":{"myPlugin":{"chars":"&\\u003C\\u003E/"}},"state":{"myPlugin":{"ampersand":"&","less-than sign":"\\u003C","greater-than sign":"\\u003E","solidus":"/","line separator":"\\u2028","paragraph separator":"\\u2029","flag of england":"\\ud83c\\udff4\\udb40\\udc67\\udb40\\udc62\\udb40\\udc65\\udb40\\udc6e\\udb40\\udc67\\udb40\\udc7f","malicious script closer":"\\u003C/script\\u003E","entity-encoded malicious script closer":"</script>"}}} | ||
JSON; | ||
$this->assertEquals( $expected, $interactivity_data_string[1] ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea: This could use the HTML Processor to step over the tokens in $interactivity_data_markup
as an additional check to ensure that the malicious script closer does not prematurely close the script
. Maybe use the PHP's DOM API as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the same thing as I worked with these tests. However, in this case I think it makes sense to match and work with literal strings. I don't have much confidence in any of the parsers to do what I expect and not try to interpret any of this as HTML markup. I want to know exactly what characters are output and don't want any entities to be transformed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HTML Processor doesn't support SCRIPT tags right now, and the tag processor is much more rudimentary. I'm not sure either is ready to handle what you suggest. @dmsnell thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case the Tag Processor should be inscrutable for testing SCRIPT
elements, because it does consume the entire thing in one go. you can compare get_modifiable_text()
to your expectation to see what was inside the script
Based on that, here's a PHP file that illustrates a script injection that would be prevented by <?php
header('Content-Type: application/xhtml+xml; charset=UTF-8');
$data = "";alert('eek');"";
?><?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">
<head>
<script>
var data = <?php echo json_encode( $data, JSON_HEX_TAG | JSON_UNESCAPED_SLASHES ); ?>;
</script>
</head>
<body>
</body>
</html> Attacking the |
But you could still if your malicious code looked like |
If you can figure out a way to get that |
Fascinating! This was helpful. I'll add In XHTML the entities will be decoded in script tags, although this seems to be a source of data corruption more than anything else. I don't think they can be abused to break out of the script tag - that's their purpose: However, it can be dangerous not to escape
Here's the demo script I was playing with: <?php
$xhtml = isset($_GET['xhtml']);
header( 'Content-Type: '
. ( $xhtml ? 'application/xhtml+xml' : 'text/html' )
. '; charset=UTF-8'
);
// PLAY WITH THE VALUE HERE
$s = "<";
$flags = 0
// always these
| JSON_HEX_TAG
| JSON_UNESCAPED_SLASHES
// these with UTF-8
| JSON_UNESCAPED_UNICODE
| JSON_UNESCAPED_LINE_TERMINATORS
| 0 ;
if ( isset( $_GET['amp'] ) ) {
$flags |= JSON_HEX_AMP;
}
if ( $xhtml ) {
echo '<!DOCTYPE html>'
. PHP_EOL
. '<html xmlns="http://www.w3.org/1999/xhtml">';
} else {
echo '<!DOCTYPE html>'
. PHP_EOL
. '<html>';
}
?>
<body>
<script id='1' type="application/json"><?php echo json_encode($s,$flags); ?></script>
<script>
const j = JSON.parse(document.getElementById('1').textContent);
console.log("%o", j);
</script>
</body>
</html>
|
Does anyone know of themes to test that don't support I'm wondering if a check like this is sufficient: if ( ! current_theme_supports( 'html5' ) ) {
$json_encode_flags |= JSON_HEX_AMP;
} Of if this need to check whether in WP Admin as well 🤔 |
Given that pages written XHTML is almost never served as actual XHTML (see GoogleChromeLabs/wpp-research#74), I think we should always assume HTML5. |
The WP Admin is considered HTML5, at least according to |
This is a great discussion, and I think I'd need far more time to digest what's going on, but here are a few initial thoughts.
I'm not sure where this comes from. It's hard to succinctly say other than to say we cannot produce a closing SCRIPT tag. You can see this with the Tag Processor, and note that it mirrors what you see in a browser. $p = new WP_HTML_Tag_Processor( '<script>This is </script😄> and this is </script>' );
$p->next_token();
echo $p->get_modifiable_text();
// Output: "This is </script😄> and this is " One final note is that once we match the $p = new WP_HTML_Tag_Processor( '<script>This is a </script closing tag="</script>">' );
$p->next_token();
echo $p->get_modifiable_text();
// Output: "This is a " The double-escaped script state can be very confusing, largely because of the wording around it. It's specifically there to allow printing SCRIPT elements from JavaScript. You not only need to open an HTML comment with These two discussions cover it better than the spec does IMO. $p = new WP_HTML_Tag_Processor( '<script>This <!-- <script> </script> --> </script>' );
$p->next_token(); echo $p->get_modifiable_text();
// Output: "This <!-- <script> </script> --> "
$p = new WP_HTML_Tag_Processor( '<script>This <!-- <script> --> </script> --> </script>' );
$p->next_token(); echo $p->get_modifiable_text();
// Output: "This <!-- <script> --> " You can see here how either a closing $p = new WP_HTML_Tag_Processor( '<script>This <!-- <script> <!-- <script> </script> --> </script>' );
$p->next_token(); echo $p->get_modifiable_text();
// Output: "This <!-- <script> <!-- <script> </script> --> "
Personally I'd rather see us move forward and focus on HTML5 because it's so extremely rare in practice to find XHTML. Yes it's possible to send the content type header, and yes if you serve as a page as We can keep in mind that HTML cannot be fully expressed through XML and this is one of the failures of XHTML. Why this is important to me is that once we start escaping content within a This is one of the reasons I discourage use of $dom = new DOMDocument();
$dom->loadHTML( "<!DOCTYPE html><html><meta charset=utf8><body><script>This <!-- <script> </script> --> </script> alert(1);</body></html>" );
echo $dom->saveHTML();
// Output:
// <!DOCTYPE html>
// <html><head><meta charset="utf8"></head><body><script>This <!-- <script> </script> --> alert(1);</body></html> In this case we instructed When |
This was perhaps a poor attempt at expressing that we need to find at least |
Given all the discussion and just how difficult it is to serve XHTML, is the consensus that It doesn't seem to be a security issue. There's only an edge case that the data may not be interpreted correctly or the data may cause the xhtml page to be invalid due to malformed cahracter references. And that's all if the page is actually served correctly as XHTML. |
Themes that do not declare
If you use a |
Thanks @sabernhardt. I did some testing with twentyten and it doesn't declare html5 support, but it certainly doesn't render an xhtml page. Unescaped I'm fairly convinced at this point that we don't need I think most discussion is settled and this is ready for review. |
If you want to test with a theme that uses an XHTML doctype in the |
Thanks again @sabernhardt. I tried with Kubrick and Colorway, and those themes still are not rendered as XHTML, no issues without JSON_HEX_AMP encoding. |
Co-authored-by: Weston Ruter <[email protected]>
Previously, the flags were set as if UTF-8 were conditionally present, but in most cases the blog character set will be UTF-8 and so flipping the logic makes it clearer that this is a "happy path" and also remove some logic that isn't necessary most of the time. I've reworded and combined the comment explaining the flags to be more specific about whey they are needed, and removed some discussions about encodings that I thought muddied the important details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sirreal I merged trunk
and made a commit to use the new is_utf8_charset()
Additionally I rearranged the comment and logic for swapping flags because I felt like it would be nicer if we expressed UTF-8 as the default and only removed the other flags if we're in a different charset. I removed some wording in the comment that I felt was more confusing about the encoding than helpful, since JSON is by spec required to be UTF-8 and by convention is universally so.
My commit is only to speed up the process, not demand the change. Please let me know how you feel. If you and @westonruter are still on board, I can merge this in soon.
Thanks @dmsnell, I'm happy with your changes 👍 |
The Interactivity API has been rendering client data in a SCRIPT element with the type `application/json` so that it's not executed as a script, but is available to one. The data runs through `wp_json_encode()` and is encoded with some flags to ensure that potentially-dangerous characters are escaped. However, this can lead to some challenges. Eagerly escaping when not necessary can make the data difficult to comprehend when reading the output HTML. For example, all non-ASCII Unicode characters are escaped with their code point equivalent. This results in `\ud83c\udd70` instead of `🅰`. In this patch, the flags for JSON encoding are refined to ensure what's necessary while relaxing other rules (leaving in those Unicode characters if the blog charset is UTF-8). This makes for Interactivity API data that's quicker as a human reader to decipher and diagnose. In summary: - This data is JSON encoded and printed in a `<script type="application/json">` tag. - If we ensure that `<` is never printed inside the data, it should be impossible to break out of the script tag and the browser treats everything as the element's `textContent`. - All other escaping becomes unnecessary at that point, including unicode escaping if the page uses the UTF-8 charset (the same encoding as JSON). See #6433 (review) Developed in #6520 Discussed in https://core.trac.wordpress.org/ticket/61170 Fixes: #61170 Follow-up to: [57563]. Props: bjorsch, dmsnell, jonsurrell, sabernhardt, westonruter. git-svn-id: https://develop.svn.wordpress.org/trunk@58159 602fd350-edb4-49c9-b593-d223f7449a82
The Interactivity API has been rendering client data in a SCRIPT element with the type `application/json` so that it's not executed as a script, but is available to one. The data runs through `wp_json_encode()` and is encoded with some flags to ensure that potentially-dangerous characters are escaped. However, this can lead to some challenges. Eagerly escaping when not necessary can make the data difficult to comprehend when reading the output HTML. For example, all non-ASCII Unicode characters are escaped with their code point equivalent. This results in `\ud83c\udd70` instead of `🅰`. In this patch, the flags for JSON encoding are refined to ensure what's necessary while relaxing other rules (leaving in those Unicode characters if the blog charset is UTF-8). This makes for Interactivity API data that's quicker as a human reader to decipher and diagnose. In summary: - This data is JSON encoded and printed in a `<script type="application/json">` tag. - If we ensure that `<` is never printed inside the data, it should be impossible to break out of the script tag and the browser treats everything as the element's `textContent`. - All other escaping becomes unnecessary at that point, including unicode escaping if the page uses the UTF-8 charset (the same encoding as JSON). See WordPress/wordpress-develop#6433 (review) Developed in WordPress/wordpress-develop#6520 Discussed in https://core.trac.wordpress.org/ticket/61170 Fixes: #61170 Follow-up to: [57563]. Props: bjorsch, dmsnell, jonsurrell, sabernhardt, westonruter. Built from https://develop.svn.wordpress.org/trunk@58159 git-svn-id: http://core.svn.wordpress.org/trunk@57622 1a063a9b-81f0-0310-95a4-ce76da25c4cd
The Interactivity API has been rendering client data in a SCRIPT element with the type `application/json` so that it's not executed as a script, but is available to one. The data runs through `wp_json_encode()` and is encoded with some flags to ensure that potentially-dangerous characters are escaped. However, this can lead to some challenges. Eagerly escaping when not necessary can make the data difficult to comprehend when reading the output HTML. For example, all non-ASCII Unicode characters are escaped with their code point equivalent. This results in `\ud83c\udd70` instead of `🅰`. In this patch, the flags for JSON encoding are refined to ensure what's necessary while relaxing other rules (leaving in those Unicode characters if the blog charset is UTF-8). This makes for Interactivity API data that's quicker as a human reader to decipher and diagnose. In summary: - This data is JSON encoded and printed in a `<script type="application/json">` tag. - If we ensure that `<` is never printed inside the data, it should be impossible to break out of the script tag and the browser treats everything as the element's `textContent`. - All other escaping becomes unnecessary at that point, including unicode escaping if the page uses the UTF-8 charset (the same encoding as JSON). See WordPress/wordpress-develop#6433 (review) Developed in WordPress/wordpress-develop#6520 Discussed in https://core.trac.wordpress.org/ticket/61170 Fixes: #61170 Follow-up to: [57563]. Props: bjorsch, dmsnell, jonsurrell, sabernhardt, westonruter. Built from https://develop.svn.wordpress.org/trunk@58159 git-svn-id: https://core.svn.wordpress.org/trunk@57622 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Improve JSON serialization of the Interactivity API store. Based on this conversation: #6433 (review)
There is a more detailed analysis below, but in summary:
<script type="application/json">
tag.<
is never printed inside the data, it should be impossible to break out of the script tag and the browser treats everything as the element'stextContent
.By using as much utf-8 text without redundant escaping, the output generally becomes smaller and more readable.
Props: @anomiex
Trac ticket: https://core.trac.wordpress.org/ticket/61170
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.