Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added host validation for urls #983

Merged
merged 18 commits into from
Mar 8, 2018
Merged

added host validation for urls #983

merged 18 commits into from
Mar 8, 2018

Conversation

rubengonzalezmrf
Copy link

@rubengonzalezmrf rubengonzalezmrf commented Mar 1, 2018

Fixes #977.

Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. A few changes needed.

Also, doesn't look like the unit test assertions are working: https://travis-ci.org/Automattic/amp-wp/jobs/347682592#L279

'<a class="foo" href="http://foo bar">value</a>',
'<a class="foo" href="">value</a>',
),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test cases for examples found in #977 (comment)

@@ -28,6 +28,7 @@ class AMP_Allowed_Tags_Generated {
'value_url' => array(
'allow_empty' => true,
'allow_relative' => true,
'valid_host' => true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't going to work since this file is generated from the spec via amphtml-update.py, so your change here will be lost.

* If given attribute's value is a URL with a host, the host must
* be valid
*/
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::VALID_HOST ] ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing AMP_Rule_Spec::VALID_HOST here, which isn't actually part of the spec, I suggest you validate the host whenever isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] ). Should it not be validated for every attribute that is a URL value?

@@ -1122,6 +1138,40 @@ private function check_attr_spec_rule_value_regex_casei( $node, $attr_name, $att
return AMP_Rule_Spec::NOT_APPLICABLE;
}

/**
* Check if attribute has a valid host value
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a @since 0.7 here

$urls_to_test = explode( ',', $attr_value );

foreach ( $urls_to_test as $url ) {
$url_host = AMP_WP_Utils::parse_url( urldecode( $url ), PHP_URL_HOST );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in #876, AMP_WP_Utils is obsolete and you can use wp_parse_url() directly.


foreach ( $urls_to_test as $url ) {
$url_host = AMP_WP_Utils::parse_url( urldecode( $url ), PHP_URL_HOST );
if ( $url_host && preg_match( '/[!"#$%&\'()*+,\/:;<=>?@[\]^`{|}~\s]/i', $url_host ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should think that a whitelist should be used rather than a blacklist here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think like you, but the amp validator is doing it this way, looking for blacklisted chars, so i thought it would be better to follow the same approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonruter
Copy link
Member

I'm getting the same failures locally.

@westonruter
Copy link
Member

If I add post_content that includes:

<a class="foo" href="http://foo bar">value</a>

<a class="foo" href="mail to:[email protected]">value</a>

I get out:

<p><a class="foo" href="">value</a></p>

<p><a class="foo" href="mail%20to:[email protected]">value</a></p>

So it seems the mail to check is failing.

}

// Check if the protocol contains invalid chars.
$url_scheme = AMP_WP_Utils::parse_url( $url, PHP_URL_SCHEME );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use wp_parse_url() instead of AMP_WP_Utils::parse_url() here.

if ( $node->hasAttribute( $attr_name ) ) {
$attr_value = $node->getAttribute( $attr_name );
$attr_value = preg_replace( '/\s*,\s*/', ',', $attr_value );
$urls_to_test = explode( ',', $attr_value );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the comma-separated list of URLs. Naturally this shouldn't be allowed in an href or src. What elements do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also combine these two lines into:

$urls_to_test = preg_split( '/\s*,\s*/', $attr_value );

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an image with a srcset attribute would have a list of comma separated urls, so i thought it could be a good idea to check all of them

@rubengonzalezmrf
Copy link
Author

I have post content now with the same 2 examples you added, and i get out the expected results. But unit tests fails anyway

if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] ) ) {
if ( $node->hasAttribute( $attr_name ) ) {
$attr_value = $node->getAttribute( $attr_name );
$urls_to_test = preg_split( '/\s*,\s*/', $attr_value );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since only srcset has comma-separated list of URLs, maybe this should check of 'srcset' === $attr_name for whether to split. Maybe that's overkill. My only concern would be if a non-srcset URL happening to have a comma in it and that this would result in the attribute being invalid.

The alternative could be to do this:

$urls_to_test = preg_split( '#\s*,\s*(?=https?://)#', $attr_value );

That will ensure that it will only split where a comma is followed by another URL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that check_attr_spec_rule_allowed_protocol and check_attr_spec_rule_disallowed_relative do simply comma-separated splits. So probably not something to worry about here then.

@westonruter
Copy link
Member

But unit tests fails anyway

The tests may not be failing. The jobs are failing due to a PHPCS error:

includes/sanitizers/class-amp-tag-and-attribute-sanitizer.php:1171:36: error - Use Yoda Condition checks, you must. (WordPress.PHP.YodaConditions.NotYoda)

https://travis-ci.org/Automattic/amp-wp/jobs/350210562#L216

* Fix phpcs yoda error.
* Fix phpunit failures in AMP_Tag_And_Attribute_Sanitizer_Attr_Spec_Rules_Test.
'a_with_wrong_host' => array(
'<a class="foo" href="http://foo bar">value</a>',
'<a class="foo" href="">value</a>',
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should add a test here for scheme-less URLs, like: <a class="foo" href="//bad domain with a space.com/foo">value</a>

}

// Check if the protocol contains invalid chars.
$dots_pos = strpos( $url, ':' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be using $scheme = wp_parse_url( $url, PHP_URL_SCHEME ) to obtain the schema instead of ':'? What if a URL contains a port, then this could match in the case of a scheme-less URL like //example.com:80/foo/ or (more likely) the URL contains a colon in the path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, you could omit PHP_URL_HOST param above and then just check for $parsed_url['scheme'].

Copy link
Author

@rubengonzalezmrf rubengonzalezmrf Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that mail to:[email protected] won't be correctly parsed by wp_parse_url, the scheme is null.
If the url contains a port number the code will check that no invalid chars are being used before the port number which will pass if the url is correct.

$urls_to_test = preg_split( '/\s*,\s*/', $node->getAttribute( $attr_name ) );
foreach ( $urls_to_test as $url ) {
// Check if the host contains invalid chars.
$url_host = wp_parse_url( urldecode( $url ), PHP_URL_HOST );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of urldecode() here. If I have a URL like https://%65%78%61%6d%70%6c%65%2e%63%6f%6d/ then this should be valid in AMP, as it is just https://example.com/ 👍

@westonruter westonruter added this to the v0.7 milestone Mar 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants