Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent validation errors for URLs with leading or trailing spaces #4166

Merged
merged 9 commits into from
Jan 24, 2020

Conversation

kienstra
Copy link
Contributor

@kienstra kienstra commented Jan 23, 2020

Summary

Prevents a validation error if a URL has a leading or trailing space.

Fixes #4117

Background

As Weston found, a leading space in a URL can cause a validation error:

<a href=" https://wp.org">Click Here</a>

Unknown error (INVALID_URL_PROTOCOL)

Preventing that error with uncovered an INVALID_URL error, which needed trim() here:

Checklist

  • My pull request is addressing an open issue (please create one otherwise).
  • My code is tested and passes existing tests.
  • My code follows the Engineering Guidelines (updates are often made to the guidelines, check it out periodically).

As Weston found, something like:
<a href=" https://wp.org"> would have
a validation error.
Those trim( $url ) calls were passed
to parse_protocol().
So do something similar inside parse_protocol(),
don't match leading space.
@googlebot googlebot added the cla: yes Signed the Google CLA label Jan 23, 2020
There is a matching group of ([^/]),
so there probably isn't a need for the lookahead.
Instead of changing this,
simply trim() the URL before passing it
to parse_protocol()
@@ -1626,7 +1626,7 @@ private function check_attr_spec_rule_value_regex_casei( DOMElement $node, $attr
private function check_attr_spec_rule_valid_url( DOMElement $node, $attr_name, $attr_spec_rule ) {
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] ) && $node->hasAttribute( $attr_name ) ) {
foreach ( $this->extract_attribute_urls( $node->getAttributeNode( $attr_name ) ) as $url ) {
$url = urldecode( $url );
$url = urldecode( trim( $url ) );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the AMP validator does this same trimming up front: https://github.com/ampproject/amphtml/blob/af1e3a550feeafd732226202b8d1f26dcefefa18/validator/engine/parse-url.js#L214-L215

However, it is also doing one more thing:

    // Browsers remove Tab/CR/LF from the entire URL, so we do too.
    unparsed = unparsed.replace(/[\t\r\n]/g, '');

I think that should be incorporated here as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being the case, instead of calling trim() directly, let's create a new private method like normalize_url_from_attribute_value() which does both the trimming and the tab/newline removal and then use that method everywhere that you are currently using trim() instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, it'll probably be several hours before I could do that, if that's alright.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

911602d adds a method to normalize the URLs, sorry for the delay.

@westonruter westonruter added this to the v1.5 milestone Jan 23, 2020
As Weston mentioned, there might not be a need
for it to run earlier.
parse_protocol() no longer trims
space on its own.
Use this instead of trim()
for the URLs in attributes.
@westonruter
Copy link
Member

@kienstra as opposed to adding empty commits, why not just restart the failed jobs in Travis?

In any case, please rebase the branch to remove the throwaway commits when done.

@kienstra
Copy link
Contributor Author

kienstra commented Jan 24, 2020

@kienstra as opposed to adding empty commits, why not just restart the failed jobs in Travis?

Yeah, that's a good idea. I had restarted the jobs several times, but pushed an empty commit just to see.

In any case, please rebase the branch to remove the throwaway commits when done.

Sure.

@kienstra
Copy link
Contributor Author

This build passed, but I'll trigger the external-http jobs once more to make sure.

@kienstra
Copy link
Contributor Author

Another external-https build passed. It's still a little troubling that they failed a few times with this PR. I'll trigger it again.

@westonruter
Copy link
Member

I've noticed the external-http requests being flaky, so I don't think it's anything to do with your code.

],
'a_with_mail_host' => [
'<a class="foo" href="mail to:[email protected]">value</a>',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the space in:

...is what caused this be invalid.

Passing <a class="foo" href="mailto:[email protected]"> to the AMP validator shows that it's valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be resolved by implementing my change here: #4166 (comment)

The mail to protocol should still be invalid, so this chunk should be reverted.

This code: <a class="foo" href="mail to:[email protected]"> should be invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, good point. That's applied in f193019

* @return string The normalized URL.
*/
private function normalize_url_from_attribute_value( $url ) {
return preg_replace( '/\s/', '', $url );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right.

Only tabs and newlines should be stripped. See https://github.com/ampproject/amphtml/blob/af1e3a550feeafd732226202b8d1f26dcefefa18/validator/engine/parse-url.js#L217-L218

So this, along with the trimming, should be:

Suggested change
return preg_replace( '/\s/', '', $url );
return preg_replace( '/[\t\r\n]/', '', trim( $url ) );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll make that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's applied in f193019

Use this from the AMP validator,
instead of a simple regex, also use trim().
The behavior of the new parse_url_ method
is changed, so these can change back.
@westonruter westonruter merged commit 277ba23 into develop Jan 24, 2020
@westonruter westonruter deleted the fix/href-validation branch January 24, 2020 06:50
@kienstra
Copy link
Contributor Author

Thanks! 🎉

@westonruter
Copy link
Member

Interestingly, the change here suppresses a validation error from occurring. In particular, consider a theme that is outputting invalid HTML like this:

		<h4 class="name">
			<a href="<a href='https://wpthemetestdata.wordpress.com/' rel='external nofollow ugc' class='url'>themedemos</a>">
				themedemos
			</a>
		</h4>

Before the change here, this results in an AMP validation error:

image

After the change here, no validation error is raised and the HTML markup output by the sanitizer is:

		<h4 class="name">
			<a href="&lt;a_href=%27https%3A%2F%2Fwpthemetestdata.wordpress.com%2F%27+rel%3D%27external+nofollow+ugc%27+class%3D%27url%27%3Ethemedemos%3C%2Fa%3E&amp;amp" rel="amphtml">
				themedemos
			</a>
		</h4>

It turns out this is valid AMP, so this is not a bug. But it shows some difference in handling, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Signed the Google CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Initial whitespace in href causes validation error
3 participants