Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix libxml issues with parsing character encoding #1071

Merged
merged 3 commits into from
Apr 13, 2018
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions includes/utils/class-amp-dom-utils.php
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,18 @@ function( $matches ) {
);
}

/*
* Add a pre-HTML5-style declaration of the encoding since libxml<2.8 doesn't recognize
* HTML5's meta charset. See <https://bugzilla.gnome.org/show_bug.cgi?id=655218>.
*/
if ( version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ) ) {
$document = preg_replace(
'#<meta[^>]+charset="([^"]+)"#i',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this may be missing the trailing > so the replacement would seem to result in doubled >> being added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my reply below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case where there is <meta charset=utf-8> with no quote marks? Sure this is probably unlikely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, yeah. I'll include a test case for that and for single quotes as well. I should also account for possible whitespace around the =.

'<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">$0',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the $0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm not replacing the original meta charset tag, just inserting the new meta tag before it. (That's also why I don't include the > in he regex, per your previous comment.)

I see that at a glance it looks like the aim is to replace the whole original meta tag, so I should probably add a comment to clarify what's going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I understand now. In that case you might want to use a lookahead match instead, like so:

$document = preg_replace(
    '#(?=<meta[^>]+charset="([^"]+)")#i',
    '<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">',
    $document
);

Not really a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, I should have thought of that. My only defense is that it was after midnight. :-P

$document
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add a limit of 1.

}

/*
* Wrap in dummy tags, since XML needs one parent node.
* It also makes it easier to loop through nodes.
Expand All @@ -123,6 +135,14 @@ function( $matches ) {
return false;
}

/*
* Remove pre-HTML5-style encoding declaration if added above.
*/
$meta_http_equiv_element = $dom->getElementById( 'meta-http-equiv' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be made a tiny more performant by first checking version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ). The result of the first call could be put inside a variable to re-use here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, you can see there is another such check for the libxml version on line 99. So all of these checks could re-use this same var, e.g.:

$is_old_libxml = version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, wrapping this in a version check makes sense to me, as does introducing a variable. I'd be inclined to name the variable something more specific, like $is_pre_2_8_libxml, but if you have a specific preference for the name I'll go with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either is fine with me. Since we're only testing for one version of libxml currently, the a simple “old“ name should suffice.

if ( $meta_http_equiv_element ) {
$meta_http_equiv_element->parentNode->removeChild( $meta_http_equiv_element );
}

return $dom;
}

Expand Down