Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix libxml issues with parsing character encoding #1071

Merged
merged 3 commits into from
Apr 13, 2018

Conversation

douglyuckling
Copy link
Contributor

@douglyuckling douglyuckling commented Apr 12, 2018

Fixes #1067.

@douglyuckling
Copy link
Contributor Author

By the way, the libxml bug doesn't clearly state what version it was fixed in, but the commit ID is in the comments there so I used git tag --contains with that commit ID to determine that the fix was first included in libxml 2.8.0.

@westonruter westonruter self-requested a review April 12, 2018 06:38
@westonruter westonruter self-assigned this Apr 12, 2018
@westonruter westonruter added this to the v0.7 milestone Apr 12, 2018
Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you also add a unit test to verify the fix?

/*
* Remove pre-HTML5-style encoding declaration if added above.
*/
$meta_http_equiv_element = $dom->getElementById( 'meta-http-equiv' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be made a tiny more performant by first checking version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ). The result of the first call could be put inside a variable to re-use here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, you can see there is another such check for the libxml version on line 99. So all of these checks could re-use this same var, e.g.:

$is_old_libxml = version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, wrapping this in a version check makes sense to me, as does introducing a variable. I'd be inclined to name the variable something more specific, like $is_pre_2_8_libxml, but if you have a specific preference for the name I'll go with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either is fine with me. Since we're only testing for one version of libxml currently, the a simple “old“ name should suffice.

*/
if ( version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ) ) {
$document = preg_replace(
'#<meta[^>]+charset="([^"]+)"#i',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this may be missing the trailing > so the replacement would seem to result in doubled >> being added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my reply below.

if ( version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ) ) {
$document = preg_replace(
'#<meta[^>]+charset="([^"]+)"#i',
'<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">$0',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the $0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm not replacing the original meta charset tag, just inserting the new meta tag before it. (That's also why I don't include the > in he regex, per your previous comment.)

I see that at a glance it looks like the aim is to replace the whole original meta tag, so I should probably add a comment to clarify what's going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I understand now. In that case you might want to use a lookahead match instead, like so:

$document = preg_replace(
    '#(?=<meta[^>]+charset="([^"]+)")#i',
    '<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">',
    $document
);

Not really a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, I should have thought of that. My only defense is that it was after midnight. :-P

'#<meta[^>]+charset="([^"]+)"#i',
'<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">$0',
$document
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add a limit of 1.

*/
if ( version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ) ) {
$document = preg_replace(
'#<meta[^>]+charset="([^"]+)"#i',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case where there is <meta charset=utf-8> with no quote marks? Sure this is probably unlikely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, yeah. I'll include a test case for that and for single quotes as well. I should also account for possible whitespace around the =.

@douglyuckling
Copy link
Contributor Author

Regarding adding a unit test, that will require actually setting up a proper PHP development environment. :-P (So far I've gotten away with just FTPing my hacked PHP files up to a dev clone of the blog on our hosting provider.)

I'll see about setting up my dev environment tonight. Is there a guide for dev environment setup you can recommend? The instructions in the theme handbook seem like a good start, but they don't cover the composer and wp commands I see mentioned in contributing.md.

Sorry, in my day job I do JVM-based web development, but until now I haven't really touched PHP in well over a decade. Speaking of my day job, I should probably get back to that...

@westonruter
Copy link
Member

No worries. I can add a unit test instead if you like.

@douglyuckling
Copy link
Contributor Author

If you're willing, that would be much appreciated.

It occurs to me there should also maybe be a check after the call to loadHTML to make sure that $dom->encoding isn't empty. If it is, there should maybe be a warning logged somewhere that the encoding couldn't be detected and therefore special characters will likely likely be corrupted in the output. That could be a huge time saver for anyone trying to track down encoding issues in the future (e.g. a different libxml bug, or something the above regex doesn't account for).

@westonruter
Copy link
Member

For unit testing, the easiest way to do that is to use the VVV environment. Clone the plugin into the wordpress-develop src install. Then when SSH'ed into the vagrant box, you can run phpunit. A bit of that is documented in:

https://github.com/Automattic/amp-wp/blob/develop/contributing.md#amp-contributing-guide
https://github.com/Automattic/amp-wp/blob/develop/contributing.md#phpunit-testing

It should be fleshed out a bit more.

@westonruter
Copy link
Member

I'll make the changes.

@westonruter
Copy link
Member

If it is, there should maybe be a warning logged somewhere that the encoding couldn't be detected and therefore special characters will likely likely be corrupted in the output. That could be a huge time saver for anyone trying to track down encoding issues in the future (e.g. a different libxml bug, or something the above regex doesn't account for).

It's a good concern but I think that it shouldn't ever happen in our case here because we explicitly are inserting a meta charset if one is not originally present:

https://github.com/Automattic/amp-wp/blob/40d1945778c9917ed258e74595d504ad31719d39/includes/class-amp-theme-support.php#L1009-L1021

@westonruter
Copy link
Member

westonruter commented Apr 12, 2018

@douglyuckling Would you please test 3c2d510 to make sure it works on your environment?

* Group libxml back-compat fixes.
* Improve meta charset search pattern.
* Ensure only one replacement is made.
* Only do removal of meta tag if originally added.
* Add unit test.
@westonruter westonruter changed the title Fix libxml issues with parsing characater encoding Fix libxml issues with parsing character encoding Apr 12, 2018
@douglyuckling
Copy link
Contributor Author

Yep, works like a charm! Thanks!

@westonruter westonruter merged commit a8afc75 into ampproject:0.7 Apr 13, 2018
@douglyuckling douglyuckling deleted the bugfix/1067 branch April 13, 2018 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants