Fix libxml issues with parsing character encoding #1071

douglyuckling · 2018-04-12T04:20:26Z

douglyuckling · 2018-04-12T04:55:33Z

By the way, the libxml bug doesn't clearly state what version it was fixed in, but the commit ID is in the comments there so I used git tag --contains with that commit ID to determine that the fix was first included in libxml 2.8.0.

westonruter

Would you also add a unit test to verify the fix?

westonruter · 2018-04-12T06:40:45Z

includes/utils/class-amp-dom-utils.php

+		/*
+		 * Remove pre-HTML5-style encoding declaration if added above.
+		 */
+		$meta_http_equiv_element = $dom->getElementById( 'meta-http-equiv' );


This could be made a tiny more performant by first checking version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ). The result of the first call could be put inside a variable to re-use here.

In fact, you can see there is another such check for the libxml version on line 99. So all of these checks could re-use this same var, e.g.:

$is_old_libxml = version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' );

Okay, wrapping this in a version check makes sense to me, as does introducing a variable. I'd be inclined to name the variable something more specific, like $is_pre_2_8_libxml, but if you have a specific preference for the name I'll go with that.

Either is fine with me. Since we're only testing for one version of libxml currently, the a simple “old“ name should suffice.

westonruter · 2018-04-12T06:41:54Z

includes/utils/class-amp-dom-utils.php

+		 */
+		if ( version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ) ) {
+			$document = preg_replace(
+				'#<meta[^>]+charset="([^"]+)"#i',


It looks like this may be missing the trailing > so the replacement would seem to result in doubled >> being added?

See my reply below.

westonruter · 2018-04-12T06:42:12Z

includes/utils/class-amp-dom-utils.php

+		if ( version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ) ) {
+			$document = preg_replace(
+				'#<meta[^>]+charset="([^"]+)"#i',
+				'<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">$0',


Why the $0?

Because I'm not replacing the original meta charset tag, just inserting the new meta tag before it. (That's also why I don't include the > in he regex, per your previous comment.)

I see that at a glance it looks like the aim is to replace the whole original meta tag, so I should probably add a comment to clarify what's going on.

Thanks. I understand now. In that case you might want to use a lookahead match instead, like so:

$document = preg_replace( '#(?=<meta[^>]+charset="([^"]+)")#i', '<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">', $document );

Not really a big deal.

Oh, yeah, I should have thought of that. My only defense is that it was after midnight. :-P

westonruter · 2018-04-12T15:51:34Z

includes/utils/class-amp-dom-utils.php

+				'#<meta[^>]+charset="([^"]+)"#i',
+				'<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">$0',
+				$document
+			);


Might want to add a limit of 1.

westonruter · 2018-04-12T15:53:39Z

includes/utils/class-amp-dom-utils.php

+		 */
+		if ( version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ) ) {
+			$document = preg_replace(
+				'#<meta[^>]+charset="([^"]+)"#i',


What about the case where there is <meta charset=utf-8> with no quote marks? Sure this is probably unlikely.

Mm, yeah. I'll include a test case for that and for single quotes as well. I should also account for possible whitespace around the =.

douglyuckling · 2018-04-12T16:09:47Z

Regarding adding a unit test, that will require actually setting up a proper PHP development environment. :-P (So far I've gotten away with just FTPing my hacked PHP files up to a dev clone of the blog on our hosting provider.)

I'll see about setting up my dev environment tonight. Is there a guide for dev environment setup you can recommend? The instructions in the theme handbook seem like a good start, but they don't cover the composer and wp commands I see mentioned in contributing.md.

Sorry, in my day job I do JVM-based web development, but until now I haven't really touched PHP in well over a decade. Speaking of my day job, I should probably get back to that...

westonruter · 2018-04-12T16:11:18Z

No worries. I can add a unit test instead if you like.

douglyuckling · 2018-04-12T16:30:06Z

If you're willing, that would be much appreciated.

It occurs to me there should also maybe be a check after the call to loadHTML to make sure that $dom->encoding isn't empty. If it is, there should maybe be a warning logged somewhere that the encoding couldn't be detected and therefore special characters will likely likely be corrupted in the output. That could be a huge time saver for anyone trying to track down encoding issues in the future (e.g. a different libxml bug, or something the above regex doesn't account for).

westonruter · 2018-04-12T16:31:35Z

For unit testing, the easiest way to do that is to use the VVV environment. Clone the plugin into the wordpress-develop src install. Then when SSH'ed into the vagrant box, you can run phpunit. A bit of that is documented in:

https://github.com/Automattic/amp-wp/blob/develop/contributing.md#amp-contributing-guide
https://github.com/Automattic/amp-wp/blob/develop/contributing.md#phpunit-testing

It should be fleshed out a bit more.

westonruter · 2018-04-12T18:03:29Z

I'll make the changes.

westonruter · 2018-04-12T18:25:44Z

If it is, there should maybe be a warning logged somewhere that the encoding couldn't be detected and therefore special characters will likely likely be corrupted in the output. That could be a huge time saver for anyone trying to track down encoding issues in the future (e.g. a different libxml bug, or something the above regex doesn't account for).

It's a good concern but I think that it shouldn't ever happen in our case here because we explicitly are inserting a meta charset if one is not originally present:

https://github.com/Automattic/amp-wp/blob/40d1945778c9917ed258e74595d504ad31719d39/includes/class-amp-theme-support.php#L1009-L1021

westonruter · 2018-04-12T18:31:02Z

@douglyuckling Would you please test 3c2d510 to make sure it works on your environment?

* Group libxml back-compat fixes. * Improve meta charset search pattern. * Ensure only one replacement is made. * Only do removal of meta tag if originally added. * Add unit test.

douglyuckling · 2018-04-13T02:06:50Z

Yep, works like a charm! Thanks!

douglyuckling added 2 commits April 12, 2018 00:13

Fix libxml issues with parsing characater encoding (ampproject#1067)

009dc39

Fix code style issues reported by the build

5dc6ef3

westonruter self-requested a review April 12, 2018 06:38

westonruter self-assigned this Apr 12, 2018

westonruter added this to the v0.7 milestone Apr 12, 2018

westonruter requested changes Apr 12, 2018

View reviewed changes

westonruter reviewed Apr 12, 2018

View reviewed changes

westonruter approved these changes Apr 12, 2018

View reviewed changes

Improve handling of old versions of libxml

3c2d510

* Group libxml back-compat fixes. * Improve meta charset search pattern. * Ensure only one replacement is made. * Only do removal of meta tag if originally added. * Add unit test.

westonruter force-pushed the bugfix/1067 branch from aad19e1 to 3c2d510 Compare April 12, 2018 18:54

westonruter changed the title ~~Fix libxml issues with parsing characater encoding~~ Fix libxml issues with parsing character encoding Apr 12, 2018

westonruter merged commit a8afc75 into ampproject:0.7 Apr 13, 2018

douglyuckling deleted the bugfix/1067 branch April 13, 2018 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix libxml issues with parsing character encoding #1071

Fix libxml issues with parsing character encoding #1071

douglyuckling commented Apr 12, 2018 •

edited by westonruter

Loading

douglyuckling commented Apr 12, 2018

westonruter left a comment

westonruter Apr 12, 2018

westonruter Apr 12, 2018

douglyuckling Apr 12, 2018

westonruter Apr 12, 2018

westonruter Apr 12, 2018

douglyuckling Apr 12, 2018

westonruter Apr 12, 2018

douglyuckling Apr 12, 2018

westonruter Apr 12, 2018

douglyuckling Apr 12, 2018

westonruter Apr 12, 2018

westonruter Apr 12, 2018

douglyuckling Apr 12, 2018

douglyuckling commented Apr 12, 2018

westonruter commented Apr 12, 2018

douglyuckling commented Apr 12, 2018

westonruter commented Apr 12, 2018

westonruter commented Apr 12, 2018

westonruter commented Apr 12, 2018

westonruter commented Apr 12, 2018 •

edited

Loading

douglyuckling commented Apr 13, 2018

Fix libxml issues with parsing character encoding #1071

Fix libxml issues with parsing character encoding #1071

Conversation

douglyuckling commented Apr 12, 2018 • edited by westonruter Loading

douglyuckling commented Apr 12, 2018

westonruter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

douglyuckling commented Apr 12, 2018

westonruter commented Apr 12, 2018

douglyuckling commented Apr 12, 2018

westonruter commented Apr 12, 2018

westonruter commented Apr 12, 2018

westonruter commented Apr 12, 2018

westonruter commented Apr 12, 2018 • edited Loading

douglyuckling commented Apr 13, 2018

douglyuckling commented Apr 12, 2018 •

edited by westonruter

Loading

westonruter commented Apr 12, 2018 •

edited

Loading