-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix libxml issues with parsing character encoding #1071
Conversation
By the way, the libxml bug doesn't clearly state what version it was fixed in, but the commit ID is in the comments there so I used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you also add a unit test to verify the fix?
/* | ||
* Remove pre-HTML5-style encoding declaration if added above. | ||
*/ | ||
$meta_http_equiv_element = $dom->getElementById( 'meta-http-equiv' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be made a tiny more performant by first checking version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' )
. The result of the first call could be put inside a variable to re-use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, you can see there is another such check for the libxml version on line 99. So all of these checks could re-use this same var, e.g.:
$is_old_libxml = version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, wrapping this in a version check makes sense to me, as does introducing a variable. I'd be inclined to name the variable something more specific, like $is_pre_2_8_libxml
, but if you have a specific preference for the name I'll go with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either is fine with me. Since we're only testing for one version of libxml currently, the a simple “old“ name should suffice.
*/ | ||
if ( version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ) ) { | ||
$document = preg_replace( | ||
'#<meta[^>]+charset="([^"]+)"#i', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this may be missing the trailing >
so the replacement would seem to result in doubled >>
being added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my reply below.
if ( version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ) ) { | ||
$document = preg_replace( | ||
'#<meta[^>]+charset="([^"]+)"#i', | ||
'<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">$0', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the $0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I'm not replacing the original meta charset
tag, just inserting the new meta
tag before it. (That's also why I don't include the >
in he regex, per your previous comment.)
I see that at a glance it looks like the aim is to replace the whole original meta tag, so I should probably add a comment to clarify what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I understand now. In that case you might want to use a lookahead match instead, like so:
$document = preg_replace(
'#(?=<meta[^>]+charset="([^"]+)")#i',
'<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">',
$document
);
Not really a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah, I should have thought of that. My only defense is that it was after midnight. :-P
'#<meta[^>]+charset="([^"]+)"#i', | ||
'<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">$0', | ||
$document | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add a limit of 1
.
*/ | ||
if ( version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ) ) { | ||
$document = preg_replace( | ||
'#<meta[^>]+charset="([^"]+)"#i', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the case where there is <meta charset=utf-8>
with no quote marks? Sure this is probably unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm, yeah. I'll include a test case for that and for single quotes as well. I should also account for possible whitespace around the =
.
Regarding adding a unit test, that will require actually setting up a proper PHP development environment. :-P (So far I've gotten away with just FTPing my hacked PHP files up to a dev clone of the blog on our hosting provider.) I'll see about setting up my dev environment tonight. Is there a guide for dev environment setup you can recommend? The instructions in the theme handbook seem like a good start, but they don't cover the Sorry, in my day job I do JVM-based web development, but until now I haven't really touched PHP in well over a decade. Speaking of my day job, I should probably get back to that... |
No worries. I can add a unit test instead if you like. |
If you're willing, that would be much appreciated. It occurs to me there should also maybe be a check after the call to |
For unit testing, the easiest way to do that is to use the VVV environment. Clone the plugin into the https://github.com/Automattic/amp-wp/blob/develop/contributing.md#amp-contributing-guide It should be fleshed out a bit more. |
I'll make the changes. |
It's a good concern but I think that it shouldn't ever happen in our case here because we explicitly are inserting a |
@douglyuckling Would you please test 3c2d510 to make sure it works on your environment? |
* Group libxml back-compat fixes. * Improve meta charset search pattern. * Ensure only one replacement is made. * Only do removal of meta tag if originally added. * Add unit test.
aad19e1
to
3c2d510
Compare
Yep, works like a charm! Thanks! |
Fixes #1067.