-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix libxml issues with parsing character encoding #1071
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -108,6 +108,18 @@ function( $matches ) { | |
); | ||
} | ||
|
||
/* | ||
* Add a pre-HTML5-style declaration of the encoding since libxml<2.8 doesn't recognize | ||
* HTML5's meta charset. See <https://bugzilla.gnome.org/show_bug.cgi?id=655218>. | ||
*/ | ||
if ( version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ) ) { | ||
$document = preg_replace( | ||
'#<meta[^>]+charset="([^"]+)"#i', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about the case where there is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mm, yeah. I'll include a test case for that and for single quotes as well. I should also account for possible whitespace around the |
||
'<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">$0', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because I'm not replacing the original I see that at a glance it looks like the aim is to replace the whole original meta tag, so I should probably add a comment to clarify what's going on. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. I understand now. In that case you might want to use a lookahead match instead, like so: $document = preg_replace(
'#(?=<meta[^>]+charset="([^"]+)")#i',
'<meta http-equiv="Content-Type" content="text/html; charset=$1" id="meta-http-equiv">',
$document
); Not really a big deal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, yeah, I should have thought of that. My only defense is that it was after midnight. :-P |
||
$document | ||
); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might want to add a limit of |
||
} | ||
|
||
/* | ||
* Wrap in dummy tags, since XML needs one parent node. | ||
* It also makes it easier to loop through nodes. | ||
|
@@ -123,6 +135,14 @@ function( $matches ) { | |
return false; | ||
} | ||
|
||
/* | ||
* Remove pre-HTML5-style encoding declaration if added above. | ||
*/ | ||
$meta_http_equiv_element = $dom->getElementById( 'meta-http-equiv' ); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could be made a tiny more performant by first checking There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In fact, you can see there is another such check for the libxml version on line 99. So all of these checks could re-use this same var, e.g.: $is_old_libxml = version_compare( LIBXML_DOTTED_VERSION, '2.8', '<' ); There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, wrapping this in a version check makes sense to me, as does introducing a variable. I'd be inclined to name the variable something more specific, like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Either is fine with me. Since we're only testing for one version of libxml currently, the a simple “old“ name should suffice. |
||
if ( $meta_http_equiv_element ) { | ||
$meta_http_equiv_element->parentNode->removeChild( $meta_http_equiv_element ); | ||
} | ||
|
||
return $dom; | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this may be missing the trailing
>
so the replacement would seem to result in doubled>>
being added?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my reply below.