Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid lossy html entities encoding by setting charset #24645

Merged
merged 8 commits into from
Aug 19, 2020

Conversation

lsl
Copy link
Contributor

@lsl lsl commented Aug 19, 2020

Description

This change specifies the content type and charset of the html passed into DomDocument as utf-8.

Replaces the mb_convert_encoding call which encodes UTF-8 as HTML-ENTITIES before handing off to DomDocument.

This change avoids the need to later revert the encoding back to UTF-8 afterwards using mb_convert_encoding. This secondary mb_convert_encoding call was converting not only the UTF-8 characters that were converted earlier but also any pre-existing entity encoded html stored inside block content.

This issue was originally raised here: Automattic/wp-calypso#44897 as I wasn't sure of the root cause at the time, originally thinking it may be because of the way Jetpack is injecting html into the data-image-description attributes.

There are more situations where this can be a problem such as encoded html entities existing inside block content then being decoded breaking html validation.

How has this been tested?

  • The npm run test-php unittests cover this somewhat, will add more soon.
  • https://3v4l.org/EaZv9 contains illustrates a comparison of the different approaches available

Screenshots

Before

After

Types of changes

  • Bug fix

Checklist:

  • My code is tested.
  • My code follows the WordPress code style.
  • My code follows the accessibility standards.
  • My code has proper inline documentation.
  • I've included developer documentation if appropriate.
  • I've updated all React Native files affected by any refactorings/renamings in this PR.

(re)Fixes #24445 (Mostly maintaining the same behavior introduced in #24447)

Sets the charset of the html passed into DomDocument to utf-8.

Replaces the mb_convert_encoding call replacing utf-8 with html entities
before handing off to DomDocument.

This avoids the need to later convert back to utf-8 from html entities
afterward. This secondary mb_convert_encoding call was converting not
only the utf-8 we converted earlier but also entity encoding html stored
inside data-* or other attributes of html elements.

Fixes Automattic/wp-calypso#44897
Maintains the fix for WordPress#24445 (WordPress#24447)
@ockham
Copy link
Contributor

ockham commented Aug 19, 2020

Looking good so far. We should just add some unit test coverage for the cases this fixes 👍

@mcsf @youknowriad Can we get this into 8.8? The fix we got into 8.7.1 broke cases where server-side rendered blocks inject HTML (e.g. Automattic/wp-calypso#44897) 😕

cc/ @sirreal @WunderBart @fullofcaffeine

@lsl
Copy link
Contributor Author

lsl commented Aug 19, 2020

We should just add some unit test coverage for the cases this fixes +1

Will do - got sidetracked today. Just added some screenshots that might help explain things a bit better to anyone stopping by.

@ockham ockham marked this pull request as ready for review August 19, 2020 16:46
@ockham
Copy link
Contributor

ockham commented Aug 19, 2020

Added unit tests (based on https://3v4l.org/EaZv9) so we can get this merged in time for the 8.8 release.

Copy link
Contributor

@ockham ockham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. I've touched this, but the unit tests give us a pretty good guarantee that this covers a very broad class of content that was previously causing encoding issues.

@ockham ockham merged commit ff67c4a into WordPress:master Aug 19, 2020
@github-actions
Copy link

Congratulations on your first merged pull request, @lsl! We'd like to credit you for your contribution in the post announcing the next WordPress release, but we can't find a WordPress.org profile associated with your GitHub account. When you have a moment, visit the following URL and click "link your GitHub account" under "GitHub Username" to link your accounts:

https://profiles.wordpress.org/me/profile/edit/

And if you don't have a WordPress.org account, you can create one on this page:

https://login.wordpress.org/register

Kudos!

@github-actions github-actions bot added the First-time Contributor Pull request opened by a first-time contributor to Gutenberg repository label Aug 19, 2020
@github-actions github-actions bot added this to the Gutenberg 8.9 milestone Aug 19, 2020
jorgefilipecosta pushed a commit that referenced this pull request Aug 19, 2020
This change specifies the content type and charset of the html passed into `DomDocument` as `utf-8`.

Replaces the `mb_convert_encoding` call which encodes `UTF-8` as `HTML-ENTITIES` before handing off to `DomDocument`.

This change avoids the need to later revert the encoding back to `UTF-8` afterwards using `mb_convert_encoding`. This secondary `mb_convert_encoding` call was converting not only the `UTF-8` characters that were converted earlier but also any pre-existing entity encoded html stored inside block content.

This issue was originally raised here: Automattic/wp-calypso#44897 as I wasn't sure of the root cause at the time, originally thinking it may be because of the way [Jetpack is injecting](https://github.com/Automattic/jetpack/blob/dcfa5ca8bdfc31aacec107aec27bb24357d6cdac/modules/carousel/jetpack-carousel.php#L434) html into the [`data-image-description` attributes](https://github.com/Automattic/jetpack/blob/dcfa5ca8bdfc31aacec107aec27bb24357d6cdac/modules/carousel/jetpack-carousel.php#L485). 

There are more situations where this can be a problem such as encoded html entities existing inside block content then being decoded breaking html validation.

Co-authored-by: Bernie Reiter <[email protected]>
@lsl lsl deleted the fix/charset branch August 20, 2020 02:40
@sirreal
Copy link
Member

sirreal commented Aug 20, 2020

Thanks for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
First-time Contributor Pull request opened by a first-time contributor to Gutenberg repository
Projects
None yet
5 participants