Fix encoding issue when source is encoded in utf8 (issue #284) #456

cholletk · 2020-12-21T16:52:55Z

Adding utf8 decoding prior of editing recipe.

(and my IDE remove lot of spaces on empty lines, sorry for that)

codecov · 2020-12-21T16:56:50Z

Codecov Report

Merging #456 (b166c75) into master (21de294) will decrease coverage by 0.00%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##             master    #456      +/-   ##
===========================================
- Coverage      0.91%   0.91%   -0.01%     
  Complexity      414     414              
===========================================
  Files            13      13              
  Lines          1307    1308       +1     
===========================================
  Hits             12      12              
- Misses         1295    1296       +1

Flag	Coverage Δ	Complexity Δ
integration	`0.00% <0.00%> (ø)`	`0.00 <0.00> (ø)`
unittests	`0.91% <0.00%> (-0.01%)`	`0.00 <0.00> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
lib/Service/RecipeService.php	`0.00% <0.00%> (ø)`	`237.00 <0.00> (ø)`

Signed-off-by: kchollet <[email protected]>

christianlupus · 2020-12-22T07:09:38Z

OK, I just looked at your PR.

First, you committed quite many whitespace changes all over the files. This is not good as it causes merge conflicts with other changes in the future.
In general, this is no big deal, I will rebase/rework these changes before emerging into master.

More interesting is the fact that you unconditionally parse the HTML input through utf8_decode. I see the different cases here:

The page uses UTF8 and claims ISO-8859-1 or similar.
The page uses UTF8 and claims so.
The page uses ISO-8859-1 and claims so.
Any other encoding like ISO-8859-6 is either used or claimed.

Your fix obviously works for case 1. However, this is only a workaround for a broken web page.
For cases 2+3 the manual is not 100% clear but the user comments point towards the fact that the string gets corrupted. I assume double decoding of a string will destroy it.
If any encoding other than ISO-8859-1 is in the play, you would have to use iconv() to translate between arbitrary charsets.

@cholletk. am I missing here a point or your PR and you have more in-depth knowledge so you can explain what the idea was?

cholletk · 2020-12-22T10:31:25Z

I agree with you on different uses cases. I think the best way to fix that in a cleaner way would be to use the declared charset in headers in order to perform conversion with iconv.

One thing that I don't understand right now is why the content has to be converted to ISO-8859-1 (and how this charset is defined in order to do something adaptable). Do you have any idea on that?

christianlupus · 2020-12-22T16:44:53Z

One thing that I don't understand right now is why the content has to be converted to ISO-8859-1 (and how this charset is defined in order to do something adaptable). Do you have any idea on that?

I'd say this is not a requirement to go for ISO-8859-1. Most probably, the DOM parser does the charset conversion according to the internal structure. If that would not be true, I assume there would be many more complaints about wrong characters by now. In a small test using PHP, I parsed an HTML file with different charsets. It seems that ISO-8859-1 (or something similar to it) is a fallback if no valid charset was found within the first 1024 bytes. If the charset was found until then the conversion is done by the DOMDocument decoder. If not found until the limit is reached the fallback (whatever it is) takes effect causing the weird letters to apply.

Internally, PHP used normally single-byte characters. If dealing with Unicode strings, you need to use the multibyte_* functions and be very careful about what you are doing. I guess that at least part of the parser is written in PHP and thus uses single-byte chars as default.

We could use this trick to make the parsing aware of the correct charset...

damda58 · 2021-09-27T15:56:26Z

not fix for me
recipe on this link : https://www.hellofresh.fr/recipes/burger-de-boeuf-oignons-confits-lard-comte-60882ee1a3d000224e2703f2/

Would it not be possible to let the user choose whether or not to enable UTF8 decoding before importing?

christianlupus · 2021-09-27T18:43:07Z

Would it not be possible to let the user choose whether or not to enable UTF8 decoding before importing?

The thing is not UTF vs non-UTF. There are dozens of encodings out there. We have to be able to tackle them all.

damda58 · 2021-09-27T19:09:59Z

in the source code of HelloFresh pages we find the charset in the metadata. Is it not possible to recover it here?

<html>
<body>
<!--StartFragment-->

<!DOCTYPE html>
--
  | <html lang="fr">
  | <head>
   ...
<meta data-react-helmet="true" name="charset" content="utf-8"/>
...

christianlupus · 2021-09-28T08:14:27Z

There is the meta tag that contains the information. Then, there is a header (content-type) possibly present that might contain the info as well. I do not know which one might have precedence if both are present and do not comply.

So, in general, you are right. It can be recovered, just like all browsers do it. However, we have to parse it and not only force a utf8_decode() on the result.

christianlupus · 2022-06-30T12:04:55Z

Superseeded by #1057.

cholletk added 4 commits December 21, 2020 18:18

Fix encoding issue when source is encoded in utf8 (issue nextcloud#284)

fcffbe4

Signed-off-by: kchollet <[email protected]>

Modifying changelog for the fix to issue nextcloud#284

c70eeb6

Signed-off-by: kchollet <[email protected]>

Remove line return (nextcloud#284)

ce96cee

Signed-off-by: kchollet <[email protected]>

Compliance with cs fixer (nextcloud#284)

84bfab7

Signed-off-by: kchollet <[email protected]>

cholletk force-pushed the master branch from 77b043f to 84bfab7 Compare December 21, 2020 17:18

christianlupus linked an issue Dec 22, 2020 that may be closed by this pull request

Encoding errors with recipes from hellofresh.ch #284

Closed

christianlupus closed this Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix encoding issue when source is encoded in utf8 (issue #284) #456

Fix encoding issue when source is encoded in utf8 (issue #284) #456

Uh oh!

cholletk commented Dec 21, 2020

Uh oh!

codecov bot commented Dec 21, 2020 •

edited

Loading

Uh oh!

christianlupus commented Dec 22, 2020

Uh oh!

cholletk commented Dec 22, 2020

Uh oh!

christianlupus commented Dec 22, 2020

Uh oh!

damda58 commented Sep 27, 2021 •

edited

Loading

Uh oh!

christianlupus commented Sep 27, 2021

Uh oh!

damda58 commented Sep 27, 2021 •

edited

Loading

Uh oh!

christianlupus commented Sep 28, 2021 •

edited

Loading

Uh oh!

christianlupus commented Jun 30, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix encoding issue when source is encoded in utf8 (issue #284) #456

Fix encoding issue when source is encoded in utf8 (issue #284) #456

Uh oh!

Conversation

cholletk commented Dec 21, 2020

Uh oh!

codecov bot commented Dec 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

christianlupus commented Dec 22, 2020

Uh oh!

cholletk commented Dec 22, 2020

Uh oh!

christianlupus commented Dec 22, 2020

Uh oh!

damda58 commented Sep 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christianlupus commented Sep 27, 2021

Uh oh!

damda58 commented Sep 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christianlupus commented Sep 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christianlupus commented Jun 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Dec 21, 2020 •

edited

Loading

damda58 commented Sep 27, 2021 •

edited

Loading

damda58 commented Sep 27, 2021 •

edited

Loading

christianlupus commented Sep 28, 2021 •

edited

Loading

christianlupus commented Jun 30, 2022 •

edited

Loading