Skip to content

Conversation

@cholletk
Copy link

Adding utf8 decoding prior of editing recipe.

(and my IDE remove lot of spaces on empty lines, sorry for that)

@codecov
Copy link

codecov bot commented Dec 21, 2020

Codecov Report

Merging #456 (b166c75) into master (21de294) will decrease coverage by 0.00%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##             master    #456      +/-   ##
===========================================
- Coverage      0.91%   0.91%   -0.01%     
  Complexity      414     414              
===========================================
  Files            13      13              
  Lines          1307    1308       +1     
===========================================
  Hits             12      12              
- Misses         1295    1296       +1     
Flag Coverage Δ Complexity Δ
integration 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
unittests 0.91% <0.00%> (-0.01%) 0.00 <0.00> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
lib/Service/RecipeService.php 0.00% <0.00%> (ø) 237.00 <0.00> (ø)

@christianlupus
Copy link
Collaborator

OK, I just looked at your PR.

First, you committed quite many whitespace changes all over the files. This is not good as it causes merge conflicts with other changes in the future.
In general, this is no big deal, I will rebase/rework these changes before emerging into master.

More interesting is the fact that you unconditionally parse the HTML input through utf8_decode. I see the different cases here:

  1. The page uses UTF8 and claims ISO-8859-1 or similar.
  2. The page uses UTF8 and claims so.
  3. The page uses ISO-8859-1 and claims so.
  4. Any other encoding like ISO-8859-6 is either used or claimed.

Your fix obviously works for case 1. However, this is only a workaround for a broken web page.
For cases 2+3 the manual is not 100% clear but the user comments point towards the fact that the string gets corrupted. I assume double decoding of a string will destroy it.
If any encoding other than ISO-8859-1 is in the play, you would have to use iconv() to translate between arbitrary charsets.

@cholletk. am I missing here a point or your PR and you have more in-depth knowledge so you can explain what the idea was?

@christianlupus christianlupus linked an issue Dec 22, 2020 that may be closed by this pull request
@cholletk
Copy link
Author

I agree with you on different uses cases. I think the best way to fix that in a cleaner way would be to use the declared charset in headers in order to perform conversion with iconv.

One thing that I don't understand right now is why the content has to be converted to ISO-8859-1 (and how this charset is defined in order to do something adaptable). Do you have any idea on that?

@christianlupus
Copy link
Collaborator

One thing that I don't understand right now is why the content has to be converted to ISO-8859-1 (and how this charset is defined in order to do something adaptable). Do you have any idea on that?

I'd say this is not a requirement to go for ISO-8859-1. Most probably, the DOM parser does the charset conversion according to the internal structure. If that would not be true, I assume there would be many more complaints about wrong characters by now. In a small test using PHP, I parsed an HTML file with different charsets. It seems that ISO-8859-1 (or something similar to it) is a fallback if no valid charset was found within the first 1024 bytes. If the charset was found until then the conversion is done by the DOMDocument decoder. If not found until the limit is reached the fallback (whatever it is) takes effect causing the weird letters to apply.

Internally, PHP used normally single-byte characters. If dealing with Unicode strings, you need to use the multibyte_* functions and be very careful about what you are doing. I guess that at least part of the parser is written in PHP and thus uses single-byte chars as default.

We could use this trick to make the parsing aware of the correct charset...

@damda58
Copy link

damda58 commented Sep 27, 2021

image
not fix for me
recipe on this link : https://www.hellofresh.fr/recipes/burger-de-boeuf-oignons-confits-lard-comte-60882ee1a3d000224e2703f2/

Would it not be possible to let the user choose whether or not to enable UTF8 decoding before importing?

@christianlupus
Copy link
Collaborator

Would it not be possible to let the user choose whether or not to enable UTF8 decoding before importing?

The thing is not UTF vs non-UTF. There are dozens of encodings out there. We have to be able to tackle them all.

@damda58
Copy link

damda58 commented Sep 27, 2021

in the source code of HelloFresh pages we find the charset in the metadata. Is it not possible to recover it here?

<html>
<body>
<!--StartFragment-->

<!DOCTYPE html>
--
  | <html lang="fr">
  | <head>
   ...
<meta data-react-helmet="true" name="charset" content="utf-8"/>
...

@christianlupus
Copy link
Collaborator

christianlupus commented Sep 28, 2021

There is the meta tag that contains the information. Then, there is a header (content-type) possibly present that might contain the info as well. I do not know which one might have precedence if both are present and do not comply.

So, in general, you are right. It can be recovered, just like all browsers do it. However, we have to parse it and not only force a utf8_decode() on the result.

@christianlupus
Copy link
Collaborator

christianlupus commented Jun 30, 2022

Superseeded by #1057.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Encoding errors with recipes from hellofresh.ch

3 participants