-
Notifications
You must be signed in to change notification settings - Fork 102
Fix encoding issue when source is encoded in utf8 (issue #284) #456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #456 +/- ##
===========================================
- Coverage 0.91% 0.91% -0.01%
Complexity 414 414
===========================================
Files 13 13
Lines 1307 1308 +1
===========================================
Hits 12 12
- Misses 1295 1296 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Signed-off-by: kchollet <[email protected]>
Signed-off-by: kchollet <[email protected]>
Signed-off-by: kchollet <[email protected]>
Signed-off-by: kchollet <[email protected]>
|
OK, I just looked at your PR. First, you committed quite many whitespace changes all over the files. This is not good as it causes merge conflicts with other changes in the future. More interesting is the fact that you unconditionally parse the HTML input through
Your fix obviously works for case 1. However, this is only a workaround for a broken web page. @cholletk. am I missing here a point or your PR and you have more in-depth knowledge so you can explain what the idea was? |
|
I agree with you on different uses cases. I think the best way to fix that in a cleaner way would be to use the declared charset in headers in order to perform conversion with iconv. One thing that I don't understand right now is why the content has to be converted to ISO-8859-1 (and how this charset is defined in order to do something adaptable). Do you have any idea on that? |
I'd say this is not a requirement to go for ISO-8859-1. Most probably, the DOM parser does the charset conversion according to the internal structure. If that would not be true, I assume there would be many more complaints about wrong characters by now. In a small test using PHP, I parsed an HTML file with different charsets. It seems that ISO-8859-1 (or something similar to it) is a fallback if no valid charset was found within the first 1024 bytes. If the charset was found until then the conversion is done by the Internally, PHP used normally single-byte characters. If dealing with Unicode strings, you need to use the We could use this trick to make the parsing aware of the correct charset... |
|
Would it not be possible to let the user choose whether or not to enable UTF8 decoding before importing? |
The thing is not UTF vs non-UTF. There are dozens of encodings out there. We have to be able to tackle them all. |
|
in the source code of HelloFresh pages we find the charset in the metadata. Is it not possible to recover it here? |
|
There is the So, in general, you are right. It can be recovered, just like all browsers do it. However, we have to parse it and not only force a |
|
Superseeded by #1057. |

Adding utf8 decoding prior of editing recipe.
(and my IDE remove lot of spaces on empty lines, sorry for that)