-
-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain #346
Conversation
Code was expecting five elements, but comment claimed that three elements was the correct number.
} | ||
|
||
var buf bytes.Buffer | ||
buf.WriteString("# ") | ||
for i := 0; i < 1024; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, let's use strings.Repeat
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
An upcoming commit will soon add a test file for which the suffix is important, in addition to the prefix. The middle of the file will contain long text that is not desirable to test. Thus, testing suffixes is helpful. Note that all current tests have no suffixes, so all cases should pass.
As discussed in #309: Since #184 we have been using DetermineEncoding to deal with the case of UTF-16 files. That was a reasonable fix for exercism/exercism#2303. DetermineEncoding only looks at the first 1024 bytes of a file. If it can't determine an encoding, it defaults to windows-1252. This causes undesirable behaviour for files with Unicode characters but also only ASCII in their first 1024 characters - they get interpreted as windows-1252, mangling the Unicode characters. This commit takes advantage of the fact that DetermineEncoding reports whether it is *certain* about its encoding guess. If it is uncertain, we default to UTF-8 instead of windows-1252. Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it is certain. Therefore, behaviour for UTF-16 files is preserved (existing tests would have caught it if behaviour were accidentally altered). A new fixture file is attached that tests this case - the test fails without the attached code change. I find it unlikely that DetermineEncoding would have returned anything other than UTF-16, UTF-8, or windows-1252 since it was made to examine HTML documents and thus examine the content-type (we always pass text/plain) and the meta tags (unlikely to be present in a non-HTML Exercism submission). The risk of this change is that anyone who **actually** wanted to submit in windows-1252 will now be unable to, but I doubt that anyone is in this constituency, and discussion in #309 seems to be in favor of nudging them toward UTF-8 anyway. Closes #309
Good explanation of the changes and the test coverage looks good too. Code looks good. The tests pass on my OSX(10.12) machine. Thanks @petertseng for this PR. |
I'd say let's go for it. |
Going for it :) |
As discussed in #309: Since #184 we have been using DetermineEncoding to
deal with the case of UTF-16 files.
That was a reasonable fix for exercism/exercism#2303.
DetermineEncoding only looks at the first 1024 bytes of a file. If it
can't determine an encoding, it defaults to windows-1252.
This causes undesirable behaviour for files with Unicode characters but
also only ASCII in their first 1024 characters - they get interpreted as
windows-1252, mangling the Unicode characters.
This commit takes advantage of the fact that DetermineEncoding reports
whether it is certain about its encoding guess. If it is uncertain, we
default to UTF-8 instead of windows-1252.
Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it
is certain. Therefore, behaviour for UTF-16 files is preserved (existing
tests would have caught it if behaviour were accidentally altered).
A new fixture file is attached that tests this case - the test fails
without the attached code change.
I find it unlikely that DetermineEncoding would have returned anything
other than UTF-16, UTF-8, or windows-1252 since it was made to examine
HTML documents and thus examine the content-type (we always pass
text/plain) and the meta tags (unlikely to be present in a non-HTML
Exercism submission).
The risk of this change is that anyone who actually wanted to submit
in windows-1252 will now be unable to, but I doubt that anyone is in
this constituency, and discussion in #309 seems to be in favor of
nudging them toward UTF-8 anyway.
Closes #309