Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain #346

Merged
merged 3 commits into from
Sep 26, 2016
Merged

iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain #346

merged 3 commits into from
Sep 26, 2016

Conversation

petertseng
Copy link
Member

@petertseng petertseng commented Sep 5, 2016

As discussed in #309: Since #184 we have been using DetermineEncoding to
deal with the case of UTF-16 files.
That was a reasonable fix for exercism/exercism#2303.

DetermineEncoding only looks at the first 1024 bytes of a file. If it
can't determine an encoding, it defaults to windows-1252.

This causes undesirable behaviour for files with Unicode characters but
also only ASCII in their first 1024 characters - they get interpreted as
windows-1252, mangling the Unicode characters.

This commit takes advantage of the fact that DetermineEncoding reports
whether it is certain about its encoding guess. If it is uncertain, we
default to UTF-8 instead of windows-1252.

Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it
is certain. Therefore, behaviour for UTF-16 files is preserved (existing
tests would have caught it if behaviour were accidentally altered).

A new fixture file is attached that tests this case - the test fails
without the attached code change.

I find it unlikely that DetermineEncoding would have returned anything
other than UTF-16, UTF-8, or windows-1252 since it was made to examine
HTML documents and thus examine the content-type (we always pass
text/plain) and the meta tags (unlikely to be present in a non-HTML
Exercism submission).

The risk of this change is that anyone who actually wanted to submit
in windows-1252 will now be unable to, but I doubt that anyone is in
this constituency, and discussion in #309 seems to be in favor of
nudging them toward UTF-8 anyway.

Closes #309

Code was expecting five elements, but comment claimed that three
elements was the correct number.
}

var buf bytes.Buffer
buf.WriteString("# ")
for i := 0; i < 1024; i++ {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, let's use strings.Repeat instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

An upcoming commit will soon add a test file for which the suffix is
important, in addition to the prefix. The middle of the file will
contain long text that is not desirable to test. Thus, testing suffixes
is helpful.

Note that all current tests have no suffixes, so all cases should pass.
As discussed in #309: Since #184 we have been using DetermineEncoding to
deal with the case of UTF-16 files.
That was a reasonable fix for exercism/exercism#2303.

DetermineEncoding only looks at the first 1024 bytes of a file. If it
can't determine an encoding, it defaults to windows-1252.

This causes undesirable behaviour for files with Unicode characters but
also only ASCII in their first 1024 characters - they get interpreted as
windows-1252, mangling the Unicode characters.

This commit takes advantage of the fact that DetermineEncoding reports
whether it is *certain* about its encoding guess. If it is uncertain, we
default to UTF-8 instead of windows-1252.

Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it
is certain. Therefore, behaviour for UTF-16 files is preserved (existing
tests would have caught it if behaviour were accidentally altered).

A new fixture file is attached that tests this case - the test fails
without the attached code change.

I find it unlikely that DetermineEncoding would have returned anything
other than UTF-16, UTF-8, or windows-1252 since it was made to examine
HTML documents and thus examine the content-type (we always pass
text/plain) and the meta tags (unlikely to be present in a non-HTML
Exercism submission).

The risk of this change is that anyone who **actually** wanted to submit
in windows-1252 will now be unable to, but I doubt that anyone is in
this constituency, and discussion in #309 seems to be in favor of
nudging them toward UTF-8 anyway.

Closes #309
@Tonkpils
Copy link
Contributor

These changes seem ok to me though I'd be more comfortable if @lcowell or @kytrinyx double checked on these. The test makes me comfortable enough to merge this though I'd feel more confident with a second pair of eyes on this.

@lcowell
Copy link
Contributor

lcowell commented Sep 20, 2016

Good explanation of the changes and the test coverage looks good too. Code looks good. The tests pass on my OSX(10.12) machine. Thanks @petertseng for this PR.

@kytrinyx
Copy link
Member

I'd say let's go for it.

@kytrinyx
Copy link
Member

Going for it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Files that have Unicode chars only after the 1024th byte get their Unicode mangled
4 participants