iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain #346

petertseng · 2016-09-05T01:04:06Z

As discussed in #309: Since #184 we have been using DetermineEncoding to
deal with the case of UTF-16 files.
That was a reasonable fix for exercism/exercism#2303.

DetermineEncoding only looks at the first 1024 bytes of a file. If it
can't determine an encoding, it defaults to windows-1252.

This causes undesirable behaviour for files with Unicode characters but
also only ASCII in their first 1024 characters - they get interpreted as
windows-1252, mangling the Unicode characters.

This commit takes advantage of the fact that DetermineEncoding reports
whether it is certain about its encoding guess. If it is uncertain, we
default to UTF-8 instead of windows-1252.

Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it
is certain. Therefore, behaviour for UTF-16 files is preserved (existing
tests would have caught it if behaviour were accidentally altered).

A new fixture file is attached that tests this case - the test fails
without the attached code change.

I find it unlikely that DetermineEncoding would have returned anything
other than UTF-16, UTF-8, or windows-1252 since it was made to examine
HTML documents and thus examine the content-type (we always pass
text/plain) and the meta tags (unlikely to be present in a non-HTML
Exercism submission).

The risk of this change is that anyone who actually wanted to submit
in windows-1252 will now be unable to, but I doubt that anyone is in
this constituency, and discussion in #309 seems to be in favor of
nudging them toward UTF-8 anyway.

Closes #309

Code was expecting five elements, but comment claimed that three elements was the correct number.

petertseng · 2016-09-05T01:07:31Z

api/iteration_test.go

 	}

+	var buf bytes.Buffer
+	buf.WriteString("# ")
+	for i := 0; i < 1024; i++ {


nah, let's use strings.Repeat instead.

An upcoming commit will soon add a test file for which the suffix is important, in addition to the prefix. The middle of the file will contain long text that is not desirable to test. Thus, testing suffixes is helpful. Note that all current tests have no suffixes, so all cases should pass.

As discussed in #309: Since #184 we have been using DetermineEncoding to deal with the case of UTF-16 files. That was a reasonable fix for exercism/exercism#2303. DetermineEncoding only looks at the first 1024 bytes of a file. If it can't determine an encoding, it defaults to windows-1252. This causes undesirable behaviour for files with Unicode characters but also only ASCII in their first 1024 characters - they get interpreted as windows-1252, mangling the Unicode characters. This commit takes advantage of the fact that DetermineEncoding reports whether it is *certain* about its encoding guess. If it is uncertain, we default to UTF-8 instead of windows-1252. Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it is certain. Therefore, behaviour for UTF-16 files is preserved (existing tests would have caught it if behaviour were accidentally altered). A new fixture file is attached that tests this case - the test fails without the attached code change. I find it unlikely that DetermineEncoding would have returned anything other than UTF-16, UTF-8, or windows-1252 since it was made to examine HTML documents and thus examine the content-type (we always pass text/plain) and the meta tags (unlikely to be present in a non-HTML Exercism submission). The risk of this change is that anyone who **actually** wanted to submit in windows-1252 will now be unable to, but I doubt that anyone is in this constituency, and discussion in #309 seems to be in favor of nudging them toward UTF-8 anyway. Closes #309

Tonkpils · 2016-09-19T01:43:23Z

These changes seem ok to me though I'd be more comfortable if @lcowell or @kytrinyx double checked on these. The test makes me comfortable enough to merge this though I'd feel more confident with a second pair of eyes on this.

lcowell · 2016-09-20T15:14:07Z

Good explanation of the changes and the test coverage looks good too. Code looks good. The tests pass on my OSX(10.12) machine. Thanks @petertseng for this PR.

kytrinyx · 2016-09-21T20:57:46Z

I'd say let's go for it.

kytrinyx · 2016-09-26T19:26:19Z

Going for it :)

iteration_test: show correct expectation in test fail message

a5046fb

Code was expecting five elements, but comment claimed that three elements was the correct number.

petertseng reviewed Sep 5, 2016
View reviewed changes

petertseng added 2 commits September 4, 2016 23:14

kytrinyx merged commit 7fd3c92 into exercism:master Sep 26, 2016

petertseng deleted the encoding branch September 26, 2016 19:41

petertseng mentioned this pull request Feb 20, 2017

UTF-8 interpreted as ISO-8859-1 on submissions page exercism/exercism#2523

Closed

kytrinyx mentioned this pull request Aug 6, 2017

Nextercism & Encoding exercism/discussions#180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain #346

iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain #346

petertseng commented Sep 5, 2016 •

edited

Loading

petertseng Sep 5, 2016

petertseng Sep 5, 2016

Tonkpils commented Sep 19, 2016

lcowell commented Sep 20, 2016

kytrinyx commented Sep 21, 2016

kytrinyx commented Sep 26, 2016

iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain #346

iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain #346

Conversation

petertseng commented Sep 5, 2016 • edited Loading

petertseng Sep 5, 2016

Choose a reason for hiding this comment

petertseng Sep 5, 2016

Choose a reason for hiding this comment

Tonkpils commented Sep 19, 2016

lcowell commented Sep 20, 2016

kytrinyx commented Sep 21, 2016

kytrinyx commented Sep 26, 2016

petertseng commented Sep 5, 2016 •

edited

Loading