-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid breaking code units offset on binary encoding #3168
Avoid breaking code units offset on binary encoding #3168
Conversation
So maybe the best in that case is to consider the source as UTF-8 bytes, because BINARY is simply inaccurate and UTF-8 is by far the most likely actual encoding of the file? |
We considered that as an option. I think that we might take that approach in a follow-up here. The issue is, if the file is Shift_JIS encoded (or some other encoding) then forcing it into UTF-8 is going to result in problems, even more so than replacing invalid/undef. For example: # encoding: binary
"イ" If you force that into UTF-8, it's going to invalid for the encoding. Another approach that we are experimenting with is something to the effect of: def for(source)
if source.ascii_only?
ASCIISource.new(source)
elsif source.binary?
if (encoding = [Encoding::UTF_8, *Encoding.list].find { source.force_encoding(_1).valid_encoding? })
new(source.force_encoding(encoding)
else
BrokenSource.new(source)
end
else
new(source)
end
end where |
True, but does this ever happen? I wonder what editors do in such a case (since most would still display non-ASCII characters). Maybe they just assume the filesystem encoding which is UTF-8? (at least on Unix, not sure on Windows) I like your |
Co-authored-by: Kevin Newton <[email protected]>
3769d3d
to
25a4cf6
Compare
I'll follow up in another PR |
When a file is using binary encoding and contains multibyte characters, trying to call
encode
with one of the accepted LSP encodings (UTF-8, UTF-16 or UTF-32) will fail because it's not a valid conversion.In situations like these, we want to avoid breaking, even if we can't provide the correct locations for nodes. In the included test, you can see that the locations are all using the number of bytes, rather than the code units.