Avoid breaking code units offset on binary encoding #3168

vinistock · 2024-10-08T14:49:32Z

When a file is using binary encoding and contains multibyte characters, trying to call encode with one of the accepted LSP encodings (UTF-8, UTF-16 or UTF-32) will fail because it's not a valid conversion.

In situations like these, we want to avoid breaking, even if we can't provide the correct locations for nodes. In the included test, you can see that the locations are all using the number of bytes, rather than the code units.

eregon · 2024-10-08T15:29:31Z

# encoding: binary files with non-US-ASCII characters are kind of a broken concept/nasty edge case.
In both the example here and in Shopify/ruby-lsp#2674 the actual encoding of the file is UTF-8, otherwise we wouldn't be able to show the ∂φ/∂x or smileys.

So maybe the best in that case is to consider the source as UTF-8 bytes, because BINARY is simply inaccurate and UTF-8 is by far the most likely actual encoding of the file?
I think it would then return the correct offsets for those two examples, yes?

test/prism/ruby/location_test.rb

kddnewton · 2024-10-08T16:16:21Z

So maybe the best in that case is to consider the source as UTF-8 bytes, because BINARY is simply inaccurate and UTF-8 is by far the most likely actual encoding of the file? I think it would then return the correct offsets for those two examples, yes?

We considered that as an option. I think that we might take that approach in a follow-up here. The issue is, if the file is Shift_JIS encoded (or some other encoding) then forcing it into UTF-8 is going to result in problems, even more so than replacing invalid/undef. For example:

# encoding: binary
"イ"

If you force that into UTF-8, it's going to invalid for the encoding. Another approach that we are experimenting with is something to the effect of:

def for(source)
  if source.ascii_only?
    ASCIISource.new(source)
  elsif source.binary?
    if (encoding = [Encoding::UTF_8, *Encoding.list].find { source.force_encoding(_1).valid_encoding? })
      new(source.force_encoding(encoding)
    else
      BrokenSource.new(source)
    end
  else
    new(source)
  end
end

where BrokenSource would include the invalid/undef options and presumably warn the user that something is amiss. I think we can do that in a follow-up though.

eregon · 2024-10-08T16:46:44Z

The issue is, if the file is Shift_JIS encoded (or some other encoding) then forcing it into UTF-8 is going to result in problems, even more so than replacing invalid/undef.

True, but does this ever happen?
At least so far all # encoding: binary files with non-US-ASCII characters I have seen are actually UTF-8.
And the reason for that # encoding: binary BTW in at least some of these ruby/spec files is to workaround a bug in the parser gem (used by RuboCop, used in ruby/spec CI), which couldn't handle things like "\xAA\xBB" with AA and BB some broken UTF-8:
https://github.com/whitequark/parser?tab=readme-ov-file#invalid-characters-inside-comments-and-literals

I wonder what editors do in such a case (since most would still display non-ASCII characters). Maybe they just assume the filesystem encoding which is UTF-8? (at least on Unix, not sure on Windows)
I think Ideally the LSP in this case would ask the editor which encoding it guessed and that would be passed to Prism, but I suspect there is no easy way to find that out.

I like your def for(source) approach, although I think we should go straight to BrokenSource if it's not valid UTF-8.

Co-authored-by: Kevin Newton <[email protected]>

kddnewton · 2024-10-09T14:07:03Z

I'll follow up in another PR

vinistock mentioned this pull request Oct 8, 2024

Error during indexing: "\xE2" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to UTF-16LE Shopify/ruby-lsp#2674

Closed

eregon reviewed Oct 8, 2024

View reviewed changes

test/prism/ruby/location_test.rb Outdated Show resolved Hide resolved

Avoid breaking code units offset on binary encoding

25a4cf6

Co-authored-by: Kevin Newton <[email protected]>

vinistock force-pushed the vs-avoid-breaking-on-binary-encoding branch from 3769d3d to 25a4cf6 Compare October 8, 2024 19:29

kddnewton merged commit 6f77baa into ruby:main Oct 9, 2024
54 checks passed

kddnewton deleted the vs-avoid-breaking-on-binary-encoding branch October 9, 2024 14:07

herwinw mentioned this pull request Oct 15, 2024

Upgrade to Prism 1.2.0 natalie-lang/natalie#2266

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid breaking code units offset on binary encoding #3168

Avoid breaking code units offset on binary encoding #3168

vinistock commented Oct 8, 2024

eregon commented Oct 8, 2024

kddnewton commented Oct 8, 2024

eregon commented Oct 8, 2024

kddnewton commented Oct 9, 2024

Avoid breaking code units offset on binary encoding #3168

Avoid breaking code units offset on binary encoding #3168

Conversation

vinistock commented Oct 8, 2024

eregon commented Oct 8, 2024

kddnewton commented Oct 8, 2024

eregon commented Oct 8, 2024

kddnewton commented Oct 9, 2024