Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid breaking code units offset on binary encoding #3168

Merged
merged 1 commit into from
Oct 9, 2024

Conversation

vinistock
Copy link
Contributor

When a file is using binary encoding and contains multibyte characters, trying to call encode with one of the accepted LSP encodings (UTF-8, UTF-16 or UTF-32) will fail because it's not a valid conversion.

In situations like these, we want to avoid breaking, even if we can't provide the correct locations for nodes. In the included test, you can see that the locations are all using the number of bytes, rather than the code units.

@eregon
Copy link
Member

eregon commented Oct 8, 2024

# encoding: binary files with non-US-ASCII characters are kind of a broken concept/nasty edge case.
In both the example here and in Shopify/ruby-lsp#2674 the actual encoding of the file is UTF-8, otherwise we wouldn't be able to show the ∂φ/∂x or smileys.

So maybe the best in that case is to consider the source as UTF-8 bytes, because BINARY is simply inaccurate and UTF-8 is by far the most likely actual encoding of the file?
I think it would then return the correct offsets for those two examples, yes?

@kddnewton
Copy link
Collaborator

So maybe the best in that case is to consider the source as UTF-8 bytes, because BINARY is simply inaccurate and UTF-8 is by far the most likely actual encoding of the file? I think it would then return the correct offsets for those two examples, yes?

We considered that as an option. I think that we might take that approach in a follow-up here. The issue is, if the file is Shift_JIS encoded (or some other encoding) then forcing it into UTF-8 is going to result in problems, even more so than replacing invalid/undef. For example:

# encoding: binary
"イ"

If you force that into UTF-8, it's going to invalid for the encoding. Another approach that we are experimenting with is something to the effect of:

def for(source)
  if source.ascii_only?
    ASCIISource.new(source)
  elsif source.binary?
    if (encoding = [Encoding::UTF_8, *Encoding.list].find { source.force_encoding(_1).valid_encoding? })
      new(source.force_encoding(encoding)
    else
      BrokenSource.new(source)
    end
  else
    new(source)
  end
end

where BrokenSource would include the invalid/undef options and presumably warn the user that something is amiss. I think we can do that in a follow-up though.

@eregon
Copy link
Member

eregon commented Oct 8, 2024

The issue is, if the file is Shift_JIS encoded (or some other encoding) then forcing it into UTF-8 is going to result in problems, even more so than replacing invalid/undef.

True, but does this ever happen?
At least so far all # encoding: binary files with non-US-ASCII characters I have seen are actually UTF-8.
And the reason for that # encoding: binary BTW in at least some of these ruby/spec files is to workaround a bug in the parser gem (used by RuboCop, used in ruby/spec CI), which couldn't handle things like "\xAA\xBB" with AA and BB some broken UTF-8:
https://github.com/whitequark/parser?tab=readme-ov-file#invalid-characters-inside-comments-and-literals

I wonder what editors do in such a case (since most would still display non-ASCII characters). Maybe they just assume the filesystem encoding which is UTF-8? (at least on Unix, not sure on Windows)
I think Ideally the LSP in this case would ask the editor which encoding it guessed and that would be passed to Prism, but I suspect there is no easy way to find that out.

I like your def for(source) approach, although I think we should go straight to BrokenSource if it's not valid UTF-8.

@vinistock vinistock force-pushed the vs-avoid-breaking-on-binary-encoding branch from 3769d3d to 25a4cf6 Compare October 8, 2024 19:29
@kddnewton
Copy link
Collaborator

I'll follow up in another PR

@kddnewton kddnewton merged commit 6f77baa into ruby:main Oct 9, 2024
54 checks passed
@kddnewton kddnewton deleted the vs-avoid-breaking-on-binary-encoding branch October 9, 2024 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants