Skip to content

Commit

Permalink
Avoid breaking code units offset on binary encoding
Browse files Browse the repository at this point in the history
Co-authored-by: Kevin Newton <[email protected]>
  • Loading branch information
vinistock and kddnewton committed Oct 8, 2024
1 parent 1653317 commit 3769d3d
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 1 deletion.
2 changes: 1 addition & 1 deletion lib/prism/parse_result.rb
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def character_column(byte_offset)
# concept of code units that differs from the number of characters in other
# encodings, it is not captured here.
def code_units_offset(byte_offset, encoding)
byteslice = (source.byteslice(0, byte_offset) or raise).encode(encoding)
byteslice = (source.byteslice(0, byte_offset) or raise).encode(encoding, invalid: :replace, undef: :replace)

if encoding == Encoding::UTF_16LE || encoding == Encoding::UTF_16BE
byteslice.bytesize / 2
Expand Down
19 changes: 19 additions & 0 deletions test/prism/ruby/location_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,25 @@ def test_code_units
assert_equal 7, location.end_code_units_column(Encoding::UTF_32LE)
end

def test_code_units_handles_binary_encoding_with_multibyte_characters
# If the encoding is set to binary and the source contains multibyte
# characters, we avoid breaking the code unit offsets, but they will
# still be incorrect.

program = Prism.parse(<<~RUBY).value
# -*- encoding: binary -*-
πŸ˜€ + πŸ˜€\n😍 ||= 😍
RUBY

# first πŸ˜€
location = program.statements.body.first.receiver.location

assert_equal 4, location.end_code_units_column(Encoding::UTF_8)
assert_equal 4, location.end_code_units_column(Encoding::UTF_16LE)
assert_equal 4, location.end_code_units_column(Encoding::UTF_32LE)
end

def test_chop
location = Prism.parse("foo").value.location

Expand Down

0 comments on commit 3769d3d

Please sign in to comment.