Skip to content

Conversation

@lnicola
Copy link
Member

@lnicola lnicola commented May 3, 2020

I'm still not sure that utf16_to_utf8_col is correct for code points from Supplementary Planes. These have two UTF-16 code units, and I feel we're not going to count them correctly.

Fixes the crash in #4263 (comment).

@matklad
Copy link
Contributor

matklad commented May 3, 2020

These have two UTF-16 code units, and I feel we're not going to count them correctly.

I believe we should count them wrongly -- it's not really UTF16, it's "the bogus encoding JavaScript uses for strings". Ie, it considers half a surrogate a character.

bors r+

However I wonder... can we just generate the test?

I mean, can we write a loop that exhaustively checks ALL character boundaries in the input? https://doc.rust-lang.org/stable/std/primitive.char.html#method.encode_utf16 exists, so I imagine we can write a brute-force loop and an index-based loop?

@bors
Copy link
Contributor

bors bot commented May 3, 2020

@bors bors bot merged commit 682c079 into rust-lang:master May 3, 2020
@lnicola
Copy link
Member Author

lnicola commented May 3, 2020

I wanted to test something, but rustfmt changes my a𐐏b string to a𐐏'b (there's an extra '), so I'm not even sure what language this is.

@lnicola lnicola deleted the branch May 3, 2020 09:07
@lnicola
Copy link
Member Author

lnicola commented May 3, 2020

I believe we should count them wrongly -- it's not really UTF16, it's "the bogus encoding JavaScript uses for strings". Ie, it considers half a surrogate a character.

No, look:

        // 0x0061 0xD801 0xDC37 0x0062
        // 0x61 0xF0 0x90 0x90 0xB7 0x62
        let col_index = LineIndex::new("a𐐏b");
        dbg!(&col_index);
        for i in 0..4 {
            eprintln!("{} => {}", i, u32::from(col_index.utf16_to_utf8_col(0, i)))
        }

JS:

"a𐐏b".length
4

So JS is counting code units, with a being code unit 0, 𐐏 code units 1, 2 and b code unit 3. So I expect the code above to output

0 => 0 // 0x0061 to 0x61
1 => 1 // 0xD801 0xDC37 to 0xF0 0x90 0x90 0xB7
2 => _ // probably won't happen
3 => 5 0x0062 to 0x62

But it prints

0 => 0
1 => 1
2 => 5
3 => 6

Because the - 1 in col += u32::from(c.len()) - 1 is probably wrong.

@matklad
Copy link
Contributor

matklad commented May 3, 2020

I've verified that a𐐏b also has length 4 in the protoocl (to double check that it indeed just uses js.lenght)

bors bot added a commit that referenced this pull request May 5, 2020
4325: Fix column conversion for supplementary plane characters r=matklad a=lnicola

Fixes #4276 (comment).

Co-authored-by: Laurențiu Nicola <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants