Don't count start of non-ASCII characters as being inside of them #4276

lnicola · 2020-05-03T06:56:57Z

I'm still not sure that utf16_to_utf8_col is correct for code points from Supplementary Planes. These have two UTF-16 code units, and I feel we're not going to count them correctly.

Fixes the crash in #4263 (comment).

matklad · 2020-05-03T08:56:46Z

These have two UTF-16 code units, and I feel we're not going to count them correctly.

I believe we should count them wrongly -- it's not really UTF16, it's "the bogus encoding JavaScript uses for strings". Ie, it considers half a surrogate a character.

bors r+

However I wonder... can we just generate the test?

I mean, can we write a loop that exhaustively checks ALL character boundaries in the input? https://doc.rust-lang.org/stable/std/primitive.char.html#method.encode_utf16 exists, so I imagine we can write a brute-force loop and an index-based loop?

bors · 2020-05-03T09:04:26Z

Build succeeded:

lnicola · 2020-05-03T09:07:31Z

I wanted to test something, but rustfmt changes my a𐐏b string to a𐐏'b (there's an extra '), so I'm not even sure what language this is.

lnicola · 2020-05-03T09:16:18Z

I believe we should count them wrongly -- it's not really UTF16, it's "the bogus encoding JavaScript uses for strings". Ie, it considers half a surrogate a character.

No, look:

        // 0x0061 0xD801 0xDC37 0x0062
        // 0x61 0xF0 0x90 0x90 0xB7 0x62
        let col_index = LineIndex::new("a𐐏b");
        dbg!(&col_index);
        for i in 0..4 {
            eprintln!("{} => {}", i, u32::from(col_index.utf16_to_utf8_col(0, i)))
        }

JS:

"a𐐏b".length
4

So JS is counting code units, with a being code unit 0, 𐐏 code units 1, 2 and b code unit 3. So I expect the code above to output

0 => 0 // 0x0061 to 0x61
1 => 1 // 0xD801 0xDC37 to 0xF0 0x90 0x90 0xB7
2 => _ // probably won't happen
3 => 5 0x0062 to 0x62

But it prints

0 => 0
1 => 1
2 => 5
3 => 6

Because the - 1 in col += u32::from(c.len()) - 1 is probably wrong.

matklad · 2020-05-03T09:28:05Z

I've verified that a𐐏b also has length 4 in the protoocl (to double check that it indeed just uses js.lenght)

4325: Fix column conversion for supplementary plane characters r=matklad a=lnicola Fixes #4276 (comment). Co-authored-by: Laurențiu Nicola <[email protected]>

Don't count start of non-ASCII characters as being inside of them

16d3bb9

lnicola mentioned this pull request May 3, 2020

Incremental text sync panics #4263

Closed

bors bot merged commit 682c079 into rust-lang:master May 3, 2020

lnicola deleted the メ branch May 3, 2020 09:07

lnicola mentioned this pull request May 5, 2020

Fix column conversion for supplementary plane characters #4325

Merged

bors bot added a commit that referenced this pull request May 5, 2020

Merge #4325

a4778dd

4325: Fix column conversion for supplementary plane characters r=matklad a=lnicola Fixes #4276 (comment). Co-authored-by: Laurențiu Nicola <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't count start of non-ASCII characters as being inside of them #4276

Don't count start of non-ASCII characters as being inside of them #4276

Uh oh!

lnicola commented May 3, 2020 •

edited

Loading

Uh oh!

matklad commented May 3, 2020

Uh oh!

bors bot commented May 3, 2020

Uh oh!

lnicola commented May 3, 2020

Uh oh!

lnicola commented May 3, 2020

Uh oh!

matklad commented May 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Don't count start of non-ASCII characters as being inside of them #4276

Don't count start of non-ASCII characters as being inside of them #4276

Uh oh!

Conversation

lnicola commented May 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matklad commented May 3, 2020

Uh oh!

bors bot commented May 3, 2020

Uh oh!

lnicola commented May 3, 2020

Uh oh!

lnicola commented May 3, 2020

Uh oh!

matklad commented May 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lnicola commented May 3, 2020 •

edited

Loading