Incorrect word boundary detection #743

Lucretiel · 2021-02-05T01:54:45Z

What version of regex are you using?

1.4.3

Describe the bug at a high level.

When scanning the string " abc를" with the pattern r"\babc\b", it returns 0 matches. It appears that the word boundary operator is failing to detect a boundary between "abc" and "를"; testing this string with https://util.unicode.org/UnicodeJsps/breaks.jsp indicates that a word boundary should be present there.

What are the steps to reproduce the behavior?

See above

What is the actual behavior?

The regular expression returns 0 matches

What is the expected behavior?

The regular expression returns a single match, "abc"

Additional information

This may be a duplicate of #579

The text was updated successfully, but these errors were encountered:

Lucretiel · 2021-02-07T00:09:23Z

Never mind; regex's behavior is correct in this case (see unicode-rs/unicode-segmentation#90)

BurntSushi · 2021-02-07T17:04:16Z

Right, this is indeed correct. The regex crate implements word boundaries in accordance with UTS#18 RL1.4. That is, a word boundary occurs when one side is \W and the other is \w. 를 matches \w.

Digging deeper, \w is defined as the union (according to RL1.4) of the following properties:

\p{Alphabetic}
\p{Join_Control}
\p{gc:Mark}
\p{gc:Decimal_Number}
\p{gc:Connector_Punctuation}

This program shows that 를 is in the Alphabetic property. Alphabetic is itself generated from several other properties.

This program shows that 를 is in the Lo or Other_Letter general category (which is one of the constituents of the Alphabetic property). According to UnicodeData.txt in the UCD 13.0.0 archive, 를 is a Hangul syllable codepoint:

AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;

And each one of those Hangul codepoints is assigned to the Other_Letter general category.

Thus, regex satisfies the UTS#18 spec here correctly as far as I can tell.

More generally though, Unicode does permit custom tailoring for particular locales. I don't know enough about Hangul syllables to say what is appropriate here, but custom tailoring is definitely beyond the scope of the regex crate.

Lucretiel closed this as completed Feb 7, 2021

Lucretiel mentioned this issue Feb 8, 2021

Bug in Word Segmentation demo unicode-org/unicodetools#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect word boundary detection #743

Incorrect word boundary detection #743

Lucretiel commented Feb 5, 2021

Lucretiel commented Feb 7, 2021

BurntSushi commented Feb 7, 2021

Incorrect word boundary detection #743

Incorrect word boundary detection #743

Comments

Lucretiel commented Feb 5, 2021

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

Additional information

Lucretiel commented Feb 7, 2021

BurntSushi commented Feb 7, 2021