Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect word boundary detection #743

Closed
Lucretiel opened this issue Feb 5, 2021 · 2 comments
Closed

Incorrect word boundary detection #743

Lucretiel opened this issue Feb 5, 2021 · 2 comments

Comments

@Lucretiel
Copy link
Contributor

What version of regex are you using?

1.4.3

Describe the bug at a high level.

When scanning the string " abc를" with the pattern r"\babc\b", it returns 0 matches. It appears that the word boundary operator is failing to detect a boundary between "abc" and "를"; testing this string with https://util.unicode.org/UnicodeJsps/breaks.jsp indicates that a word boundary should be present there.

What are the steps to reproduce the behavior?

See above

What is the actual behavior?

The regular expression returns 0 matches

What is the expected behavior?

The regular expression returns a single match, "abc"

Additional information

This may be a duplicate of #579

@Lucretiel
Copy link
Contributor Author

Never mind; regex's behavior is correct in this case (see unicode-rs/unicode-segmentation#90)

@BurntSushi
Copy link
Member

Right, this is indeed correct. The regex crate implements word boundaries in accordance with UTS#18 RL1.4. That is, a word boundary occurs when one side is \W and the other is \w. matches \w.

Digging deeper, \w is defined as the union (according to RL1.4) of the following properties:

  • \p{Alphabetic}
  • \p{Join_Control}
  • \p{gc:Mark}
  • \p{gc:Decimal_Number}
  • \p{gc:Connector_Punctuation}

This program shows that is in the Alphabetic property. Alphabetic is itself generated from several other properties.

This program shows that is in the Lo or Other_Letter general category (which is one of the constituents of the Alphabetic property). According to UnicodeData.txt in the UCD 13.0.0 archive, is a Hangul syllable codepoint:

AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;

And each one of those Hangul codepoints is assigned to the Other_Letter general category.

Thus, regex satisfies the UTS#18 spec here correctly as far as I can tell.

More generally though, Unicode does permit custom tailoring for particular locales. I don't know enough about Hangul syllables to say what is appropriate here, but custom tailoring is definitely beyond the scope of the regex crate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants