-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect word boundary detection #743
Comments
Never mind; |
Right, this is indeed correct. The regex crate implements word boundaries in accordance with UTS#18 RL1.4. That is, a word boundary occurs when one side is Digging deeper,
This program shows that This program shows that
And each one of those Hangul codepoints is assigned to the Thus, More generally though, Unicode does permit custom tailoring for particular locales. I don't know enough about Hangul syllables to say what is appropriate here, but custom tailoring is definitely beyond the scope of the |
What version of regex are you using?
1.4.3
Describe the bug at a high level.
When scanning the string " abc를" with the pattern
r"\babc\b"
, it returns 0 matches. It appears that the word boundary operator is failing to detect a boundary between "abc" and "를"; testing this string with https://util.unicode.org/UnicodeJsps/breaks.jsp indicates that a word boundary should be present there.What are the steps to reproduce the behavior?
See above
What is the actual behavior?
The regular expression returns 0 matches
What is the expected behavior?
The regular expression returns a single match, "abc"
Additional information
This may be a duplicate of #579
The text was updated successfully, but these errors were encountered: