-
-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPython 3.11.7 breaks regex
module compatible pattern width calculations
#1376
Comments
I made a PR with a fix, although I would suggest not relying on automatically generated priories. Just adding an explicit priority would also fix this for your grammar. |
Wow, thanks for the quick turnaround! We'll look into setting explicit priorities too, thanks for the suggestion 😄 . |
@MichaelSquires I'm currently on vacation, so no promises, but I'll try to do it next week. |
Thanks for letting me know. Enjoy your vacation! |
Hi @erezsh, I hope you enjoyed the holidays! Wondering if you think we'll see a new release with this bugfix soon? |
@MichaelSquires Thanks for the reminder. I'll try to do it this Monday. |
Released version 1.1.9 Sorry for the delay! |
Excellent news! Thank you for your support! |
Describe the bug
This CPython change recently got released in CPython 3.11.7: python/cpython#109859. You'll see in
Lib/re/_parser.py
thatMAXWIDTH
(1 << 64
) is now the maximum width of a regex pattern, notMAXREPEAT
(2**32 - 1
). Lark usesMAXREPEAT
(utils.py#L154) to assume the maximum width when it can't use there
module to calculate the width. This was all fine until CPython 3.11.7 was released earlier today and it started breaking our custom grammar.Our specific issue is that terminal ordering changed because the lengths of the regex patterns are used to determine priority. Since 3.11.7, regex patterns that are only compatible with the
regex
module have a much shorter length thanre
module compatible patterns.This issue (or very similar to it) was discussed just a few weeks ago on github: mrabarnett/mrab-regex#513.
To Reproduce
Here's a minimized test. In this example,
NUM
uses a named capture group without the "P" ((?P<NUMVAL>\d)
) which is supported by theregex
module only. Because it's not supported by the builtinre
module, it's length is assumed to be (min, max) of(1, MAXREPEAT)
where MAXREPEAT =2^^32 -1
. ForSTR
, it isre
module compatible so theget_width()
call succeeds and its length is calculated as(1, MAXWIDTH)
where MAXWIDTH =1 << 64
.Please let me know if you have any questions about this.
Thanks!
The text was updated successfully, but these errors were encountered: