Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPython 3.11.7 breaks regex module compatible pattern width calculations #1376

Closed
MichaelSquires opened this issue Dec 7, 2023 · 9 comments

Comments

@MichaelSquires
Copy link

Describe the bug

This CPython change recently got released in CPython 3.11.7: python/cpython#109859. You'll see in Lib/re/_parser.py that MAXWIDTH (1 << 64) is now the maximum width of a regex pattern, not MAXREPEAT (2**32 - 1). Lark uses MAXREPEAT (utils.py#L154) to assume the maximum width when it can't use the re module to calculate the width. This was all fine until CPython 3.11.7 was released earlier today and it started breaking our custom grammar.

Our specific issue is that terminal ordering changed because the lengths of the regex patterns are used to determine priority. Since 3.11.7, regex patterns that are only compatible with the regex module have a much shorter length than re module compatible patterns.

This issue (or very similar to it) was discussed just a few weeks ago on github: mrabarnett/mrab-regex#513.

To Reproduce

Here's a minimized test. In this example, NUM uses a named capture group without the "P" ((?P<NUMVAL>\d)) which is supported by the regex module only. Because it's not supported by the builtin re module, it's length is assumed to be (min, max) of (1, MAXREPEAT) where MAXREPEAT = 2^^32 -1. For STR, it is re module compatible so the get_width() call succeeds and its length is calculated as (1, MAXWIDTH) where MAXWIDTH = 1 << 64.

    def test_terminal_widths(self):
        g = r"""
        start: expr
        expr: NUM | STR
        NUM: /(?<NUMVAL>\d)+/
        STR: /\w+/
        """
        p = Lark(g, regex=True, parser='lalr')
        term0 = p.terminals[0]
        term1 = p.terminals[1]
        self.assertEqual(term0.priority, term1.priority)
        self.assertEqual(term0.pattern.min_width, term1.pattern.min_width)
        self.assertEqual(term0.pattern.max_width, term1.pattern.max_width)

Please let me know if you have any questions about this.

Thanks!

@MegaIng
Copy link
Member

MegaIng commented Dec 7, 2023

I made a PR with a fix, although I would suggest not relying on automatically generated priories. Just adding an explicit priority would also fix this for your grammar.

@MichaelSquires
Copy link
Author

Wow, thanks for the quick turnaround!

We'll look into setting explicit priorities too, thanks for the suggestion 😄 .

@MichaelSquires
Copy link
Author

This issue is resolved with #1377 being merged in, I'll go ahead and close it. Thank you for your help/support!

@erezsh @MegaIng Any estimates on when we might see a release with this fix?

@erezsh
Copy link
Member

erezsh commented Dec 9, 2023

@MichaelSquires I'm currently on vacation, so no promises, but I'll try to do it next week.

@MichaelSquires
Copy link
Author

@MichaelSquires I'm currently on vacation, so no promises, but I'll try to do it next week.

Thanks for letting me know. Enjoy your vacation!

@MichaelSquires
Copy link
Author

Hi @erezsh, I hope you enjoyed the holidays! Wondering if you think we'll see a new release with this bugfix soon?

@erezsh
Copy link
Member

erezsh commented Jan 5, 2024

@MichaelSquires Thanks for the reminder. I'll try to do it this Monday.

@erezsh
Copy link
Member

erezsh commented Jan 10, 2024

Released version 1.1.9

Sorry for the delay!

@MichaelSquires
Copy link
Author

Excellent news! Thank you for your support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants