Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More strict rules for group numbers and names in RE #91760

Closed
serhiy-storchaka opened this issue Apr 20, 2022 · 2 comments · Fixed by #91792 or #91794
Closed

More strict rules for group numbers and names in RE #91760

serhiy-storchaka opened this issue Apr 20, 2022 · 2 comments · Fixed by #91792 or #91794
Labels
3.11 only security fixes topic-regex type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

There were unintentional changes in parsing regular expressions between Python 2 and Python 3.

  1. Group references.

    In patterns and replacement strings you can refer a group by its number using syntax \N where N is a 1-2 digit decimal number. The number should not start by 0, because it will be in an octal escape sequence. The group number can also be used in the conditional expression (?(N)...) in patterns and in references \g<N> in replacement strings. And it is interesting, that in Python 3 it can be not only a sequence of decimal digits. The following things are allowed in the group number:

    • Initial zero: \g<01>.
    • Spaces around the number: \g< 1 >.
    • Underscores: \g<1_2>.
    • Non-decimal digits: \g<¹>.
    • Non-ASCII decimal digits: \g<१>.

    All this is purely an implementation artifact. After \g< we search the nearest > and pass a substring between < and > to int(). In other implementation we could search the longest sequence of decimal digits and all above examples (except may be the first one) would be filtered out automatically.

  2. Group names.

    In (?P<name>...), (?P=name), (?(name)...) and \g<name> we can refer groups by name. To avoid ambiguity there is a limitation: the name should follow the rules for identifier. In Python 2 it means that it should contain only letters, digits and underscores and start with a non-digit. Letters and digits are ASCII-only: [A-Za-z] and [0-9].

    In Python 3 identifiers can contain non-ASCII letters and digits. It is good. But in bytes patterns and replacement strings the codes \xaa, \xb2, \xb3, \xb5, \xb9, \xba, \xc0-\xd6, \xd8-\xf6, \xf8-\xff are allowed in the group name. They correspond characters ª²³µ¹ºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ after decoding.

    It is an implementation artifact too. Bytes patterns and replacement strings are decoded with the Latin1 encoding for parsing. It simplifies and speeds up the code. There is no other reason why letters and digits in the range U-0080--U-00FF are allowed.

    Note that In Python 3 the bytes literal can only contain printable literal characters in the ASCII range. Codes outside of this range should be represented as octal or hexadecimal escape sequences. So supporting non-ASCII letters and digits does not add to readability.

Since the above "features" are not intentional, not supported by most other RE engines (except regex, which is also written in Python), are not tested, and can be changed in result of refactoring the parser, I suggest to introduce more strict rules on group number and name.

  1. Group number should only contain ASCII decimal digits in range [0-9]. Initial 0 is not allowed except for group number 0.
  2. Group name in the bytes pattern or replacement string should only contain ASCII letters and digits.

The question: do we need a deprecation period for this? I have wrote a code for both options (with deprecation and with error), will create PRs tomorrow.

@serhiy-storchaka serhiy-storchaka added type-feature A feature request or enhancement topic-regex 3.11 only security fixes labels Apr 20, 2022
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 21, 2022
…group names in RE

Only sequence of ASCII digits not starting with 0 (except group 0) is
now accepted as a numerical reference.
The group name in bytes patterns and replacement strings can now only
contain ASCII letters and digits and underscore.
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 21, 2022
…id in future

Only sequence of ASCII digits not starting with 0 (except group 0) will be
accepted as a numerical reference.
The group name in bytes patterns and replacement strings could only
contain ASCII letters and digits and underscore.
@ezio-melotti
Copy link
Member

The question: do we need a deprecation period for this?

A deprecation period of at least one release would be good.

@serhiy-storchaka
Copy link
Member Author

Could you please review the PR?

I am not sure about forbidding the initial zero. On one hand, \012 is not a backreference to group 12, but a character with code 0o12, so there may be some confusion. On other hand, PCRE (according to online testers) allows the initial zero and interprets the number as decimal. So I perhaps will remove this check.

serhiy-storchaka added a commit that referenced this issue Apr 30, 2022
…future (GH-91794)

Only sequence of ASCII digits will be accepted as a numerical reference.
The group name in bytes patterns and replacement strings could only
contain ASCII letters and digits and underscore.
serhiy-storchaka added a commit that referenced this issue May 8, 2022
…names in RE (GH-91792)

Only sequence of ASCII digits is now accepted as a numerical reference.
The group name in bytes patterns and replacement strings can now only
contain ASCII letters and digits and underscore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes topic-regex type-feature A feature request or enhancement
Projects
None yet
3 participants
@serhiy-storchaka @ezio-melotti and others