Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix capturing group numbering and a few more issues in regex parsing #392

Merged
merged 4 commits into from
Aug 6, 2023

Conversation

adams85
Copy link
Collaborator

@adams85 adams85 commented Aug 5, 2023

The JS regex engine assigns numbers to capturing groups sequentially (regardless of the group being named or not named) but .NET uses a different, weird approach:

[...] Captures that use parentheses are numbered automatically from left to right based on the order of the opening parentheses in the regular expression, starting from 1. However, named capture groups are always ordered last, after non-named capture groups. [...]

This could totally mess up numbered backreferences and replace pattern references. (See also sebastienros/jint#1603 (comment)). So, as a workaround, we wrap all named capturing groups in a non-named capturing group to force .NET to include all capturing groups in the resulting match in the expected order. (For example /.(?<a>a)(.)\1/ will be converted to @".((?<a>a))(.)\1", which make it work as expected).

Of course, this won't prevent named groups from being listed after the numbered ones but we can't really do anything about that other than returning the actual count of groups (RegExpParseResult.ActualRegexGroupCount). Using this information, consumers can discard the irrelevant groups. We also provide a method (RegExpParseResult.GetRegexGroupName) for querying the group name by number, which can be used as a replacement for Regex.GroupNameFromNumber.

This means that we can no longer get away with storing a plain Regex object in the AST or returning it from Scanner.AdaptRegExp. We need to put a wrapper (RegExpParseResult) around the Regex object to be able to store the related information. (RegExpParseResult also allows us to improve error reporting a bit.) How do you feel about this API change? (BTW, I tried to adjust Jint to this change and could do that without major problems. I'll open a PR over there soon so you can evaluate that side of the "equation" as well.)

Also fixed a minor bug related to lone surrogate matching. (Test262 test language/literals/regexp/u-surrogate-pairs-atom-escape-decimal.js will now pass.)

Copy link
Collaborator

@lahma lahma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as far as can tell and understand, both here and on Jint's side.

@adams85
Copy link
Collaborator Author

adams85 commented Aug 6, 2023

Thanks for the review!

@adams85 adams85 merged commit d29cb04 into main Aug 6, 2023
2 checks passed
@adams85 adams85 deleted the improve-regex-parsing branch August 6, 2023 12:30
@adams85 adams85 mentioned this pull request Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants