Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete old code generation approach from RegexCompiler / source generator #62318

Merged
merged 4 commits into from
Dec 3, 2021

Conversation

stephentoub
Copy link
Member

In .NET Framework and up through .NET Core 3.1, the code generated for RegexOptions.Compiler was effectively an unrolled version of what RegexInterpreter would process. The RegexNode tree would be turned into a series of opcodes via RegexWriter; the interpreter would then sit in a loop processing those opcodes, and the RegexCompiler iterates through the opcodes generating code for each equivalent to what the interpreter would do but with some decisions made at compile-time rather than at run-time. This approach, however, leads to complicated code that's not pay-for-play (e.g. a big backtracking jump table that all compilations go through even if there's no backtracking), that doesn't factor in the shape of the tree (e.g. it's difficult to add optimizations based on interactions between nodes in the graph), and that doesn't read well when emitted as C# instead of IL as part of the source generator. In .NET 5, we started adding an alternative implementation that processed the RegexNode tree directly, addressing all of those cited issues; however, it only worked for a subset of expressions, namely those with little-to-no backtracking (e.g. non-atomic loops and alternations weren't supported). Since then, we've improved it to the point where everything other than RegexOptions.RightToLeft (which implicitly means lookbehinds as well) is supported, and we've agreed it's ok to drop compilation for those constructs; if they ever become an issue, we can add support for them via the new compilation scheme.

As such, this PR:

  • Deletes all of the code associated with the older code generation scheme
  • Updates the Regex ctor to fall back to selecting the interpreter if the expression can't be compiled
  • Updates the source generator to fall back to just emitting a cached use of Regex if the expression can't be compiled (and issuing a diagnostic in that case)
  • Adds several tests that now pass with the new scheme that didn't with the old (and that still don't with the interpreter)

It also addresses PR feedback from #62268, adds some more comments, adds some more checks for atomicity to reduce unnecessary code spit, and adds several tests from issues that are now addressed in the newer code gen.

cc: @joperezr, @danmoseley
Fixes #61902
Contributes to #58786
Contributes to #62094

@ghost
Copy link

ghost commented Dec 3, 2021

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

In .NET Framework and up through .NET Core 3.1, the code generated for RegexOptions.Compiler was effectively an unrolled version of what RegexInterpreter would process. The RegexNode tree would be turned into a series of opcodes via RegexWriter; the interpreter would then sit in a loop processing those opcodes, and the RegexCompiler iterates through the opcodes generating code for each equivalent to what the interpreter would do but with some decisions made at compile-time rather than at run-time. This approach, however, leads to complicated code that's not pay-for-play (e.g. a big backtracking jump table that all compilations go through even if there's no backtracking), that doesn't factor in the shape of the tree (e.g. it's difficult to add optimizations based on interactions between nodes in the graph), and that doesn't read well when emitted as C# instead of IL as part of the source generator. In .NET 5, we started adding an alternative implementation that processed the RegexNode tree directly, addressing all of those cited issues; however, it only worked for a subset of expressions, namely those with little-to-no backtracking (e.g. non-atomic loops and alternations weren't supported). Since then, we've improved it to the point where everything other than RegexOptions.RightToLeft (which implicitly means lookbehinds as well) is supported, and we've agreed it's ok to drop compilation for those constructs; if they ever become an issue, we can add support for them via the new compilation scheme.

As such, this PR:

  • Deletes all of the code associated with the older code generation scheme
  • Updates the Regex ctor to fall back to selecting the interpreter if the expression can't be compiled
  • Updates the source generator to fall back to just emitting a cached use of Regex if the expression can't be compiled (and issuing a diagnostic in that case)
  • Adds several tests that now pass with the new scheme that didn't with the old (and that still don't with the interpreter)

It also addresses PR feedback from #62268, adds some more comments, adds some more checks for atomicity to reduce unnecessary code spit, and adds several tests from issues that are now addressed in the newer code gen.

cc: @joperezr, @danmoseley
Fixes #61902
Contributes to #58786
Contributes to #62094

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

…ator

In .NET Framework and up through .NET Core 3.1, the code generated for RegexOptions.Compiler was effectively an unrolled version of what RegexInterpreter would process.  The RegexNode tree would be turned into a series of opcodes via RegexWriter; the interpreter would then sit in a loop processing those opcodes, and the RegexCompiler iterates through the opcodes generating code for each equivalent to what the interpreter would do but with some decisions made at compile-time rather than at run-time.  This approach, however, leads to complicated code that's not pay-for-play (e.g. a big backtracking jump table that all compilations go through even if there's no backtracking), that doesn't factor in the shape of the tree (e.g. it's difficult to add optimizations based on interactions between nodes in the graph), and that doesn't read well when emitted as C# instead of IL as part of the source generator.  In .NET 5, we started adding an alternative implementation that processed the RegexNode tree directly, addressing all of those cited issues; however, it only worked for a subset of expressions, namely those with little-to-no backtracking (e.g. non-atomic loops and alternations weren't supported).  Since then, we've improved it to the point where everything other than RegexOptions.RightToLeft (which implicitly means lookbehinds as well) is supported, and we've agreed it's ok to drop compilation for those constructs; if they ever become an issue, we can add support for them via the new compilation scheme.

As such, this PR:
- Deletes all of the code associated with the older code generation scheme
- Updates the Regex ctor to fall back to selecting the interpreter if the expression can't be compiled
- Updates the source generator to fall back to just emitting a cached use of Regex if the expression can't be compiled (and issuing a diagnostic in that case)
- Adds several tests that now pass with the new scheme that didn't with the old (and that still don't with the interpreter)
Also added some comments and renamed a few methods for consistency between RegexCompiler and RegexGenerator.Emitter
Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got to love PRs that delete around 4.5K lines of code, specially over two files that were very large already. :shipit:

@stephentoub stephentoub merged commit 80ba698 into dotnet:main Dec 3, 2021
@stephentoub stephentoub deleted the deleteoldcodegen branch December 3, 2021 22:09
@ghost ghost locked as resolved and limited conversation to collaborators Jan 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Finish "simplified" regex codegen strategy
3 participants