-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Regex<AnyRegexOutput>
to be used in the DSL.
#504
base: main
Are you sure you want to change the base?
Conversation
What does this do to anchors? Do we need to fix up the search bounds? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the intended long-term design? I.e. what are the erasure instructions doing?
@@ -278,6 +278,10 @@ extension Instruction { | |||
/// | |||
case backreference | |||
|
|||
case beginTypeErase | |||
|
|||
case endTypeErase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do these instructions do? What's their semantics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments. The processor will maintain a stack of capture lists. beginTypeErase
will push a new list onto the capture stack, so that all captures get added to that list. endTypeErase(_: ValReg)
will convert all elements of the current capture list to an AnyRegexOutput
, and moves it to the given value register, and pop the stack. It's similar to how matcher
works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully parsing this. Could you write some XFAIL tests that illustrate the work that remains to be done after this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no XFAIL-able tests. This PR fixes the bug, but it's not as efficient as handling this in the bytecode directly. I'd be happy to chat in person also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would begin/endTypeErase
work in the context of backtracking. For that matter, how does this PR support any backtracking into ARO
at all, i.e. isn't this forcing it to be atomic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think backtracking into ARO will be supported, since savePoints
currently stores the capture list. It can be extended to store the capture stack.
func testTypeErasedRegexInDSL() throws { | ||
do { | ||
let input = "johnappleseed: 12." | ||
let numberRegex = try! Regex(#"(\d+)\.?"#) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case where the dynamic regex begins with a ^
? I think that would help clarify the behavior you're proposing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one. The idea is to fix up the search bounds so that ^
refers to the start of the input.
It happens to work today because anchors currently use the base string's bounds, not the input slice's bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, there is a bug here that another bug is masking, but when that other bug is fixed, how do we fix this bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can illustrate the difference with a substring input, where the subject bounds are the substring's bounds and the search bounds are contained within.
aa0ca5b
to
63124f5
Compare
Apply a workaround to allow `Regex<AnyRegexOutput>` to be used in the DSL. This workaround emits each nested `Regex<AnyRegexOutput>` as a custom matcher so that it's essentially treated as a separate compilation unit. A proper fix for this is to introduce scoped type erasure in the matching engine so that all type erasure (including the top-level one) goes through this model. I left some stubs in (`beginTypeErase`, `endTypeErase`) which I'll implement in a follow-up. Since this implementation is using `Executor.match(...) -> Regex.Match` in the regex compiler, we need to add availability annotations to the `Executor` and `Compiler` types. Resolves rdar://94320030.
Hope this clarifies the intended design. Let me know if you have any feedback. |
@swift-ci please test |
1 similar comment
@swift-ci please test |
There's a small bug in there and the fix should be relatively simple. I'll take a look tomorrow. |
// endTypeErase | ||
let program = try Compiler(tree: DSLTree(child)).emit() | ||
let executor = Executor(program: program) | ||
return emitMatcher { input, startIndex, range in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matchers don't get the subject bounds, so I don't see how this would support anchors that refer to the subject bounds. We could have a different matcher interface or else add subject bounds as an additional parameter for the internal code (probably not API though).
if case .typeErase = root { | ||
return self | ||
} | ||
return .init(node: .typeErase(root)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this on creation instead, or somewhere else?
var root = root | ||
while case let .typeErase(child) = root { | ||
root = child | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this scan in here? It seems like you'd want to capture an ARO
, is that not possible?
ZeroOrMore(.whitespace) | ||
Capture { numberRegex } | ||
} | ||
XCTAssertNil(input.wholeMatch(of: regex)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Positive match tests with anchors?
func testTypeErasedRegexInDSL() throws { | ||
do { | ||
let input = "johnappleseed: 12." | ||
let numberRegex = try! Regex(#"(\d+)\.?"#) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, there is a bug here that another bug is masking, but when that other bug is fixed, how do we fix this bug?
func testTypeErasedRegexInDSL() throws { | ||
do { | ||
let input = "johnappleseed: 12." | ||
let numberRegex = try! Regex(#"(\d+)\.?"#) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can illustrate the difference with a substring input, where the subject bounds are the substring's bounds and the search bounds are contained within.
Apply a workaround to allow
Regex<AnyRegexOutput>
to be used in the DSL. This workaround emits each nestedRegex<AnyRegexOutput>
as a custom matcher so that it's essentially treated as a separate compilation unit.A proper fix for this is to introduce scoped type erasure in the matching engine so that all type erasure (including the top-level one) goes through this model. I left some stubs in (
beginTypeErase
,endTypeErase
) which I'll implement in a follow-up.Since this implementation is using
Executor.match(...) -> Regex.Match
in the regex compiler, we need to add availability annotations to theExecutor
andCompiler
types.Resolves rdar://94320030.