-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add regexp_replace function #281
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this almost looks good to me.
extensions/functions_string.yaml
Outdated
specified using the `occurrence` argument. Specifying `1` means only the first occurrence | ||
will be replaced and specifying `0` means all occurrences will be replaced. The number of | ||
characters from the beginning of the string to begin starting to search for pattern matches | ||
can be specified using the `position` argument. Specifying `1` means to search for matches | ||
starting at the first character of the input string. The replacement string can capture | ||
groups using numbered backreferences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to be more explicit about how > 1 works, e.g. "Specifying 1
means only the first occurrence will be replaced, 2
means the second occurrence, and so on. Specify 0
to replace all occurrences." and likewise for the string indices.
What happens if I were to pass negative numbers or 0 for the string index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. For string index, it states that it should be a positive non-zero integer now. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already understood that I'm not supposed to do it, but it still doesn't really specify what would happen if I would :) The same actually applies for what happens if the regex fails to compile, or the replacement has a back-reference that doesn't exist. Maybe this is overly pedantic and it should be up to the engine to decide whether to invoke MAYBE_HALT_AND_CATCH_FIRE
, throw a segfault, or do something more gentle, but otherwise I see three options:
- throw an error and fail the entire plan if any invocation fails;
- change nullability behavior to
SPECIFIED_OUTPUT
and always return nullable, so you can fail invocations independently by returning null; - allow either of the above by means of an option enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was originally under the assumption that the engine would decide, which I guess would open up a lot of variability between engine implementations. Do you have a preference for any of those three options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the third is most in line with other functions, like how Substrait is explicit about how integer overflow is to be handled. However, if all major execution engines happen to do 1 or 2 then we should do that, too (I wouldn't know). I don't have a strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like not that many actually have this as an argument that can be passed. MySQL does though and if you pass a negative input, it throws and error saying incorrect parameters.
How do you suggest we define the option that allows 1 or 2? [ DECLARED_OUTPUT, ERROR]? And would we need to explicitly set the nullability for the implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you suggest we define the option that allows 1 or 2? [ DECLARED_OUTPUT, ERROR]?
I was thinking something like an option named "on_error" with values ERROR
and NULL
.
And would we need to explicitly set the nullability for the implementations?
Yes, and that's actually the primary reason I'm kind of on the fence about it. There's no way to specify that the function should have MIRROR behavior for ERROR
and DECLARED_OUTPUT
behavior for NULL
, so you'd have to do DECLARED_OUTPUT
for both. That's kind of nasty and will probably not work well with the functions in existing query engines.
I may well be overthinking it. Maybe we should just add something like "Behavior is undefined if the regex fails to compile, the replacement contains an illegal back-reference, the occurrence value is out of range, or the position value is out of range." to leave it up to the engine. Or just keep things as is and not specify anything, since that statement is arguably redundant (though I'd prefer to be overly explicit to show that we did actually think about these edge cases).
I guess I'll leave this up to you and @jacques-n (or anyone else stumbling upon this who might have better ideas).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with adding the error handling message you wrote there to the description. I'm also on the fence about potentially having this nullability situation here.
6f604ed
to
4047714
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll mark this as approved because I'm good with this, but whoever's merging this might want to have a look at the unresolved thread and weigh in before bashing "merge".
e2b31ae
to
a59c45a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richtia Needs rebase due to other merges. Also, can you add a CASE_INSENSITIVE_ASCII
option to the end of case_sensitivity
as we discussed offline at some point?
a59c45a
to
3c262f6
Compare
PR to regexp_replace function.
Addressed some of the comments made by @jvanstraten here: #256
Regular expression
modifiers
are made options here. If there's a better way, i'm open to suggestions.Once we can come to an agreement on how to handle this regex function, i'll submit another PR to add a few of the others.