Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add regexp_replace function #281

Merged
merged 2 commits into from
Sep 2, 2022

Conversation

richtia
Copy link
Member

@richtia richtia commented Aug 8, 2022

PR to regexp_replace function.

Addressed some of the comments made by @jvanstraten here: #256

Regular expression modifiers are made options here. If there's a better way, i'm open to suggestions.

Once we can come to an agreement on how to handle this regex function, i'll submit another PR to add a few of the others.

Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this almost looks good to me.

extensions/functions_string.yaml Outdated Show resolved Hide resolved
extensions/functions_string.yaml Outdated Show resolved Hide resolved
Comment on lines 261 to 266
specified using the `occurrence` argument. Specifying `1` means only the first occurrence
will be replaced and specifying `0` means all occurrences will be replaced. The number of
characters from the beginning of the string to begin starting to search for pattern matches
can be specified using the `position` argument. Specifying `1` means to search for matches
starting at the first character of the input string. The replacement string can capture
groups using numbered backreferences.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to be more explicit about how > 1 works, e.g. "Specifying 1 means only the first occurrence will be replaced, 2 means the second occurrence, and so on. Specify 0 to replace all occurrences." and likewise for the string indices.

What happens if I were to pass negative numbers or 0 for the string index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. For string index, it states that it should be a positive non-zero integer now. How does that sound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already understood that I'm not supposed to do it, but it still doesn't really specify what would happen if I would :) The same actually applies for what happens if the regex fails to compile, or the replacement has a back-reference that doesn't exist. Maybe this is overly pedantic and it should be up to the engine to decide whether to invoke MAYBE_HALT_AND_CATCH_FIRE, throw a segfault, or do something more gentle, but otherwise I see three options:

  • throw an error and fail the entire plan if any invocation fails;
  • change nullability behavior to SPECIFIED_OUTPUT and always return nullable, so you can fail invocations independently by returning null;
  • allow either of the above by means of an option enum.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was originally under the assumption that the engine would decide, which I guess would open up a lot of variability between engine implementations. Do you have a preference for any of those three options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the third is most in line with other functions, like how Substrait is explicit about how integer overflow is to be handled. However, if all major execution engines happen to do 1 or 2 then we should do that, too (I wouldn't know). I don't have a strong opinion.

Copy link
Member Author

@richtia richtia Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like not that many actually have this as an argument that can be passed. MySQL does though and if you pass a negative input, it throws and error saying incorrect parameters.

How do you suggest we define the option that allows 1 or 2? [ DECLARED_OUTPUT, ERROR]? And would we need to explicitly set the nullability for the implementations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you suggest we define the option that allows 1 or 2? [ DECLARED_OUTPUT, ERROR]?

I was thinking something like an option named "on_error" with values ERROR and NULL.

And would we need to explicitly set the nullability for the implementations?

Yes, and that's actually the primary reason I'm kind of on the fence about it. There's no way to specify that the function should have MIRROR behavior for ERROR and DECLARED_OUTPUT behavior for NULL, so you'd have to do DECLARED_OUTPUT for both. That's kind of nasty and will probably not work well with the functions in existing query engines.

I may well be overthinking it. Maybe we should just add something like "Behavior is undefined if the regex fails to compile, the replacement contains an illegal back-reference, the occurrence value is out of range, or the position value is out of range." to leave it up to the engine. Or just keep things as is and not specify anything, since that statement is arguably redundant (though I'd prefer to be overly explicit to show that we did actually think about these edge cases).

I guess I'll leave this up to you and @jacques-n (or anyone else stumbling upon this who might have better ideas).

Copy link
Member Author

@richtia richtia Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with adding the error handling message you wrote there to the description. I'm also on the fence about potentially having this nullability situation here.

extensions/functions_string.yaml Show resolved Hide resolved
extensions/functions_string.yaml Outdated Show resolved Hide resolved
jvanstraten
jvanstraten previously approved these changes Aug 17, 2022
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll mark this as approved because I'm good with this, but whoever's merging this might want to have a look at the unresolved thread and weigh in before bashing "merge".

Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richtia Needs rebase due to other merges. Also, can you add a CASE_INSENSITIVE_ASCII option to the end of case_sensitivity as we discussed offline at some point?

@richtia
Copy link
Member Author

richtia commented Sep 2, 2022

@richtia Needs rebase due to other merges. Also, can you add a CASE_INSENSITIVE_ASCII option to the end of case_sensitivity as we discussed offline at some point?

Updated! This one just updates the options for regexp_replace. #293 handles the rest of them.

@jvanstraten jvanstraten merged commit 433d049 into substrait-io:main Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants