Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add string containment functions #256

Merged
merged 6 commits into from
Jul 25, 2022

Conversation

richtia
Copy link
Member

@richtia richtia commented Jul 22, 2022

PR to add definitions for string containment functions.

description: The substring to search for.
return: i8
-
name: regexp_strpos
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these regular expression ones, I'm not sure whether the input and regex pattern can have mismatched types, so i just kept it consistent for now.

Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that many of my comments apply all over the place, this list is not exhaustive. Regexes in particular are a huge can of worms to open if we want any usable level of precision as to what needs to be supported and what doesn't. Maybe it's better to postpone the regex stuff to a later PR.

extensions/functions_string.yaml Show resolved Hide resolved
extensions/functions_string.yaml Show resolved Hide resolved
extensions/functions_string.yaml Outdated Show resolved Hide resolved
extensions/functions_string.yaml Outdated Show resolved Hide resolved
Comment on lines 252 to 253
- options: [ CASE_SENSITIVE, CASE_INSENSITIVE ]
required: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regular expressions generally have a lot more options than just case sensitivity. Dot-all/multiline comes to mind. I'm not sure how else we could represent the options, but if we do it this way and ever we'd want to add more options here we'd have to break the signature of the function.

extensions/functions_string.yaml Outdated Show resolved Hide resolved
extensions/functions_string.yaml Outdated Show resolved Hide resolved
-
name: regexp_strpos
description: >-
Return the position of the first occurrence of a regular expression pattern in another
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a regular expression pattern

Which flavor? There are many, and some vary greatly in terms of performance.

Comment on lines 257 to 259
- value: "string"
name: "pattern"
description: The regular expression pattern.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to consider requiring the regex to be constant. If not, we need to consider what happens if the pattern fails to compile on a row-by-row basis.

Comment on lines 480 to 482
- value: "string"
name: "replacement"
description: The string to replace the regular expression match with.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support backreferences in replacements? Named or only numbered?

@richtia
Copy link
Member Author

richtia commented Jul 25, 2022

Note that many of my comments apply all over the place, this list is not exhaustive. Regexes in particular are a huge can of worms to open if we want any usable level of precision as to what needs to be supported and what doesn't. Maybe it's better to postpone the regex stuff to a later PR.

Thanks for all the input! It definitely seems like there should be a longer discussion/writeup before the regex stuff is attempted. I'll remove those from this PR. And we can use your initial comments on here later on.

return: i64
- name: replace
description: >-
Replace all occurrence of the substring with the replacement string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Replace all occurrence of the substring with the replacement string.
Replace all occurrences of the substring with the replacement string.

@jvanstraten
Copy link
Contributor

I'll leave the regex discussions as unresolved, but aside from above grammar nit and overlapping/non-overlapping option/description for count_substring I'm good with this.

@richtia
Copy link
Member Author

richtia commented Jul 25, 2022

I'll leave the regex discussions as unresolved, but aside from above grammar nit and overlapping/non-overlapping option/description for count_substring I'm good with this.

Updated both! I decided to stick with just the description update for the overlap/non-overlap for now, since I couldn't actually find anything like this that provides that option as part of a function.

jvanstraten
jvanstraten previously approved these changes Jul 25, 2022
- value: "varchar<L1>"
name: "input"
description: The input string.
- value: "varchar<L1>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the input string need to be the same length? If not, l1 shouldn't be repeated.

- value: "varchar<L1>"
name: "input"
description: Input string.
- value: "varchar<L1>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated both

jacques-n
jacques-n previously approved these changes Jul 25, 2022
Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @richtia !

@jacques-n
Copy link
Contributor

Looks like a check is failing. @richtia , can you address? Thx!

@richtia
Copy link
Member Author

richtia commented Jul 25, 2022

Looks like a check is failing. @richtia , can you address? Thx!

@jacques-n Yep. Just updated one of my commit messages due to linting issues.

- value: "varchar<L3>"
name: "replacement"
description: The replacement string.
return: "varchar<L1 - L2 + L3>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree with that one. The actual length depends on the number of replacements, this arbitrarily assumes there was one replacement. I'd just put L1 there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh...good point. Updated. Thanks!

@jacques-n jacques-n merged commit d6b9b34 into substrait-io:main Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants