Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add string transform functions #267

Merged
merged 1 commit into from
Sep 2, 2022

Conversation

richtia
Copy link
Member

@richtia richtia commented Jul 26, 2022

PR to add definitions for string transform functions.

name: lower
description: >-
Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci
collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mainly looking here when writing this:
https://forums.mysql.com/read.php?103,187048,188748#msg-188748
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/

Follow up to @jacques-n comment here:
#245 (comment)

Not sure if these are the kinds of rules we're looking for, so open to more discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Rust defines ASCII versions as well because the Unicode rules are pretty horrible (because natural language is pretty horrible). Looking beyond the horizon at potential FPGA acceleration I am also horrified by this stuff. Maybe we can introduce an option for these functions to choose between full UTF support (preferred) and ASCII-only? I don't feel strongly about this, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the option, let me know if you think it's okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really fault you for this because in all likelihood nothing says anything about this, but I'd strongly prefer we use underscores for the option names. Dashes are an absolute PITA to parse in any DSL because they're ambiguous with subtractions. If there's precedent for dashes you can just leave it as is though, I guess...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch...I think all the other options actually do use underscores. I just changed it.

extensions/functions_string.yaml Outdated Show resolved Hide resolved
extensions/functions_string.yaml Outdated Show resolved Hide resolved
name: lower
description: >-
Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci
collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Rust defines ASCII versions as well because the Unicode rules are pretty horrible (because natural language is pretty horrible). Looking beyond the horizon at potential FPGA acceleration I am also horrified by this stuff. Maybe we can introduce an option for these functions to choose between full UTF support (preferred) and ASCII-only? I don't feel strongly about this, though.

extensions/functions_string.yaml Outdated Show resolved Hide resolved
return: i64
-
name: bit_length
description: Return the number of of bits in the input string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "of of"

Also... bits? How the heck is that defined? UTF-8 bytes times 8 or some code point bit count shenanigans?

Copy link
Member Author

@richtia richtia Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems to be bits lol. https://www.w3resource.com/PostgreSQL/bit-length-function.php

Do you think I should add any other details to the description here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs there don't list any reason why it's in bits rather than bytes... IMO it should just not exist in this form and should just be byte_length. When converting a bit_length(x) function in SQL to Substrait it can always be represented as multiply(byte_length(x), 8), so byte_length is more primitive. I could see the reason they used bit granularity being that with different storage compression methods you might end up with bit-level stuff, but Substrait doesn't define any of that stuff for strings; all it says is that it's UTF-8-encoded, so at best you can have a function that specifies the number of bytes needed for said UTF-8 encoding.

OTOH, there's precedent for following SQL spec nonsense (like sum([]) returning null). Is this specifically a Postgres function or is it part of the SQL spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgres and MySQL have bit_length, they both also have one called 'octet_length', which returns bytes:
https://www.w3resource.com/PostgreSQL/octet_length-function.php

I can add a separate byte_length function, so we have both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above function is a synonym for LENGTH().

Are their strings not UTF-8? I don't even know anymore.

I think I'm just going to abstain from this discussion because I increasingly feel like I don't have the background to make informed decisions about this. @jacques-n do you hold strong opinions about these functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's confusing because i've seen length defined 2 different ways. One returns the bytes, the other returns the number of characters (which is why I just updated the length function in this PR to be char_length).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd always include evaluation of the commercial databases, not just the OSS ones. E.g. Oracle and Sql Server. What do they have?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacques-n SQL Server and Oracle have both bit_length and octet_length (for bytes). I've added octet_length to this PR as well.

https://docs.oracle.com/cd/B19188_01/doc/B15917/sqfunc.htm#i1005968
https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/string-functions?view=sql-server-ver16

@richtia richtia requested a review from jvanstraten July 27, 2022 16:33
jvanstraten
jvanstraten previously approved these changes Jul 28, 2022
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if no one has any further input on the open bit_length discussion.

jvanstraten
jvanstraten previously approved these changes Aug 15, 2022
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@richtia richtia force-pushed the string_transform_functions branch 2 times, most recently from c43ccf8 to adb8bf2 Compare August 24, 2022 15:49
@richtia
Copy link
Member Author

richtia commented Aug 24, 2022

LGTM

Just made the change so that concat is also variadic

jvanstraten
jvanstraten previously approved these changes Aug 24, 2022
@ianmcook
Copy link
Contributor

ianmcook commented Aug 25, 2022

I think concat and concat_ws should accept a minimum of one string argument (not two). PostgreSQL, MySQL, and other systems allow that.

@richtia richtia force-pushed the string_transform_functions branch 2 times, most recently from d42e26d to bbd9f5b Compare August 25, 2022 17:39
@richtia
Copy link
Member Author

richtia commented Aug 25, 2022

I think concat and concat_ws should accept a minimum of one string argument (not two). PostgreSQL, MySQL, and other systems allow that.

Updated. Thanks!

@jvanstraten
Copy link
Contributor

jvanstraten commented Aug 25, 2022

FWIW I accepted it as 2 minimum because 1 reduces to no operation, but I don't feel strongly either way. Arguably it further generalizes to returning the empty string for 0 arguments.

ETA: for compound types that would break the return type derivation, though.

jvanstraten
jvanstraten previously approved these changes Aug 25, 2022
@richtia
Copy link
Member Author

richtia commented Sep 1, 2022

Rebased and added a bunch missing function argument names

@richtia richtia requested review from jvanstraten and removed request for jacques-n September 1, 2022 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants