feat: add string transform functions #267

richtia · 2022-07-26T21:47:14Z

PR to add definitions for string transform functions.

extensions/functions_string.yaml

richtia · 2022-07-26T21:55:28Z

extensions/functions_string.yaml

+    name: lower
+    description: >-
+      Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci
+      collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.


I was mainly looking here when writing this:
https://forums.mysql.com/read.php?103,187048,188748#msg-188748
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/

Follow up to @jacques-n comment here:
#245 (comment)

Not sure if these are the kinds of rules we're looking for, so open to more discussion.

Hm. Rust defines ASCII versions as well because the Unicode rules are pretty horrible (because natural language is pretty horrible). Looking beyond the horizon at potential FPGA acceleration I am also horrified by this stuff. Maybe we can introduce an option for these functions to choose between full UTF support (preferred) and ASCII-only? I don't feel strongly about this, though.

I added the option, let me know if you think it's okay.

I can't really fault you for this because in all likelihood nothing says anything about this, but I'd strongly prefer we use underscores for the option names. Dashes are an absolute PITA to parse in any DSL because they're ambiguous with subtractions. If there's precedent for dashes you can just leave it as is though, I guess...

Thanks for the catch...I think all the other options actually do use underscores. I just changed it.

extensions/functions_string.yaml

jvanstraten · 2022-07-27T07:51:14Z

extensions/functions_string.yaml

+    name: lower
+    description: >-
+      Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci
+      collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/.


Hm. Rust defines ASCII versions as well because the Unicode rules are pretty horrible (because natural language is pretty horrible). Looking beyond the horizon at potential FPGA acceleration I am also horrified by this stuff. Maybe we can introduce an option for these functions to choose between full UTF support (preferred) and ASCII-only? I don't feel strongly about this, though.

extensions/functions_string.yaml

jvanstraten · 2022-07-27T07:53:52Z

extensions/functions_string.yaml

+        return: i64
+  -
+    name: bit_length
+    description: Return the number of of bits in the input string.


Typo: "of of"

Also... bits? How the heck is that defined? UTF-8 bytes times 8 or some code point bit count shenanigans?

Yeah, it seems to be bits lol. https://www.w3resource.com/PostgreSQL/bit-length-function.php

Do you think I should add any other details to the description here?

The docs there don't list any reason why it's in bits rather than bytes... IMO it should just not exist in this form and should just be byte_length. When converting a bit_length(x) function in SQL to Substrait it can always be represented as multiply(byte_length(x), 8), so byte_length is more primitive. I could see the reason they used bit granularity being that with different storage compression methods you might end up with bit-level stuff, but Substrait doesn't define any of that stuff for strings; all it says is that it's UTF-8-encoded, so at best you can have a function that specifies the number of bytes needed for said UTF-8 encoding.

OTOH, there's precedent for following SQL spec nonsense (like sum([]) returning null). Is this specifically a Postgres function or is it part of the SQL spec?

Postgres and MySQL have bit_length, they both also have one called 'octet_length', which returns bytes:
https://www.w3resource.com/PostgreSQL/octet_length-function.php

I can add a separate byte_length function, so we have both.

The above function is a synonym for LENGTH().

Are their strings not UTF-8? I don't even know anymore.

I think I'm just going to abstain from this discussion because I increasingly feel like I don't have the background to make informed decisions about this. @jacques-n do you hold strong opinions about these functions?

Yeah, it's confusing because i've seen length defined 2 different ways. One returns the bytes, the other returns the number of characters (which is why I just updated the length function in this PR to be char_length).

I think I'd always include evaluation of the commercial databases, not just the OSS ones. E.g. Oracle and Sql Server. What do they have?

@jacques-n SQL Server and Oracle have both bit_length and octet_length (for bytes). I've added octet_length to this PR as well.

https://docs.oracle.com/cd/B19188_01/doc/B15917/sqfunc.htm#i1005968
https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/string-functions?view=sql-server-ver16

jvanstraten

LGTM if no one has any further input on the open bit_length discussion.

jvanstraten

LGTM

richtia · 2022-08-24T15:54:01Z

LGTM

Just made the change so that concat is also variadic

ianmcook · 2022-08-25T17:15:21Z

I think concat and concat_ws should accept a minimum of one string argument (not two). PostgreSQL, MySQL, and other systems allow that.

richtia · 2022-08-25T17:42:22Z

I think concat and concat_ws should accept a minimum of one string argument (not two). PostgreSQL, MySQL, and other systems allow that.

Updated. Thanks!

jvanstraten · 2022-08-25T20:08:35Z

FWIW I accepted it as 2 minimum because 1 reduces to no operation, but I don't feel strongly either way. Arguably it further generalizes to returning the empty string for 0 arguments.

ETA: for compound types that would break the return type derivation, though.

richtia · 2022-09-01T19:05:57Z

Rebased and added a bunch missing function argument names

richtia commented Jul 26, 2022

View reviewed changes

extensions/functions_string.yaml Outdated Show resolved Hide resolved

richtia commented Jul 26, 2022

View reviewed changes

jvanstraten requested changes Jul 27, 2022

View reviewed changes

richtia requested a review from jvanstraten July 27, 2022 16:33

jvanstraten previously approved these changes Jul 28, 2022

View reviewed changes

richtia dismissed jvanstraten’s stale review via 6327ca5 July 28, 2022 17:05

richtia force-pushed the string_transform_functions branch from f1cb5ac to 6327ca5 Compare July 28, 2022 17:05

richtia requested review from jacques-n and jvanstraten July 28, 2022 17:08

jvanstraten previously approved these changes Aug 15, 2022

View reviewed changes

richtia dismissed jvanstraten’s stale review via 7b95418 August 24, 2022 15:27

richtia force-pushed the string_transform_functions branch 2 times, most recently from c43ccf8 to adb8bf2 Compare August 24, 2022 15:49

jvanstraten previously approved these changes Aug 24, 2022

View reviewed changes

richtia dismissed jvanstraten’s stale review via d42e26d August 25, 2022 17:39

richtia force-pushed the string_transform_functions branch 2 times, most recently from d42e26d to bbd9f5b Compare August 25, 2022 17:39

jvanstraten previously approved these changes Aug 25, 2022

View reviewed changes

richtia dismissed jvanstraten’s stale review via be4f84f September 1, 2022 19:03

richtia force-pushed the string_transform_functions branch from 7074a25 to be4f84f Compare September 1, 2022 19:03

feat: add string transform functions

16c982d

richtia force-pushed the string_transform_functions branch from be4f84f to 16c982d Compare September 1, 2022 19:04

richtia requested review from jvanstraten and removed request for jacques-n September 1, 2022 19:06

jvanstraten approved these changes Sep 2, 2022

View reviewed changes

jvanstraten merged commit ff2f7f1 into substrait-io:main Sep 2, 2022

ianmcook mentioned this pull request Nov 28, 2022

feat: add functions for arithmetic, rounding, logarithmic, and string transformations #230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add string transform functions #267

feat: add string transform functions #267

richtia commented Jul 26, 2022

richtia Jul 26, 2022

jvanstraten Jul 27, 2022

richtia Jul 27, 2022

jvanstraten Jul 27, 2022

richtia Jul 27, 2022

jvanstraten Jul 27, 2022

jvanstraten Jul 27, 2022

richtia Jul 27, 2022 •

edited

Loading

jvanstraten Jul 27, 2022

richtia Jul 27, 2022

jvanstraten Jul 27, 2022

richtia Jul 27, 2022

jacques-n Jul 28, 2022

richtia Jul 28, 2022

jvanstraten left a comment

jvanstraten left a comment

richtia commented Aug 24, 2022

ianmcook commented Aug 25, 2022 •

edited

Loading

richtia commented Aug 25, 2022

jvanstraten commented Aug 25, 2022 •

edited

Loading

richtia commented Sep 1, 2022

feat: add string transform functions #267

feat: add string transform functions #267

Conversation

richtia commented Jul 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richtia Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvanstraten left a comment

Choose a reason for hiding this comment

jvanstraten left a comment

Choose a reason for hiding this comment

richtia commented Aug 24, 2022

ianmcook commented Aug 25, 2022 • edited Loading

richtia commented Aug 25, 2022

jvanstraten commented Aug 25, 2022 • edited Loading

richtia commented Sep 1, 2022

richtia Jul 27, 2022 •

edited

Loading

ianmcook commented Aug 25, 2022 •

edited

Loading

jvanstraten commented Aug 25, 2022 •

edited

Loading