-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add string transform functions #267
feat: add string transform functions #267
Conversation
name: lower | ||
description: >- | ||
Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci | ||
collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mainly looking here when writing this:
https://forums.mysql.com/read.php?103,187048,188748#msg-188748
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html
MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/
Follow up to @jacques-n comment here:
#245 (comment)
Not sure if these are the kinds of rules we're looking for, so open to more discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Rust defines ASCII versions as well because the Unicode rules are pretty horrible (because natural language is pretty horrible). Looking beyond the horizon at potential FPGA acceleration I am also horrified by this stuff. Maybe we can introduce an option for these functions to choose between full UTF support (preferred) and ASCII-only? I don't feel strongly about this, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the option, let me know if you think it's okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really fault you for this because in all likelihood nothing says anything about this, but I'd strongly prefer we use underscores for the option names. Dashes are an absolute PITA to parse in any DSL because they're ambiguous with subtractions. If there's precedent for dashes you can just leave it as is though, I guess...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch...I think all the other options actually do use underscores. I just changed it.
name: lower | ||
description: >- | ||
Transform the string to lower case characters. Implementation should follow the utf8_unicode_ci | ||
collations according to the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Rust defines ASCII versions as well because the Unicode rules are pretty horrible (because natural language is pretty horrible). Looking beyond the horizon at potential FPGA acceleration I am also horrified by this stuff. Maybe we can introduce an option for these functions to choose between full UTF support (preferred) and ASCII-only? I don't feel strongly about this, though.
extensions/functions_string.yaml
Outdated
return: i64 | ||
- | ||
name: bit_length | ||
description: Return the number of of bits in the input string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "of of"
Also... bits? How the heck is that defined? UTF-8 bytes times 8 or some code point bit count shenanigans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seems to be bits lol. https://www.w3resource.com/PostgreSQL/bit-length-function.php
Do you think I should add any other details to the description here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs there don't list any reason why it's in bits rather than bytes... IMO it should just not exist in this form and should just be byte_length
. When converting a bit_length(x)
function in SQL to Substrait it can always be represented as multiply(byte_length(x), 8)
, so byte_length
is more primitive. I could see the reason they used bit granularity being that with different storage compression methods you might end up with bit-level stuff, but Substrait doesn't define any of that stuff for string
s; all it says is that it's UTF-8-encoded, so at best you can have a function that specifies the number of bytes needed for said UTF-8 encoding.
OTOH, there's precedent for following SQL spec nonsense (like sum([])
returning null). Is this specifically a Postgres function or is it part of the SQL spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres and MySQL have bit_length, they both also have one called 'octet_length', which returns bytes:
https://www.w3resource.com/PostgreSQL/octet_length-function.php
I can add a separate byte_length function, so we have both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above function is a synonym for LENGTH().
Are their strings not UTF-8? I don't even know anymore.
I think I'm just going to abstain from this discussion because I increasingly feel like I don't have the background to make informed decisions about this. @jacques-n do you hold strong opinions about these functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's confusing because i've seen length defined 2 different ways. One returns the bytes, the other returns the number of characters (which is why I just updated the length function in this PR to be char_length).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd always include evaluation of the commercial databases, not just the OSS ones. E.g. Oracle and Sql Server. What do they have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacques-n SQL Server and Oracle have both bit_length and octet_length (for bytes). I've added octet_length to this PR as well.
https://docs.oracle.com/cd/B19188_01/doc/B15917/sqfunc.htm#i1005968
https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/string-functions?view=sql-server-ver16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if no one has any further input on the open bit_length
discussion.
f1cb5ac
to
6327ca5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
c43ccf8
to
adb8bf2
Compare
Just made the change so that concat is also variadic |
I think |
d42e26d
to
bbd9f5b
Compare
Updated. Thanks! |
FWIW I accepted it as 2 minimum because 1 reduces to no operation, but I don't feel strongly either way. Arguably it further generalizes to returning the empty string for 0 arguments. ETA: for compound types that would break the return type derivation, though. |
7074a25
to
be4f84f
Compare
be4f84f
to
16c982d
Compare
Rebased and added a bunch missing function argument names |
PR to add definitions for string transform functions.