-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11188: [Rust] Support crypto functions from PostgreSQL dialect #9139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR probably needs to be rebased to pick up the fix for #9138 FYI |
22d0ec5 to
acf8fe0
Compare
|
Thank you @alamb for notice 👍 I've done with PR, marked it as ready for review and awaiting review from DF's team. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ovr , thanks a lot, looks really good. 👍 I left some minor comments. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will crash on null values, no?
Binary can also be built from an iterator, afai remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will crash.
Replaced this with Ok(array.iter().map(|x| x.map(|x| $FUNC(x))).collect()) without calling as_slice() directly on SHA2DigestOutput<D> and It works. Wierd... How is it possible?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should have a test of one of these functions here, with and without nulls, and with an empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I added tests for nulls and empty strings by SQL execution.
|
Thank you, @jorgecarleitao, for your review. I've added tests + fix handling null values. I've compared it with PostgreSQL, and It works as expected. select
md5('tom') AS md5_tom,
md5('') AS md5_empty_str,
md5(null) AS md5_null,
encode(sha224('tom'), 'hex') AS sha224_tom,
encode(sha224(''), 'hex') AS sha224_empty_str,
sha224(null) AS sha224_null;[
{
"md5_tom": "34b7da764b21d298ef307d04d8152dc5",
"md5_empty_str": "d41d8cd98f00b204e9800998ecf8427e",
"md5_null": null,
"sha224_tom": "0bf6cb62649c42a9ae3876ab6f6d92ad36cb5414e495f8873292be4d",
"sha224_empty_str": "d14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f",
"sha224_null": null
}
]Thanks |
| pin-project-lite= "^0.2.0" | ||
| tokio = { version = "0.2", features = ["macros", "blocking", "rt-core", "rt-threaded", "sync"] } | ||
| log = "^0.4" | ||
| md-5 = "^0.9.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me like we might want to start offering a way to keep the number of required dependencies of DataFusion down. For example, in this case we could potentially put the use of crypto functions behind a feature flag.
I am not proposing to add the feature flag as part of this PR, but more like trying to set the general direction of allowing users to pick features that they need and not have to pay compilation time (or binary size) cost for those they don't
What do you think @jorgecarleitao and @andygrove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree with you, @alamb. In this case, we want to support posgres dialect, so it makes sense to support these functions (and not implement these ourselves, as they are even security related).
In general, as long as the crates are small, I do not see a major issue. Our expensive dependencies are Tokio, crossbeam, etc, specially because they really increase the compile time (e.g. compared to the arrow crate).
We already offer a scalar UDF that has the same performance as our own expressions. So, I think that this is the most we can do here.
| Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8]) | ||
| } | ||
| BuiltinScalarFunction::Rtrim => { | ||
| BuiltinScalarFunction::Upper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 nice cleanup
Codecov Report
@@ Coverage Diff @@
## master #9139 +/- ##
==========================================
- Coverage 81.81% 81.77% -0.05%
==========================================
Files 214 215 +1
Lines 51373 51461 +88
==========================================
+ Hits 42033 42083 +50
- Misses 9340 9378 +38
Continue to review full report at Codecov.
|
|
Perhaps the PR description should be rewritten before merge. |
|
I filed https://issues.apache.org/jira/browse/ARROW-11214 to track the feature flag idea |
|
Thanks again for the contribution @ovr ! |
Implemented functions: