Skip to content

Conversation

@ovr
Copy link
Contributor

@ovr ovr commented Jan 8, 2021

Implemented functions:

  • MD5 (return type string)
  • SHA224 (return type binary)
  • SHA256 (return type binary)
  • SHA384 (return type binary)
  • SHA512 (return type binary)

@ovr ovr force-pushed the crypto-functions branch from 4a06629 to 7f262d2 Compare January 8, 2021 17:03
@github-actions
Copy link

github-actions bot commented Jan 8, 2021

@alamb
Copy link
Contributor

alamb commented Jan 8, 2021

This PR probably needs to be rebased to pick up the fix for #9138 FYI

@ovr ovr force-pushed the crypto-functions branch 3 times, most recently from 22d0ec5 to acf8fe0 Compare January 8, 2021 19:38
@ovr ovr marked this pull request as ready for review January 8, 2021 19:38
@ovr
Copy link
Contributor Author

ovr commented Jan 8, 2021

Thank you @alamb for notice 👍

I've done with PR, marked it as ready for review and awaiting review from DF's team.

@ovr ovr force-pushed the crypto-functions branch from acf8fe0 to 14d759a Compare January 8, 2021 19:41
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ovr , thanks a lot, looks really good. 👍 I left some minor comments. :)

Comment on lines 99 to 103
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will crash on null values, no?

Binary can also be built from an iterator, afai remember.

Copy link
Contributor Author

@ovr ovr Jan 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will crash.

Replaced this with Ok(array.iter().map(|x| x.map(|x| $FUNC(x))).collect()) without calling as_slice() directly on SHA2DigestOutput<D> and It works. Wierd... How is it possible?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should have a test of one of these functions here, with and without nulls, and with an empty string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I added tests for nulls and empty strings by SQL execution.

@ovr ovr force-pushed the crypto-functions branch from 0cb4653 to 3baf93c Compare January 9, 2021 12:58
@ovr
Copy link
Contributor Author

ovr commented Jan 9, 2021

Thank you, @jorgecarleitao, for your review. I've added tests + fix handling null values.

I've compared it with PostgreSQL, and It works as expected.

select
        md5('tom') AS md5_tom,
        md5('') AS md5_empty_str,
        md5(null) AS md5_null,
        encode(sha224('tom'), 'hex') AS sha224_tom,
        encode(sha224(''), 'hex') AS sha224_empty_str,
        sha224(null) AS sha224_null;
[
  {
    "md5_tom": "34b7da764b21d298ef307d04d8152dc5",
    "md5_empty_str": "d41d8cd98f00b204e9800998ecf8427e",
    "md5_null": null,
    "sha224_tom": "0bf6cb62649c42a9ae3876ab6f6d92ad36cb5414e495f8873292be4d",
    "sha224_empty_str": "d14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f",
    "sha224_null": null
  }
]

Thanks

pin-project-lite= "^0.2.0"
tokio = { version = "0.2", features = ["macros", "blocking", "rt-core", "rt-threaded", "sync"] }
log = "^0.4"
md-5 = "^0.9.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me like we might want to start offering a way to keep the number of required dependencies of DataFusion down. For example, in this case we could potentially put the use of crypto functions behind a feature flag.

I am not proposing to add the feature flag as part of this PR, but more like trying to set the general direction of allowing users to pick features that they need and not have to pay compilation time (or binary size) cost for those they don't

What do you think @jorgecarleitao and @andygrove

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree with you, @alamb. In this case, we want to support posgres dialect, so it makes sense to support these functions (and not implement these ourselves, as they are even security related).

In general, as long as the crates are small, I do not see a major issue. Our expensive dependencies are Tokio, crossbeam, etc, specially because they really increase the compile time (e.g. compared to the arrow crate).

We already offer a scalar UDF that has the same performance as our own expressions. So, I think that this is the most we can do here.

Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8])
}
BuiltinScalarFunction::Rtrim => {
BuiltinScalarFunction::Upper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice cleanup

@codecov-io
Copy link

Codecov Report

Merging #9139 (871e18c) into master (08cccd6) will decrease coverage by 0.04%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9139      +/-   ##
==========================================
- Coverage   81.81%   81.77%   -0.05%     
==========================================
  Files         214      215       +1     
  Lines       51373    51461      +88     
==========================================
+ Hits        42033    42083      +50     
- Misses       9340     9378      +38     
Impacted Files Coverage Δ
rust/datafusion/src/logical_plan/expr.rs 76.92% <ø> (ø)
rust/datafusion/src/physical_plan/mod.rs 86.00% <ø> (ø)
rust/arrow/src/util/display.rs 48.33% <33.33%> (-0.79%) ⬇️
rust/datafusion/src/physical_plan/functions.rs 74.82% <60.00%> (-3.69%) ⬇️
...datafusion/src/physical_plan/crypto_expressions.rs 75.00% <75.00%> (ø)
rust/datafusion/tests/sql.rs 99.82% <100.00%> (+<0.01%) ⬆️
rust/datafusion/src/datasource/csv.rs 65.00% <0.00%> (-16.25%) ⬇️
rust/datafusion/src/datasource/parquet.rs 94.77% <0.00%> (-1.44%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08cccd6...871e18c. Read the comment docs.

@mqy
Copy link
Contributor

mqy commented Jan 11, 2021

Perhaps the PR description should be rewritten before merge.

@ovr ovr changed the title ARROW-11188: [Rust] Support crypto functions from PostgreSQL dialect … ARROW-11188: [Rust] Support crypto functions from PostgreSQL dialect Jan 11, 2021
@alamb
Copy link
Contributor

alamb commented Jan 11, 2021

I filed https://issues.apache.org/jira/browse/ARROW-11214 to track the feature flag idea

@alamb alamb closed this in 6da7718 Jan 11, 2021
@alamb
Copy link
Contributor

alamb commented Jan 11, 2021

Thanks again for the contribution @ovr !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants