-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11188: [Rust] Support crypto functions from PostgreSQL dialect #9139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| //! Crypto expressions | ||
|
|
||
| use md5::Md5; | ||
| use sha2::{ | ||
| digest::Output as SHA2DigestOutput, Digest as SHA2Digest, Sha224, Sha256, Sha384, | ||
| Sha512, | ||
| }; | ||
|
|
||
| use crate::error::{DataFusionError, Result}; | ||
| use arrow::array::{ | ||
| ArrayRef, GenericBinaryArray, GenericStringArray, StringOffsetSizeTrait, | ||
| }; | ||
|
|
||
| fn md5_process(input: &str) -> String { | ||
| let mut digest = Md5::default(); | ||
| digest.update(&input); | ||
|
|
||
| let mut result = String::new(); | ||
|
|
||
| for byte in &digest.finalize() { | ||
| result.push_str(&format!("{:02x}", byte)); | ||
| } | ||
|
|
||
| result | ||
| } | ||
|
|
||
| // It's not possible to return &[u8], because trait in trait without short lifetime | ||
| fn sha_process<D: SHA2Digest + Default>(input: &str) -> SHA2DigestOutput<D> { | ||
| let mut digest = D::default(); | ||
| digest.update(&input); | ||
|
|
||
| digest.finalize() | ||
| } | ||
|
|
||
| macro_rules! crypto_unary_string_function { | ||
| ($NAME:ident, $FUNC:expr) => { | ||
| /// crypto function that accepts Utf8 or LargeUtf8 and returns Utf8 string | ||
| pub fn $NAME<T: StringOffsetSizeTrait>( | ||
| args: &[ArrayRef], | ||
| ) -> Result<GenericStringArray<i32>> { | ||
| if args.len() != 1 { | ||
| return Err(DataFusionError::Internal(format!( | ||
| "{:?} args were supplied but {} takes exactly one argument", | ||
| args.len(), | ||
| String::from(stringify!($NAME)), | ||
| ))); | ||
| } | ||
|
|
||
| let array = args[0] | ||
| .as_any() | ||
| .downcast_ref::<GenericStringArray<T>>() | ||
| .unwrap(); | ||
|
|
||
| // first map is the iterator, second is for the `Option<_>` | ||
| Ok(array.iter().map(|x| x.map(|x| $FUNC(x))).collect()) | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| macro_rules! crypto_unary_binary_function { | ||
| ($NAME:ident, $FUNC:expr) => { | ||
| /// crypto function that accepts Utf8 or LargeUtf8 and returns Binary | ||
| pub fn $NAME<T: StringOffsetSizeTrait>( | ||
| args: &[ArrayRef], | ||
| ) -> Result<GenericBinaryArray<i32>> { | ||
| if args.len() != 1 { | ||
| return Err(DataFusionError::Internal(format!( | ||
| "{:?} args were supplied but {} takes exactly one argument", | ||
| args.len(), | ||
| String::from(stringify!($NAME)), | ||
| ))); | ||
| } | ||
|
|
||
| let array = args[0] | ||
| .as_any() | ||
| .downcast_ref::<GenericStringArray<T>>() | ||
| .unwrap(); | ||
|
|
||
| // first map is the iterator, second is for the `Option<_>` | ||
| Ok(array.iter().map(|x| x.map(|x| $FUNC(x))).collect()) | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| crypto_unary_string_function!(md5, md5_process); | ||
| crypto_unary_binary_function!(sha224, sha_process::<Sha224>); | ||
| crypto_unary_binary_function!(sha256, sha_process::<Sha256>); | ||
| crypto_unary_binary_function!(sha384, sha_process::<Sha384>); | ||
| crypto_unary_binary_function!(sha512, sha_process::<Sha512>); | ||
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -35,6 +35,7 @@ use super::{ | |
| }; | ||
| use crate::error::{DataFusionError, Result}; | ||
| use crate::physical_plan::array_expressions; | ||
| use crate::physical_plan::crypto_expressions; | ||
| use crate::physical_plan::datetime_expressions; | ||
| use crate::physical_plan::expressions::{nullif_func, SUPPORTED_NULLIF_TYPES}; | ||
| use crate::physical_plan::math_expressions; | ||
|
|
@@ -136,6 +137,16 @@ pub enum BuiltinScalarFunction { | |
| NullIf, | ||
| /// Date truncate | ||
| DateTrunc, | ||
| /// MD5 | ||
| MD5, | ||
| /// SHA224 | ||
| SHA224, | ||
| /// SHA256, | ||
| SHA256, | ||
| /// SHA384 | ||
| SHA384, | ||
| /// SHA512, | ||
| SHA512, | ||
| } | ||
|
|
||
| impl fmt::Display for BuiltinScalarFunction { | ||
|
|
@@ -179,6 +190,11 @@ impl FromStr for BuiltinScalarFunction { | |
| "date_trunc" => BuiltinScalarFunction::DateTrunc, | ||
| "array" => BuiltinScalarFunction::Array, | ||
| "nullif" => BuiltinScalarFunction::NullIf, | ||
| "md5" => BuiltinScalarFunction::MD5, | ||
| "sha224" => BuiltinScalarFunction::SHA224, | ||
| "sha256" => BuiltinScalarFunction::SHA256, | ||
| "sha384" => BuiltinScalarFunction::SHA384, | ||
| "sha512" => BuiltinScalarFunction::SHA512, | ||
| _ => { | ||
| return Err(DataFusionError::Plan(format!( | ||
| "There is no built-in function named {}", | ||
|
|
@@ -288,6 +304,56 @@ pub fn return_type( | |
| let coerced_types = data_types(arg_types, &signature(fun)); | ||
| coerced_types.map(|typs| typs[0].clone()) | ||
| } | ||
| BuiltinScalarFunction::MD5 => Ok(match arg_types[0] { | ||
| DataType::LargeUtf8 => DataType::LargeUtf8, | ||
| DataType::Utf8 => DataType::Utf8, | ||
| _ => { | ||
| // this error is internal as `data_types` should have captured this. | ||
| return Err(DataFusionError::Internal( | ||
| "The md5 function can only accept strings.".to_string(), | ||
| )); | ||
| } | ||
| }), | ||
| BuiltinScalarFunction::SHA224 => Ok(match arg_types[0] { | ||
| DataType::LargeUtf8 => DataType::Binary, | ||
| DataType::Utf8 => DataType::Binary, | ||
| _ => { | ||
| // this error is internal as `data_types` should have captured this. | ||
| return Err(DataFusionError::Internal( | ||
| "The sha224 function can only accept strings.".to_string(), | ||
| )); | ||
| } | ||
| }), | ||
| BuiltinScalarFunction::SHA256 => Ok(match arg_types[0] { | ||
| DataType::LargeUtf8 => DataType::Binary, | ||
| DataType::Utf8 => DataType::Binary, | ||
| _ => { | ||
| // this error is internal as `data_types` should have captured this. | ||
| return Err(DataFusionError::Internal( | ||
| "The sha256 function can only accept strings.".to_string(), | ||
| )); | ||
| } | ||
| }), | ||
| BuiltinScalarFunction::SHA384 => Ok(match arg_types[0] { | ||
| DataType::LargeUtf8 => DataType::Binary, | ||
| DataType::Utf8 => DataType::Binary, | ||
| _ => { | ||
| // this error is internal as `data_types` should have captured this. | ||
| return Err(DataFusionError::Internal( | ||
| "The sha384 function can only accept strings.".to_string(), | ||
| )); | ||
| } | ||
| }), | ||
| BuiltinScalarFunction::SHA512 => Ok(match arg_types[0] { | ||
| DataType::LargeUtf8 => DataType::Binary, | ||
| DataType::Utf8 => DataType::Binary, | ||
| _ => { | ||
| // this error is internal as `data_types` should have captured this. | ||
| return Err(DataFusionError::Internal( | ||
| "The sha512 function can only accept strings.".to_string(), | ||
| )); | ||
| } | ||
| }), | ||
| _ => Ok(DataType::Float64), | ||
| } | ||
| } | ||
|
|
@@ -318,6 +384,46 @@ pub fn create_physical_expr( | |
| BuiltinScalarFunction::Abs => math_expressions::abs, | ||
| BuiltinScalarFunction::Signum => math_expressions::signum, | ||
| BuiltinScalarFunction::NullIf => nullif_func, | ||
| BuiltinScalarFunction::MD5 => |args| match args[0].data_type() { | ||
| DataType::Utf8 => Ok(Arc::new(crypto_expressions::md5::<i32>(args)?)), | ||
| DataType::LargeUtf8 => Ok(Arc::new(crypto_expressions::md5::<i64>(args)?)), | ||
| other => Err(DataFusionError::Internal(format!( | ||
| "Unsupported data type {:?} for function md5", | ||
| other, | ||
| ))), | ||
| }, | ||
| BuiltinScalarFunction::SHA224 => |args| match args[0].data_type() { | ||
| DataType::Utf8 => Ok(Arc::new(crypto_expressions::sha224::<i32>(args)?)), | ||
| DataType::LargeUtf8 => Ok(Arc::new(crypto_expressions::sha224::<i64>(args)?)), | ||
| other => Err(DataFusionError::Internal(format!( | ||
| "Unsupported data type {:?} for function sha224", | ||
| other, | ||
| ))), | ||
| }, | ||
| BuiltinScalarFunction::SHA256 => |args| match args[0].data_type() { | ||
| DataType::Utf8 => Ok(Arc::new(crypto_expressions::sha256::<i32>(args)?)), | ||
| DataType::LargeUtf8 => Ok(Arc::new(crypto_expressions::sha256::<i64>(args)?)), | ||
| other => Err(DataFusionError::Internal(format!( | ||
| "Unsupported data type {:?} for function sha256", | ||
| other, | ||
| ))), | ||
| }, | ||
| BuiltinScalarFunction::SHA384 => |args| match args[0].data_type() { | ||
| DataType::Utf8 => Ok(Arc::new(crypto_expressions::sha384::<i32>(args)?)), | ||
| DataType::LargeUtf8 => Ok(Arc::new(crypto_expressions::sha384::<i64>(args)?)), | ||
| other => Err(DataFusionError::Internal(format!( | ||
| "Unsupported data type {:?} for function sha384", | ||
| other, | ||
| ))), | ||
| }, | ||
| BuiltinScalarFunction::SHA512 => |args| match args[0].data_type() { | ||
| DataType::Utf8 => Ok(Arc::new(crypto_expressions::sha512::<i32>(args)?)), | ||
| DataType::LargeUtf8 => Ok(Arc::new(crypto_expressions::sha512::<i64>(args)?)), | ||
| other => Err(DataFusionError::Internal(format!( | ||
| "Unsupported data type {:?} for function sha512", | ||
| other, | ||
| ))), | ||
| }, | ||
| BuiltinScalarFunction::Length => |args| Ok(length(args[0].as_ref())?), | ||
| BuiltinScalarFunction::Concat => { | ||
| |args| Ok(Arc::new(string_expressions::concatenate(args)?)) | ||
|
|
@@ -392,23 +498,18 @@ fn signature(fun: &BuiltinScalarFunction) -> Signature { | |
|
|
||
| // for now, the list is small, as we do not have many built-in functions. | ||
| match fun { | ||
| BuiltinScalarFunction::Length => { | ||
| Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8]) | ||
| } | ||
| BuiltinScalarFunction::Concat => Signature::Variadic(vec![DataType::Utf8]), | ||
| BuiltinScalarFunction::Lower => { | ||
| Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8]) | ||
| } | ||
| BuiltinScalarFunction::Upper => { | ||
| Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8]) | ||
| } | ||
| BuiltinScalarFunction::Trim => { | ||
| Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8]) | ||
| } | ||
| BuiltinScalarFunction::Ltrim => { | ||
| Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8]) | ||
| } | ||
| BuiltinScalarFunction::Rtrim => { | ||
| BuiltinScalarFunction::Upper | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 nice cleanup |
||
| | BuiltinScalarFunction::Lower | ||
| | BuiltinScalarFunction::Length | ||
| | BuiltinScalarFunction::Trim | ||
| | BuiltinScalarFunction::Ltrim | ||
| | BuiltinScalarFunction::Rtrim | ||
| | BuiltinScalarFunction::MD5 | ||
| | BuiltinScalarFunction::SHA224 | ||
| | BuiltinScalarFunction::SHA256 | ||
| | BuiltinScalarFunction::SHA384 | ||
| | BuiltinScalarFunction::SHA512 => { | ||
| Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8]) | ||
| } | ||
| BuiltinScalarFunction::ToTimestamp => Signature::Uniform(1, vec![DataType::Utf8]), | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me like we might want to start offering a way to keep the number of required dependencies of DataFusion down. For example, in this case we could potentially put the use of crypto functions behind a feature flag.
I am not proposing to add the feature flag as part of this PR, but more like trying to set the general direction of allowing users to pick features that they need and not have to pay compilation time (or binary size) cost for those they don't
What do you think @jorgecarleitao and @andygrove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree with you, @alamb. In this case, we want to support posgres dialect, so it makes sense to support these functions (and not implement these ourselves, as they are even security related).
In general, as long as the crates are small, I do not see a major issue. Our expensive dependencies are Tokio, crossbeam, etc, specially because they really increase the compile time (e.g. compared to the arrow crate).
We already offer a scalar UDF that has the same performance as our own expressions. So, I think that this is the most we can do here.