-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10969: [Rust][DataFusion] Implement basic String ANSI SQL Functions #8966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks really good. Thanks a lot for taking the time to implement these.
I left a lot of comments, but they are all pretty small, so please do not take their number by the correctness of this: you understood and used all the APIs really well, they are just considering edge cases and small consistency improvements.
| let mut builder = StringBuilder::new(args.len()); | ||
| for index in 0..args[0].len() { | ||
| if string_args.is_null(index) { | ||
| builder.append_null()?; | ||
| } else { | ||
| builder.append_value(&string_args.value(index).$FUNC())?; | ||
| } | ||
| } | ||
| Ok(builder.finish()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Arrow crate implements efficient IntoIter and FromIter that generally make the code simpler to read and more performant because it performs less bound checks. I.e. something like
string_args.iter().map(|x| x.map(|x| x.$FUNC())).collect()
// (first map is the iterator, second is for the `Option<_>`will probably work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This did work well but I have struggled to make it work with code that supports both Utf8 and LargeUtf8 types as the code does now. Maybe you could help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trick is to use the generic GenericStringArray, whose StringArray and LargeStringArray are concrete types of. Something like
fn op<T: StringOffsetSizeTrait>(array: GenericStringArray<T>) -> GenericStringArray<T> {
let array = array.downcast::<GenericStringArray<T>>().unwrap();
array.iter().map(|x| x.map(|x| x.$FUNC())).collect()
}The FromIter and ToIterator are implemented for the generic struct and thus the compiler should be able to resolve these for both T: i32 and T: i64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorgecarleitao . Your code makes a lot of sense and the macro is much cleaner however I am stuck at the next bit which is how to pass in T. I can do it in functions::create_physical_expr like below but this does not feel correct.
BuiltinScalarFunction::Lower => |args| match args[0].data_type() {
DataType::Utf8 => Ok(Arc::new(string_expressions::lower::<i32>(args)?)),
DataType::LargeUtf8 => Ok(Arc::new(string_expressions::lower::<i64>(args)?)),
other => Err(DataFusionError::Internal(format!(
"Unsupported data type {:?} for function lower",
other,
))),
},There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok you can disregard this comment as it is exactly how length is implemented. Technically we should not expose length as a SQL function but we could rename the alias to be char_length and character_length based on the ANSI SQL spec: https://jakewheat.github.io/sql-overview/sql-2016-foundation-grammar.html#char-length-expression
| } | ||
| BuiltinScalarFunction::Concat => Signature::Variadic(vec![DataType::Utf8]), | ||
| BuiltinScalarFunction::CharacterLength => { | ||
| Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature states that LargeUtf8 is supported, but the implementation only supports Utf8.
If we only use DataType::Utf8 here, the planner will coerce any LargeUtf8 to Utf8. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reworked the macros significantly to now support Utf8 and LargeUtf8 functions going forward. It would be trivial to add more functions like ltrim which support both types.
| let sql = "SELECT | ||
| char_length('josé') AS char_length | ||
| ,character_length('josé') AS character_length | ||
| ,lower('TOM') AS lower | ||
| ,upper('tom') AS upper | ||
| ,trim(' tom ') AS trim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this does not cover the null cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
I have raised https://issues.apache.org/jira/browse/ARROW-10970 to provide the ability support SQL like:
SELECT char_length(NULL) AS char_length_nullSo i can see if I can add that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove copying you in due to decision:
I have now added the NULL value to both the test cases and the planner.
This is where things get interesting. For this statement:
SELECT NULLSpark implements a special NullType for this return type but that creates a lot of side effects for things like the Parquet writer and JDBC drivers do not support this type.
I tested Postgres:
CREATE TABLE test AS
SELECT NULL;The DDL for this table shows that column as a text type so that is why I have applied the default utf8 type to Value(Null).
Thanks @jorgecarleitao and no problem with the number of comments. I will work through these and let you know. |
|
This looks great. Thanks @seddonm1 |
Codecov Report
@@ Coverage Diff @@
## master #8966 +/- ##
==========================================
- Coverage 83.26% 83.16% -0.10%
==========================================
Files 196 200 +4
Lines 48192 48992 +800
==========================================
+ Hits 40125 40743 +618
- Misses 8067 8249 +182
Continue to review full report at Codecov.
|
|
@jorgecarleitao Thanks for your comments (they really help me learn) and have done a major refactor. Please pay close attention to the comments here: #8966 (comment) as I do not want to make decisions like that on my own. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, @seddonm1 , really clean implementation.
Thanks @jorgecarleitao and thanks for your patience 👍 |
|
@seddonm1 , the API for built-in functions is relatively new and WIP. If you felt that it did not suit the needs or that it could be simpler / easier to use, please raise that concern. We anticipate that it will be more used as time progresses, and it is useful to check its design from time to time to make sure that its assumptions still hold. |
Dandandan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @seddonm1 , great additions!
The instructions in the README are definitely enough information to easily add these kind of functions (I updated slightly) and I think the API (like the The big question at a project level is which dialect of SQL to support. Adding |
…ions This PR implements some of the basic string functions as per the ANSI SQL specification. To properly meet the ANSI specification work will need to be done on the `sqlparser` to support the verbose style that the ANSI spec has such as ```sql trim(both 'xyz' from 'yxTomxx') ``` Closes apache#8966 from seddonm1/basic-string-functions Authored-by: Mike Seddon <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>
This PR implements some of the basic string functions as per the ANSI SQL specification. To properly meet the ANSI specification work will need to be done on the
sqlparserto support the verbose style that the ANSI spec has such as