ARROW-10969: [Rust][DataFusion] Implement basic String ANSI SQL Functions #8966

seddonm1 · 2020-12-19T00:25:13Z

This PR implements some of the basic string functions as per the ANSI SQL specification. To properly meet the ANSI specification work will need to be done on the sqlparser to support the verbose style that the ANSI spec has such as

trim(both 'xyz' from 'yxTomxx')

github-actions · 2020-12-19T00:34:21Z

https://issues.apache.org/jira/browse/ARROW-10969

jorgecarleitao

Overall, this looks really good. Thanks a lot for taking the time to implement these.

I left a lot of comments, but they are all pretty small, so please do not take their number by the correctness of this: you understood and used all the APIs really well, they are just considering edge cases and small consistency improvements.

jorgecarleitao · 2020-12-19T07:00:55Z

rust/datafusion/src/physical_plan/string_expressions.rs

+            let mut builder = StringBuilder::new(args.len());
+            for index in 0..args[0].len() {
+                if string_args.is_null(index) {
+                    builder.append_null()?;
+                } else {
+                    builder.append_value(&string_args.value(index).$FUNC())?;
+                }
+            }
+            Ok(builder.finish())


The Arrow crate implements efficient IntoIter and FromIter that generally make the code simpler to read and more performant because it performs less bound checks. I.e. something like

string_args.iter().map(|x| x.map(|x| x.$FUNC())).collect() // (first map is the iterator, second is for the `Option<_>`

will probably work.

This did work well but I have struggled to make it work with code that supports both Utf8 and LargeUtf8 types as the code does now. Maybe you could help here.

The trick is to use the generic GenericStringArray, whose StringArray and LargeStringArray are concrete types of. Something like

fn op<T: StringOffsetSizeTrait>(array: GenericStringArray<T>) -> GenericStringArray<T> { let array = array.downcast::<GenericStringArray<T>>().unwrap(); array.iter().map(|x| x.map(|x| x.$FUNC())).collect() }

The FromIter and ToIterator are implemented for the generic struct and thus the compiler should be able to resolve these for both T: i32 and T: i64.

Thanks @jorgecarleitao . Your code makes a lot of sense and the macro is much cleaner however I am stuck at the next bit which is how to pass in T. I can do it in functions::create_physical_expr like below but this does not feel correct.

BuiltinScalarFunction::Lower => |args| match args[0].data_type() { DataType::Utf8 => Ok(Arc::new(string_expressions::lower::<i32>(args)?)), DataType::LargeUtf8 => Ok(Arc::new(string_expressions::lower::<i64>(args)?)), other => Err(DataFusionError::Internal(format!( "Unsupported data type {:?} for function lower", other, ))), },

Ok you can disregard this comment as it is exactly how length is implemented. Technically we should not expose length as a SQL function but we could rename the alias to be char_length and character_length based on the ANSI SQL spec: https://jakewheat.github.io/sql-overview/sql-2016-foundation-grammar.html#char-length-expression

jorgecarleitao · 2020-12-19T07:02:29Z

rust/datafusion/src/physical_plan/functions.rs

        }
        BuiltinScalarFunction::Concat => Signature::Variadic(vec![DataType::Utf8]),
+        BuiltinScalarFunction::CharacterLength => {
+            Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8])


The signature states that LargeUtf8 is supported, but the implementation only supports Utf8.

If we only use DataType::Utf8 here, the planner will coerce any LargeUtf8 to Utf8. :)

I have reworked the macros significantly to now support Utf8 and LargeUtf8 functions going forward. It would be trivial to add more functions like ltrim which support both types.

jorgecarleitao · 2020-12-19T07:03:28Z

rust/datafusion/tests/sql.rs

+    let sql = "SELECT
+        char_length('josé') AS char_length
+        ,character_length('josé') AS character_length
+        ,lower('TOM') AS lower
+        ,upper('tom') AS upper
+        ,trim(' tom ') AS trim


I think that this does not cover the null cases.

Yes.

I have raised https://issues.apache.org/jira/browse/ARROW-10970 to provide the ability support SQL like:

SELECT char_length(NULL) AS char_length_null

So i can see if I can add that too.

@andygrove copying you in due to decision:

I have now added the NULL value to both the test cases and the planner.

This is where things get interesting. For this statement:

SELECT NULL

Spark implements a special NullType for this return type but that creates a lot of side effects for things like the Parquet writer and JDBC drivers do not support this type.

I tested Postgres:

CREATE TABLE test AS SELECT NULL;

The DDL for this table shows that column as a text type so that is why I have applied the default utf8 type to Value(Null).

rust/datafusion/tests/sql.rs

rust/datafusion/src/physical_plan/functions.rs

rust/datafusion/src/physical_plan/string_expressions.rs

seddonm1 · 2020-12-19T07:23:22Z

Overall, this looks really good. Thanks a lot for taking the time to implement these.

I left a lot of comments, but they are all pretty small, so please do not take their number by the correctness of this: you understood and used all the APIs really well, they are just considering edge cases and small consistency improvements.

Thanks @jorgecarleitao and no problem with the number of comments.

I will work through these and let you know.

andygrove · 2020-12-19T16:14:49Z

This looks great. Thanks @seddonm1

codecov-io · 2020-12-20T00:56:39Z

Codecov Report

Merging #8966 (2f26b23) into master (519e9da) will decrease coverage by 0.09%.
The diff coverage is 74.46%.

@@            Coverage Diff             @@
##           master    #8966      +/-   ##
==========================================
- Coverage   83.26%   83.16%   -0.10%     
==========================================
  Files         196      200       +4     
  Lines       48192    48992     +800     
==========================================
+ Hits        40125    40743     +618     
- Misses       8067     8249     +182

Impacted Files	Coverage Δ
rust/datafusion/src/logical_plan/expr.rs	`76.76% <ø> (ø)`
rust/datafusion/src/physical_plan/functions.rs	`79.34% <65.71%> (-2.68%)`	⬇️
...datafusion/src/physical_plan/string_expressions.rs	`87.50% <100.00%> (+1.78%)`	⬆️
rust/datafusion/src/sql/planner.rs	`84.81% <100.00%> (+3.08%)`	⬆️
rust/datafusion/tests/sql.rs	`99.83% <100.00%> (+<0.01%)`	⬆️
rust/arrow/src/compute/kernels/arithmetic.rs	`89.79% <0.00%> (-9.23%)`	⬇️
rust/arrow/src/compute/kernels/sort.rs	`93.42% <0.00%> (-2.34%)`	⬇️
rust/datafusion/src/optimizer/filter_push_down.rs	`97.65% <0.00%> (-1.74%)`	⬇️
rust/datafusion/src/logical_plan/builder.rs	`90.10% <0.00%> (-1.00%)`	⬇️
rust/arrow/src/compute/util.rs	`98.93% <0.00%> (-0.44%)`	⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5819943...2f26b23. Read the comment docs.

seddonm1 · 2020-12-20T01:09:25Z

@jorgecarleitao Thanks for your comments (they really help me learn) and have done a major refactor.

Please pay close attention to the comments here: #8966 (comment) as I do not want to make decisions like that on my own.

jorgecarleitao

Thanks a lot, @seddonm1 , really clean implementation.

seddonm1 · 2020-12-21T06:23:50Z

Thanks a lot, @seddonm1 , really clean implementation.

Thanks @jorgecarleitao and thanks for your patience 👍

jorgecarleitao · 2020-12-21T06:44:26Z

@seddonm1 , the API for built-in functions is relatively new and WIP. If you felt that it did not suit the needs or that it could be simpler / easier to use, please raise that concern. We anticipate that it will be more used as time progresses, and it is useful to check its design from time to time to make sure that its assumptions still hold.

Dandandan

LGTM. Thanks @seddonm1 , great additions!

seddonm1 · 2020-12-21T19:30:11Z

@seddonm1 , the API for built-in functions is relatively new and WIP. If you felt that it did not suit the needs or that it could be simpler / easier to use, please raise that concern. We anticipate that it will be more used as time progresses, and it is useful to check its design from time to time to make sure that its assumptions still hold.

The instructions in the README are definitely enough information to easily add these kind of functions (I updated slightly) and I think the API (like the Variadic and Uniform signatures) is really intuitive - so I think you did a great job.

The big question at a project level is which dialect of SQL to support. Adding ltrim for example is trivial but that is a Postgres specific function: https://www.postgresql.org/docs/13/functions-string.html#FUNCTIONS-STRING-OTHER (its in lots of other implementations too). If we decide Postgres support then already the DataFusion concat implementation does not align with Postgres in how it handles null.

…ions This PR implements some of the basic string functions as per the ANSI SQL specification. To properly meet the ANSI specification work will need to be done on the `sqlparser` to support the verbose style that the ANSI spec has such as ```sql trim(both 'xyz' from 'yxTomxx') ``` Closes apache#8966 from seddonm1/basic-string-functions Authored-by: Mike Seddon <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

add basic string functions

8df27ec

github-actions bot added Component: Rust - DataFusion Component: Rust labels Dec 19, 2020

jorgecarleitao reviewed Dec 19, 2020

View reviewed changes

seddonm1 added 3 commits December 20, 2020 11:31

rewrite to support Utf8 and LargeUtf8

a2ccbba

update README with unary function addition steps

011d622

add support for NULL

8267501

major simplification based on feedback

2f26b23

jorgecarleitao approved these changes Dec 21, 2020

View reviewed changes

Dandandan approved these changes Dec 21, 2020

View reviewed changes

jorgecarleitao closed this in c0dad80 Dec 22, 2020

seddonm1 mentioned this pull request Dec 23, 2020

ARROW-11013: [Rust][DataFusion] Add trim to CsvReader #9001

Closed

seddonm1 deleted the basic-string-functions branch January 4, 2021 22:48

asfimport mentioned this pull request Dec 22, 2020

[Rust][DataFusion] Implement ANSI SQL Functions #26892

Closed

ARROW-10969: [Rust][DataFusion] Implement basic String ANSI SQL Functions #8966

ARROW-10969: [Rust][DataFusion] Implement basic String ANSI SQL Functions #8966

Uh oh!

Conversation

seddonm1 commented Dec 19, 2020

Uh oh!

github-actions bot commented Dec 19, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seddonm1 Dec 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seddonm1 commented Dec 19, 2020

Uh oh!

andygrove commented Dec 19, 2020

Uh oh!

codecov-io commented Dec 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

seddonm1 commented Dec 20, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

seddonm1 commented Dec 21, 2020

Uh oh!

jorgecarleitao commented Dec 21, 2020

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

seddonm1 commented Dec 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

seddonm1 Dec 20, 2020 •

edited

Loading

codecov-io commented Dec 20, 2020 •

edited

Loading