Skip to content

Conversation

@seddonm1
Copy link
Contributor

This PR implements some of the basic string functions as per the ANSI SQL specification. To properly meet the ANSI specification work will need to be done on the sqlparser to support the verbose style that the ANSI spec has such as

trim(both 'xyz' from 'yxTomxx')

@github-actions
Copy link

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks really good. Thanks a lot for taking the time to implement these.

I left a lot of comments, but they are all pretty small, so please do not take their number by the correctness of this: you understood and used all the APIs really well, they are just considering edge cases and small consistency improvements.

Comment on lines 128 to 136
let mut builder = StringBuilder::new(args.len());
for index in 0..args[0].len() {
if string_args.is_null(index) {
builder.append_null()?;
} else {
builder.append_value(&string_args.value(index).$FUNC())?;
}
}
Ok(builder.finish())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arrow crate implements efficient IntoIter and FromIter that generally make the code simpler to read and more performant because it performs less bound checks. I.e. something like

string_args.iter().map(|x| x.map(|x| x.$FUNC())).collect()
// (first map is the iterator, second is for the `Option<_>`

will probably work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did work well but I have struggled to make it work with code that supports both Utf8 and LargeUtf8 types as the code does now. Maybe you could help here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trick is to use the generic GenericStringArray, whose StringArray and LargeStringArray are concrete types of. Something like

fn op<T: StringOffsetSizeTrait>(array: GenericStringArray<T>) -> GenericStringArray<T> {
     let array = array.downcast::<GenericStringArray<T>>().unwrap();

     array.iter().map(|x| x.map(|x| x.$FUNC())).collect()
}

The FromIter and ToIterator are implemented for the generic struct and thus the compiler should be able to resolve these for both T: i32 and T: i64.

Copy link
Contributor Author

@seddonm1 seddonm1 Dec 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorgecarleitao . Your code makes a lot of sense and the macro is much cleaner however I am stuck at the next bit which is how to pass in T. I can do it in functions::create_physical_expr like below but this does not feel correct.

BuiltinScalarFunction::Lower => |args| match args[0].data_type() {
    DataType::Utf8 => Ok(Arc::new(string_expressions::lower::<i32>(args)?)),
    DataType::LargeUtf8 => Ok(Arc::new(string_expressions::lower::<i64>(args)?)),
    other => Err(DataFusionError::Internal(format!(
        "Unsupported data type {:?} for function lower",
        other,
    ))),
},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok you can disregard this comment as it is exactly how length is implemented. Technically we should not expose length as a SQL function but we could rename the alias to be char_length and character_length based on the ANSI SQL spec: https://jakewheat.github.io/sql-overview/sql-2016-foundation-grammar.html#char-length-expression

}
BuiltinScalarFunction::Concat => Signature::Variadic(vec![DataType::Utf8]),
BuiltinScalarFunction::CharacterLength => {
Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature states that LargeUtf8 is supported, but the implementation only supports Utf8.

If we only use DataType::Utf8 here, the planner will coerce any LargeUtf8 to Utf8. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reworked the macros significantly to now support Utf8 and LargeUtf8 functions going forward. It would be trivial to add more functions like ltrim which support both types.

Comment on lines 1834 to 1839
let sql = "SELECT
char_length('josé') AS char_length
,character_length('josé') AS character_length
,lower('TOM') AS lower
,upper('tom') AS upper
,trim(' tom ') AS trim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this does not cover the null cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

I have raised https://issues.apache.org/jira/browse/ARROW-10970 to provide the ability support SQL like:

SELECT char_length(NULL) AS char_length_null

So i can see if I can add that too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove copying you in due to decision:

I have now added the NULL value to both the test cases and the planner.

This is where things get interesting. For this statement:

SELECT NULL

Spark implements a special NullType for this return type but that creates a lot of side effects for things like the Parquet writer and JDBC drivers do not support this type.

I tested Postgres:

CREATE TABLE test AS
SELECT NULL;

The DDL for this table shows that column as a text type so that is why I have applied the default utf8 type to Value(Null).

@seddonm1
Copy link
Contributor Author

Overall, this looks really good. Thanks a lot for taking the time to implement these.

I left a lot of comments, but they are all pretty small, so please do not take their number by the correctness of this: you understood and used all the APIs really well, they are just considering edge cases and small consistency improvements.

Thanks @jorgecarleitao and no problem with the number of comments.

I will work through these and let you know.

@andygrove
Copy link
Member

This looks great. Thanks @seddonm1

@codecov-io
Copy link

codecov-io commented Dec 20, 2020

Codecov Report

Merging #8966 (2f26b23) into master (519e9da) will decrease coverage by 0.09%.
The diff coverage is 74.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8966      +/-   ##
==========================================
- Coverage   83.26%   83.16%   -0.10%     
==========================================
  Files         196      200       +4     
  Lines       48192    48992     +800     
==========================================
+ Hits        40125    40743     +618     
- Misses       8067     8249     +182     
Impacted Files Coverage Δ
rust/datafusion/src/logical_plan/expr.rs 76.76% <ø> (ø)
rust/datafusion/src/physical_plan/functions.rs 79.34% <65.71%> (-2.68%) ⬇️
...datafusion/src/physical_plan/string_expressions.rs 87.50% <100.00%> (+1.78%) ⬆️
rust/datafusion/src/sql/planner.rs 84.81% <100.00%> (+3.08%) ⬆️
rust/datafusion/tests/sql.rs 99.83% <100.00%> (+<0.01%) ⬆️
rust/arrow/src/compute/kernels/arithmetic.rs 89.79% <0.00%> (-9.23%) ⬇️
rust/arrow/src/compute/kernels/sort.rs 93.42% <0.00%> (-2.34%) ⬇️
rust/datafusion/src/optimizer/filter_push_down.rs 97.65% <0.00%> (-1.74%) ⬇️
rust/datafusion/src/logical_plan/builder.rs 90.10% <0.00%> (-1.00%) ⬇️
rust/arrow/src/compute/util.rs 98.93% <0.00%> (-0.44%) ⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5819943...2f26b23. Read the comment docs.

@seddonm1
Copy link
Contributor Author

@jorgecarleitao Thanks for your comments (they really help me learn) and have done a major refactor.

Please pay close attention to the comments here: #8966 (comment) as I do not want to make decisions like that on my own.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @seddonm1 , really clean implementation.

@seddonm1
Copy link
Contributor Author

Thanks a lot, @seddonm1 , really clean implementation.

Thanks @jorgecarleitao and thanks for your patience 👍

@jorgecarleitao
Copy link
Member

@seddonm1 , the API for built-in functions is relatively new and WIP. If you felt that it did not suit the needs or that it could be simpler / easier to use, please raise that concern. We anticipate that it will be more used as time progresses, and it is useful to check its design from time to time to make sure that its assumptions still hold.

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @seddonm1 , great additions!

@seddonm1
Copy link
Contributor Author

@seddonm1 , the API for built-in functions is relatively new and WIP. If you felt that it did not suit the needs or that it could be simpler / easier to use, please raise that concern. We anticipate that it will be more used as time progresses, and it is useful to check its design from time to time to make sure that its assumptions still hold.

The instructions in the README are definitely enough information to easily add these kind of functions (I updated slightly) and I think the API (like the Variadic and Uniform signatures) is really intuitive - so I think you did a great job.

The big question at a project level is which dialect of SQL to support. Adding ltrim for example is trivial but that is a Postgres specific function: https://www.postgresql.org/docs/13/functions-string.html#FUNCTIONS-STRING-OTHER (its in lots of other implementations too). If we decide Postgres support then already the DataFusion concat implementation does not align with Postgres in how it handles null.

Dandandan pushed a commit to Dandandan/arrow that referenced this pull request Dec 22, 2020
…ions

This PR implements some of the basic string functions as per the ANSI SQL specification. To properly meet the ANSI specification work will need to be done on the `sqlparser` to support the verbose style that the ANSI spec has such as
```sql
trim(both 'xyz' from 'yxTomxx')
```

Closes apache#8966 from seddonm1/basic-string-functions

Authored-by: Mike Seddon <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
@seddonm1 seddonm1 deleted the basic-string-functions branch January 4, 2021 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants