Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the BTRIM scalar function to support Utf8View #11835

Closed
Tracked by #11790
tshauck opened this issue Aug 5, 2024 · 2 comments · Fixed by #11920
Closed
Tracked by #11790

Update the BTRIM scalar function to support Utf8View #11835

tshauck opened this issue Aug 5, 2024 · 2 comments · Fixed by #11920
Assignees
Labels
enhancement New feature or request

Comments

@tshauck
Copy link
Contributor

tshauck commented Aug 5, 2024

Part of #11752 and #11790

Currently, a call to BTRIM with a Utf8View datatypes induces a cast. After the change that fixes this issue, it should not.

## Ensure no casts for BTRIM
query TT
EXPLAIN SELECT
  BTRIM(column1_utf8view, 'foo') AS l
FROM test;
----
logical_plan
01)Projection: btrim(CAST(test.column1_utf8view AS Utf8), Utf8("foo")) AS l
02)--TableScan: test projection=[column1_utf8view]

btrim is defined here: https://github.com/apache/datafusion/blob/main/datafusion/functions/src/string/btrim.rs

Is your feature request related to a problem or challenge?

We are working to add complete StringView support in DataFusion, which permits potentially much faster processing of string data. See #10918 for more background.

Today, most DataFusion string functions support DataType::Utf8 and DataType::LargeUtf8 and when called with a StringView argument DataFusion will cast the argument back to DataType::Utf8 which is expensive.

To realize the full speed of StringView, we need to ensure that all string functions support the DataType::Utf8View directly.

Describe the solution you'd like

Update the function to support DataType::Utf8View directly

Describe alternatives you've considered

The typical steps are:

  1. Write some tests showing the function doesn't support Utf8View (see the tests in string_view.slt to ensure the arguments are not being cast
  2. Change the Signature of the function to accept Utf8View in addition to Utf8/LargeUtf8
  3. Update the implementation of the function to operate on Utf8View

Example PRs

Additional context

The documentation of string functions can be found here: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#string-functions

To test a function with StringView with datafusion-cli you can use an example like this (replacing starts_with with the relevant function)

> create table foo as values (arrow_cast('foo', 'Utf8View'), arrow_cast('bar', 'Utf8View'));
0 row(s) fetched.
Elapsed 0.043 seconds.

> select starts_with(column1, column2) from foo;
+--------------------------------------+
| starts_with(foo.column1,foo.column2) |
+--------------------------------------+
| false                                |
+--------------------------------------+
1 row(s) fetched.
Elapsed 0.015 seconds.

To see if it is using utf8 view, use EXPLAIN to see the plan and verify there is no CAST. In this example the CAST(column1@0 AS Utf8) indicates that the function is not using Utf8View natively

> explain select starts_with(column1, column2) from foo;
+---------------+------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                         |
+---------------+------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: starts_with(CAST(foo.column1 AS Utf8), CAST(foo.column2 AS Utf8))                                                |
|               |   TableScan: foo projection=[column1, column2]                                                                               |
| physical_plan | ProjectionExec: expr=[starts_with(CAST(column1@0 AS Utf8), CAST(column2@1 AS Utf8)) as starts_with(foo.column1,foo.column2)] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                                                              |
|               |                                                                                                                              |
+---------------+------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.006 seconds.

It is also often good to test with a constant as well (likewise there should be no cast):

> explain select starts_with(column1, 'foo') from foo;
+---------------+----------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                     |
+---------------+----------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: starts_with(CAST(foo.column1 AS Utf8), Utf8("foo"))                                          |
|               |   TableScan: foo projection=[column1]                                                                    |
| physical_plan | ProjectionExec: expr=[starts_with(CAST(column1@0 AS Utf8), foo) as starts_with(foo.column1,Utf8("foo"))] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                                          |
|               |                                                                                                          |
+---------------+----------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.

### Additional context

_No response_
@tshauck tshauck added the enhancement New feature or request label Aug 5, 2024
@tshauck tshauck changed the title Update the BTRIM scalar function to support Utf8View Update the BTRIM scalar function to support Utf8View Aug 5, 2024
@Kev1n8
Copy link
Contributor

Kev1n8 commented Aug 7, 2024

take

@Kev1n8
Copy link
Contributor

Kev1n8 commented Aug 8, 2024

I have a question about how this should be done. BTRIM would like to modify the input string, while Utf8View is not mutable. So is it correct that the input data_type can only be Utf8 or LargeUtf8? If so, should I leave the signature as

            signature: Signature::one_of(
                vec![
                    Exact(vec![Utf8]),
                    Exact(vec![Utf8, Utf8]),
                    Exact(vec![Utf8, Utf8View]),
                    // BTRIM wants to modify string, while StringView is not mutable, hence
                    // input should be cast to utf8
                    // Exact(vec![Utf8View]),
                    // Exact(vec![Utf8View, Utf8]),
                    // Exact(vec![Utf8View, Utf8View]),
                ],
                Volatility::Immutable,
            ),

or manually cast it in invoke(), which does not display CAST in plans:


            DataType::Utf8View => {
                let args = &[
                    args[0].cast_to(&DataType::Utf8, None)?,
                    args[1].clone(),
                ];
                make_scalar_function(
                    btrim::<i32>,
                    vec![Hint::Pad, Hint::AcceptsSingular],
                )(args)
            }

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants