-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve trim
for string view
#12395
base: main
Are you sure you want to change the base?
Improve trim
for string view
#12395
Conversation
FYI @Rachelint that #12383 is modifying |
Thanks! I will push forward this until #12383 merged. |
f6c83bf
to
325fac6
Compare
325fac6
to
48cb4db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Rachelint -- this looks really nice and quite close 🙏
I left some comments, but I don't think they are required to merge this.
I do think we should have benchmark numbers showing this makes things faster in order to merge it. Could you please make a StringView based benchmark for trim -- perhaps in
// regarding copyright ownership. The ASF licenses this file |
Then we can run that benchmark and show that this PR improves the performance.
Thanks again!
@@ -82,7 +82,11 @@ impl ScalarUDFImpl for BTrimFunc { | |||
} | |||
|
|||
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | |||
utf8_to_str_type(&arg_types[0], "btrim") | |||
if arg_types[0] == DataType::Utf8View { | |||
Ok(DataType::Utf8View) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Also eventually it would also be possible to return Utf8View
when the input was Utf8
and save a copy as well
use datafusion_common::cast::{as_generic_string_array, as_string_view_array}; | ||
use datafusion_common::Result; | ||
use datafusion_common::{exec_err, ScalarValue}; | ||
use datafusion_expr::ColumnarValue; | ||
|
||
/// Make a `u128` based on the given substr, start(offset to view.offset), and | ||
/// push into to the given buffers | ||
pub(crate) fn make_and_append_view( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I wonder if we should (as a follow on PR) propose adding this upstream to arrow-rs as it seems valuable for any trim related kernels on stringview
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds great! and #12383 (comment) can be solved if it is function in arrow-rs.
@@ -81,7 +81,11 @@ impl ScalarUDFImpl for LtrimFunc { | |||
} | |||
|
|||
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | |||
utf8_to_str_type(&arg_types[0], "ltrim") | |||
if arg_types[0] == DataType::Utf8View { | |||
Ok(DataType::Utf8View) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we possibly add a .slt test to cover this (showing that the output type is now a view and some basic end to end tests (if not already done)?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I am fixing tests and benchmarks now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed test, and introduced a benchamrk in #12513 .
#12395 (comment) shows number about improvement.
The benchmark pr still need to be sorted out, I will make it later today.
I think maybe we should place the LTrim/RTrim/BTrim into a same place(like trim.rs)? |
For benchmarking, I would recommend this PR #12111. for what it's worth |
Thanks, it is really helpful! |
4e092d4
to
dbd0f25
Compare
Run benchmark introduced in #12513, about 10~20% improvement for the long string(64 bytes). Highlights, as we expected, the string view trim mainly reduces copyings when the trimmed result > 12:
The detailed sorted out benchmark result:
|
Which issue does this PR close?
Closes #12387
Rationale for this change
Similar as the string view version substr, we can impl the string view version trim to improve performance.
What changes are included in this PR?
Are these changes tested?
Test by new unit test and exist other tests.
Are there any user-facing changes?
No.