-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10959: [C++] Add scalar string join kernel #8990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8f6e26e to
609a388
Compare
d4608a9 to
356c300
Compare
609a388 to
d42058a
Compare
d42058a to
9c0836d
Compare
|
Failures seem unrelated. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @maartenbreddels . Here are some comments.
|
Regarding "join" in the name: Regarding "binary" in the name: It would also be good for the name to reflect that this operates on a list array. How about renaming it |
|
I'll just point out that "join" is not a python-ism. There is a string join in Java, Rust, C#, JavaScript, etc. and it is consistently called join. I think R is the only language I can find that doesn't call it "join". |
SQL—which probably has more users than all those languages combined 😁 |
Snark aside, that's a good point, and since users who are thinking in the SQL way will always be at least one abstraction layer removed from the C++ library, that makes me think "join" is fine 👍 |
|
In SQL it's called |
In SQL, |
|
@maartenbreddels Do you want to update this PR? |
|
I'm going to take this up. |
f06549e to
9353e4e
Compare
|
I adressed review comments, added a benchmark, and made |
|
Still a TODO to address (also support large_list inputs). Edit: done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep the public overloads to a minimum, these should probably just take Datums. Added https://issues.apache.org/jira/browse/ARROW-12953
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't lose anything for string kernels which always do their own allocation, but it's worth noting that *out is always a correctly shaped ArrayData. We could also safely write
| std::shared_ptr<Array> string_array; | |
| RETURN_NOT_OK(builder->Finish(&string_array)); | |
| *out = *string_array->data(); | |
| RETURN_NOT_OK(builder->FinishInternal(const_cast<std::shared_ptr<ArrayData>*>(&out->array()))); |
@jorisvandenbossche I've implemented this kernel as a binary (arity) kernel, so the input list array and the separator input string array can both be an array (see python test).
I did not implement the case where the input list is a scalar, and the separator an array, since I don't think that's very common.
And note that the kernel is named
binary_joinbecause it takes string-like and binary-like inputs.