Skip to content

Conversation

@maartenbreddels
Copy link
Contributor

@jorisvandenbossche I've implemented this kernel as a binary (arity) kernel, so the input list array and the separator input string array can both be an array (see python test).

I did not implement the case where the input list is a scalar, and the separator an array, since I don't think that's very common.

And note that the kernel is named binary_join because it takes string-like and binary-like inputs.

@github-actions
Copy link

@maartenbreddels
Copy link
Contributor Author

Failures seem unrelated.
@pitrou this is ready for review assuming @jorisvandenbossche agrees with the current behavior

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maartenbreddels . Here are some comments.

@ianmcook
Copy link
Member

Regarding "join" in the name:
+1 for consistency with Python's join function
-1 because as Arrow gains more database-like features, the word "join" is likely to confuse people. "concat" might be better.

Regarding "binary" in the name:
IMO that is likely to cause users to get the impression that this is not for strings.

It would also be good for the name to reflect that this operates on a list array.

How about renaming it list_concat?

@westonpace
Copy link
Member

I'll just point out that "join" is not a python-ism. There is a string join in Java, Rust, C#, JavaScript, etc. and it is consistently called join. I think R is the only language I can find that doesn't call it "join".

@ianmcook
Copy link
Member

I'll just point out that "join" is not a python-ism. There is a string join in Java, Rust, C#, JavaScript, etc. and it is consistently called join. I think R is the only language I can find that doesn't call it "join".

SQL—which probably has more users than all those languages combined 😁

@ianmcook
Copy link
Member

I'll just point out that "join" is not a python-ism. There is a string join in Java, Rust, C#, JavaScript, etc. and it is consistently called join. I think R is the only language I can find that doesn't call it "join".

SQL—which probably has more users than all those languages combined 😁

Snark aside, that's a good point, and since users who are thinking in the SQL way will always be at least one abstraction layer removed from the C++ library, that makes me think "join" is fine 👍

@jorisvandenbossche
Copy link
Member

In SQL it's called CONCAT? (https://www.w3schools.com/sql/func_mysql_concat.asp, although this doesn't have the concept of a join separator)

@ianmcook
Copy link
Member

In SQL it's called CONCAT? (https://www.w3schools.com/sql/func_mysql_concat.asp, although this doesn't have the concept of a join separator)

In SQL, concat_ws is the variant with a separator

@pitrou
Copy link
Member

pitrou commented May 10, 2021

@maartenbreddels Do you want to update this PR?

@pitrou
Copy link
Member

pitrou commented Jun 2, 2021

I'm going to take this up.

@pitrou pitrou force-pushed the ARROW-10959 branch 2 times, most recently from f06549e to 9353e4e Compare June 3, 2021 10:56
@pitrou
Copy link
Member

pitrou commented Jun 3, 2021

I adressed review comments, added a benchmark, and made binary_join 3x faster (in particular by preallocating data at the right size).

@pitrou
Copy link
Member

pitrou commented Jun 3, 2021

Still a TODO to address (also support large_list inputs).

Edit: done

@pitrou pitrou requested a review from bkietz June 3, 2021 11:52
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the public overloads to a minimum, these should probably just take Datums. Added https://issues.apache.org/jira/browse/ARROW-12953

Comment on lines +2608 to +2875
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't lose anything for string kernels which always do their own allocation, but it's worth noting that *out is always a correctly shaped ArrayData. We could also safely write

Suggested change
std::shared_ptr<Array> string_array;
RETURN_NOT_OK(builder->Finish(&string_array));
*out = *string_array->data();
RETURN_NOT_OK(builder->FinishInternal(const_cast<std::shared_ptr<ArrayData>*>(&out->array())));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants