-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24636][SQL] Type coercion of arrays for array_join function #21620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| case ArrayJoin(arr, d, nr) if !ArrayType(StringType).acceptsType(arr.dataType) && | ||
| ArrayType.acceptsType(arr.dataType) => | ||
| val containsNull = arr.dataType.asInstanceOf[ArrayType].containsNull | ||
| ArrayJoin(Cast(arr, ArrayType(StringType, containsNull)), d, nr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not every type can be casted to StringType. What about using ImplicitTypeCasts.implicitCast in order to check if we can cast it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mgaido91,
to be honest, I've considered this option before submitting this PR. But I'm glad that you mentioned this approach. At least, we can discuss pros and cons of different solutions. Usage of ImplicitTypeCasts.implicitCast would enable conversion only from primitive types. I think it would be nice to support non-primitive types as well. WDYT?
Re: Casting to StringType: According to Cast.canCast method should be possible to cast any type to StringType:
line 42: case (_, StringType) => true
Or am I missing something? I hope test cases in .../typeCoercion/native/arrayJoin.sql cover to StringType conversions from all Spark types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure. I think it is arguable which is the right result of SELECT array_join(array(array('a', 'b'), array('c', 'd')), ';') for instance. With this PR, the result is [a, b];[c, d] but shouldn't it be [a;b];[c;d]? Moreover, Presto, which is the reference here, doesn't support nested arrays for instance:
presto> select array_join(array[array[1, 2, 3], array[3, 4, 5]], ';');
Query 20180625_090549_00003_bsbcg failed: Input type array(integer) not supported
So, I'd avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, no problem. Let's support just arrays of primitive types for now. Thanks!
|
Test build #92253 has finished for PR 21620 at commit
|
| case aj @ ArrayJoin(arr, d, nr) if !ArrayType(StringType).acceptsType(arr.dataType) && | ||
| ArrayType.acceptsType(arr.dataType) => | ||
| val containsNull = arr.dataType.asInstanceOf[ArrayType].containsNull | ||
| ImplicitTypeCasts.implicitCast(arr, ArrayType(StringType, containsNull)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: usually we try to avoid long chains of functions, what about:
ImplicitTypeCasts.implicitCast(arr, ArrayType(StringType, containsNull)) match {
...
}
|
LGMT |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one comment.
| ArrayType.acceptsType(arr.dataType) => | ||
| val containsNull = arr.dataType.asInstanceOf[ArrayType].containsNull | ||
| ImplicitTypeCasts.implicitCast(arr, ArrayType(StringType, containsNull)) match { | ||
| case Some(finalDataType) => ArrayJoin(finalDataType, d, nr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
castedArr or something instead of finalDataType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spot on, thanks!
|
Test build #92297 has finished for PR 21620 at commit
|
|
Test build #92298 has finished for PR 21620 at commit
|
|
Test build #92299 has finished for PR 21620 at commit
|
|
Merged to master. |
| -- !query 9 | ||
| SELECT array_join(array(timestamp '2016-11-15 20:54:00.000', timestamp '2016-11-12 20:54:00.000'), ', ') | ||
| -- !query 9 schema | ||
| struct<array_join(array(TIMESTAMP('2016-11-15 20:54:00.0'), TIMESTAMP('2016-11-12 20:54:00.0')), , ):string> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the input array is very long, the automatically generated column name will be also super long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yes, it will be. In general, if an expression has children: Seq[Expression] as its argument, the automatically generated column name will be long for now?
What changes were proposed in this pull request?
Presto's implementation accepts arbitrary arrays of primitive types as an input:
This PR proposes to implement a type coercion rule for
array_joinfunction that converts arrays of primitive as well as non-primitive types to arrays of string.How was this patch tested?
New test cases add into: