Skip to content

Conversation

@mn-mikke
Copy link
Contributor

What changes were proposed in this pull request?

Presto's implementation accepts arbitrary arrays of primitive types as an input:

presto> SELECT array_join(ARRAY [1, 2, 3], ', ');
_col0
---------
1, 2, 3
(1 row)

This PR proposes to implement a type coercion rule for array_join function that converts arrays of primitive as well as non-primitive types to arrays of string.

How was this patch tested?

New test cases add into:

  • sql-tests/inputs/typeCoercion/native/arrayJoin.sql
  • DataFrameFunctionsSuite.scala

@mn-mikke
Copy link
Contributor Author

cc @ueshin @mgaido91

case ArrayJoin(arr, d, nr) if !ArrayType(StringType).acceptsType(arr.dataType) &&
ArrayType.acceptsType(arr.dataType) =>
val containsNull = arr.dataType.asInstanceOf[ArrayType].containsNull
ArrayJoin(Cast(arr, ArrayType(StringType, containsNull)), d, nr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not every type can be casted to StringType. What about using ImplicitTypeCasts.implicitCast in order to check if we can cast it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mgaido91,
to be honest, I've considered this option before submitting this PR. But I'm glad that you mentioned this approach. At least, we can discuss pros and cons of different solutions. Usage of ImplicitTypeCasts.implicitCast would enable conversion only from primitive types. I think it would be nice to support non-primitive types as well. WDYT?

Re: Casting to StringType: According to Cast.canCast method should be possible to cast any type to StringType:
line 42: case (_, StringType) => true
Or am I missing something? I hope test cases in .../typeCoercion/native/arrayJoin.sql cover to StringType conversions from all Spark types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. I think it is arguable which is the right result of SELECT array_join(array(array('a', 'b'), array('c', 'd')), ';') for instance. With this PR, the result is [a, b];[c, d] but shouldn't it be [a;b];[c;d]? Moreover, Presto, which is the reference here, doesn't support nested arrays for instance:

presto> select array_join(array[array[1, 2, 3], array[3, 4, 5]], ';');
Query 20180625_090549_00003_bsbcg failed: Input type array(integer) not supported

So, I'd avoid that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, no problem. Let's support just arrays of primitive types for now. Thanks!

@SparkQA
Copy link

SparkQA commented Jun 23, 2018

Test build #92253 has finished for PR 21620 at commit 9801771.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case aj @ ArrayJoin(arr, d, nr) if !ArrayType(StringType).acceptsType(arr.dataType) &&
ArrayType.acceptsType(arr.dataType) =>
val containsNull = arr.dataType.asInstanceOf[ArrayType].containsNull
ImplicitTypeCasts.implicitCast(arr, ArrayType(StringType, containsNull))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: usually we try to avoid long chains of functions, what about:

ImplicitTypeCasts.implicitCast(arr, ArrayType(StringType, containsNull)) match {
  ...
}

@mgaido91
Copy link
Contributor

LGMT

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one comment.

ArrayType.acceptsType(arr.dataType) =>
val containsNull = arr.dataType.asInstanceOf[ArrayType].containsNull
ImplicitTypeCasts.implicitCast(arr, ArrayType(StringType, containsNull)) match {
case Some(finalDataType) => ArrayJoin(finalDataType, d, nr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

castedArr or something instead of finalDataType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spot on, thanks!

@SparkQA
Copy link

SparkQA commented Jun 25, 2018

Test build #92297 has finished for PR 21620 at commit 435b7ed.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2018

Test build #92298 has finished for PR 21620 at commit 3082947.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2018

Test build #92299 has finished for PR 21620 at commit 2aef9eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in e07aee2 Jun 26, 2018
-- !query 9
SELECT array_join(array(timestamp '2016-11-15 20:54:00.000', timestamp '2016-11-12 20:54:00.000'), ', ')
-- !query 9 schema
struct<array_join(array(TIMESTAMP('2016-11-15 20:54:00.0'), TIMESTAMP('2016-11-12 20:54:00.0')), , ):string>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input array is very long, the automatically generated column name will be also super long?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yes, it will be. In general, if an expression has children: Seq[Expression] as its argument, the automatically generated column name will be long for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants