Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -536,6 +536,11 @@ object TypeCoercion {
case None => c
}

case ArrayJoin(arr, d, nr) if !ArrayType(StringType).acceptsType(arr.dataType) &&
ArrayType.acceptsType(arr.dataType) =>
val containsNull = arr.dataType.asInstanceOf[ArrayType].containsNull
ArrayJoin(Cast(arr, ArrayType(StringType, containsNull)), d, nr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not every type can be casted to StringType. What about using ImplicitTypeCasts.implicitCast in order to check if we can cast it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mgaido91,
to be honest, I've considered this option before submitting this PR. But I'm glad that you mentioned this approach. At least, we can discuss pros and cons of different solutions. Usage of ImplicitTypeCasts.implicitCast would enable conversion only from primitive types. I think it would be nice to support non-primitive types as well. WDYT?

Re: Casting to StringType: According to Cast.canCast method should be possible to cast any type to StringType:
line 42: case (_, StringType) => true
Or am I missing something? I hope test cases in .../typeCoercion/native/arrayJoin.sql cover to StringType conversions from all Spark types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. I think it is arguable which is the right result of SELECT array_join(array(array('a', 'b'), array('c', 'd')), ';') for instance. With this PR, the result is [a, b];[c, d] but shouldn't it be [a;b];[c;d]? Moreover, Presto, which is the reference here, doesn't support nested arrays for instance:

presto> select array_join(array[array[1, 2, 3], array[3, 4, 5]], ';');
Query 20180625_090549_00003_bsbcg failed: Input type array(integer) not supported

So, I'd avoid that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, no problem. Let's support just arrays of primitive types for now. Thanks!


case m @ CreateMap(children) if m.keys.length == m.values.length &&
(!haveSameType(m.keys) || !haveSameType(m.values)) =>
val newKeys = if (haveSameType(m.keys)) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1237,6 +1237,7 @@ case class ArrayJoin(

override def dataType: DataType = StringType

override def prettyName: String = "array_join"
}

/**
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
SELECT array_join(array(true, false), ', ');
SELECT array_join(array(2Y, 1Y), ', ');
SELECT array_join(array(2S, 1S), ', ');
SELECT array_join(array(2, 1), ', ');
SELECT array_join(array(2L, 1L), ', ');
SELECT array_join(array(9223372036854775809, 9223372036854775808), ', ');
SELECT array_join(array(2.0D, 1.0D), ', ');
SELECT array_join(array(float(2.0), float(1.0)), ', ');
SELECT array_join(array(date '2016-03-14', date '2016-03-13'), ', ');
SELECT array_join(array(timestamp '2016-11-15 20:54:00.000', timestamp '2016-11-12 20:54:00.000'), ', ');
SELECT array_join(array('a', 'b'), ', ');
SELECT array_join(array(array('a', 'b'), array('c', 'd')), ', ');
SELECT array_join(array(struct('a', 1), struct('b', 2)), ', ');
SELECT array_join(array(map('a', 1), map('b', 2)), ', ');
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 14


-- !query 0
SELECT array_join(array(true, false), ', ')
-- !query 0 schema
struct<array_join(array(true, false), , ):string>
-- !query 0 output
true, false


-- !query 1
SELECT array_join(array(2Y, 1Y), ', ')
-- !query 1 schema
struct<array_join(array(2, 1), , ):string>
-- !query 1 output
2, 1


-- !query 2
SELECT array_join(array(2S, 1S), ', ')
-- !query 2 schema
struct<array_join(array(2, 1), , ):string>
-- !query 2 output
2, 1


-- !query 3
SELECT array_join(array(2, 1), ', ')
-- !query 3 schema
struct<array_join(array(2, 1), , ):string>
-- !query 3 output
2, 1


-- !query 4
SELECT array_join(array(2L, 1L), ', ')
-- !query 4 schema
struct<array_join(array(2, 1), , ):string>
-- !query 4 output
2, 1


-- !query 5
SELECT array_join(array(9223372036854775809, 9223372036854775808), ', ')
-- !query 5 schema
struct<array_join(array(9223372036854775809, 9223372036854775808), , ):string>
-- !query 5 output
9223372036854775809, 9223372036854775808


-- !query 6
SELECT array_join(array(2.0D, 1.0D), ', ')
-- !query 6 schema
struct<array_join(array(2.0, 1.0), , ):string>
-- !query 6 output
2.0, 1.0


-- !query 7
SELECT array_join(array(float(2.0), float(1.0)), ', ')
-- !query 7 schema
struct<array_join(array(CAST(2.0 AS FLOAT), CAST(1.0 AS FLOAT)), , ):string>
-- !query 7 output
2.0, 1.0


-- !query 8
SELECT array_join(array(date '2016-03-14', date '2016-03-13'), ', ')
-- !query 8 schema
struct<array_join(array(DATE '2016-03-14', DATE '2016-03-13'), , ):string>
-- !query 8 output
2016-03-14, 2016-03-13


-- !query 9
SELECT array_join(array(timestamp '2016-11-15 20:54:00.000', timestamp '2016-11-12 20:54:00.000'), ', ')
-- !query 9 schema
struct<array_join(array(TIMESTAMP('2016-11-15 20:54:00.0'), TIMESTAMP('2016-11-12 20:54:00.0')), , ):string>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input array is very long, the automatically generated column name will be also super long?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yes, it will be. In general, if an expression has children: Seq[Expression] as its argument, the automatically generated column name will be long for now?

-- !query 9 output
2016-11-15 20:54:00, 2016-11-12 20:54:00


-- !query 10
SELECT array_join(array('a', 'b'), ', ')
-- !query 10 schema
struct<array_join(array(a, b), , ):string>
-- !query 10 output
a, b


-- !query 11
SELECT array_join(array(array('a', 'b'), array('c', 'd')), ', ')
-- !query 11 schema
struct<array_join(array(array(a, b), array(c, d)), , ):string>
-- !query 11 output
[a, b], [c, d]


-- !query 12
SELECT array_join(array(struct('a', 1), struct('b', 2)), ', ')
-- !query 12 schema
struct<array_join(array(named_struct(col1, a, col2, 1), named_struct(col1, b, col2, 2)), , ):string>
-- !query 12 output
[a, 1], [b, 2]


-- !query 13
SELECT array_join(array(map('a', 1), map('b', 2)), ', ')
-- !query 13 schema
struct<array_join(array(map(a, 1), map(b, 2)), , ):string>
-- !query 13 output
[a -> 1], [b -> 2]
Original file line number Diff line number Diff line change
Expand Up @@ -552,6 +552,23 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext {
checkAnswer(
df.selectExpr("array_join(x, delimiter, 'NULL')"),
Seq(Row("a,b"), Row("a,NULL,b"), Row("")))

val idf = Seq(Seq(1, 2, 3)).toDF("x")

checkAnswer(
idf.select(array_join(idf("x"), ", ")),
Seq(Row("1, 2, 3"))
)
checkAnswer(
idf.selectExpr("array_join(x, ', ')"),
Seq(Row("1, 2, 3"))
)
intercept[AnalysisException] {
idf.selectExpr("array_join(x, 1)")
}
intercept[AnalysisException] {
idf.selectExpr("array_join(x, ', ', 1)")
}
}

test("array_min function") {
Expand Down