Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@

package org.apache.spark.sql.catalyst.expressions

import org.apache.spark.sql.catalyst.util.TypeUtils

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless?


/**
* Rewrites an expression using rules that are guaranteed preserve the result while attempting
* to remove cosmetic variations. Deterministic expressions that are `equal` after canonicalization
Expand Down Expand Up @@ -85,6 +87,14 @@ object Canonicalize {
case Not(GreaterThanOrEqual(l, r)) => LessThan(l, r)
case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r)

// order the list in the In operator
// we can do this only if all the elements in the list are literals with the same datatype
case i @ In(value, list)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just reorder elements in list by hashCode?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea, thanks. I am doing it. It will allow also to cover more cases than simple literals. Thank you.

if i.inSetConvertible && list.map(_.dataType.asNullable).distinct.size == 1 =>

@dongjoon-hyun dongjoon-hyun May 18, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pinging me, @mgaido91 .
isSetConvertible ensures all types are literal. So, Literal.dataType.asNullable doesn't do anything for here in list.map(_.dataType.asNullable).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your comment @dongjoon-hyun, but I am not sure I agree with you. What if we have something like in (array(null, 1), array(1, 2, 3), array(3, 2, 1))? The first literal would contain an array which can contain nulls while the others would not be, so in this case we would have 2 distinct datatypes (because of nullability).
Am I missing something? Thanks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check again. I might forget something here. :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Right. I confused those cases.

val literals = list.map(_.asInstanceOf[Literal])
val ordering = TypeUtils.getInterpretedOrdering(literals.head.dataType)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-ordering type, this will throw match error.

In(value, literals.sortBy(_.value)(ordering))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For complex literals like array, this doesn't work. Please add a test case for complex types and handle them.

scala> sql("select * from t where array(1,2) in (array(1,2),array(2,1))").queryExecution.logical.canonicalized.semanticHash()
res4: Int = -1398094385

scala> sql("select * from t where array(1,2) in (array(2,1),array(1,2))").queryExecution.logical.canonicalized.semanticHash()
res5: Int = -1233982198

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it works. The problem in your example is that array(1, 2) is not a literal, but it is a CreateArray expression. So we are not reordering it. Using literal arrays it works as shown in the updated UT.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. BTW, it comes from your example. Anyway, my bad. :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, it's my fault. Anyway, with @viirya's suggestion, also this case works properly. I updated the UT too. Thanks.


case _ => e
}
}
28 changes: 28 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2265,4 +2265,32 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
val df = spark.range(1).select($"id", new Column(Uuid()))
checkAnswer(df, df.collect())
}

test("SPARK-24276: IN returns sameResult if the order of literals is different") {
val df = spark.range(1)
val p1 = df.where($"id".isin(1, 2))
val p2 = df.where($"id".isin(2, 1))
val p3 = df.where($"id".isin(1, 2, 3))

assert(p1.queryExecution.executedPlan.sameResult(p2.queryExecution.executedPlan))
assert(!p1.queryExecution.executedPlan.sameResult(p3.queryExecution.executedPlan))

val h1 = p1.queryExecution.logical.canonicalized.semanticHash()
val h2 = p2.queryExecution.logical.canonicalized.semanticHash()
val h3 = p3.queryExecution.logical.canonicalized.semanticHash()
assert(h1 == h2)
assert(h1 != h3)

val df2 = Seq(Array(1, 2)).toDF("id")
val arrays1 = df2.where($"id".isin(lit(Array(1, 2)), lit(Array(2, 1))))
val arrays2 = df2.where($"id".isin(lit(Array(2, 1)), lit(Array(1, 2))))
val arrays3 = df2.where($"id".isin(lit(Array(3, 2)), lit(Array(2, 1))))
assert(arrays1.queryExecution.executedPlan.sameResult(arrays2.queryExecution.executedPlan))
assert(!arrays1.queryExecution.executedPlan.sameResult(arrays3.queryExecution.executedPlan))
val arraysHash1 = arrays1.queryExecution.logical.canonicalized.semanticHash()
val arraysHash2 = arrays2.queryExecution.logical.canonicalized.semanticHash()
val arraysHash3 = arrays3.queryExecution.logical.canonicalized.semanticHash()
assert(arraysHash1 == arraysHash2)
assert(arraysHash1 != arraysHash3)
}
}