Skip to content

Conversation

@rednaxelafx
Copy link
Contributor

What changes were proposed in this pull request?

Fix GenericArrayData.equals, so that it respects the actual types of the elements.
e.g. an instance that represents an array<int> and another instance that represents an array<long> should be considered incompatible, and thus should return false for equals.

GenericArrayData doesn't keep any schema information by itself, and rather relies on the Java objects referenced by its array field's elements to keep track of their own object types. So, the most straightforward way to respect their types is to call equals on the elements, instead of using Scala's == operator, which can have semantics that are not always desirable:

new java.lang.Integer(123) == new java.lang.Long(123L) // true in Scala
new java.lang.Integer(123).equals(new java.lang.Long(123L)) // false in Scala

How was this patch tested?

Added unit test in ComplexDataSuite

val array1 = new GenericArrayData(Array[Int](123))
val array2 = new GenericArrayData(Array[Long](123L))

assert(!array1.equals(array2))
Copy link
Member

@MaxGekk MaxGekk Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check positive case when two arrays have the same element type? and the same elements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Can you test short and byte, too?


// GenericArrayData
scala> val arrayByte= new GenericArrayData(Array[Short](123.toByte))
scala> val arrayShort = new GenericArrayData(Array[Short](123.toShort))
scala> val arrayInt = new GenericArrayData(Array[Int](123))
scala> val arrayLong = new GenericArrayData(Array[Long](123L))
scala> arrayByte.equals(arrayLong)
res8: Boolean = true

scala> arrayByte.equals(arrayInt)
res9: Boolean = true

scala> arrayShort.equals(arrayInt)
res10: Boolean = true

scala> arrayShort.equals(arrayLong)
res11: Boolean = true


// UnsafeArrayData
scala> val unsafeByte = ExpressionEncoder[Array[Byte]].resolveAndBind().toRow(arrayByte).getArray(0)
scala> val unsafeShort = ExpressionEncoder[Array[Short]].resolveAndBind().toRow(arrayShort).getArray(0)
scala> val unsafeInt = ExpressionEncoder[Array[Int]].resolveAndBind().toRow(arrayInt).getArray(0)
scala> val unsafeLong = ExpressionEncoder[Array[Long]].resolveAndBind().toRow(arrayLong).getArray(0)
scala> arrayByte.equals(arrayLong)
res12: Boolean = false

scala> arrayByte.equals(arrayInt)
res13: Boolean = false

scala> arrayShort.equals(arrayInt)
res14: Boolean = false

scala> arrayShort.equals(arrayLong)
res15: Boolean = false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without schema, Spark can never compare a generic and an unsafe array. This fix is not a real bug fix, but makes the GenericArrayData more clear about equals semantic, which is good to have, and might be useful when writing tests.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the changes break user's apps if arrays of int and long (other integer like type) are not comparable any more?

return false
}
case _ => if (o1 != o2) {
case _ => if (!o1.equals(o2)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any needs to handle Array[Byte] separately above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in java byte[] or other primitive arrays doesn't have a proper equals implementation.

scala> Array(1) == Array(1)
res0: Boolean = false

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92335 has finished for PR 21643 at commit d91b44a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rednaxelafx
Copy link
Contributor Author

#21643 (review)
For all practical purposes, no, this change shouldn't break any use cases that we support; there could be use cases that we didn't intend to support that might break.

In Spark SQL, SQL/DataFrame/Dataset operations are all type-checked, and the analyzer will strictly reject operations that involve incompatible types.
Array types in Spark SQL are neither covariant nor contravariant, so e.g. array<int> and array<long> would be considered incompatible, and the analyzer would have rejected equality comparison between them in the first place.

In all other cases, GenericArrayData is a Spark SQL internal type that shouldn't be used on the outside, and I don't think we ever intended to support direct access of this type from the outside.

@rednaxelafx
Copy link
Contributor Author

Thanks for your comments, @MaxGekk @maropu @cloud-fan , I've tweaked the unit test case to address your comments. Please check it out again.

arraysShouldEqual(0.toByte, 123.toByte, (-123).toByte) // Byte
arraysShouldEqual(0.toShort, 123.toShort, (-256).toShort) // Short
arraysShouldEqual(0, 123, -65536) // Int
arraysShouldEqual(0L, 123L, -65536L) // Long
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not important but if you are checking corner cases, probably, it makes sense to pass values like Long.MinValue and Double.MaxValue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good one. I can do that (and NaNs/Infinity for floating point types too)

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92352 has finished for PR 21643 at commit 083673c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92353 has finished for PR 21643 at commit a30de22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

test("SPARK-24659: GenericArrayData.equals should respect element type differences") {
import scala.reflect.ClassTag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can move this import to the head of this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion! I'm used to making one-off imports inside a function when an import is only used within that function, so that the scope is as narrow as possible without being disturbing.
Are there any Spark coding style guidelines that suggest otherwise? If so I'll follow the guideline and always import at the beginning of the file.

@maropu
Copy link
Member

maropu commented Jun 27, 2018

LGTM except for one minor comment.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 1b9368f Jun 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants