Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,7 @@ abstract class HashExpression[E] extends Expression {
}

val hashResultType = CodeGenerator.javaType(dataType)
val typedSeed = if (dataType.sameType(LongType)) s"${seed}L" else s"$seed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @patrickcording . BTW, this seems to change the hash result, doesn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it change the result? it would just let it not fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't change the result. For xxHash, the generated code just becomes long varName = 123L instead of long varName = 123. When the seed is <= 2^31, theres no difference between the two statements. When it is > 2^31, compilation of the generated code would fail without the L, so it is now possible to use any 64 bit seed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the concern was something like: a 4-byte int seed isn't the same as an 8-byte long seed when it comes to hashing, even if they have the same integer value. But here, this should only affect HashExpression[Long]s like XxHash64, which already would have promoted any int seed value to long in generated code.

val codes = ctx.splitExpressionsWithCurrentInputs(
expressions = childrenHash,
funcName = "computeHash",
Expand All @@ -296,7 +297,7 @@ abstract class HashExpression[E] extends Expression {

ev.copy(code =
code"""
|$hashResultType ${ev.value} = $seed;
|$hashResultType ${ev.value} = $typedSeed;
|$codes
""".stripMargin)
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -684,6 +684,33 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
assert(murmur3HashPlan(wideRow).getInt(0) == murmursHashEval)
}

test("SPARK-30633: xxHash with different type seeds") {
val literal = Literal.create(42L, LongType)

val longSeeds = Seq(
Long.MinValue,
Integer.MIN_VALUE.toLong - 1L,
0L,
Integer.MAX_VALUE.toLong + 1L,
Long.MaxValue
)
for (seed <- longSeeds) {
checkEvaluation(XxHash64(Seq(literal), seed), XxHash64(Seq(literal), seed).eval())
}

val intSeeds = Seq(
Integer.MIN_VALUE,
0,
Integer.MAX_VALUE
)
for (seed <- intSeeds) {
checkEvaluation(XxHash64(Seq(literal), seed), XxHash64(Seq(literal), seed).eval())
}

checkEvaluation(XxHash64(Seq(literal), 100), XxHash64(Seq(literal), 100L).eval())
checkEvaluation(XxHash64(Seq(literal), 100L), XxHash64(Seq(literal), 100).eval())
}

private def testHash(inputSchema: StructType): Unit = {
val inputGenerator = RandomDataGenerator.forType(inputSchema, nullable = false).get
val encoder = RowEncoder(inputSchema)
Expand All @@ -700,5 +727,17 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
checkEvaluation(HiveHash(literals), HiveHash(literals).eval())
}
}

val longSeed = Math.abs(seed).toLong + Integer.MAX_VALUE.toLong
test(s"SPARK-30633: xxHash64 with long seed: ${inputSchema.simpleString}") {
for (_ <- 1 to 10) {
val input = encoder.toRow(inputGenerator.apply().asInstanceOf[Row]).asInstanceOf[UnsafeRow]
val literals = input.toSeq(inputSchema).zip(inputSchema.map(_.dataType)).map {
case (value, dt) => Literal.create(value, dt)
}
// Only test the interpreted version has same result with codegen version.
checkEvaluation(XxHash64(literals, longSeed), XxHash64(literals, longSeed).eval())
}
}
}
}