-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 3.3 - Support bucket in FunctionCatalog #5513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.3 - Support bucket in FunctionCatalog #5513
Conversation
24a8c1d to
18e5904
Compare
| public Integer produceResult(InternalRow input) { | ||
| // return null for null input to match what Spark does in the code-generated versions. | ||
| return input.isNullAt(NUM_BUCKETS_ORDINAL) || input.isNullAt(VALUE_ORDINAL) | ||
| ? null | ||
| : invoke(input.getInt(NUM_BUCKETS_ORDINAL), input.getUTF8String(VALUE_ORDINAL)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason when produceResult was defined in the super class, Spark complained that it was not defined and errored out.
So I've moved all of the logic to each subclass.
06207b3 to
4634d0e
Compare
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java
Outdated
Show resolved
Hide resolved
4eebf93 to
bb29cc3
Compare
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkBucketFunction.java
Show resolved
Hide resolved
| .hashBytes( | ||
| value.array(), | ||
| value.arrayOffset() + value.position(), | ||
| value.arrayOffset() + value.remaining()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that this isn't correct. It has been wrong for years, evidently.
HashFunction.hashBytes accepts a length. value.remaining() is that length.
Looks like this was working because arrayOffset was never non-zero. That's because the only way to get a ByteBuffer with a non-zero arrayOffset (as far as I can tell) is to use ByteBuffer.slice(), which creates a copy of the ByteBuffer and sets it. Since slice doesn't allow setting the position and limit, everywhere that I've been able to find uses duplicate() and then sets position and limit because there's no need to limit the start or capacity of the ByteBuffer when the backing array is not limited. Allocation and wrapping byte arrays always produces arrayOffset=0.
Here's a test that catches this. @kbendick, can you add this to TestBucketing along with the fix here (remove value.arrayOffset())?
@Test
public void testByteBufferOnHeapArrayOffset() {
byte[] bytes = randomBytes(128);
ByteBuffer raw = ByteBuffer.wrap(bytes, 5, 100);
ByteBuffer buffer = raw.slice();
Assert.assertEquals("Buffer arrayOffset should be 5", 5, buffer.arrayOffset());
Bucket<ByteBuffer> bucketFunc = Bucket.get(Types.BinaryType.get(), 100);
Assert.assertEquals(
"HeapByteBuffer hash should match hash for correct slice",
hashBytes(bytes, 5, 100),
bucketFunc.hash(buffer));
// verify that the buffer was not modified
Assert.assertEquals("Buffer position should be 0", 0, buffer.position());
Assert.assertEquals("Buffer limit should not change", 100, buffer.limit());
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah absolutely I'll add this test this evening.
I was potentially going to use a different hash function for the byte[] that Spark passes (or investigate it at least), but I'll make this change this evening!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the fix and the test.
I can put it in a separate PR if we think that would be better for people who cherry-pick.
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java
Show resolved
Hide resolved
| return (BucketUtil.hashDecimal(value.toJavaBigDecimal()) & Integer.MAX_VALUE) % numBuckets; | ||
| } | ||
|
|
||
| public BucketDecimal(int precision, int scale) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash function is actually independent of precision and scale. It looks like the only reason to pass precision and scale (other than to preserve the input type) is for the canonical name. I think instead this can pass in the Spark type and pass that back through inputTypes. The canonical name should be iceberg.bucket(decimal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash function is indeed independent of precision and scale. I've updated the canonical name.
But we do also need the precision and scale to get the Decimal value in produceResult. And it seems we need it for creating the correct inputType as mentioned. Let me see about not passing those in and passing in the spark type instead.
| public void testBucketIntegers() { | ||
| Assert.assertEquals( | ||
| "Byte type should bucket similarly to integer", | ||
| 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably validate the result against the transform result. That seems safer to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test suite, testValuesFromSpec, that tests against the hashed values.
I could add more tests against just the transform result if we want now but we do at least have tests for the hashed transform!
| Assert.assertEquals(10, scalarSql("SELECT system.bucket(12, %s)", asBytesLiteral("abcdefg"))); | ||
| Assert.assertEquals(13, scalarSql("SELECT system.bucket(18, %s)", asBytesLiteral("abc\0\0"))); | ||
| Assert.assertEquals(42, scalarSql("SELECT system.bucket(48, %s)", asBytesLiteral("abc"))); | ||
| Assert.assertEquals(3, scalarSql("SELECT system.bucket(16, %s)", asBytesLiteral("测试_"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these all of the tests from the bucket function tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theres not that many test cases in the Bucket transform iirc.
I'll be sure to grab all of the test cases that are in the bucket transform to ensure consistency, in addition to checking the transform result instead of just the final bucketed result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction. The bucket function test suite uses a seeded random to generate data.
I did however test the output of all of the values from the spec against their hashed output. Let me know if that's sufficient or if I should update these to be against the hash output (or both).
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkBucketFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java
Show resolved
Hide resolved
bb29cc3 to
f14c2e7
Compare
…wherever possible
…ortint for number of buckets - already covered by test that checks that those can be used and implicitly by the check that magic functions are called
…le only matter for the input type binding
…in all subclasses
…s precision and scale are not important to the hash function itself
009a27e to
1fb393d
Compare
…etInteger, where we explicitly want Spark to cast short and byte to integer type
… value (not the bucketed value)
|
@rdblue is there anything else you want me to update before this is ready to merge or are we just giving other time for people to review? Also cc @huaxingao @aokolnychyi the Spark |
| } | ||
|
|
||
| // TODO - We can probably hash the bytes directly given they're already UTF-8 input. | ||
| return apply(numBuckets, hash(value.toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think since this is already a UTF-8 String we can hash the bytes directly instead of converting to Java String which is UTF-16 (for CharSequence) -> converting to UTF-8 bytes -> hashing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this might make the testing of the hash function harder. Let me check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if we change hash here to be public static int hash(ByteBuffer value) we could then return apply(numBuckets, hash(value.getByteBuffer())); which returns the same results.
I think this change is worth it, thought it makes the tests slightly wonkier for the test suite that checks the hash function output directly as we have to call ByteBuffer.wrap("iceberg".getBytes("UTF-8")).
But it would arguably make the bucket string function faster on UTF8String input which is what will get passed in at runtime.
|
@kbendick, I think I was waiting for tests to pass. Looks good now. |
(cherry picked from commit 69bcf05)
Adds a
bucketfunction that performs the Iceberg partition transformation to the Spark FunctionCatalog, for usage from SQL and for usage with storage partitioned joins.This closes #5349