Spark 3.3 - Support bucket in FunctionCatalog #5513

kbendick · 2022-08-12T17:25:17Z

Adds a bucket function that performs the Iceberg partition transformation to the Spark FunctionCatalog, for usage from SQL and for usage with storage partitioned joins.

This closes #5349

kbendick · 2022-08-12T18:32:35Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java

+    public Integer produceResult(InternalRow input) {
+      // return null for null input to match what Spark does in the code-generated versions.
+      return input.isNullAt(NUM_BUCKETS_ORDINAL) || input.isNullAt(VALUE_ORDINAL)
+          ? null
+          : invoke(input.getInt(NUM_BUCKETS_ORDINAL), input.getUTF8String(VALUE_ORDINAL));
+    }
+  }


For some reason when produceResult was defined in the super class, Spark complained that it was not defined and errored out.

So I've moved all of the logic to each subclass.

api/src/main/java/org/apache/iceberg/util/BucketHashUtil.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkBucketFunction.java

api/src/main/java/org/apache/iceberg/util/BucketUtil.java

rdblue · 2022-08-13T22:42:09Z

api/src/main/java/org/apache/iceberg/util/BucketUtil.java

+          .hashBytes(
+              value.array(),
+              value.arrayOffset() + value.position(),
+              value.arrayOffset() + value.remaining())


I just realized that this isn't correct. It has been wrong for years, evidently.

HashFunction.hashBytes accepts a length. value.remaining() is that length.

Looks like this was working because arrayOffset was never non-zero. That's because the only way to get a ByteBuffer with a non-zero arrayOffset (as far as I can tell) is to use ByteBuffer.slice(), which creates a copy of the ByteBuffer and sets it. Since slice doesn't allow setting the position and limit, everywhere that I've been able to find uses duplicate() and then sets position and limit because there's no need to limit the start or capacity of the ByteBuffer when the backing array is not limited. Allocation and wrapping byte arrays always produces arrayOffset=0.

Here's a test that catches this. @kbendick, can you add this to TestBucketing along with the fix here (remove value.arrayOffset())?

@Test public void testByteBufferOnHeapArrayOffset() { byte[] bytes = randomBytes(128); ByteBuffer raw = ByteBuffer.wrap(bytes, 5, 100); ByteBuffer buffer = raw.slice(); Assert.assertEquals("Buffer arrayOffset should be 5", 5, buffer.arrayOffset()); Bucket<ByteBuffer> bucketFunc = Bucket.get(Types.BinaryType.get(), 100); Assert.assertEquals( "HeapByteBuffer hash should match hash for correct slice", hashBytes(bytes, 5, 100), bucketFunc.hash(buffer)); // verify that the buffer was not modified Assert.assertEquals("Buffer position should be 0", 0, buffer.position()); Assert.assertEquals("Buffer limit should not change", 100, buffer.limit()); }

Yeah absolutely I'll add this test this evening.

I was potentially going to use a different hash function for the byte[] that Spark passes (or investigate it at least), but I'll make this change this evening!

Added the fix and the test.

I can put it in a separate PR if we think that would be better for people who cherry-pick.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java

rdblue · 2022-08-13T22:58:52Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java

+      return (BucketUtil.hashDecimal(value.toJavaBigDecimal()) & Integer.MAX_VALUE) % numBuckets;
+    }
+
+    public BucketDecimal(int precision, int scale) {


The hash function is actually independent of precision and scale. It looks like the only reason to pass precision and scale (other than to preserve the input type) is for the canonical name. I think instead this can pass in the Spark type and pass that back through inputTypes. The canonical name should be iceberg.bucket(decimal).

The hash function is indeed independent of precision and scale. I've updated the canonical name.

But we do also need the precision and scale to get the Decimal value in produceResult. And it seems we need it for creating the correct inputType as mentioned. Let me see about not passing those in and passing in the spark type instead.

rdblue · 2022-08-13T23:00:39Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkBucketFunction.java

+  public void testBucketIntegers() {
+    Assert.assertEquals(
+        "Byte type should bucket similarly to integer",
+        3,


I would probably validate the result against the transform result. That seems safer to me.

I added a test suite, testValuesFromSpec, that tests against the hashed values.

I could add more tests against just the transform result if we want now but we do at least have tests for the hashed transform!

rdblue · 2022-08-13T23:01:46Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkBucketFunction.java

+    Assert.assertEquals(10, scalarSql("SELECT system.bucket(12, %s)", asBytesLiteral("abcdefg")));
+    Assert.assertEquals(13, scalarSql("SELECT system.bucket(18, %s)", asBytesLiteral("abc\0\0")));
+    Assert.assertEquals(42, scalarSql("SELECT system.bucket(48, %s)", asBytesLiteral("abc")));
+    Assert.assertEquals(3, scalarSql("SELECT system.bucket(16, %s)", asBytesLiteral("测试_")));


Are these all of the tests from the bucket function tests?

Theres not that many test cases in the Bucket transform iirc.

I'll be sure to grab all of the test cases that are in the bucket transform to ensure consistency, in addition to checking the transform result instead of just the final bucketed result.

Correction. The bucket function test suite uses a seeded random to generate data.

I did however test the output of all of the values from the spec against their hashed output. Let me know if that's sufficient or if I should update these to be against the hash output (or both).

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkBucketFunction.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java

…n javadoc

…wherever possible

…by the spec

…ortint for number of buckets - already covered by test that checks that those can be used and implicitly by the check that magic functions are called

…le only matter for the input type binding

…in all subclasses

…s precision and scale are not important to the hash function itself

…etInteger, where we explicitly want Spark to cast short and byte to integer type

… value (not the bucketed value)

kbendick · 2022-08-16T19:36:40Z

@rdblue is there anything else you want me to update before this is ready to merge or are we just giving other time for people to review?

Also cc @huaxingao @aokolnychyi the Spark bucket function that can be used for storage partitioned joins on bucket columns. PTAL. 🙂

kbendick · 2022-08-16T19:38:17Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java

+      }
+
+      // TODO - We can probably hash the bytes directly given they're already UTF-8 input.
+      return apply(numBuckets, hash(value.toString()));


I think since this is already a UTF-8 String we can hash the bytes directly instead of converting to Java String which is UTF-16 (for CharSequence) -> converting to UTF-8 bytes -> hashing.

But this might make the testing of the hash function harder. Let me check.

Yeah if we change hash here to be public static int hash(ByteBuffer value) we could then return apply(numBuckets, hash(value.getByteBuffer())); which returns the same results.

I think this change is worth it, thought it makes the tests slightly wonkier for the test suite that checks the hash function output directly as we have to call ByteBuffer.wrap("iceberg".getBytes("UTF-8")).

But it would arguably make the bucket string function faster on UTF8String input which is what will get passed in at runtime.

rdblue · 2022-08-18T16:49:16Z

@kbendick, I think I was waiting for tests to pass. Looks good now.

(cherry picked from commit 69bcf05)

github-actions bot added API spark labels Aug 12, 2022

kbendick force-pushed the kb-add-spark-bucket-function-squash branch 2 times, most recently from 24a8c1d to 18e5904 Compare August 12, 2022 18:18

kbendick changed the title ~~Spark - Add Buket Function for FunctionCatalog~~ Spark - Add Bucket Function for FunctionCatalog Aug 12, 2022

kbendick marked this pull request as ready for review August 12, 2022 18:20

kbendick mentioned this pull request Aug 12, 2022

Spark: Support truncate in FunctionCatalog #5431

Merged

kbendick changed the title ~~Spark - Add Bucket Function for FunctionCatalog~~ Spark 3.3 - Support bucket in FunctionCatalog Aug 12, 2022

kbendick commented Aug 12, 2022

View reviewed changes

kbendick force-pushed the kb-add-spark-bucket-function-squash branch from 06207b3 to 4634d0e Compare August 12, 2022 18:36

rdblue reviewed Aug 12, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/util/BucketHashUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Aug 12, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/util/BucketHashUtil.java Outdated Show resolved Hide resolved

kbendick commented Aug 12, 2022

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java Outdated Show resolved Hide resolved

kbendick force-pushed the kb-add-spark-bucket-function-squash branch from 4eebf93 to bb29cc3 Compare August 12, 2022 22:05

kbendick commented Aug 12, 2022

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkBucketFunction.java Show resolved Hide resolved