Data, Parquet: Fix UUID ClassCastException when reading Parquet files with UUIDs #14027

ndrluis · 2025-09-08T23:21:47Z

I was working on this PyIceberg issue (apache/iceberg-python#2372) and I wrote a new test where PyIceberg writes a parquet file and PySpark writes another one. I wanted to ensure that we are able to read from both parquet files, but then I started receiving this exception: java.util.UUID cannot be cast to class java.nio.ByteBuffer. So this PR focuses on solving this problem to maintain compatibility between both implementations.

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetConversions.java

ndrluis · 2025-09-09T23:28:50Z

Thank you @huaxingao for the review, I made the requested changes.

huaxingao

LGTM

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java

parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java

amogh-jahagirdar · 2025-09-11T15:14:53Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetConversions.java

  }

  static Function<Object, Object> converterFromParquet(PrimitiveType type) {
+    if (type.getLogicalTypeAnnotation() instanceof UUIDLogicalTypeAnnotation) {


This fix seems OK to me, but the part I don't quite understand yet (I haven't dug into it yet) is this issue PyIceberg specific? What's different about for instance the dictionaries with UUID produced by Spark and why doesn't that fail?

cc @Fokko may have some insights too here since I know he was working on some UUID related fixes in the past.

This change does not solve all the problems. I'm doing some experiments here, playing around with pyiceberg and Spark, so I discovered some other things that I'm double-checking. I intend to add a more detailed analysis here over the weekend.

ndrluis · 2025-09-14T16:07:19Z

Quick update on this issue - I'm going to focus on solving this problem on the Java side first. Once Iceberg Java has the correct behavior, I'll come back to PyIceberg and make the necessary adjustments. So here's the minimal test that I'm running using PySpark (since I have more familiarity with it than the Java environment).

Tested with the following Iceberg Runtimes:
org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.9.0
org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.10.0

Test Case

@pytest.mark.integration
def test_uuid_write_read_with_pyspark(session_catalog: Catalog, spark: SparkSession) -> None:
    identifier = "default.test_uuid_write_and_read_with_pyspark"

    catalog = load_catalog("default", type="in-memory")
    catalog.create_namespace("ns")

    schema = Schema(NestedField(field_id=1, name="uuid_col", field_type=UUIDType(), required=False))

    try:
        session_catalog.drop_table(identifier=identifier)
    except NoSuchTableError:
        pass

    table = _create_table(session_catalog, identifier, {"format-version": "2"}, schema=schema)

    spark.sql(
        f"""
        INSERT INTO {identifier} VALUES ("22222222-2222-2222-2222-222222222222")
        """
    )
    df = spark.table(identifier)

    assert df.count() == 1

    result = df.where("uuid_col = '22222222-2222-2222-2222-222222222222'")
    assert result.count() == 1

Error
The test passes for df.count() but fails when applying the WHERE condition with the following error:

25/09/14 12:45:49 ERROR BaseReader: Error reading file(s): s3://warehouse/default/test_uuid_write_and_read_with_pyspark/data/00000-0-c8b11c46-5ef7-426e-a1d5-de8aa720af6d-0-00001.parquet
java.lang.ClassCastException: class java.util.UUID cannot be cast to class java.nio.ByteBuffer (java.util.UUID and java.nio.ByteBuffer are in module java.base of loader 'bootstrap')
        at java.base/java.nio.ByteBuffer.compareTo(ByteBuffer.java:267)
        at java.base/java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:52)
        at java.base/java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:47)
        at org.apache.iceberg.types.Comparators$NullSafeChainedComparator.compare(Comparators.java:253)
        at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:352)
        at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:79)
        at org.apache.iceberg.expressions.ExpressionVisitors$BoundExpressionVisitor.predicate(ExpressionVisitors.java:162)
        at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:390)
        at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:409)
        at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eval(ParquetMetricsRowGroupFilter.java:103)
        at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter.shouldRead(ParquetMetricsRowGroupFilter.java:73)
        at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:108)
        at org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90)
        at org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99)
        at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:116)
        at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:43)
        at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:134)
        at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120)
        at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158)
        [... rest of stack trace ...]

ndrluis · 2025-09-15T02:27:54Z

@huaxingao @amogh-jahagirdar @Fokko with my latest commit, I was able to fix both cases. Since PyArrow (the version used by PyIceberg) does not add information about the logical type annotation, and since we are reverting back to using binary(16) in the visitor to represent the type on the PyIceberg side, we will only have this information when PyArrow has full support for UUID. Therefore, it's safer for us to verify the Iceberg type instead of the Parquet logical type annotation.

I have already tested the scenario of writing with PyIceberg using binary(16) and reading with this branch.

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java

Fokko · 2025-09-23T22:19:26Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetConversions.java

+      } else if (icebergType.typeId() == Type.TypeID.UUID) {
+        return binary -> UUIDUtil.convert(((Binary) binary).toByteBuffer());


This seems like an odd place to apply this conversion since the rows above are more about schema evolution. However, looking at it a bit closer, I think it makes sense. Other logical types, such as TimestampLiteral store the primitive type internally (long), while the UUIDLiteral keeps a UUID rather than bytes.

This will just compare the bytes using an unsigned lexicographical binary comparator.

shangxinli · 2025-09-24T02:17:40Z

The schema defines the column as optional. Can you null UUID value tests?

java.util.UUID cannot be cast to class java.nio.ByteBuffer

ndrluis · 2025-09-24T15:38:59Z

@Fokko @shangxinli I made the suggested changes

nastra · 2025-09-26T09:48:53Z

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java

        record.setField("_struct_not_null", structNotNull); // struct with int

+        record.setField(
+            "_uuid_col", (i % 3 == 0) ? UUID_WITH_ZEROS : (i % 3 == 1) ? UUID_WITH_ONES : null);


minor: what's the reason for doing the modulo here? why not just write UUID_WITH_ZEROS?

The idea was to use different values to make sure the filtering works, but now I think just with_zeros and null are enough. WDYT?

I think it comes down to also testing other expressions. See my other comment on this

nastra · 2025-09-26T09:53:20Z

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java

+  public void testUUIDEq() {
+    assumeThat(format).as("Only valid for Parquet").isEqualTo(FileFormat.PARQUET);
+
+    boolean shouldRead = shouldRead(equal("uuid_col", UUID_WITH_ZEROS));


what about testing other expressions?

shangxinli · 2025-09-26T13:26:51Z

LGTM

ndrluis · 2025-10-06T12:45:19Z

@nastra I made the suggested changes

nastra · 2025-10-07T09:18:46Z

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java


+  private static final UUID UUID_WITH_ZEROS =
+      UUID.fromString("00000000-0000-0000-0000-000000000000");
+  private static final UUID UUID_WITH_ONES =


when is this actually used?

My bad, the b51a298 commit solves this.

nastra · 2025-10-10T09:28:46Z

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java

+
+    UUID nonExistentUuid = UUID.fromString("99999999-9999-9999-9999-999999999999");
+
+    boolean shouldRead = shouldRead(notEqual("uuid_col", UUID_WITH_ZEROS));


the test is still missing equal/greaterThan/lessThan. please also update the other test

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java

nastra · 2025-10-10T09:29:53Z

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java

        structNotNull.setField("_int_field", INT_MIN_VALUE + i);
        record.setField("_struct_not_null", structNotNull); // struct with int

+        record.setField("_uuid_col", (i % 2 == 0) ? UUID_WITH_ZEROS : null);


nit: newline right above this line can be removed

nastra

LGTM, thanks @ndrluis

nastra · 2025-10-15T14:30:04Z

I'll leave this open for a bit in case @huaxingao wants to review this as well

huaxingao

LGTM

… with UUIDs (apache#14027) (cherry picked from commit ef40079)

… with UUIDs (#14027) (#14523) (cherry picked from commit ef40079) Co-authored-by: Andre Luis Anastacio <[email protected]>

github-actions bot added parquet data labels Sep 8, 2025

ndrluis mentioned this pull request Sep 8, 2025

Error when filtering by UUID in table scan apache/iceberg-python#2372

Open

3 tasks

huaxingao reviewed Sep 9, 2025

View reviewed changes

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java Outdated Show resolved Hide resolved

huaxingao reviewed Sep 9, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetConversions.java Outdated Show resolved Hide resolved

ndrluis requested a review from huaxingao September 10, 2025 01:00

huaxingao approved these changes Sep 10, 2025

View reviewed changes

amogh-jahagirdar reviewed Sep 11, 2025

View reviewed changes

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java Outdated Show resolved Hide resolved

parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java Outdated Show resolved Hide resolved

amogh-jahagirdar reviewed Sep 11, 2025

View reviewed changes

Fokko self-requested a review September 11, 2025 16:05

ndrluis requested review from amogh-jahagirdar and huaxingao September 15, 2025 12:40

ndrluis changed the title ~~Data, Parquet: Fix UUID ClassCastException when reading Parquet files with UUIDs written by PyIceberg~~ Data, Parquet: Fix UUID ClassCastException when reading Parquet files with UUIDs Sep 23, 2025

Fokko approved these changes Sep 23, 2025

View reviewed changes

ndrluis force-pushed the fix-uuid branch 3 times, most recently from 249d25d to 9c70716 Compare September 24, 2025 15:33

Parquet: Fix UUID ClassCastException

3c77bba

java.util.UUID cannot be cast to class java.nio.ByteBuffer

ndrluis force-pushed the fix-uuid branch from 9c70716 to 3c77bba Compare September 24, 2025 15:36

Fokko added this to the Iceberg 1.10.1 milestone Sep 25, 2025

nastra reviewed Sep 26, 2025

View reviewed changes

shangxinli approved these changes Sep 26, 2025

View reviewed changes

CatOrLeader mentioned this pull request Sep 28, 2025

[AMORO-3796] Fixed UUID type in Iceberg tables with partitions & buckets apache/amoro#3797

Merged

fixup! Parquet: Fix UUID ClassCastException

ebd82de

ndrluis requested a review from nastra October 3, 2025 18:40

nastra reviewed Oct 7, 2025

View reviewed changes

Remove unused variables

b51a298

Fokko requested a review from nastra October 9, 2025 15:18

nastra reviewed Oct 10, 2025

View reviewed changes

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java Outdated Show resolved Hide resolved

Apply suggestion from @nastra

5aa04cf

nastra reviewed Oct 10, 2025

View reviewed changes

fixup! Parquet: Fix UUID ClassCastException

1a51553

ndrluis force-pushed the fix-uuid branch from fef1745 to 1a51553 Compare October 14, 2025 13:46

ndrluis requested a review from nastra October 14, 2025 13:46

nastra approved these changes Oct 14, 2025

View reviewed changes

huaxingao approved these changes Oct 15, 2025

View reviewed changes

nastra merged commit ef40079 into apache:main Oct 15, 2025
42 checks passed

huaxingao pushed a commit to huaxingao/iceberg that referenced this pull request Nov 6, 2025

Data, Parquet: Fix UUID ClassCastException when reading Parquet files…

fdc2f48

… with UUIDs (apache#14027) (cherry picked from commit ef40079)

huaxingao added a commit that referenced this pull request Nov 6, 2025

Data, Parquet: Fix UUID ClassCastException when reading Parquet files…

a28fd54

… with UUIDs (#14027) (#14523) (cherry picked from commit ef40079) Co-authored-by: Andre Luis Anastacio <[email protected]>

kevinjqliu mentioned this pull request Jan 5, 2026

fix: Use binary(16) for UUID type to ensure Spark compatibility apache/iceberg-python#2881

Open

		} else if (icebergType.typeId() == Type.TypeID.UUID) {
		return binary -> UUIDUtil.convert(((Binary) binary).toByteBuffer());


		UUID nonExistentUuid = UUID.fromString("99999999-9999-9999-9999-999999999999");

		boolean shouldRead = shouldRead(notEqual("uuid_col", UUID_WITH_ZEROS));

Data, Parquet: Fix UUID ClassCastException when reading Parquet files with UUIDs #14027

Data, Parquet: Fix UUID ClassCastException when reading Parquet files with UUIDs #14027

Uh oh!

Conversation

ndrluis commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

ndrluis commented Sep 9, 2025

Uh oh!

huaxingao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ndrluis commented Sep 14, 2025

Uh oh!

ndrluis commented Sep 15, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli commented Sep 24, 2025

Uh oh!

ndrluis commented Sep 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli commented Sep 26, 2025

Uh oh!

ndrluis commented Oct 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

nastra commented Oct 15, 2025

Uh oh!

huaxingao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nastra Oct 10, 2025 •

edited

Loading