Skip to content

Conversation

@ndrluis
Copy link
Contributor

@ndrluis ndrluis commented Sep 8, 2025

I was working on this PyIceberg issue (apache/iceberg-python#2372) and I wrote a new test where PyIceberg writes a parquet file and PySpark writes another one. I wanted to ensure that we are able to read from both parquet files, but then I started receiving this exception: java.util.UUID cannot be cast to class java.nio.ByteBuffer. So this PR focuses on solving this problem to maintain compatibility between both implementations.

@ndrluis
Copy link
Contributor Author

ndrluis commented Sep 9, 2025

Thank you @huaxingao for the review, I made the requested changes.

@ndrluis ndrluis requested a review from huaxingao September 10, 2025 01:00
Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

static Function<Object, Object> converterFromParquet(PrimitiveType type) {
if (type.getLogicalTypeAnnotation() instanceof UUIDLogicalTypeAnnotation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix seems OK to me, but the part I don't quite understand yet (I haven't dug into it yet) is this issue PyIceberg specific? What's different about for instance the dictionaries with UUID produced by Spark and why doesn't that fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Fokko may have some insights too here since I know he was working on some UUID related fixes in the past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not solve all the problems. I'm doing some experiments here, playing around with pyiceberg and Spark, so I discovered some other things that I'm double-checking. I intend to add a more detailed analysis here over the weekend.

@Fokko Fokko self-requested a review September 11, 2025 16:05
@ndrluis
Copy link
Contributor Author

ndrluis commented Sep 14, 2025

Quick update on this issue - I'm going to focus on solving this problem on the Java side first. Once Iceberg Java has the correct behavior, I'll come back to PyIceberg and make the necessary adjustments. So here's the minimal test that I'm running using PySpark (since I have more familiarity with it than the Java environment).

Tested with the following Iceberg Runtimes:
org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.9.0
org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.10.0

Test Case

@pytest.mark.integration
def test_uuid_write_read_with_pyspark(session_catalog: Catalog, spark: SparkSession) -> None:
    identifier = "default.test_uuid_write_and_read_with_pyspark"

    catalog = load_catalog("default", type="in-memory")
    catalog.create_namespace("ns")

    schema = Schema(NestedField(field_id=1, name="uuid_col", field_type=UUIDType(), required=False))

    try:
        session_catalog.drop_table(identifier=identifier)
    except NoSuchTableError:
        pass

    table = _create_table(session_catalog, identifier, {"format-version": "2"}, schema=schema)

    spark.sql(
        f"""
        INSERT INTO {identifier} VALUES ("22222222-2222-2222-2222-222222222222")
        """
    )
    df = spark.table(identifier)

    assert df.count() == 1

    result = df.where("uuid_col = '22222222-2222-2222-2222-222222222222'")
    assert result.count() == 1

Error
The test passes for df.count() but fails when applying the WHERE condition with the following error:

25/09/14 12:45:49 ERROR BaseReader: Error reading file(s): s3://warehouse/default/test_uuid_write_and_read_with_pyspark/data/00000-0-c8b11c46-5ef7-426e-a1d5-de8aa720af6d-0-00001.parquet
java.lang.ClassCastException: class java.util.UUID cannot be cast to class java.nio.ByteBuffer (java.util.UUID and java.nio.ByteBuffer are in module java.base of loader 'bootstrap')
        at java.base/java.nio.ByteBuffer.compareTo(ByteBuffer.java:267)
        at java.base/java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:52)
        at java.base/java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:47)
        at org.apache.iceberg.types.Comparators$NullSafeChainedComparator.compare(Comparators.java:253)
        at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:352)
        at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:79)
        at org.apache.iceberg.expressions.ExpressionVisitors$BoundExpressionVisitor.predicate(ExpressionVisitors.java:162)
        at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:390)
        at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:409)
        at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eval(ParquetMetricsRowGroupFilter.java:103)
        at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter.shouldRead(ParquetMetricsRowGroupFilter.java:73)
        at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:108)
        at org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90)
        at org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99)
        at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:116)
        at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:43)
        at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:134)
        at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120)
        at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158)
        [... rest of stack trace ...]

@ndrluis
Copy link
Contributor Author

ndrluis commented Sep 15, 2025

@huaxingao @amogh-jahagirdar @Fokko with my latest commit, I was able to fix both cases. Since PyArrow (the version used by PyIceberg) does not add information about the logical type annotation, and since we are reverting back to using binary(16) in the visitor to represent the type on the PyIceberg side, we will only have this information when PyArrow has full support for UUID. Therefore, it's safer for us to verify the Iceberg type instead of the Parquet logical type annotation.

I have already tested the scenario of writing with PyIceberg using binary(16) and reading with this branch.

@ndrluis ndrluis changed the title Data, Parquet: Fix UUID ClassCastException when reading Parquet files with UUIDs written by PyIceberg Data, Parquet: Fix UUID ClassCastException when reading Parquet files with UUIDs Sep 23, 2025
Comment on lines +86 to +87
} else if (icebergType.typeId() == Type.TypeID.UUID) {
return binary -> UUIDUtil.convert(((Binary) binary).toByteBuffer());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an odd place to apply this conversion since the rows above are more about schema evolution. However, looking at it a bit closer, I think it makes sense. Other logical types, such as TimestampLiteral store the primitive type internally (long), while the UUIDLiteral keeps a UUID rather than bytes.

This will just compare the bytes using an unsigned lexicographical binary comparator.

@shangxinli
Copy link
Contributor

The schema defines the column as optional. Can you null UUID value tests?

@ndrluis ndrluis force-pushed the fix-uuid branch 3 times, most recently from 249d25d to 9c70716 Compare September 24, 2025 15:33
java.util.UUID cannot be cast to class java.nio.ByteBuffer
@ndrluis
Copy link
Contributor Author

ndrluis commented Sep 24, 2025

@Fokko @shangxinli I made the suggested changes

@Fokko Fokko added this to the Iceberg 1.10.1 milestone Sep 25, 2025
record.setField("_struct_not_null", structNotNull); // struct with int

record.setField(
"_uuid_col", (i % 3 == 0) ? UUID_WITH_ZEROS : (i % 3 == 1) ? UUID_WITH_ONES : null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: what's the reason for doing the modulo here? why not just write UUID_WITH_ZEROS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to use different values to make sure the filtering works, but now I think just with_zeros and null are enough. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it comes down to also testing other expressions. See my other comment on this

public void testUUIDEq() {
assumeThat(format).as("Only valid for Parquet").isEqualTo(FileFormat.PARQUET);

boolean shouldRead = shouldRead(equal("uuid_col", UUID_WITH_ZEROS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about testing other expressions?

@shangxinli
Copy link
Contributor

LGTM

@ndrluis ndrluis requested a review from nastra October 3, 2025 18:40
@ndrluis
Copy link
Contributor Author

ndrluis commented Oct 6, 2025

@nastra I made the suggested changes


private static final UUID UUID_WITH_ZEROS =
UUID.fromString("00000000-0000-0000-0000-000000000000");
private static final UUID UUID_WITH_ONES =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is this actually used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, the b51a298 commit solves this.

@Fokko Fokko requested a review from nastra October 9, 2025 15:18

UUID nonExistentUuid = UUID.fromString("99999999-9999-9999-9999-999999999999");

boolean shouldRead = shouldRead(notEqual("uuid_col", UUID_WITH_ZEROS));
Copy link
Contributor

@nastra nastra Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is still missing equal/greaterThan/lessThan. please also update the other test

structNotNull.setField("_int_field", INT_MIN_VALUE + i);
record.setField("_struct_not_null", structNotNull); // struct with int

record.setField("_uuid_col", (i % 2 == 0) ? UUID_WITH_ZEROS : null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline right above this line can be removed

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @ndrluis

@nastra
Copy link
Contributor

nastra commented Oct 15, 2025

I'll leave this open for a bit in case @huaxingao wants to review this as well

Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nastra nastra merged commit ef40079 into apache:main Oct 15, 2025
42 checks passed
huaxingao pushed a commit to huaxingao/iceberg that referenced this pull request Nov 6, 2025
huaxingao added a commit that referenced this pull request Nov 6, 2025
… with UUIDs (#14027) (#14523)

(cherry picked from commit ef40079)

Co-authored-by: Andre Luis Anastacio <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants