-
Notifications
You must be signed in to change notification settings - Fork 3k
Data, Parquet: Fix UUID ClassCastException when reading Parquet files with UUIDs #14027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetConversions.java
Outdated
Show resolved
Hide resolved
|
Thank you @huaxingao for the review, I made the requested changes. |
huaxingao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| static Function<Object, Object> converterFromParquet(PrimitiveType type) { | ||
| if (type.getLogicalTypeAnnotation() instanceof UUIDLogicalTypeAnnotation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix seems OK to me, but the part I don't quite understand yet (I haven't dug into it yet) is this issue PyIceberg specific? What's different about for instance the dictionaries with UUID produced by Spark and why doesn't that fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Fokko may have some insights too here since I know he was working on some UUID related fixes in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change does not solve all the problems. I'm doing some experiments here, playing around with pyiceberg and Spark, so I discovered some other things that I'm double-checking. I intend to add a more detailed analysis here over the weekend.
|
Quick update on this issue - I'm going to focus on solving this problem on the Java side first. Once Iceberg Java has the correct behavior, I'll come back to PyIceberg and make the necessary adjustments. So here's the minimal test that I'm running using PySpark (since I have more familiarity with it than the Java environment). Tested with the following Iceberg Runtimes: Test Case @pytest.mark.integration
def test_uuid_write_read_with_pyspark(session_catalog: Catalog, spark: SparkSession) -> None:
identifier = "default.test_uuid_write_and_read_with_pyspark"
catalog = load_catalog("default", type="in-memory")
catalog.create_namespace("ns")
schema = Schema(NestedField(field_id=1, name="uuid_col", field_type=UUIDType(), required=False))
try:
session_catalog.drop_table(identifier=identifier)
except NoSuchTableError:
pass
table = _create_table(session_catalog, identifier, {"format-version": "2"}, schema=schema)
spark.sql(
f"""
INSERT INTO {identifier} VALUES ("22222222-2222-2222-2222-222222222222")
"""
)
df = spark.table(identifier)
assert df.count() == 1
result = df.where("uuid_col = '22222222-2222-2222-2222-222222222222'")
assert result.count() == 1Error |
|
@huaxingao @amogh-jahagirdar @Fokko with my latest commit, I was able to fix both cases. Since PyArrow (the version used by PyIceberg) does not add information about the logical type annotation, and since we are reverting back to using binary(16) in the visitor to represent the type on the PyIceberg side, we will only have this information when PyArrow has full support for UUID. Therefore, it's safer for us to verify the Iceberg type instead of the Parquet logical type annotation. I have already tested the scenario of writing with PyIceberg using binary(16) and reading with this branch. |
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java
Outdated
Show resolved
Hide resolved
| } else if (icebergType.typeId() == Type.TypeID.UUID) { | ||
| return binary -> UUIDUtil.convert(((Binary) binary).toByteBuffer()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an odd place to apply this conversion since the rows above are more about schema evolution. However, looking at it a bit closer, I think it makes sense. Other logical types, such as TimestampLiteral store the primitive type internally (long), while the UUIDLiteral keeps a UUID rather than bytes.
This will just compare the bytes using an unsigned lexicographical binary comparator.
|
The schema defines the column as optional. Can you null UUID value tests? |
249d25d to
9c70716
Compare
java.util.UUID cannot be cast to class java.nio.ByteBuffer
|
@Fokko @shangxinli I made the suggested changes |
| record.setField("_struct_not_null", structNotNull); // struct with int | ||
|
|
||
| record.setField( | ||
| "_uuid_col", (i % 3 == 0) ? UUID_WITH_ZEROS : (i % 3 == 1) ? UUID_WITH_ONES : null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: what's the reason for doing the modulo here? why not just write UUID_WITH_ZEROS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to use different values to make sure the filtering works, but now I think just with_zeros and null are enough. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it comes down to also testing other expressions. See my other comment on this
| public void testUUIDEq() { | ||
| assumeThat(format).as("Only valid for Parquet").isEqualTo(FileFormat.PARQUET); | ||
|
|
||
| boolean shouldRead = shouldRead(equal("uuid_col", UUID_WITH_ZEROS)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about testing other expressions?
|
LGTM |
|
@nastra I made the suggested changes |
|
|
||
| private static final UUID UUID_WITH_ZEROS = | ||
| UUID.fromString("00000000-0000-0000-0000-000000000000"); | ||
| private static final UUID UUID_WITH_ONES = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is this actually used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, the b51a298 commit solves this.
|
|
||
| UUID nonExistentUuid = UUID.fromString("99999999-9999-9999-9999-999999999999"); | ||
|
|
||
| boolean shouldRead = shouldRead(notEqual("uuid_col", UUID_WITH_ZEROS)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test is still missing equal/greaterThan/lessThan. please also update the other test
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java
Outdated
Show resolved
Hide resolved
| structNotNull.setField("_int_field", INT_MIN_VALUE + i); | ||
| record.setField("_struct_not_null", structNotNull); // struct with int | ||
|
|
||
| record.setField("_uuid_col", (i % 2 == 0) ? UUID_WITH_ZEROS : null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline right above this line can be removed
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @ndrluis
|
I'll leave this open for a bit in case @huaxingao wants to review this as well |
huaxingao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… with UUIDs (apache#14027) (cherry picked from commit ef40079)
… with UUIDs (#14027) (#14523) (cherry picked from commit ef40079) Co-authored-by: Andre Luis Anastacio <[email protected]>
I was working on this PyIceberg issue (apache/iceberg-python#2372) and I wrote a new test where PyIceberg writes a parquet file and PySpark writes another one. I wanted to ensure that we are able to read from both parquet files, but then I started receiving this exception: java.util.UUID cannot be cast to class java.nio.ByteBuffer. So this PR focuses on solving this problem to maintain compatibility between both implementations.