-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Fix reading variant null values in Delta Lake #26184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In cases where a value chunk includes both null and non-null entries, the chunk must be treated as potentially nullable.
3ceefed to
b06be75
Compare
|
/test-with-secrets sha=b06be75c90801490ebc04fefaa3ce32f0ab7b6ae |
|
The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/16253030103 |
| definitionLevel, | ||
| required, | ||
| valueField, | ||
| new PrimitiveField(valueField.getType(), false, valueField.getDescriptor(), valueField.getId()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean Parquet spec is wrong, or Databricks violates the spec?
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#variant-in-parquet
The value field must be annotated as required for unshredded Variant values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the meta of the test parquet file that written by Databricks:
parquet meta part-00000-3dae12c4-61bc-4177-bd36-2c936db81e90-c000.snappy.parquet
File path: part-00000-3dae12c4-61bc-4177-bd36-2c936db81e90-c000.snappy.parquet
Created by: parquet-mr version 1.12.3-databricks-0002 (build 2484a95dbe16a0023e3eb29c201f99ff9ea771ee)
Properties:
org.apache.spark.version: 3.5.0
com.databricks.spark.jobGroupId: 1752418278132_8642402946176337271_3e73399aa411479296f9bc16c62a6681
org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"id","type":"integer","nullable":true,"metadata":{}},{"name":"x","type":"variant","nullable":true,"metadata":{}}]}
com.databricks.spark.clusterId: 1002-064054-nbosugsx
Schema:
message spark_schema {
optional int32 id;
optional group x {
required binary value;
required binary metadata;
}
}
I think the Schema is correct
But I don't know why our NestedColumnReader#readNonNull can't read correctly cc @raunaqmorarka
Description
In cases where a value chunk includes both null and non-null
entries, the chunk must be treated as potentially nullable.
This is the another corner case that missed in #26027
Release notes