Skip to content

Conversation

@chenjian2664
Copy link
Contributor

@chenjian2664 chenjian2664 commented Jul 13, 2025

Description

In cases where a value chunk includes both null and non-null
entries, the chunk must be treated as potentially nullable.

This is the another corner case that missed in #26027

Release notes

## Delta Lake
* Fix failure when reading `null` values on `json` type columns. ({issue}`26184`)

In cases where a value chunk includes both null and non-null
entries, the chunk must be treated as potentially nullable.
@cla-bot cla-bot bot added the cla-signed label Jul 13, 2025
@github-actions github-actions bot added the delta-lake Delta Lake connector label Jul 13, 2025
@chenjian2664 chenjian2664 force-pushed the fix_variant_null_reading branch from 3ceefed to b06be75 Compare July 13, 2025 15:20
@ebyhr
Copy link
Member

ebyhr commented Jul 13, 2025

/test-with-secrets sha=b06be75c90801490ebc04fefaa3ce32f0ab7b6ae

@github-actions
Copy link

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/16253030103

definitionLevel,
required,
valueField,
new PrimitiveField(valueField.getType(), false, valueField.getDescriptor(), valueField.getId()),
Copy link
Member

@ebyhr ebyhr Jul 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean Parquet spec is wrong, or Databricks violates the spec?
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#variant-in-parquet

The value field must be annotated as required for unshredded Variant values

Copy link
Contributor Author

@chenjian2664 chenjian2664 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the meta of the test parquet file that written by Databricks:

parquet meta part-00000-3dae12c4-61bc-4177-bd36-2c936db81e90-c000.snappy.parquet

File path:  part-00000-3dae12c4-61bc-4177-bd36-2c936db81e90-c000.snappy.parquet
Created by: parquet-mr version 1.12.3-databricks-0002 (build 2484a95dbe16a0023e3eb29c201f99ff9ea771ee)
Properties:
                   org.apache.spark.version: 3.5.0
            com.databricks.spark.jobGroupId: 1752418278132_8642402946176337271_3e73399aa411479296f9bc16c62a6681
  org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"id","type":"integer","nullable":true,"metadata":{}},{"name":"x","type":"variant","nullable":true,"metadata":{}}]}
             com.databricks.spark.clusterId: 1002-064054-nbosugsx
Schema:
message spark_schema {
  optional int32 id;
  optional group x {
    required binary value;
    required binary metadata;
  }
}

I think the Schema is correct

But I don't know why our NestedColumnReader#readNonNull can't read correctly cc @raunaqmorarka

@chenjian2664
Copy link
Contributor Author

#26194

@ebyhr ebyhr merged commit 2a390bf into trinodb:master Jul 14, 2025
121 of 124 checks passed
@github-actions github-actions bot added this to the 477 milestone Jul 14, 2025
@chenjian2664 chenjian2664 deleted the fix_variant_null_reading branch July 14, 2025 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed delta-lake Delta Lake connector

Development

Successfully merging this pull request may close these issues.

2 participants