Skip to content

Comments

Add Delta data type to Parquet physical type mappings in PROTOCOL.md#2048

Merged
vkorukanti merged 3 commits intodelta-io:masterfrom
vkorukanti:physicalTypes
Apr 11, 2024
Merged

Add Delta data type to Parquet physical type mappings in PROTOCOL.md#2048
vkorukanti merged 3 commits intodelta-io:masterfrom
vkorukanti:physicalTypes

Conversation

@vkorukanti
Copy link
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (Delta PROTOCOL.md updates/clarifications)

Description

Currently, Delta protocol doesn't specify how a Delta data type is stored physically in Parquet files. This PR is attempting to document the Delta data type to Parquet physical/logical type mappings.

How was this patch tested?

NA

Does this PR introduce any user-facing changes?

No

decimal| `int32`, `int64` or `fixed_length_binary` | `DECIMALe(scale, precision)`
string| `binary` | `string (UTF-8)`
binary| `binary` |
array| either as `2-level` or `3-level` representation. Refer to [Parquet documentation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) for further details | `LIST`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2-level representation is based on the old format. It is possible that we have some old writers that wrote in this format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the protocol express a preference for 3-level then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same response as the other comment here.

@vkorukanti
Copy link
Collaborator Author

cc @ryan-johnson-databricks , @tdas

Copy link
Collaborator

@ryan-johnson-databricks ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend we get @zsxwing and/or @brkyvz review as well.

decimal| `int32`, `int64` or `fixed_length_binary` | `DECIMALe(scale, precision)`
string| `binary` | `string (UTF-8)`
binary| `binary` |
array| either as `2-level` or `3-level` representation. Refer to [Parquet documentation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) for further details | `LIST`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the protocol express a preference for 3-level then?

@vkorukanti vkorukanti requested a review from zsxwing October 3, 2023 20:29
PROTOCOL.md Outdated
-|-
type| Always the string "array"
elementType| The type of element stored in this array represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition
elementType| The type of element stored in this array is represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of the is here, maybe we need , instead

Copy link
Collaborator

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unfortunately not an expert here. The vectorized reader/writer team may provide more details, but this looks correct to me

int| `int32` | `INT(bitwidth = 32, signed = true)`
long| `int64` | `INT(bitwidth = 64, signed = true)`
date| `int32` | `DATE`
timestamp| `int96` or `int64` | `TIMESTAMP(isAdjustedToUTC = true, units = microseconds)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we recommend int64 as preferred? int96 is deprecated: apache/parquet-format#86

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got. Lets merge this PR and then we can look at what should be the preferred one, because it affects the existing readers and need to see if all existing readers have support for int64 as timestamp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second this. I think we should at least mandate all timestamp cols written after some date to be int64.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lzlfred are you aware of any reader that doesn't support int64 yet?

Context: In Kernel the ParquetHandler for writes takes ColumnarBatch which has schema as StructType. There is no clear way to communicate with the current API whether to write the timestamp column as INT96 or INT64. There are no configuration options like we have in Delta-Spark. We need

  • update the API to indicate what type of physical Parquet format we want for a given column
    • could be in the metadata of StructField.
  • make the writer always write as INT64 (checking if any Delta clients have problem with or not support it yet)

@felipepessoto
Copy link
Contributor

@vkorukanti @zsxwing it seems this change was forgotten, but very important for Delta ecosystem. Do you have plans to merge it?

Copy link
Collaborator Author

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipepessoto Will merge this soon. Thanks for reminding.

int| `int32` | `INT(bitwidth = 32, signed = true)`
long| `int64` | `INT(bitwidth = 64, signed = true)`
date| `int32` | `DATE`
timestamp| `int96` or `int64` | `TIMESTAMP(isAdjustedToUTC = true, units = microseconds)`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got. Lets merge this PR and then we can look at what should be the preferred one, because it affects the existing readers and need to see if all existing readers have support for int64 as timestamp.

decimal| `int32`, `int64` or `fixed_length_binary` | `DECIMALe(scale, precision)`
string| `binary` | `string (UTF-8)`
binary| `binary` |
array| either as `2-level` or `3-level` representation. Refer to [Parquet documentation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) for further details | `LIST`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same response as the other comment here.

@vkorukanti vkorukanti merged commit 45ad641 into delta-io:master Apr 11, 2024
andreaschat-db pushed a commit to andreaschat-db/delta that referenced this pull request Apr 16, 2024
…elta-io#2048)

## Description
Currently, Delta protocol doesn't specify how a Delta data type is
stored physically in Parquet files. This PR is attempting to document
the Delta data type to Parquet physical/logical type mappings.

## How was this patch tested?
NA

## Does this PR introduce _any_ user-facing changes?
No
@vkorukanti vkorukanti deleted the physicalTypes branch May 9, 2024 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants