Add Delta data type to Parquet physical type mappings in PROTOCOL.md#2048
Add Delta data type to Parquet physical type mappings in PROTOCOL.md#2048vkorukanti merged 3 commits intodelta-io:masterfrom
Conversation
| decimal| `int32`, `int64` or `fixed_length_binary` | `DECIMALe(scale, precision)` | ||
| string| `binary` | `string (UTF-8)` | ||
| binary| `binary` | | ||
| array| either as `2-level` or `3-level` representation. Refer to [Parquet documentation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) for further details | `LIST` |
There was a problem hiding this comment.
2-level representation is based on the old format. It is possible that we have some old writers that wrote in this format.
There was a problem hiding this comment.
Should the protocol express a preference for 3-level then?
There was a problem hiding this comment.
Same response as the other comment here.
| decimal| `int32`, `int64` or `fixed_length_binary` | `DECIMALe(scale, precision)` | ||
| string| `binary` | `string (UTF-8)` | ||
| binary| `binary` | | ||
| array| either as `2-level` or `3-level` representation. Refer to [Parquet documentation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) for further details | `LIST` |
There was a problem hiding this comment.
Should the protocol express a preference for 3-level then?
PROTOCOL.md
Outdated
| -|- | ||
| type| Always the string "array" | ||
| elementType| The type of element stored in this array represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition | ||
| elementType| The type of element stored in this array is represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition |
There was a problem hiding this comment.
instead of the is here, maybe we need , instead
| int| `int32` | `INT(bitwidth = 32, signed = true)` | ||
| long| `int64` | `INT(bitwidth = 64, signed = true)` | ||
| date| `int32` | `DATE` | ||
| timestamp| `int96` or `int64` | `TIMESTAMP(isAdjustedToUTC = true, units = microseconds)` |
There was a problem hiding this comment.
Should we recommend int64 as preferred? int96 is deprecated: apache/parquet-format#86
There was a problem hiding this comment.
Got. Lets merge this PR and then we can look at what should be the preferred one, because it affects the existing readers and need to see if all existing readers have support for int64 as timestamp.
There was a problem hiding this comment.
second this. I think we should at least mandate all timestamp cols written after some date to be int64.
There was a problem hiding this comment.
@lzlfred are you aware of any reader that doesn't support int64 yet?
Context: In Kernel the ParquetHandler for writes takes ColumnarBatch which has schema as StructType. There is no clear way to communicate with the current API whether to write the timestamp column as INT96 or INT64. There are no configuration options like we have in Delta-Spark. We need
- update the API to indicate what type of physical Parquet format we want for a given column
- could be in the metadata of
StructField.
- could be in the metadata of
- make the writer always write as INT64 (checking if any Delta clients have problem with or not support it yet)
|
@vkorukanti @zsxwing it seems this change was forgotten, but very important for Delta ecosystem. Do you have plans to merge it? |
vkorukanti
left a comment
There was a problem hiding this comment.
@felipepessoto Will merge this soon. Thanks for reminding.
| int| `int32` | `INT(bitwidth = 32, signed = true)` | ||
| long| `int64` | `INT(bitwidth = 64, signed = true)` | ||
| date| `int32` | `DATE` | ||
| timestamp| `int96` or `int64` | `TIMESTAMP(isAdjustedToUTC = true, units = microseconds)` |
There was a problem hiding this comment.
Got. Lets merge this PR and then we can look at what should be the preferred one, because it affects the existing readers and need to see if all existing readers have support for int64 as timestamp.
| decimal| `int32`, `int64` or `fixed_length_binary` | `DECIMALe(scale, precision)` | ||
| string| `binary` | `string (UTF-8)` | ||
| binary| `binary` | | ||
| array| either as `2-level` or `3-level` representation. Refer to [Parquet documentation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) for further details | `LIST` |
There was a problem hiding this comment.
Same response as the other comment here.
5ccc9d4 to
194c66b
Compare
…elta-io#2048) ## Description Currently, Delta protocol doesn't specify how a Delta data type is stored physically in Parquet files. This PR is attempting to document the Delta data type to Parquet physical/logical type mappings. ## How was this patch tested? NA ## Does this PR introduce _any_ user-facing changes? No
Which Delta project/connector is this regarding?
Description
Currently, Delta protocol doesn't specify how a Delta data type is stored physically in Parquet files. This PR is attempting to document the Delta data type to Parquet physical/logical type mappings.
How was this patch tested?
NA
Does this PR introduce any user-facing changes?
No