Added toggle for Parquet V1 and V2 formats#9497
Added toggle for Parquet V1 and V2 formats#9497joshthoward wants to merge 2 commits intotrinodb:masterfrom
Conversation
This sounds like important piece to solve #6377
It seems like we didn't make a very deliberate decision to use file format version V2 and we later realized this is causing problems (#7953) We should switch to v1 for now, unless we can name actual benefits of having a toggle. |
|
per @anjalinorwood #7953 (comment) changing the |
|
+1000
|
Yes, I agree that we should switch to V1 since that's what's supported more broadly. I think we should keep V2 as an experimental flag. It's useful to be able to write data in the new format for others to be able to test compatibility as they develop that support in their engines. |
|
@anjalinorwood Based on your comments in #7953, did you have an explicit test case where something else was needed to support the V1 spec? Writing |
it would be awesome if Trino could become the reference implementation for Parquet v2 format, but i would assume parquet-mr is going to be that no matter what we do. Toggles are not something to add haphazardly. There is a cost to them in terms of code and cognitive complexity that affects future maintainability and ability to safely evolve the system.
Good question. While we look for the answer to that, i think we should also take a look what it would take to produce files that are readable by Hive. |
|
If the other PR gets merged first we'll have a small merge conflict with https://github.com/trinodb/trino/pull/9569/files/fd0c06389c6977de7ecc96a70ec6959086f90446#r725263578 But it should just be another case of switching a "v2" to a "v1" |
Cherry-pick of trinodb/trino#9497 and trinodb/trino#9611 co-authored by Josh Howard <joshthoward@gmail.com>
Cherry-pick of trinodb/trino#9497 and trinodb/trino#9611 co-authored by Josh Howard <joshthoward@gmail.com>
Cherry-pick of trinodb/trino#9497 and trinodb/trino#9611 co-authored by Josh Howard <joshthoward@gmail.com>
This PR creates a toggle to write data in the Parquet V1 format using the native writer. The default version remains the same to ensure that there are no compatibility issues here.
With this change:
org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.hive.serde2.io.HiveVarcharWritablefor all tested Hive versions (there are mixed errors for Parquet V2 files)Why does this seem to work? (TBH I wouldn't have expected it to)
ValuesWriteris injected intoPrimitiveColumnWriterhere.valuesWriterFactorycreates a newValuesWriterhere.DefaultValuesWriterFactoryhereDefaultValuesWriterFactoryrespects the Parquet V1 or V2 flag for constructingValuesWriters here.This PR is still a WIP, but I wanted to post this for discussion.