Conversation
|
Our parquet struct schema keeps evolving. Very frequently, we have parquet files of different versions co-existing. Instead of failing the query due to schema mismatching, it is better to return null for new fields if the data file is old. |
There was a problem hiding this comment.
Sometimes this check helps to catch problems with the data definition in the metastore, so I think silently returning nulls if the schemas do not match is not the right way to go.
There was a problem hiding this comment.
This check is good but too restrictive. It makes it hard to evolve the schema seamlessly. We return nulls and log some message below so that queries can still go on. If they pay attention to the logging, they should realize what's happened.
Even if the number matches, the schema still could be incompatible. Users still need to know what they are doing.
|
can you add unit tests? |
There was a problem hiding this comment.
I don't think we need the log here (especially in INFO level)
There was a problem hiding this comment.
It's good to log something since we don't fail the query any more. I can make this a warning.
|
Sure, will add some unit tests. |
306170b to
b316a2e
Compare
|
Added a unit test to the patch. |
5d449a5 to
706735e
Compare
|
Updated the patch: 1) now we support both adding and removing struct fields; 2) support adding fields at any place in the struct; 3) added more test cases for these scenarios. |
706735e to
cc17148
Compare
|
Added another patch that supports changing the order of existing fields in a struct. With these patches, now we fully support Parquet struct schema evolution.
Schema mismatch will be logged although query executes. |
aebcbe4 to
7753059
Compare
There was a problem hiding this comment.
Is it possible to update all BlockConverter into BlockFieldConverter? Then we just need on BlockConverter, with fieldIndex
There was a problem hiding this comment.
Most of converters can share the same interface, except ParquetListEntryConverter and ParquetMapEntryConverter. The field index doesn't apply to these two entry converters.
|
@zhenxiao now that the new parquet reader supports structs, we should add this same feature to it |
|
@dain yes, we are working / stress testing it |
|
@jxiang Is there anything that I can do to help out with this PR? I'm facing the same problem where we have mismatching schemas for structs due to schema evolution and it's not very feasible for us backfill the old Parquet files to match the new schemas. I can apply this change to my fork but I think other people may find this feature useful as well. That being said, is there a different approach to schema evolution involving Parquet files for cases similar to this, without applying this patch? |
|
@zhenxiao the code still doesn't allow for struct types to evolve https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetHiveRecordCursor.java#L743-L748 anything we can do to help in this pr ? |
|
we have a rebased version here: |
eee4f3b to
333ecb2
Compare
|
adding the updated pull request for reference #6675 |
2b018e0 to
5b14fbb
Compare
|
Thanks @zhenxiao, I pushed the rebased version to this branch. |
5b14fbb to
0f526c9
Compare
|
@dain could you take a look when you get a chance? Thanks. |
|
@jxiang yep. I see there are two (or three) PRs related to this. Can you help me understand which ones I should review in which order? |
|
Yeah, as @zhenxiao said, this is the first one. Thanks. |
|
continue with: |
No description provided.