Skip to content

Flexible parquet struct converter#4714

Closed
jxiang wants to merge 1 commit intoprestodb:masterfrom
jxiang:extra_parquet_struct_fields
Closed

Flexible parquet struct converter#4714
jxiang wants to merge 1 commit intoprestodb:masterfrom
jxiang:extra_parquet_struct_fields

Conversation

@jxiang
Copy link
Contributor

@jxiang jxiang commented Mar 3, 2016

No description provided.

@jxiang
Copy link
Contributor Author

jxiang commented Mar 3, 2016

Our parquet struct schema keeps evolving. Very frequently, we have parquet files of different versions co-existing. Instead of failing the query due to schema mismatching, it is better to return null for new fields if the data file is old.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes this check helps to catch problems with the data definition in the metastore, so I think silently returning nulls if the schemas do not match is not the right way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is good but too restrictive. It makes it hard to evolve the schema seamlessly. We return nulls and log some message below so that queries can still go on. If they pay attention to the logging, they should realize what's happened.

Even if the number matches, the schema still could be incompatible. Users still need to know what they are doing.

@nezihyigitbasi
Copy link
Contributor

can you add unit tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the log here (especially in INFO level)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to log something since we don't fail the query any more. I can make this a warning.

@jxiang
Copy link
Contributor Author

jxiang commented Mar 4, 2016

Sure, will add some unit tests.

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch from 306170b to b316a2e Compare March 6, 2016 00:18
@jxiang
Copy link
Contributor Author

jxiang commented Mar 6, 2016

Added a unit test to the patch.

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from 5d449a5 to 706735e Compare March 15, 2016 18:42
@jxiang
Copy link
Contributor Author

jxiang commented Mar 15, 2016

Updated the patch: 1) now we support both adding and removing struct fields; 2) support adding fields at any place in the struct; 3) added more test cases for these scenarios.
However, it is not supported to change the order of existing fields in a struct. In such a case, a schema mismatch error will be thrown.

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch from 706735e to cc17148 Compare March 16, 2016 19:59
@jxiang
Copy link
Contributor Author

jxiang commented Mar 16, 2016

Added another patch that supports changing the order of existing fields in a struct. With these patches, now we fully support Parquet struct schema evolution.

  • The schema from the metastore is the source of truth
  • Use field order from the metastore schema,
  • Field in metastore schema but not in parquet schema, has null value,
  • Filed not in metasotre schema but in parquet schema, is ignored.

Schema mismatch will be logged although query executes.

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from aebcbe4 to 7753059 Compare March 23, 2016 16:40
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to update all BlockConverter into BlockFieldConverter? Then we just need on BlockConverter, with fieldIndex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of converters can share the same interface, except ParquetListEntryConverter and ParquetMapEntryConverter. The field index doesn't apply to these two entry converters.

@dain
Copy link
Contributor

dain commented Jan 11, 2017

@zhenxiao now that the new parquet reader supports structs, we should add this same feature to it

@zhenxiao
Copy link
Collaborator

@dain yes, we are working / stress testing it
New Parquet Reader is using nested path names to look up ColumnDescriptors, it has most schema evolution support now

@markcho
Copy link

markcho commented Feb 10, 2017

@jxiang Is there anything that I can do to help out with this PR?

I'm facing the same problem where we have mismatching schemas for structs due to schema evolution and it's not very feasible for us backfill the old Parquet files to match the new schemas.

I can apply this change to my fork but I think other people may find this feature useful as well.

That being said, is there a different approach to schema evolution involving Parquet files for cases similar to this, without applying this patch?

@billonahill
Copy link

+1 to @markcho's comment. We're also in need of this patch as well. cc/ @Yaliang.

@Gauravshah
Copy link

@zhenxiao
Copy link
Collaborator

we have a rebased version here:
ba5a3e4

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from eee4f3b to 333ecb2 Compare February 14, 2017 19:27
@Gauravshah
Copy link

adding the updated pull request for reference #6675

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch 2 times, most recently from 2b018e0 to 5b14fbb Compare February 14, 2017 19:41
@jxiang
Copy link
Contributor Author

jxiang commented Feb 14, 2017

Thanks @zhenxiao, I pushed the rebased version to this branch.

@jxiang jxiang force-pushed the extra_parquet_struct_fields branch from 5b14fbb to 0f526c9 Compare February 14, 2017 19:58
@jxiang
Copy link
Contributor Author

jxiang commented Feb 14, 2017

@dain could you take a look when you get a chance? Thanks.

@dain
Copy link
Contributor

dain commented Feb 14, 2017

@jxiang yep. I see there are two (or three) PRs related to this. Can you help me understand which ones I should review in which order?

@dain dain self-requested a review February 14, 2017 23:14
@zhenxiao
Copy link
Collaborator

@dain this PR is the very first one, now @jxiang has all schema evolution stuff in one commit. #6675 is built on top of this

@jxiang
Copy link
Contributor Author

jxiang commented Feb 15, 2017

Yeah, as @zhenxiao said, this is the first one. Thanks.

@dain dain assigned nezihyigitbasi and unassigned dain Mar 17, 2017
@dain dain requested review from nezihyigitbasi and removed request for dain March 17, 2017 19:10
@nezihyigitbasi
Copy link
Contributor

@zhenxiao @jxiang AFAIU #6675 supersedes this one. If that's correct please close this PR and then we can work on the other one.

@zhenxiao
Copy link
Collaborator

continue with:
#6675

@jxiang jxiang closed this Apr 17, 2017
@jxiang jxiang deleted the extra_parquet_struct_fields branch April 18, 2017 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants