Skip to content

Support adding fields to nested Parquet structs#4939

Closed
billonahill wants to merge 9 commits intoprestodb:masterfrom
billonahill:billg/oss_support_adding_to_structs
Closed

Support adding fields to nested Parquet structs#4939
billonahill wants to merge 9 commits intoprestodb:masterfrom
billonahill:billg/oss_support_adding_to_structs

Conversation

@billonahill
Copy link

Presto currently support adding columns to a table, but not adding fields to a nested struct. This adds support for struct field additions.

@zhenxiao
Copy link
Collaborator

zhenxiao commented Apr 5, 2016

@billonahill ,
@jxiang has a PR, also resolving adding/deleting/reordering fields in Parquet structs:
#4714

@billonahill billonahill closed this Apr 5, 2016
@billonahill billonahill reopened this Apr 5, 2016
* table struct fields are permitted.
*
* @throws PrestoException if the schema has not evolved in a supported way
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is used for all file formats, but it looks like only the Parquet record reader supports this. What will happen with the other formats?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to move forward with this PR or PR #4714 (or both)? If we want to merge this one, which could be an iteration towards a the larger #4714, then I can add integration tests to verify the other formats.

@jxiang
Copy link
Contributor

jxiang commented Apr 25, 2016

I think it is better to have PR #4714 at first. If needed, then we can get this one.

@nezihyigitbasi
Copy link
Contributor

@billonahill @zhenxiao just to confirm with you guys, does #4714 supersede this PR? If yes please close this one and then we can focus on the other.

@billonahill
Copy link
Author

@nezihyigitbasi I've confirmed that this issue still exists on master by re-submitting this PR with only the test changes included. I will check against #4714 as well. Is there a way to exercise the AbstractTestHiveClient.testTypesParquet test without using travis CI or running docker locally? I'd like to be able to just run the unit test locally but assertGetRecordsOptional doesn't execute.

@electrum
Copy link
Contributor

@billonahill The purpose of that test is to read data that was written by Hive. You can create the table in Hive, then run TestHiveClient with the parameters set to point at your Hive installation. The test table is created by running create-test.sql and create-test-hive13.sql in presto-hive/src/test/sql in Hive. Though for just that one test, you can modify the scripts to remove anything unnecessary (can take a long time if your Hive cluster is slow).

@electrum
Copy link
Contributor

For IntelliJ, you set the parameters in the "Parameters" tab of the TestNG run configuration. I run those Hive tests all the time in the IDE.

@billonahill
Copy link
Author

billonahill commented Apr 14, 2017

Thanks @electrum for jogging my memory re how to run these tests, it's been a while. :) Adding my own notes here, this is how I run from the command line:

mvn -pl presto-hive-hadoop2 -P test-hive-hadoop2 -Dhive.hadoop2.databaseName=default -Dtest=TestHiveClient -Dmaven.surefire.debug test

@nezihyigitbasi #4714 does not seem to address this issue. What's needed is for HiveCoercionPolicy to handle structs, as well as a StructCoercionPolicy in HiveCoercionRecordCursor. One issue is that #4714 is pretty old and a lot has changed on master regarding coercion since it was implemented. We could merge #4714 and I can follow up with a patch, or we can include the test changes from this patch in #4714 and implement a fix. Or I can recreate this patch off master and we can commit it first. Let me know what you and @jxiang prefer.

For my own notes, I've made a branch based on the patch in #4714 (with the current master merged in) at https://github.com/billonahill/presto/tree/jxiang_extra_parquet_struct_fields which includes the fix to HiveCoercionPolicy, but still needs a fix to HiveCoercionRecordCursor.

@billonahill
Copy link
Author

billonahill commented Apr 17, 2017

I've re-based the patch from master and updated it. This patch is ready to merge and can go in before #4714 (or #6675, which is much more extensive and with many merge conflicts).

@zhenxiao
Copy link
Collaborator

@billonahill not so clear about this PR, could you please add more explanation. Is there anything missing in #4714 for schema evolution in Parquet? We just use that and are working fine

@billonahill
Copy link
Author

@zhenxiao if you include the unit tests from this patch only in #4714 you can see that they fail. I didn't investigate the code in detail to see why though. The patch in #4714 is based on a pretty old version of the master and has conflicts so I might not have resolved correctly based on your intent. If you could rebase it and try running with the tests from this patch that would show if it works or not.

@nezihyigitbasi
Copy link
Contributor

@zhenxiao can you give @billonahill 's suggestion a try and let us know if that test works or not?

@zhenxiao
Copy link
Collaborator

@nezihyigitbasi @billonahill I rebased our Nested Schema Evolution work at:
#6675

It becomes an independent PR for Nested Schema Evolution. I will close all other pending PRs for that.

@billonahill could you please take a test on 6675 to see whether that could fix the problem? We've using that in production for a long time, to resolve all schema evolution issues. It is quite stable.

@billonahill
Copy link
Author

billonahill commented Apr 20, 2017

I've commented on #6675 that the problem this patch addresses still exists, even after applying the branch in #6675. We could either merge #6675 and then pursue this issue, or vice-versa. Or we could incorporate the tests from this patch into #6675 and expand it's scope.

@stale
Copy link

stale bot commented Apr 3, 2019

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the task, make sure you've addressed reviewer comments, and rebase on the latest master. Thank you for your contributions!

@stale stale bot added the stale label Apr 3, 2019
@stale stale bot closed this Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants