Allow multiple or missing Hive bucket files by v-jizhang · Pull Request #16456 · prestodb/presto

v-jizhang · 2021-07-21T23:05:33Z

Cherry-pick of trinodb/trino#822,
trinodb/trino#848 and
trinodb/trino#1375

Co-authored-by: David Phillips david@acz.org
Co-authored-by: Piotr Findeisen piotr.findeisen@gmail.com

== RELEASE NOTES ==

Hive Changes
* Allow multiple or missing Hive bucket files
This can be configured by using Hive Configuration Property ``hive.create-empty-bucket-files``.
Changes are also made to use Hive naming convention for bucket file names when computing bucket file name.

aweisberg

Thank you for working on this!

Noticed some small things.

The commit messages don't follow our guidelines https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#example-commit-message Specifically the later commits don't have the original commit message and don't link to the PR in the body.

The release note isn't very detailed should link to the documentation for the new configuration options. I think it's also worth mentioning it changes the naming convention used by Presto for filenames just so people know it occurred.

I am not 100% sure we want to default to omitting empty bucket files. Generally I want to make sure this isn't going to break compatibility with systems that are expecting the existing output from Presto.

aweisberg · 2021-07-22T16:53:02Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java

Unrelated to your change, but handling temporary vs non Temporary could be handled entirely inside location service simplifying this code.

aweisberg · 2021-07-22T17:32:53Z

presto-hive/src/test/java/com/facebook/presto/hive/AbstractTestHiveClient.java

No love for assertThat? :-)

Just want to keep it as it is. :)

aweisberg · 2021-07-22T18:34:04Z

presto-hive/src/main/java/com/facebook/presto/hive/StoragePartitionLoader.java

There is an opportunity to simplify here after #15629 because we can now handle missing/empty buckets.

@viczhang861 do you agree? Maybe we can converge on one code path here.

@viczhang861 Commented below, I'll refactor it after this PR.

aweisberg · 2021-07-22T18:59:59Z

presto-hive/src/main/java/com/facebook/presto/hive/StoragePartitionLoader.java

Banged my head against this for a while...

So you can have multiple files for each bucket hence array multi-map, but only if it follows the format where the bucket index can be extracted. Earlier changes (squashed into one commit in this PR) changed the output format for Presto to match Hive and left it backwards compatible in terms of being able to extract bucket numbers from filenames.

If the bucket index can't be extracted then the number of files must match number of buckets. Then there are two different file name formats in this case where the number can't be extracted and we have two potential ways to sort in order to generate an index for each file. I don't know why there are two ways, but that was pre-existing from #15536

You are allowed to have missing files as long as all files have the bucket index as well because the check was moved into the loop with the continue that will skip the check.

So in conclusion I believe this does what it says in terms of allowing missing or multiples files, and I think it preserves all the existing behaviors that were there before.

Thank you for coming to my TED talk.

aweisberg · 2021-07-22T21:24:44Z

presto-hive/src/test/java/com/facebook/presto/hive/TestBackgroundHiveSplitLoader.java

This belongs in TestHiveWriterFactory

Fixed. Thanks

aweisberg · 2021-07-22T21:26:24Z

presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java

This is the wrong test parameter here. It's supposed to be createEmpty which wasn't added to the test method.

Added. Thanks

v-jizhang · 2021-07-23T00:14:53Z

Thank you for working on this!

Noticed some small things.

The commit messages don't follow our guidelines https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#example-commit-message Specifically the later commits don't have the original commit message and don't link to the PR in the body.

The release note isn't very detailed should link to the documentation for the new configuration options. I think it's also worth mentioning it changes the naming convention used by Presto for filenames just so people know it occurred.

I am not 100% sure we want to default to omitting empty bucket files. Generally I want to make sure this isn't going to break compatibility with systems that are expecting the existing output from Presto.

I'll fix them and do another push once review is complete.

viczhang861

Thank you very much for working on this.
FYI, e3f7ede improves the case for empty bucket file of temporary table, with Hive version update to 3.0 already, we can combine the previous session property (for temporary table) and the one you added here, but let's do refactoring later when changes introduced in this PR are tested and stable.
Put Cherry-pick of Trino#848 as commit message and make commit title informative
You can cherry pick multiple PRs from Trino into one commit, whatever you think makes most sense.

viczhang861 · 2021-07-23T15:10:26Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java

Nit: try to fix trailing space in the original commit

Fixed, thanks.

@v-jizhang, seems the change is still in the wrong commit?

Cherry-pick of trinodb/trino#822. The following commits are included: Move table name to end of error message - trinodb/trino@a78f930 Add getSchemaTableName method - trinodb/trino@c761b47 Make HiveWritableTableHandle field final - trinodb/trino@6a06017 Remove unnecessary schemaTableName utility method - trinodb/trino@6323b9a Cleanup code in HiveMetadata - trinodb/trino@f0d5e52 Remove explicit file prefix for Hive writer handles - trinodb/trino@3d2f977 Allow query ID to be a file name prefix or suffix - trinodb/trino@7b6d37e Use Hive naming convention for bucket file names - trinodb/trino@b56b285 Simplify code in getBucketedSplits - trinodb/trino@f814cd6 Allow multiple or missing Hive bucket files - trinodb/trino@ebcbf22 Allow disabling the creation of empty bucket files trinodb/trino@dfaa70c Co-authored-by: David Phillips <david@acz.org>

Cherry-pick of trinodb/trino#1375 Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>

Cherry-pick of trinodb/trino#848 Co-authored-by: David Phillips <david@acz.org>

prestodb#16456 broke creation of empty unpartitioned bucketed tables if they weren't using a temporary staging directory, as the target directory would not get created.

#16456 broke creation of empty unpartitioned bucketed tables if they weren't using a temporary staging directory, as the target directory would not get created.

v-jizhang mentioned this pull request Jul 21, 2021

Allow multiple or missing Hive bucket files #16230

Closed

aweisberg requested review from aweisberg and highker July 22, 2021 15:35

v-jizhang force-pushed the hive_multiple_missing_bucket_files_2 branch from a55838d to 7f8def6 Compare July 22, 2021 21:16

aweisberg suggested changes Jul 22, 2021

View reviewed changes

viczhang861 reviewed Jul 23, 2021

View reviewed changes

v-jizhang and others added 3 commits July 23, 2021 13:20

Skip empty bucket creation by default

6079159

Cherry-pick of trinodb/trino#1375 Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>

Document Hive create empty buckets config

8edb15b

Cherry-pick of trinodb/trino#848 Co-authored-by: David Phillips <david@acz.org>

v-jizhang force-pushed the hive_multiple_missing_bucket_files_2 branch from 8028e1e to 8edb15b Compare July 23, 2021 20:23

v-jizhang requested a review from aweisberg July 23, 2021 20:30

aweisberg approved these changes Jul 26, 2021

View reviewed changes

aweisberg requested a review from viczhang861 July 26, 2021 19:16

highker approved these changes Aug 5, 2021

View reviewed changes

highker self-assigned this Aug 5, 2021

highker merged commit 69f63c5 into prestodb:master Aug 9, 2021

rschlussel mentioned this pull request Aug 10, 2021

Always create hive target directory if not present #16590

Merged

rschlussel added a commit that referenced this pull request Aug 11, 2021

Always create hive target directory if not present

a778192

#16456 broke creation of empty unpartitioned bucketed tables if they weren't using a temporary staging directory, as the target directory would not get created.

varungajjala mentioned this pull request Aug 16, 2021

Add release notes for 0.260 #16619

Merged

3 tasks

Conversation

v-jizhang commented Jul 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aweisberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

v-jizhang commented Jul 23, 2021

Uh oh!

viczhang861 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

v-jizhang commented Jul 21, 2021 •

edited

Loading