Skip to content

Allow multiple or missing Hive bucket files#16456

Merged
highker merged 3 commits intoprestodb:masterfrom
v-jizhang:hive_multiple_missing_bucket_files_2
Aug 9, 2021
Merged

Allow multiple or missing Hive bucket files#16456
highker merged 3 commits intoprestodb:masterfrom
v-jizhang:hive_multiple_missing_bucket_files_2

Conversation

@v-jizhang
Copy link
Contributor

@v-jizhang v-jizhang commented Jul 21, 2021

Cherry-pick of trinodb/trino#822,
trinodb/trino#848 and
trinodb/trino#1375

Co-authored-by: David Phillips david@acz.org
Co-authored-by: Piotr Findeisen piotr.findeisen@gmail.com

== RELEASE NOTES ==

Hive Changes
* Allow multiple or missing Hive bucket files
This can be configured by using Hive Configuration Property ``hive.create-empty-bucket-files``.
Changes are also made to use Hive naming convention for bucket file names when computing bucket file name.

@aweisberg aweisberg requested review from aweisberg and highker July 22, 2021 15:35
@v-jizhang v-jizhang force-pushed the hive_multiple_missing_bucket_files_2 branch from a55838d to 7f8def6 Compare July 22, 2021 21:16
Copy link
Contributor

@aweisberg aweisberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

Noticed some small things.

The commit messages don't follow our guidelines https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#example-commit-message Specifically the later commits don't have the original commit message and don't link to the PR in the body.

The release note isn't very detailed should link to the documentation for the new configuration options. I think it's also worth mentioning it changes the naming convention used by Presto for filenames just so people know it occurred.

I am not 100% sure we want to default to omitting empty bucket files. Generally I want to make sure this isn't going to break compatibility with systems that are expecting the existing output from Presto.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to your change, but handling temporary vs non Temporary could be handled entirely inside location service simplifying this code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No love for assertThat? :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to keep it as it is. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an opportunity to simplify here after #15629 because we can now handle missing/empty buckets.

@viczhang861 do you agree? Maybe we can converge on one code path here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viczhang861 Commented below, I'll refactor it after this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Banged my head against this for a while...

So you can have multiple files for each bucket hence array multi-map, but only if it follows the format where the bucket index can be extracted. Earlier changes (squashed into one commit in this PR) changed the output format for Presto to match Hive and left it backwards compatible in terms of being able to extract bucket numbers from filenames.

If the bucket index can't be extracted then the number of files must match number of buckets. Then there are two different file name formats in this case where the number can't be extracted and we have two potential ways to sort in order to generate an index for each file. I don't know why there are two ways, but that was pre-existing from #15536

You are allowed to have missing files as long as all files have the bucket index as well because the check was moved into the loop with the continue that will skip the check.

So in conclusion I believe this does what it says in terms of allowing missing or multiples files, and I think it preserves all the existing behaviors that were there before.

Thank you for coming to my TED talk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This belongs in TestHiveWriterFactory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong test parameter here. It's supposed to be createEmpty which wasn't added to the test method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thanks

@v-jizhang
Copy link
Contributor Author

Thank you for working on this!

Noticed some small things.

The commit messages don't follow our guidelines https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#example-commit-message Specifically the later commits don't have the original commit message and don't link to the PR in the body.

The release note isn't very detailed should link to the documentation for the new configuration options. I think it's also worth mentioning it changes the naming convention used by Presto for filenames just so people know it occurred.

I am not 100% sure we want to default to omitting empty bucket files. Generally I want to make sure this isn't going to break compatibility with systems that are expecting the existing output from Presto.

I'll fix them and do another push once review is complete.

Copy link
Contributor

@viczhang861 viczhang861 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Thank you very much for working on this.
  2. FYI, e3f7ede improves the case for empty bucket file of temporary table, with Hive version update to 3.0 already, we can combine the previous session property (for temporary table) and the one you added here, but let's do refactoring later when changes introduced in this PR are tested and stable.
  3. Put Cherry-pick of Trino#848 as commit message and make commit title informative
  4. You can cherry pick multiple PRs from Trino into one commit, whatever you think makes most sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: try to fix trailing space in the original commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v-jizhang, seems the change is still in the wrong commit?

v-jizhang and others added 3 commits July 23, 2021 13:20
Cherry-pick of trinodb/trino#822.
The following commits are included:
Move table name to end of error message -
    trinodb/trino@a78f930
Add getSchemaTableName method -
    trinodb/trino@c761b47
Make HiveWritableTableHandle field final -
    trinodb/trino@6a06017
Remove unnecessary schemaTableName utility method -
    trinodb/trino@6323b9a
Cleanup code in HiveMetadata -
    trinodb/trino@f0d5e52
Remove explicit file prefix for Hive writer handles -
    trinodb/trino@3d2f977
Allow query ID to be a file name prefix or suffix -
    trinodb/trino@7b6d37e
Use Hive naming convention for bucket file names -
    trinodb/trino@b56b285
Simplify code in getBucketedSplits -
    trinodb/trino@f814cd6
Allow multiple or missing Hive bucket files -
    trinodb/trino@ebcbf22
Allow disabling the creation of empty bucket files
    trinodb/trino@dfaa70c

Co-authored-by: David Phillips <david@acz.org>
Cherry-pick of trinodb/trino#1375

Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
Cherry-pick of trinodb/trino#848

Co-authored-by: David Phillips <david@acz.org>
@v-jizhang v-jizhang force-pushed the hive_multiple_missing_bucket_files_2 branch from 8028e1e to 8edb15b Compare July 23, 2021 20:23
@v-jizhang v-jizhang requested a review from aweisberg July 23, 2021 20:30
@aweisberg aweisberg requested a review from viczhang861 July 26, 2021 19:16
@highker highker self-assigned this Aug 5, 2021
@highker highker merged commit 69f63c5 into prestodb:master Aug 9, 2021
rschlussel added a commit to rschlussel/presto that referenced this pull request Aug 11, 2021
prestodb#16456 broke creation of empty
unpartitioned bucketed tables if they weren't using a temporary staging
directory, as the target directory would not get created.
rschlussel added a commit that referenced this pull request Aug 11, 2021
#16456 broke creation of empty
unpartitioned bucketed tables if they weren't using a temporary staging
directory, as the target directory would not get created.
@varungajjala varungajjala mentioned this pull request Aug 16, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants