Allow multiple or missing Hive bucket files#16456
Conversation
a55838d to
7f8def6
Compare
aweisberg
left a comment
There was a problem hiding this comment.
Thank you for working on this!
Noticed some small things.
The commit messages don't follow our guidelines https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#example-commit-message Specifically the later commits don't have the original commit message and don't link to the PR in the body.
The release note isn't very detailed should link to the documentation for the new configuration options. I think it's also worth mentioning it changes the naming convention used by Presto for filenames just so people know it occurred.
I am not 100% sure we want to default to omitting empty bucket files. Generally I want to make sure this isn't going to break compatibility with systems that are expecting the existing output from Presto.
There was a problem hiding this comment.
Unrelated to your change, but handling temporary vs non Temporary could be handled entirely inside location service simplifying this code.
There was a problem hiding this comment.
No love for assertThat? :-)
There was a problem hiding this comment.
Just want to keep it as it is. :)
There was a problem hiding this comment.
There is an opportunity to simplify here after #15629 because we can now handle missing/empty buckets.
@viczhang861 do you agree? Maybe we can converge on one code path here.
There was a problem hiding this comment.
@viczhang861 Commented below, I'll refactor it after this PR.
There was a problem hiding this comment.
Banged my head against this for a while...
So you can have multiple files for each bucket hence array multi-map, but only if it follows the format where the bucket index can be extracted. Earlier changes (squashed into one commit in this PR) changed the output format for Presto to match Hive and left it backwards compatible in terms of being able to extract bucket numbers from filenames.
If the bucket index can't be extracted then the number of files must match number of buckets. Then there are two different file name formats in this case where the number can't be extracted and we have two potential ways to sort in order to generate an index for each file. I don't know why there are two ways, but that was pre-existing from #15536
You are allowed to have missing files as long as all files have the bucket index as well because the check was moved into the loop with the continue that will skip the check.
So in conclusion I believe this does what it says in terms of allowing missing or multiples files, and I think it preserves all the existing behaviors that were there before.
Thank you for coming to my TED talk.
There was a problem hiding this comment.
This belongs in TestHiveWriterFactory
There was a problem hiding this comment.
This is the wrong test parameter here. It's supposed to be createEmpty which wasn't added to the test method.
I'll fix them and do another push once review is complete. |
viczhang861
left a comment
There was a problem hiding this comment.
- Thank you very much for working on this.
- FYI, e3f7ede improves the case for empty bucket file of temporary table, with Hive version update to 3.0 already, we can combine the previous session property (for temporary table) and the one you added here, but let's do refactoring later when changes introduced in this PR are tested and stable.
- Put Cherry-pick of Trino#848 as commit message and make commit title informative
- You can cherry pick multiple PRs from Trino into one commit, whatever you think makes most sense.
There was a problem hiding this comment.
Nit: try to fix trailing space in the original commit
There was a problem hiding this comment.
@v-jizhang, seems the change is still in the wrong commit?
Cherry-pick of trinodb/trino#822. The following commits are included: Move table name to end of error message - trinodb/trino@a78f930 Add getSchemaTableName method - trinodb/trino@c761b47 Make HiveWritableTableHandle field final - trinodb/trino@6a06017 Remove unnecessary schemaTableName utility method - trinodb/trino@6323b9a Cleanup code in HiveMetadata - trinodb/trino@f0d5e52 Remove explicit file prefix for Hive writer handles - trinodb/trino@3d2f977 Allow query ID to be a file name prefix or suffix - trinodb/trino@7b6d37e Use Hive naming convention for bucket file names - trinodb/trino@b56b285 Simplify code in getBucketedSplits - trinodb/trino@f814cd6 Allow multiple or missing Hive bucket files - trinodb/trino@ebcbf22 Allow disabling the creation of empty bucket files trinodb/trino@dfaa70c Co-authored-by: David Phillips <david@acz.org>
Cherry-pick of trinodb/trino#1375 Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
Cherry-pick of trinodb/trino#848 Co-authored-by: David Phillips <david@acz.org>
8028e1e to
8edb15b
Compare
prestodb#16456 broke creation of empty unpartitioned bucketed tables if they weren't using a temporary staging directory, as the target directory would not get created.
#16456 broke creation of empty unpartitioned bucketed tables if they weren't using a temporary staging directory, as the target directory would not get created.
Cherry-pick of trinodb/trino#822,
trinodb/trino#848 and
trinodb/trino#1375
Co-authored-by: David Phillips david@acz.org
Co-authored-by: Piotr Findeisen piotr.findeisen@gmail.com