Skip to content

Allow multiple or missing Hive bucket files#822

Merged
electrum merged 11 commits intotrinodb:masterfrom
electrum:hive-bucket
May 28, 2019
Merged

Allow multiple or missing Hive bucket files#822
electrum merged 11 commits intotrinodb:masterfrom
electrum:hive-bucket

Conversation

@electrum
Copy link
Member

@electrum electrum commented May 27, 2019

No description provided.

@cla-bot cla-bot bot added the cla-signed label May 27, 2019
@electrum electrum requested a review from dain May 27, 2019 04:36
@electrum electrum requested a review from martint May 27, 2019 18:59
@findepi
Copy link
Member

findepi commented May 28, 2019

@electrum do you happen to know if this os functionality similar to prestodb/presto#6282?

@electrum electrum added this to the 312 milestone May 28, 2019
@electrum
Copy link
Member Author

@findepi It is similar but more flexible, as this allows any number of files per bucket, whereas that one seems to require the file count to be a multiple of the bucket count. This also changes Presto to write files using the Hive naming convention.

@jiegzhan
Copy link

jiegzhan commented Nov 7, 2019

With EMR 5-21 (Presto 0.215), still got this issue: Query 20191107_222140_00006_rf89j failed: Hive table 'dev.wifi_logs' is corrupt. The number of files in the directory (256) does not match the declared bucket count (64) for partition: date_key=2019-11-05

Are there any configurations to ignore this check?

@findepi
Copy link
Member

findepi commented Nov 7, 2019

@jiegzhan
in 0.215 the check is unconditional, cannot be disabled. That version simply does not handle the case you're in.
Please try it out in Presto 324 https://prestosql.io/download.html . this will just work now.

If you can't upgrade just yet for some reason, please ask for more advice on #troubleshooting channel on our slack (https://prestosql.io/slack.html).
See you there!

@akhilnaidu
Copy link

akhilnaidu commented Jun 11, 2020

With EMR 5-21 (Presto 0.215), still got this issue: Query 20191107_222140_00006_rf89j failed: Hive table 'dev.wifi_logs' is corrupt. The number of files in the directory (256) does not match the declared bucket count (64) for partition: date_key=2019-11-05

Are there any configurations to ignore this check?

@jiegzhan You are referring to the Prestodb distribution (release 0.215) that comes along with EMR while this feature is a part of Prestosql release 312.
Currently this is not supported in EMR Prestodb distribution.

v-jizhang added a commit to v-jizhang/presto that referenced this pull request Jun 9, 2021
Cherry-pick of trinodb/trino#822,
trinodb/trino#848 and
trinodb/trino#1375

Co-authored-by: David Phillips <david@acz.org>
Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
v-jizhang added a commit to v-jizhang/presto that referenced this pull request Jul 23, 2021
Cherry-pick of trinodb/trino#822.
The following commits are included:
Move table name to end of error message -
    trinodb/trino@a78f930
Add getSchemaTableName method -
    trinodb/trino@c761b47
Make HiveWritableTableHandle field final -
    trinodb/trino@6a06017
Remove unnecessary schemaTableName utility method -
    trinodb/trino@6323b9a
Cleanup code in HiveMetadata -
    trinodb/trino@f0d5e52
Remove explicit file prefix for Hive writer handles -
    trinodb/trino@3d2f977
Allow query ID to be a file name prefix or suffix -
    trinodb/trino@7b6d37e
Use Hive naming convention for bucket file names -
    trinodb/trino@b56b285
Simplify code in getBucketedSplits -
    trinodb/trino@f814cd6
Allow multiple or missing Hive bucket files -
    trinodb/trino@ebcbf22
Allow disabling the creation of empty bucket files
    trinodb/trino@dfaa70c

Co-authored-by: David Phillips <david@acz.org>
highker pushed a commit to prestodb/presto that referenced this pull request Aug 9, 2021
Cherry-pick of trinodb/trino#822.
The following commits are included:
Move table name to end of error message -
    trinodb/trino@a78f930
Add getSchemaTableName method -
    trinodb/trino@c761b47
Make HiveWritableTableHandle field final -
    trinodb/trino@6a06017
Remove unnecessary schemaTableName utility method -
    trinodb/trino@6323b9a
Cleanup code in HiveMetadata -
    trinodb/trino@f0d5e52
Remove explicit file prefix for Hive writer handles -
    trinodb/trino@3d2f977
Allow query ID to be a file name prefix or suffix -
    trinodb/trino@7b6d37e
Use Hive naming convention for bucket file names -
    trinodb/trino@b56b285
Simplify code in getBucketedSplits -
    trinodb/trino@f814cd6
Allow multiple or missing Hive bucket files -
    trinodb/trino@ebcbf22
Allow disabling the creation of empty bucket files
    trinodb/trino@dfaa70c

Co-authored-by: David Phillips <david@acz.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants