feat(native): Insert into bucketed but unpartitioned Hive table#25139
feat(native): Insert into bucketed but unpartitioned Hive table#25139anandamideShakyan wants to merge 4 commits intoprestodb:masterfrom
Conversation
|
@anandamideShakyan : Thanks for this PR. Have you tried this functionality with Prestissimo ? You might need facebookincubator/velox#13283 as well for it. |
|
@aditi-pandit Sure I will add the support in Prestissimo after facebookincubator/velox#13283 is merged. |
|
@anandamideShakyan : Ther are failures in product tests. PTAL. |
0e437dd to
38805a8
Compare
I tried it on Prestissimo, with one coordinator and one worker. I created a table in hive schema and tpcds catalog using: Inserted values: Was able to see the entries on running the select query:
|
38805a8 to
817f7df
Compare
| import static java.lang.Boolean.parseBoolean; | ||
| import static org.testng.Assert.assertEquals; | ||
|
|
||
| public class TestHivePartitionedInsertNative |
There was a problem hiding this comment.
Could we move these testcases to presto-tests or presto-product-tests? Ideally, we don't want to add new testcases to presto-native-tests, instead we should just extend the existing e2e tests (such as the ones added to presto-product-tests in this PR) to run with with the native query runner.
817f7df to
ed26ecc
Compare
|
Consider adding an example of how to use this new ability, or at least a mention that this is now possible for users to do and why it's useful (as you wrote in the Description), to the documentation. |
ed26ecc to
e8591d3
Compare
e8591d3 to
f68fe3d
Compare
|
@anandamideShakyan : It will be good to complete this work as it has been a long pending item. Please can you take a look at the failures. |
|
Inserts into bucketed Hive tables using the C++ (Velox) worker were failing during finishInsert with:
This happens because Presto’s Hive metadata layer assumes exactly one file per bucket per partition. The Java worker never hits this path because it always creates one file per bucket, even when a bucket receives zero rows. The Velox (C++) HiveDataSink, however, only created writers for buckets that actually received rows. When a bucket was empty, no writer → no file, causing Presto to think the bucket was missing and fail verification. This is why inserts succeeded when data happened to hit all buckets, and failed otherwise. Fix The fix ensures that Velox creates one writer (and therefore one output file) per bucket, matching Java worker behavior and Presto’s expectations. Specifically: During HiveDataSink::splitInputRowsAndEnsureWriters(), we now pre-create writers for all buckets (for each partition, if partitioned). This guarantees that every bucket produces exactly one file, even if it contains zero rows. As a result, computeFileNamesForMissingBuckets() is never triggered and finishInsert succeeds. To Do
With this fix locally, I am able to insert into bucketed hive tables with and without sidecar. I am now looking at resolving the unit test failure that came after these changes : #25115 |
|
@anandamideShakyan : Presto has a property hive.create-empty-bucket-files to control whether to create empty bucket files. Seems like this should always be false for native engine. But in any case, doesn't Presto server create the missing buckets on the co-ordinator in the TableFinish logic and not in the worker ? I feel it should be on co-ordinator in TableFinish as its only after seeing all the worker files should we know which buckets are empty. The individual worker cannot make this decision. This error seems like a local problem between Hive and Presto on the co-ordinator. Please can you recheck if something else is missing. |


Description
Addresses #25104
Currently, Presto does not support INSERT INTO operations on bucketed but unpartitioned Hive tables. This limitation originates from a hard check in HiveWriterFactory:
https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveWriterFactory.java#L480
Motivation and Context
Supporting writes to bucketed unpartitioned Hive tables in Presto would improve compatibility and enhance Presto’s ability to handle modern Hive table layouts. It's a reasonable and useful feature for users who wish to leverage bucketing for performance optimizations even without partitioning.
Impact
This change would align Presto’s behavior with the broader SQL-on-Hadoop ecosystem and remove an artificial limitation that may block valid use cases — particularly in data warehousing environments where bucketing is used independently of partitioning.
Release Notes