Skip to content

feat(native): Insert into bucketed but unpartitioned Hive table#25139

Open
anandamideShakyan wants to merge 4 commits intoprestodb:masterfrom
anandamideShakyan:insert-bucketed-unpar-hive
Open

feat(native): Insert into bucketed but unpartitioned Hive table#25139
anandamideShakyan wants to merge 4 commits intoprestodb:masterfrom
anandamideShakyan:insert-bucketed-unpar-hive

Conversation

@anandamideShakyan
Copy link
Contributor

@anandamideShakyan anandamideShakyan commented May 18, 2025

Description

Addresses #25104
Currently, Presto does not support INSERT INTO operations on bucketed but unpartitioned Hive tables. This limitation originates from a hard check in HiveWriterFactory:

https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveWriterFactory.java#L480

Motivation and Context

Supporting writes to bucketed unpartitioned Hive tables in Presto would improve compatibility and enhance Presto’s ability to handle modern Hive table layouts. It's a reasonable and useful feature for users who wish to leverage bucketing for performance optimizations even without partitioning.

Impact

This change would align Presto’s behavior with the broader SQL-on-Hadoop ecosystem and remove an artificial limitation that may block valid use cases — particularly in data warehousing environments where bucketing is used independently of partitioning.

Release Notes

== NO RELEASE NOTE ==

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label May 18, 2025
@anandamideShakyan anandamideShakyan marked this pull request as ready for review May 18, 2025 16:29
@anandamideShakyan anandamideShakyan requested a review from a team as a code owner May 18, 2025 16:29
@prestodb-ci prestodb-ci requested review from a team, namya28 and pramodsatya and removed request for a team May 18, 2025 16:29
@aditi-pandit
Copy link
Contributor

@anandamideShakyan : Thanks for this PR.

Have you tried this functionality with Prestissimo ? You might need facebookincubator/velox#13283 as well for it.

@anandamideShakyan
Copy link
Contributor Author

@aditi-pandit Sure I will add the support in Prestissimo after facebookincubator/velox#13283 is merged.

@aditi-pandit
Copy link
Contributor

@anandamideShakyan : Ther are failures in product tests. PTAL.

2025-05-18 19:49:10 INFO: [78 of 435] com.facebook.presto.tests.hive.TestHiveBucketedTables.testInsertIntoBucketedTables (Groups: )
2025-05-18 19:49:11 INFO: FAILURE     /    com.facebook.presto.tests.hive.TestHiveBucketedTables.testInsertIntoBucketedTables (Groups: ) took 1.1 seconds
2025-05-18 19:49:11 SEVERE: Failure cause:
java.lang.IllegalArgumentException: No mutable table instance found for name TableHandle{name=bucket_nation}
	at io.prestodb.tempto.fulfillment.table.TablesState.get(TablesState.java:64)
	at io.prestodb.tempto.fulfillment.table.TablesState.get(TablesState.java:48)
	at com.facebook.presto.tests.hive.TestHiveBucketedTables.testInsertIntoBucketedTables(TestHiveBucketedTables.java:173)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@anandamideShakyan anandamideShakyan force-pushed the insert-bucketed-unpar-hive branch 4 times, most recently from 0e437dd to 38805a8 Compare May 26, 2025 07:21
@anandamideShakyan
Copy link
Contributor Author

anandamideShakyan commented Jun 18, 2025

@anandamideShakyan : Thanks for this PR.

Have you tried this functionality with Prestissimo ? You might need facebookincubator/velox#13283 as well for it.

I tried it on Prestissimo, with one coordinator and one worker. I created a table in hive schema and tpcds catalog using:

CREATE TABLE cars (
    id BIGINT,
    name VARCHAR,
    brand VARCHAR
)
WITH (
    format = 'PARQUET',
    bucketed_by = ARRAY['id'],
    bucket_count = 4
);

Inserted values:

INSERT INTO cars (id, name, brand) VALUES
  (1, 'Model S', 'Tesla'),
  (2, 'Civic', 'Honda'),
  (3, 'Mustang', 'Ford'),
  (4, 'A4', 'Audi');

Was able to see the entries on running the select query:

Screenshot 2025-06-18 at 1 07 59 PM

@anandamideShakyan anandamideShakyan force-pushed the insert-bucketed-unpar-hive branch from 38805a8 to 817f7df Compare June 25, 2025 07:33
@anandamideShakyan anandamideShakyan requested a review from a team as a code owner June 25, 2025 07:33
import static java.lang.Boolean.parseBoolean;
import static org.testng.Assert.assertEquals;

public class TestHivePartitionedInsertNative
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move these testcases to presto-tests or presto-product-tests? Ideally, we don't want to add new testcases to presto-native-tests, instead we should just extend the existing e2e tests (such as the ones added to presto-product-tests in this PR) to run with with the native query runner.

@anandamideShakyan anandamideShakyan force-pushed the insert-bucketed-unpar-hive branch from 817f7df to ed26ecc Compare June 27, 2025 22:41
@steveburnett
Copy link
Contributor

Consider adding an example of how to use this new ability, or at least a mention that this is now possible for users to do and why it's useful (as you wrote in the Description), to the documentation.

@anandamideShakyan anandamideShakyan force-pushed the insert-bucketed-unpar-hive branch from ed26ecc to e8591d3 Compare January 31, 2026 11:15
@anandamideShakyan anandamideShakyan changed the title Insert into bucketed but unpartitioned Hive table feat(native): Insert into bucketed but unpartitioned Hive table Jan 31, 2026
@anandamideShakyan anandamideShakyan force-pushed the insert-bucketed-unpar-hive branch from e8591d3 to f68fe3d Compare January 31, 2026 12:36
@aditi-pandit
Copy link
Contributor

@anandamideShakyan : It will be good to complete this work as it has been a long pending item. Please can you take a look at the failures.

@anandamideShakyan
Copy link
Contributor Author

Inserts into bucketed Hive tables using the C++ (Velox) worker were failing during finishInsert with:
Screenshot 2026-02-06 at 8 12 48 PM

VerifyException: computeFileNamesForMissingBuckets

This happens because Presto’s Hive metadata layer assumes exactly one file per bucket per partition.
If any bucket does not produce a file, Presto attempts to synthesize “missing bucket” files during commit.

The Java worker never hits this path because it always creates one file per bucket, even when a bucket receives zero rows.

The Velox (C++) HiveDataSink, however, only created writers for buckets that actually received rows. When a bucket was empty, no writer → no file, causing Presto to think the bucket was missing and fail verification.

This is why inserts succeeded when data happened to hit all buckets, and failed otherwise.

Fix

The fix ensures that Velox creates one writer (and therefore one output file) per bucket, matching Java worker behavior and Presto’s expectations.

Specifically:

During HiveDataSink::splitInputRowsAndEnsureWriters(), we now pre-create writers for all buckets (for each partition, if partitioned).

This guarantees that every bucket produces exactly one file, even if it contains zero rows.

As a result, computeFileNamesForMissingBuckets() is never triggered and finishInsert succeeds.

To Do

  • This is a Velox-side fix (C++ worker behavior).

  • The original PR is in Presto, but the correct fix belongs in Velox, so a separate Velox PR is required. Will create velox PR soon.

  • This change aligns C++ worker semantics with Java worker semantics and Hive’s bucketing contract.

With this fix locally, I am able to insert into bucketed hive tables with and without sidecar. I am now looking at resolving the unit test failure that came after these changes : #25115

@aditi-pandit
Copy link
Contributor

@anandamideShakyan : Presto has a property hive.create-empty-bucket-files to control whether to create empty bucket files. Seems like this should always be false for native engine.

But in any case, doesn't Presto server create the missing buckets on the co-ordinator in the TableFinish logic and not in the worker ? I feel it should be on co-ordinator in TableFinish as its only after seeing all the worker files should we know which buckets are empty. The individual worker cannot make this decision.

This error seems like a local problem between Hive and Presto on the co-ordinator.

Please can you recheck if something else is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants