Skip to content

feat: Add SQL Support for MERGE INTO in Presto (efficient workload distribution)#1

Open
acarpente-denodo wants to merge 8 commits intofeature/20578_SQL_Support_for_MERGE_INTOfrom
feature/20578_SQL_Support_for_MERGE_INTO_(efficient_workload_distribution)
Open

feat: Add SQL Support for MERGE INTO in Presto (efficient workload distribution)#1
acarpente-denodo wants to merge 8 commits intofeature/20578_SQL_Support_for_MERGE_INTOfrom
feature/20578_SQL_Support_for_MERGE_INTO_(efficient_workload_distribution)

Conversation

@acarpente-denodo
Copy link
Copy Markdown
Owner

@acarpente-denodo acarpente-denodo commented Nov 13, 2025

Description

Engine support for SQL MERGE INTO. The MERGE INTO command inserts or updates rows in a table based on specified conditions.

Syntax:

MERGE INTO target_table [ [ AS ]  target_alias ]
USING { source_table | query } [ [ AS ] source_alias ]
ON search_condition
WHEN MATCHED THEN
    UPDATE SET ( column = expression [, ...] )
WHEN NOT MATCHED THEN
    INSERT [ column_list ]
    VALUES (expression, ...)

Example: MERGE INTO usage to update the sales information for existing products and insert the sales information for the new products in the market.

MERGE INTO product_sales AS s
    USING monthly_sales AS ms
    ON s.product_id = ms.product_id
WHEN MATCHED THEN
    UPDATE SET
        sales = sales + ms.sales
      , last_sale = ms.sale_date
      , current_price = ms.price
WHEN NOT MATCHED THEN
    INSERT (product_id, sales, last_sale, current_price)
    VALUES (ms.product_id, ms.sales, ms.sale_date, ms.price)

The Presto engine commit introduces an enum called RowChangeParadigm, which describes how a connector modifies rows. The iceberg connector will utilize the DELETE_ROW_AND_INSERT_ROW paradigm, as it represents an updated row as a combination of a deleted row followed by an inserted row. The CHANGE_ONLY_UPDATED_COLUMNS paradigm is meant for connectors that support updating individual columns of rows.

Note: Changes were made after reviewing the following Trino PR: trinodb/trino#7126
So, this commit is deeply inspired by Trino's implementation.

Motivation and Context

The MERGE INTO statement is commonly used to integrate data from two tables with different contents but similar structures.
For example, the source table could be part of a production transactional system, while the target table might be located in a data warehouse for analytics.
Regularly, MERGE operations are performed to update the analytics warehouse with the latest production data.
You can also use MERGE with tables that have different structures, as long as you can define a condition to match the rows between them.

Test Plan

Automated tests developed in TestSqlParser, TestSqlParserErrorHandling, TestStatementBuilder, AbstractAnalyzerTest, TestAnalyzer, and TestClassLoaderSafeWrappers classes.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* Optimize MERGE INTO command execution.

@acarpente-denodo
Copy link
Copy Markdown
Owner Author

@sourcery-ai review

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Nov 13, 2025

Reviewer's Guide

This PR implements SQL MERGE support with efficient workload distribution by introducing a dedicated MergePartitioningHandle to unify insert/update partition schemes, adapting the planner to generate and propagate partitioning schemes for MERGE, updating the engine’s NodePartitioningManager and local exchange rules to handle merge-specific partition logic, extending Metadata SPI for merge update layouts, and providing Iceberg-specific bucket functions and node partitioning provider stubs.

Class diagram for new and updated partitioning handle classes

classDiagram
    class PartitioningHandle {
        +Optional<ConnectorId> connectorId
        +Optional<ConnectorTransactionHandle> transactionHandle
        +ConnectorPartitioningHandle connectorHandle
    }
    class MergePartitioningHandle {
        +Optional<PartitioningScheme> insertPartitioning
        +Optional<PartitioningScheme> updatePartitioning
        +NodePartitionMap getNodePartitioningMap(Function)
        +PartitionFunction getPartitionFunction(PartitionFunctionLookup, List<Type>, int[])
    }
    PartitioningHandle --> MergePartitioningHandle : uses
    MergePartitioningHandle --|> ConnectorPartitioningHandle
    class IcebergPartitioningHandle {
        +List<IcebergColumnHandle> partitioningColumns
        +List<PartitionField> partitioning
    }
    IcebergPartitioningHandle --|> ConnectorPartitioningHandle
    class IcebergUpdateHandle {
    }
    IcebergUpdateHandle --|> ConnectorPartitioningHandle
Loading

Class diagram for MergeWriterNode and MergeProcessorNode changes

classDiagram
    class MergeWriterNode {
        +PlanNode source
        +MergeTarget target
        +List<VariableReferenceExpression> mergeProcessorProjectedVariables
        +Optional<PartitioningScheme> partitioningScheme
        +List<VariableReferenceExpression> outputs
    }
    class MergeProcessorNode {
        +PlanNode source
        +MergeTarget target
        +VariableReferenceExpression targetTableRowIdColumnVariable
        +VariableReferenceExpression mergeRowVariable
        +List<VariableReferenceExpression> targetColumnVariables
        +List<VariableReferenceExpression> targetRedistributionColumnVariables
        +List<VariableReferenceExpression> outputs
    }
    MergeWriterNode --> MergeProcessorNode : uses
Loading

Class diagram for new Iceberg partitioning provider and bucket functions

classDiagram
    class IcebergNodePartitioningProvider {
        +Optional<ConnectorBucketNodeMap> getBucketNodeMap(...)
        +BucketFunction getBucketFunction(...)
        +int getBucketCount(...)
        +ToIntFunction<ConnectorSplit> getSplitBucketFunction(...)
    }
    IcebergNodePartitioningProvider --|> ConnectorNodePartitioningProvider
    class IcebergBucketFunction {
        +int getBucket(Page, int)
    }
    IcebergBucketFunction --|> BucketFunction
    class IcebergUpdateBucketFunction {
        +int getBucket(Page, int)
    }
    IcebergUpdateBucketFunction --|> BucketFunction
Loading

Class diagram for updated ConnectorNodePartitioningProvider SPI

classDiagram
    class ConnectorNodePartitioningProvider {
        +Optional<ConnectorBucketNodeMap> getBucketNodeMap(...)
        +BucketFunction getBucketFunction(...)
        +int getBucketCount(...)
        +ToIntFunction<ConnectorSplit> getSplitBucketFunction(...)
    }
Loading

Class diagram for updated Metadata SPI for merge update layout

classDiagram
    class Metadata {
        +Optional<PartitioningHandle> getMergeUpdateLayout(Session, TableHandle)
    }
    class ConnectorMetadata {
        +Optional<ConnectorPartitioningHandle> getMergeUpdateLayout(ConnectorSession, ConnectorTableHandle)
    }
    Metadata --> ConnectorMetadata : delegates
Loading

Class diagram for updated MergeAnalysis structure

classDiagram
    class MergeAnalysis {
        +Table targetTable
        +List<ColumnMetadata> targetColumnsMetadata
        +List<ColumnHandle> targetColumnHandles
        +List<ColumnHandle> targetRedistributionColumnHandles
        +List<List<ColumnHandle>> mergeCaseColumnHandles
        +Set<ColumnHandle> nonNullableColumnHandles
        +Map<ColumnHandle, Integer> columnHandleFieldNumbers
        +List<Integer> insertPartitioningArgumentIndexes
        +Optional<NewTableLayout> insertLayout
        +Optional<PartitioningHandle> updateLayout
        +Scope targetTableScope
        +Scope joinScope
    }
Loading

File-Level Changes

Change Details Files
Introduce MergePartitioningHandle and integrate partition logic into NodePartitioningManager
  • Add MergePartitioningHandle to represent combined insert/update partitioning schemes
  • Extend NodePartitioningManager to handle MergePartitioningHandle in getPartitionFunction and getNodePartitioningMap
  • Remove duplicate SystemPartitioningHandle logic and delegate to a new systemNodePartitionMap helper
  • Adjust getConnectorBucketNodeMap to return Optional and fall back to system distribution
NodePartitioningManager.java
SystemPartitioningHandle.java
MergePartitioningHandle.java
BasePlanFragmenter.java
Extend QueryPlanner and analyzer to generate and carry MERGE partitioning schemes
  • Implement createMergePartitioningScheme in QueryPlanner.plan(Merge) to build PartitioningScheme for MERGE
  • Capture insert/update layouts and redistribution columns in StatementAnalyzer and MergeAnalysis
  • Add partitioningScheme field to MergeWriterNode and targetRedistributionColumnVariables to MergeProcessorNode
  • Wire partitioningScheme through SymbolMapper, PruneUnreferencedOutputs, UnaliasSymbolReferences, and PlanPrinter
QueryPlanner.java
StatementAnalyzer.java
MergeAnalysis in Analysis.java
MergeWriterNode.java
MergeProcessorNode.java
SymbolMapper.java
PruneUnreferencedOutputs.java
UnaliasSymbolReferences.java
PlanPrinter.java
PlanBuilder.java
Adjust local exchange and global exchange rules to enforce merge partitioning
  • Modify AddLocalExchanges.visitPartitionedWriter to accept optional PartitioningScheme and enforce partitioned exchanges
  • Update AddExchanges.visitMergeWriter to pass partitioningScheme and control single-writer-per-partition flag
AddLocalExchanges.java
AddExchanges.java
Update merge operators and row-change processors to handle redistribution columns
  • Propagate redistributionColumns into MergeProcessorOperatorFactory and MergeProcessorOperator
  • Add redistributionChannelCount and mapping logic in DeleteAndInsertMergeProcessor
  • Include writeRedistributionColumnCount in ChangeOnlyUpdatedColumnsMergeProcessor and enforce input channel count checks
MergeProcessorOperator.java
DeleteAndInsertMergeProcessor.java
ChangeOnlyUpdatedColumnsMergeProcessor.java
Extend Metadata SPI for merge update layouts and implement in Iceberg/MetadataManager
  • Add getMergeUpdateLayout to ConnectorMetadata, Metadata, MetadataManager, DelegatingMetadataManager, ClassLoaderSafeConnectorMetadata
  • Implement getMergeUpdateLayout and getInsertLayout in IcebergAbstractMetadata
  • Wire mergeUpdateLayout in StatementAnalyzer and MetadataManager
ConnectorMetadata.java
Metadata.java
MetadataManager.java
DelegatingMetadataManager.java
ClassLoaderSafeConnectorMetadata.java
IcebergAbstractMetadata.java
StatementAnalyzer.java
Unify connector NodePartitioningProvider API to return Optional maps and add Iceberg provider stub
  • Change ConnectorNodePartitioningProvider.getBucketNodeMap signature to return Optional
  • Update all existing providers (Hive, Pinot, BlackHole, Tpcds, Tpch, ClassLoaderSafe) to return Optional
  • Add IcebergNodePartitioningProvider stub for future integration
ConnectorNodePartitioningProvider.java
ClassLoaderSafeNodePartitioningProvider.java
HiveNodePartitioningProvider.java
PinotNodePartitioningProvider.java
BlackHoleNodePartitioningProvider.java
TpcdsNodePartitioningProvider.java
TpchNodePartitioningProvider.java
IcebergNodePartitioningProvider.java
Provide Iceberg-specific bucket functions for merge and updates
  • Add IcebergBucketFunction to compute hash buckets based on PartitionSpec and transforms
  • Add IcebergUpdateBucketFunction for update layout bucket assignment
  • Adjust IcebergPageSourceProvider and IcebergPageSinkProvider constructors for partitionData parameters
IcebergBucketFunction.java
IcebergUpdateBucketFunction.java
IcebergPageSourceProvider.java
IcebergPageSinkProvider.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/NodePartitioningManager.java:217` </location>
<code_context>
+        int bucketCount = getBucketCount(session, partitioningHandle, connectorBucketNodeMap, preferDynamic);
+
+        // TODO #20578: WIP - This method is under development. Unsafe ".get()" method call.
+        NodeSelectionStrategy nodeSelectionStrategy = connectorBucketNodeMap.get().getNodeSelectionStrategy();
         switch (nodeSelectionStrategy) {
             case HARD_AFFINITY:
</code_context>

<issue_to_address>
**issue (bug_risk):** Unsafe use of Optional.get() in getBucketNodeMap could lead to runtime exceptions.

If getConnectorBucketNodeMap returns Optional.empty(), calling get() will throw an exception. Please handle the empty case explicitly or use orElseThrow with a clear error message.
</issue_to_address>

### Comment 2
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/optimizations/AddLocalExchanges.java:746` </location>
<code_context>
+            // connector provided hash function
+            verify(!(partitioningScheme.getPartitioning().getHandle().getConnectorHandle() instanceof SystemPartitioningHandle));
+            // TODO #20578: Check if the following verification is correct.
+            verify(partitioningScheme.getPartitioning().getArguments().stream()
+                    .noneMatch(argument -> argument instanceof ConstantExpression),
+                    "Table writer partitioning has constant arguments");
</code_context>

<issue_to_address>
**question:** The verification for constant arguments may be too strict.

This restriction may prevent legitimate scenarios, such as partitioning by a constant. Please review if this check is essential or if it can be made less strict.
</issue_to_address>

### Comment 3
<location> `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergNodePartitioningProvider.java:77` </location>
<code_context>
+    @Override
+    public int getBucketCount(ConnectorTransactionHandle transactionHandle, ConnectorSession session, ConnectorPartitioningHandle partitioningHandle)
+    {
+        return 0;
+    }
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Returning 0 for getBucketCount may cause division by zero errors.

Consider returning a positive default value or throwing an exception to prevent downstream errors.
</issue_to_address>

### Comment 4
<location> `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergBucketFunction.java:85` </location>
<code_context>
+    @Override
+    public int getBucket(Page page, int position)
+    {
+        return HiveBucketing.getBucket(bucketCount, types, page, position);
+
+        // TODO #20578: Trino.
</code_context>

<issue_to_address>
**question:** Using HiveBucketing for Iceberg may not be semantically correct.

Verify that HiveBucketing.getBucket aligns with Iceberg's partitioning logic, and clearly document any differences or constraints.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

int bucketCount = getBucketCount(session, partitioningHandle, connectorBucketNodeMap, preferDynamic);

// TODO #20578: WIP - This method is under development. Unsafe ".get()" method call.
NodeSelectionStrategy nodeSelectionStrategy = connectorBucketNodeMap.get().getNodeSelectionStrategy();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Unsafe use of Optional.get() in getBucketNodeMap could lead to runtime exceptions.

If getConnectorBucketNodeMap returns Optional.empty(), calling get() will throw an exception. Please handle the empty case explicitly or use orElseThrow with a clear error message.

// connector provided hash function
verify(!(partitioningScheme.getPartitioning().getHandle().getConnectorHandle() instanceof SystemPartitioningHandle));
// TODO #20578: Check if the following verification is correct.
verify(partitioningScheme.getPartitioning().getArguments().stream()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: The verification for constant arguments may be too strict.

This restriction may prevent legitimate scenarios, such as partitioning by a constant. Please review if this check is essential or if it can be made less strict.

@Override
public int getBucketCount(ConnectorTransactionHandle transactionHandle, ConnectorSession session, ConnectorPartitioningHandle partitioningHandle)
{
return 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Returning 0 for getBucketCount may cause division by zero errors.

Consider returning a positive default value or throwing an exception to prevent downstream errors.

@Override
public int getBucket(Page page, int position)
{
return HiveBucketing.getBucket(bucketCount, types, page, position);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Using HiveBucketing for Iceberg may not be semantically correct.

Verify that HiveBucketing.getBucket aligns with Iceberg's partitioning logic, and clearly document any differences or constraints.

@acarpente-denodo acarpente-denodo force-pushed the feature/20578_SQL_Support_for_MERGE_INTO branch 2 times, most recently from 7ad95ad to 0521ca4 Compare November 17, 2025 17:51
@acarpente-denodo acarpente-denodo force-pushed the feature/20578_SQL_Support_for_MERGE_INTO_(efficient_workload_distribution) branch from 31239ce to 834479e Compare November 18, 2025 11:44
@github-actions
Copy link
Copy Markdown

github-actions bot commented Nov 18, 2025

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@acarpente-denodo acarpente-denodo changed the title Feature/20578 sql support for merge into (efficient workload distribution) feat: Add SQL Support for MERGE INTO in Presto (efficient workload distribution) Nov 18, 2025
@acarpente-denodo acarpente-denodo force-pushed the feature/20578_SQL_Support_for_MERGE_INTO_(efficient_workload_distribution) branch from 834479e to 047dbbd Compare November 18, 2025 14:55
acarpente-denodo and others added 8 commits November 19, 2025 11:51
Cherry-pick of trinodb/trino@cee96c3

Co-authored-by: David Stryker <david.stryker@starburstdata.com>
Automated tests.

Cherry-pick of trinodb/trino@cee96c3

Co-authored-by: David Stryker <david.stryker@starburstdata.com>
Support SQL MERGE in the Iceberg connector

Cherry-pick of trinodb/trino@6cb188b

Co-authored-by: David Phillips <david@acz.org>
SQL MERGE automated tests for Iceberg connector

Cherry-pick of trinodb/trino@6cb188b

Co-authored-by: David Phillips <david@acz.org>
Add MERGE efficient workload partitioning support
Add MERGE efficient workload partitioning support
@acarpente-denodo acarpente-denodo force-pushed the feature/20578_SQL_Support_for_MERGE_INTO branch from 0521ca4 to cd53db1 Compare November 19, 2025 10:55
@acarpente-denodo acarpente-denodo force-pushed the feature/20578_SQL_Support_for_MERGE_INTO_(efficient_workload_distribution) branch from 047dbbd to 46e53c6 Compare November 19, 2025 11:03
@acarpente-denodo acarpente-denodo force-pushed the feature/20578_SQL_Support_for_MERGE_INTO branch 5 times, most recently from 00a2694 to f9e99cb Compare November 26, 2025 16:35
@acarpente-denodo acarpente-denodo force-pushed the feature/20578_SQL_Support_for_MERGE_INTO branch from f9e99cb to 994f813 Compare November 26, 2025 17:08
@acarpente-denodo acarpente-denodo force-pushed the feature/20578_SQL_Support_for_MERGE_INTO branch 4 times, most recently from 0b73712 to 298b708 Compare December 22, 2025 15:46
@acarpente-denodo acarpente-denodo force-pushed the feature/20578_SQL_Support_for_MERGE_INTO branch from 298b708 to 79a5b6e Compare December 29, 2025 18:01
@acarpente-denodo acarpente-denodo force-pushed the feature/20578_SQL_Support_for_MERGE_INTO branch 2 times, most recently from da4927d to e6c9526 Compare December 30, 2025 12:26
@acarpente-denodo acarpente-denodo force-pushed the feature/20578_SQL_Support_for_MERGE_INTO branch 8 times, most recently from 96b82ae to 92c97a0 Compare January 14, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant