[Improvement] Storage Partition Join #13390
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Currently, Apache Flink's does not support storage partition join, which can lead to unnecessary data shuffles in batch mode. We have implemented the Query Planner changes in Flink already here apache/flink#26715
This feature IS ONLY APPLIED via a config
table.optimizer.storage-partition-join-enabled=trueo/w there is no impact to current jobs. This PR only supports batch execution mode.This PR consists of relevant changes for the Flink Iceberg Source. Please note that these changes are ONLY included for FLIP27 (new Source API).
NOTE: We migrated to usnig FLIP27 and have included that backport in this PR Backport: #10832
This PR adds the following support.
- #10832 for Iceberg 1.5.x
- Enhances
IcebergTableSourceto implementSupportsPartitioninginterface which we defined on the flink
side which enables Iceberg
to report Partitioning metadata to the Flink Query planner. Done via
outputPartitioning()returningKeyGroupedPartitioningwith table’spartition scheme. It can support various transform types including
bucket, identity, month, day, year.
- Improvements to
IcebergSourceto support StoragePartitionJoin- Enhances
FlinkSplitPlannerto include a method to group ScanTasks bygroupingKey (Partition Values) which enables us to ensure that all
records within the same partition end up being processed by the same
subtask.
- PartitionAwareSplitAssignment capabilities including a new
PartitionAwareSplitAssignerFactoryandPartitionAwareSplitAssignerwhich is responsible for ensuring that records with the same partition
are assigned to the same subtask via deterministic assignment
- Includes a new
SpecTransformToFlinkTransformto map the variousTransformExpressions used to represent the partitions to the Flink
System
Testing
TestPartitionAwareSplitAssignerto verify that splits were deterministically applied to the correct subtasksTestFlinkSplitPlannerto test improved functionality to get batchSplits based onScanGroupTestStoragePartitionedJointo verify that we correctly ensure that we get the correct metadataCorrectly use SPJ
Correctly cannot use SPJ