feat(plugin-iceberg): Add rewrite_data_files procedure#26374
feat(plugin-iceberg): Add rewrite_data_files procedure#26374hantangwangd merged 10 commits intoprestodb:masterfrom
rewrite_data_files procedure#26374Conversation
There was a problem hiding this comment.
Sorry @hantangwangd, your pull request is larger than the review limit of 150000 diff characters
e3f4e26 to
a6a6101
Compare
7b97da0 to
7ebf2a0
Compare
d258f2f to
30f74af
Compare
Reviewer's GuideThis PR refactors the Iceberg connector to enable distributed procedure execution by introducing a procedure context framework, adding a new serialized handle type, extending metadata to manage procedure lifecycles, implementing the rewrite_data_files procedure and its split source, wiring ProcedureRegistry into the connector, and providing extensive tests. Sequence diagram for distributed procedure lifecycle in Iceberg connectorsequenceDiagram
participant "Coordinator (Presto Engine)"
participant "IcebergAbstractMetadata"
participant "ProcedureRegistry"
participant "DistributedProcedure (RewriteDataFilesProcedure)"
participant "IcebergProcedureContext"
participant "IcebergSplitManager"
participant "CallDistributedProcedureSplitSource"
participant "IcebergPageSinkProvider"
participant "IcebergPageSink"
participant "IcebergTable"
"Coordinator (Presto Engine)"->>"IcebergAbstractMetadata": beginCallDistributedProcedure(...)
"IcebergAbstractMetadata"->>"ProcedureRegistry": resolve(procedureName)
"ProcedureRegistry"-->>"IcebergAbstractMetadata": DistributedProcedure instance
"IcebergAbstractMetadata"->>"DistributedProcedure": createContext()
"DistributedProcedure"-->>"IcebergAbstractMetadata": IcebergProcedureContext
"IcebergAbstractMetadata"->>"IcebergProcedureContext": setTable(Table)
"IcebergAbstractMetadata"->>"IcebergProcedureContext": setTransaction(Transaction)
"IcebergAbstractMetadata"->>"DistributedProcedure": begin(...)
"DistributedProcedure"->>"IcebergProcedureContext": setConnectorSplitSource(CallDistributedProcedureSplitSource)
"IcebergAbstractMetadata"-->>"Coordinator (Presto Engine)": IcebergDistributedProcedureHandle
"Coordinator (Presto Engine)"->>"IcebergSplitManager": getSplits(...)
"IcebergSplitManager"->>"IcebergAbstractMetadata": getSplitSourceInCurrentCallProcedureTransaction()
"IcebergAbstractMetadata"-->>"IcebergSplitManager": CallDistributedProcedureSplitSource
"IcebergSplitManager"-->>"Coordinator (Presto Engine)": splits
"Coordinator (Presto Engine)"->>"IcebergPageSinkProvider": createPageSink(..., IcebergDistributedProcedureHandle)
"IcebergPageSinkProvider"->>"IcebergPageSink": createPageSink(...)
"Coordinator (Presto Engine)"->>"IcebergAbstractMetadata": finishCallDistributedProcedure(...)
"IcebergAbstractMetadata"->>"ProcedureRegistry": resolve(procedureName)
"ProcedureRegistry"-->>"IcebergAbstractMetadata": DistributedProcedure instance
"IcebergAbstractMetadata"->>"DistributedProcedure": finish(...)
"DistributedProcedure"->>"IcebergProcedureContext": collect scanned files, commit new files
"IcebergAbstractMetadata"->>"IcebergTable": commitTransaction()
"IcebergAbstractMetadata"->>"IcebergProcedureContext": destroy()
"IcebergAbstractMetadata"-->>"Coordinator (Presto Engine)": procedure finished
ER diagram for new IcebergDistributedProcedureHandle data typeerDiagram
ICEBERG_DISTRIBUTED_PROCEDURE_HANDLE {
String schemaName
IcebergTableName tableName
PrestoIcebergSchema schema
PrestoIcebergPartitionSpec partitionSpec
IcebergColumnHandle inputColumns
String outputPath
FileFormat fileFormat
HiveCompressionCodec compressionCodec
Map storageProperties
}
ICEBERG_DISTRIBUTED_PROCEDURE_HANDLE ||--o| ICEBERG_TABLE_NAME : "tableName"
ICEBERG_DISTRIBUTED_PROCEDURE_HANDLE ||--o| PRESTO_ICEBERG_SCHEMA : "schema"
ICEBERG_DISTRIBUTED_PROCEDURE_HANDLE ||--o| PRESTO_ICEBERG_PARTITION_SPEC : "partitionSpec"
ICEBERG_DISTRIBUTED_PROCEDURE_HANDLE ||--|{ ICEBERG_COLUMN_HANDLE : "inputColumns"
ICEBERG_DISTRIBUTED_PROCEDURE_HANDLE ||--o| FILE_FORMAT : "fileFormat"
ICEBERG_DISTRIBUTED_PROCEDURE_HANDLE ||--o| HIVE_COMPRESSION_CODEC : "compressionCodec"
Class diagram for new and updated Iceberg distributed procedure typesclassDiagram
class IcebergProcedureContext {
+Set<DataFile> scannedDataFiles
+Set<DeleteFile> fullyAppliedDeleteFiles
+Map<String, Object> relevantData
+Optional<Table> table
+Transaction transaction
+Optional<ConnectorSplitSource> connectorSplitSource
+setTable(Table table)
+setTransaction(Transaction transaction)
+getTable()
+getTransaction()
+setConnectorSplitSource(ConnectorSplitSource splitSource)
+getConnectorSplitSource()
+getScannedDataFiles()
+getFullyAppliedDeleteFiles()
+getRelevantData()
+destroy()
}
class IcebergDistributedProcedureHandle {
+String schemaName
+IcebergTableName tableName
+PrestoIcebergSchema schema
+PrestoIcebergPartitionSpec partitionSpec
+List<IcebergColumnHandle> inputColumns
+String outputPath
+FileFormat fileFormat
+HiveCompressionCodec compressionCodec
+Map<String, String> storageProperties
+IcebergDistributedProcedureHandle(...)
}
class IcebergWritableTableHandle {
}
IcebergDistributedProcedureHandle --|> IcebergWritableTableHandle
IcebergDistributedProcedureHandle ..|> ConnectorDistributedProcedureHandle
class CallDistributedProcedureSplitSource {
-CloseableIterator<FileScanTask> fileScanTaskIterator
-Optional<Consumer<FileScanTask>> fileScanTaskConsumer
-TableScan tableScan
-Closer closer
-double minimumAssignedSplitWeight
-ConnectorSession session
+getNextBatch(...)
+isFinished()
+close()
-toIcebergSplit(FileScanTask task)
}
class RewriteDataFilesProcedure {
+TypeManager typeManager
+JsonCodec<CommitTaskData> commitTaskCodec
+RewriteDataFilesProcedure(...)
+get()
-beginCallDistributedProcedure(...)
-finishCallDistributedProcedure(...)
}
IcebergProcedureContext ..|> ConnectorProcedureContext
CallDistributedProcedureSplitSource ..|> ConnectorSplitSource
RewriteDataFilesProcedure ..|> Provider
RewriteDataFilesProcedure ..|> DistributedProcedure
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- This PR is very large and touches many layers (protocol, metadata, split manager, planner, C++ bindings, and extensive tests); consider splitting into smaller, focused PRs to simplify review and isolate potential regressions.
- There’s a lot of duplicated test setup/assertion code in TestRewriteDataFilesProcedure (and related tests); factor out common helpers for table creation, file‐count assertions, and cleanup to reduce maintenance overhead.
- In IcebergProcedureContext.destroy, you clear splits and file sets but don’t reset the 'table' or 'transaction' fields—consider clearing those as well to fully release resources after procedure completion.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- This PR is very large and touches many layers (protocol, metadata, split manager, planner, C++ bindings, and extensive tests); consider splitting into smaller, focused PRs to simplify review and isolate potential regressions.
- There’s a lot of duplicated test setup/assertion code in TestRewriteDataFilesProcedure (and related tests); factor out common helpers for table creation, file‐count assertions, and cleanup to reduce maintenance overhead.
- In IcebergProcedureContext.destroy, you clear splits and file sets but don’t reset the 'table' or 'transaction' fields—consider clearing those as well to fully release resources after procedure completion.
## Individual Comments
### Comment 1
<location> `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergProcedureContext.java:92-93` </location>
<code_context>
+ this.relevantData.clear();
+ this.scannedDataFiles.clear();
+ this.fullyAppliedDeleteFiles.clear();
+ this.connectorSplitSource.ifPresent(ConnectorSplitSource::close);
+ this.connectorSplitSource = null;
+ }
+}
</code_context>
<issue_to_address>
**issue (bug_risk):** Setting connectorSplitSource to null may lead to NullPointerExceptions.
Assigning null to connectorSplitSource, which is an Optional, breaks expected usage and may cause runtime errors. Use Optional.empty() instead to prevent NullPointerExceptions.
</issue_to_address>
### Comment 2
<location> `presto-iceberg/src/main/java/com/facebook/presto/iceberg/RewriteDataFilesProcedure.java:167-170` </location>
<code_context>
+ .map(slice -> commitTaskCodec.fromJson(slice.getBytes()))
+ .collect(toImmutableList());
+
+ org.apache.iceberg.types.Type[] partitionColumnTypes = icebergTable.spec().fields().stream()
+ .map(field -> field.transform().getResultType(
+ icebergTable.schema().findType(field.sourceId())))
+ .toArray(Type[]::new);
+
+ Set<DataFile> newFiles = new HashSet<>();
</code_context>
<issue_to_address>
**issue (bug_risk):** Potential mismatch between partition spec fields and schema types.
If findType returns null for a missing sourceId, this could lead to runtime errors. Please add validation or error handling for cases where the type is not found.
</issue_to_address>
### Comment 3
<location> `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java:1074-1083` </location>
<code_context>
+ throw new PrestoException(NOT_SUPPORTED, "This connector do not allow table execute at specified snapshot");
+ }
+
+ transaction = icebergTable.newTransaction();
+ BaseProcedure<?> procedure = procedureRegistry.resolve(
+ new ConnectorId(procedureName.getCatalogName()),
+ new SchemaTableName(
+ procedureName.getSchemaName(),
+ procedureName.getObjectName()));
+ verify(procedure instanceof DistributedProcedure, "procedure must be DistributedProcedure");
+ procedureContext = Optional.of((IcebergProcedureContext) ((DistributedProcedure) procedure).createContext());
+ procedureContext.get().setTable(icebergTable);
+ procedureContext.get().setTransaction(transaction);
+ return ((DistributedProcedure) procedure).begin(session, procedureContext.get(), tableLayoutHandle, arguments);
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** Transaction is assigned to a field but not cleared after procedure completion.
Since the transaction field remains set after finishCallDistributedProcedure, running multiple procedures may result in stale or incorrect state. Please ensure the transaction field is cleared or properly scoped after each procedure completes.
</issue_to_address>
### Comment 4
<location> `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java:1099` </location>
<code_context>
+ verify(procedureContext.isPresent(), "procedure context must be present");
+ ((DistributedProcedure) procedure).finish(procedureContext.get(), procedureHandle, fragments);
+ transaction.commitTransaction();
+ procedureContext.get().destroy();
+ }
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Destroying procedureContext does not reset the Optional field.
Reset procedureContext to Optional.empty() after destroy to prevent unexpected behavior if accessed post-completion.
</issue_to_address>
### Comment 5
<location> `presto-docs/src/main/sphinx/connector/iceberg.rst:1239-1242` </location>
<code_context>
+Rewrite Data Files
+^^^^^^^^^^^^^^^^^^
+
+Iceberg tracks all data files under different partition specs in a table. More data files requires
+more metadata to be stored in manifest files, and small data files can cause unnecessary amount metadata and
+less efficient queries from file open costs. Also, data files under different partition specs can
+prevent metadata level deletion or thorough predicate push down for Presto.
</code_context>
<issue_to_address>
**issue (typo):** Correct verb agreement and missing word in sentence.
The correct sentence is: 'More data files require more metadata to be stored in manifest files, and small data files can cause an unnecessary amount of metadata and less efficient queries due to file open costs.'
```suggestion
Iceberg tracks all data files under different partition specs in a table. More data files require
more metadata to be stored in manifest files, and small data files can cause an unnecessary amount of metadata and
less efficient queries due to file open costs. Also, data files under different partition specs can
prevent metadata level deletion or thorough predicate push down for Presto.
```
</issue_to_address>
### Comment 6
<location> `presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp:745` </location>
<code_context>
} // namespace facebook::presto::protocol::iceberg
namespace facebook::presto::protocol::iceberg {
+IcebergDistributedProcedureHandle::
+ IcebergDistributedProcedureHandle() noexcept {
+ _type = "hive-iceberg";
+}
</code_context>
<issue_to_address>
**issue (review_instructions):** Member variable '_type' uses leading underscore, but should use camelCase_ for private/protected members.
The member variable '_type' does not follow the required camelCase_ convention for private/protected members. Please rename it to 'type_' to comply with the coding standard.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `presto-native-execution/**/*.hpp,presto-native-execution/**/*.hpp,presto-native-execution/**/*.cpp`
**Instructions:**
Use camelCase_ for private and protected members variables.
</details>
</issue_to_address>
### Comment 7
<location> `presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp:746` </location>
<code_context>
namespace facebook::presto::protocol::iceberg {
+IcebergDistributedProcedureHandle::
+ IcebergDistributedProcedureHandle() noexcept {
+ _type = "hive-iceberg";
+}
+
</code_context>
<issue_to_address>
**issue (review_instructions):** Member variable '_type' should use camelCase_ (e.g., 'type_') for private/protected members.
Please update '_type' to 'type_' to match the required naming convention for private/protected member variables.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `presto-native-execution/**/*.hpp,presto-native-execution/**/*.hpp,presto-native-execution/**/*.cpp`
**Instructions:**
Use camelCase_ for private and protected members variables.
</details>
</issue_to_address>
### Comment 8
<location> `presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp:818` </location>
<code_context>
+}
+
+void from_json(const json& j, IcebergDistributedProcedureHandle& p) {
+ p._type = j["@type"];
+ from_json_key(
+ j,
</code_context>
<issue_to_address>
**issue (review_instructions):** Member variable '_type' should use camelCase_ (e.g., 'type_') for private/protected members.
Please update '_type' to 'type_' to match the required naming convention for private/protected member variables.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `presto-native-execution/**/*.hpp,presto-native-execution/**/*.hpp,presto-native-execution/**/*.cpp`
**Instructions:**
Use camelCase_ for private and protected members variables.
</details>
</issue_to_address>
### Comment 9
<location> `presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp:745` </location>
<code_context>
} // namespace facebook::presto::protocol::iceberg
namespace facebook::presto::protocol::iceberg {
+IcebergDistributedProcedureHandle::
+ IcebergDistributedProcedureHandle() noexcept {
+ _type = "hive-iceberg";
+}
</code_context>
<issue_to_address>
**issue (review_instructions):** The member variable '_type' uses a leading underscore, which is not camelCase_ as required for private/protected members.
Private/protected member variables should use camelCase_ (e.g., 'type_') rather than a leading underscore. Please rename '_type' to 'type_' for consistency with the coding standard.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `presto-native-execution/**/*.hpp,presto-native-execution/**/*.cpp`
**Instructions:**
Use camelCase_ for private and protected members variables.
</details>
</issue_to_address>
### Comment 10
<location> `presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp:746` </location>
<code_context>
namespace facebook::presto::protocol::iceberg {
+IcebergDistributedProcedureHandle::
+ IcebergDistributedProcedureHandle() noexcept {
+ _type = "hive-iceberg";
+}
+
</code_context>
<issue_to_address>
**issue (review_instructions):** The member variable '_type' uses a leading underscore, which is not camelCase_ as required for private/protected members.
Please rename '_type' to 'type_' to follow the camelCase_ convention for private/protected member variables.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `presto-native-execution/**/*.hpp,presto-native-execution/**/*.cpp`
**Instructions:**
Use camelCase_ for private and protected members variables.
</details>
</issue_to_address>
### Comment 11
<location> `presto-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp:818` </location>
<code_context>
+}
+
+void from_json(const json& j, IcebergDistributedProcedureHandle& p) {
+ p._type = j["@type"];
+ from_json_key(
+ j,
</code_context>
<issue_to_address>
**issue (review_instructions):** The member variable '_type' uses a leading underscore, which is not camelCase_ as required for private/protected members.
Please update '_type' to 'type_' to comply with the camelCase_ convention for private/protected member variables.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `presto-native-execution/**/*.hpp,presto-native-execution/**/*.cpp`
**Instructions:**
Use camelCase_ for private and protected members variables.
</details>
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergProcedureContext.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java
Outdated
Show resolved
Hide resolved
...to-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp
Show resolved
Hide resolved
...to-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp
Show resolved
Hide resolved
...to-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp
Show resolved
Hide resolved
...to-native-execution/presto_cpp/presto_protocol/connector/iceberg/presto_protocol_iceberg.cpp
Show resolved
Hide resolved
steveburnett
left a comment
There was a problem hiding this comment.
Thanks for the doc! Just a nit of formatting.
30f74af to
9959e0c
Compare
|
@steveburnett thanks for the review, fixed! Please take a look when you have a minute. |
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local doc build. Looks good, thanks!
9959e0c to
5b0e05a
Compare
| private final double minimumAssignedSplitWeight; | ||
| private final ConnectorSession session; | ||
|
|
||
| public CallDistributedProcedureSplitSource( |
There was a problem hiding this comment.
I think the most straightforward thing to do is to simply have a new method in ConnectorSplitSource that explicitly takes in ConnectorProcedureContext and ConnectorDistributedProcedureHandle. That way we can unify the implementation here and just use the existing IcebergSplitSource, and I also think it would be a lot more straightforward to understand compared to overloading the Iceberg Metadata object to also include this custom split source.
There was a problem hiding this comment.
That's a great idea and sounds very reasonable. I've made the change following your suggestion, and removed CallDistributedProcedureSplitSource completely. Please take a look when you get a chance, thanks a lot!
8ec72a0 to
0d136f9
Compare
|
@tdcmeehan, it's a really great point for discussion. Thanks for bringing it up. As far as I can see, there is an implicit constraint in row level DELETE and UPDATE behaviors: they do not alter the existing data file structure. Following this, in our current DELETE implementation framework, a page sink writes delete files only for a specific, targeted data file (and UPDATE works similarly). This differs fundamentally from operations like Given that difference, reusing the existing DELETE/UPDATE framework to track which data files and delete files have been rewritten is a little challenging — it would likely require fairly big changes. Actually, as I understand, Regarding your concern about the exposure of the procedure context, I agree that your concern is very reasonable. In the current design, I provided the Does this approach sound reasonable to you? Any thoughts or alternatives would be greatly appreciated! |
|
@hantangwangd my point regarding DELETE is not that the operation itself is similar. Of course, I agree, logically it's more like a write. What I meant was that both DELETE and Can you elaborate on how the dedicated file planning step would work? How would it know what files were rewritten? |
90638b1 to
b35da01
Compare
Hi @tdcmeehan, I've just appended a new commit to the PR. In the latest change, I avoided exposing the Iceberg procedure context to external components and moved the logic of identifying which data/delete files to rewrite into the Please take a look when you get a chance and let me know what you think. Very glad to discuss and open to any suggestions. Thanks a lot! |
...o-iceberg/src/main/java/com/facebook/presto/iceberg/procedure/RewriteDataFilesProcedure.java
Show resolved
Hide resolved
| private ConnectorDistributedProcedureHandle beginCallDistributedProcedure(ConnectorSession session, IcebergProcedureContext procedureContext, IcebergTableLayoutHandle layoutHandle, Object[] arguments) | ||
| { | ||
| try (ThreadContextClassLoader ignored = new ThreadContextClassLoader(getClass().getClassLoader())) { | ||
| Table icebergTable = procedureContext.getTable().orElseThrow(() -> new VerifyException("Target table does not exist")); |
There was a problem hiding this comment.
Can't this be done during finishCallDistributedProcedure, since the Table stored here itself is versioned by a snapshot ID?
There was a problem hiding this comment.
I think this is a great idea. Moving the file planning after the successful write operation avoids unnecessary planning work in cases where the actual write might fail. We can simply pass the necessary information for file planning (e.g., the iceberg table layout) to the finishCallDistributedProcedure via the procedure context. Another benefit is that this approach eliminates the need to maintain scannedDataFiles and fullyAppliedDeleteFiles within the IcebergProcedureContext, which simplifies it significantly. Done!
| procedureName.getSchemaName(), | ||
| procedureName.getObjectName())); | ||
| verify(procedure instanceof DistributedProcedure, "procedure must be DistributedProcedure"); | ||
| procedureContext = Optional.of((IcebergProcedureContext) ((DistributedProcedure) procedure).createContext()); |
There was a problem hiding this comment.
Could we now make all of these parameters immutable and store them in the constructor?
There was a problem hiding this comment.
Sure, I've refactored it.
Thanks for looking into that. I agree that this re-scanning approach seems reasonable. |
|
@tdcmeehan thank you so much for the review and feedback. I've addressed your comments. Please take a look when you get a chance. |
tdcmeehan
left a comment
There was a problem hiding this comment.
Thanks @hantangwangd. This looks really great. I just have one more question about the context object.
...o-iceberg/src/main/java/com/facebook/presto/iceberg/procedure/RewriteDataFilesProcedure.java
Outdated
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergProcedureContext.java
Outdated
Show resolved
Hide resolved
…ding a customized split source
And fix the error message thrown when target table does not exist
And transmit the necessary data between begin/finish throw procedure handle
aa201e9 to
c216a9d
Compare
rewrite_data_files Iceberg procedure
rewrite_data_files Iceberg procedurerewrite_data_files procedure
|
@hantangwangd Sorry for the late response and thanks for the code. Thank you very much. |
|
Hi @PingLiuPing. For data rewrite type distributed procedures, in Prestissimo we need to plan the After this, yes, Prestissimo executes the same behavior as it does for an |
Description
This PR is the second part of many PRs to support distributed procedure into Presto. It is a split of the original entire PR which is located here: #22659.
The whole work in this PR includes the following parts:
Re-factor Iceberg connector to support
call distributed procedure. Introduce Iceberg's procedure context and expandIcebergSplitManagerto support split source planned byIcebergAbstractMetadata.beginCallDistributedProcedure(...). This split source will be set to procedure context, and use procedure context to hold all the files to be rewritten as well.Support Iceberg
rewrite_data_filesprocedure. It build a customized split source, set the split source to procedure context in order to be used inIcebergSplitManager. And register a file scan task consumer to collector and hold all the scanned files into procedure context. Then finally in the commit stage, get all the data files and delete files that has been rewritten, and all the files that has been newly generated, change and commit their metadata through Iceberg table'sRewriteFilestransaction.Motivation and Context
prestodb/rfcs#12
Impact
N/A
Test Plan
rewrite_data_filesContributor checklist
Release Notes