feat: Expand procedure architecture for distributed execution, and support iceberg procedure rewrite_data_files#22659
Conversation
7ec819c to
9440737
Compare
f89dc40 to
e796fa2
Compare
acb0351 to
c3eaa96
Compare
05de3c8 to
0dc3dbb
Compare
rewrite_data_filesrewrite_data_files
steveburnett
left a comment
There was a problem hiding this comment.
Thanks for the draft doc! Some nits about punctuation, formatting, and some suggested rephrasing for readability and conciseness, but the content looks good.
0dc3dbb to
a78c41c
Compare
|
@steveburnett Thanks a lot for your suggestion, all be fixed. Please take a look when convenient! |
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local doc build, looks good. Thanks!
a78c41c to
2fdbab7
Compare
2fdbab7 to
befe9a7
Compare
|
@tdcmeehan Thanks for the review. Sure, I'll add the relevant documentation as soon as possible. |
tdcmeehan
left a comment
There was a problem hiding this comment.
Very good work. Well done! I've left some feedback, but it's mostly minor.
I would split this PR into at least 3 parts:
- All of the code in core Presto to support distributed prcoedures
- The C++ counterpart for this code
- The Iceberg integration
| private final BeginCallDistributedProcedure beginCallDistributedProcedure; | ||
| private final FinishCallDistributedProcedure finishCallDistributedProcedure; | ||
|
|
||
| protected DistributedProcedure(DistributedProcedureType type, String schema, String name, List<Argument> arguments, BeginCallDistributedProcedure beginCallDistributedProcedure, FinishCallDistributedProcedure finishCallDistributedProcedure) |
There was a problem hiding this comment.
I think it would be easier to read and more idiomatic to make DistributedProcedure abstract, and make beginCallDistributedProcedure and finishCallDistributedProcedure abstract methods.
There was a problem hiding this comment.
Great idea, I completely agree! Initially, I didn't declare DistributedProcedure as abstract because Procedure itself wasn't declared as abstract. Now, I've made both Procedure and DistributedProcedure abstract, and introduced a LocalProcedure to represent the original coordinator-only procedures. This makes the overall procedure architecture much clearer and easier to understand.
presto-main-base/src/main/java/com/facebook/presto/sql/analyzer/StatementAnalyzer.java
Outdated
Show resolved
Hide resolved
presto-main-base/src/main/java/com/facebook/presto/execution/CallTask.java
Outdated
Show resolved
Hide resolved
presto-main-base/src/main/java/com/facebook/presto/sql/planner/optimizations/SymbolMapper.java
Outdated
Show resolved
Hide resolved
| new DynamicFiltersChecker(), | ||
| new WarnOnScanWithoutPartitionPredicate(featuresConfig)); | ||
| new WarnOnScanWithoutPartitionPredicate(featuresConfig), | ||
| new CallDistributedProcedureValidator()); |
There was a problem hiding this comment.
@hantangwangd it would be nice to have plan tests, like TestHashGenerationOptimizer, that show the type of plan that gets generated by a distributed procedure.
There was a problem hiding this comment.
Thanks for the suggestion. Since the CALL DISTRIBUTED PROCEDURE statement requires a valid distributed procedure to be invoked, and currently only one has been implemented in Iceberg connector, I've added the test case to TestIcebergLogicalPlanner. Please take a look when you have time, thanks a lot.
presto-main-base/src/main/java/com/facebook/presto/testing/TestProcedureRegistry.java
Show resolved
Hide resolved
presto-spi/src/main/java/com/facebook/presto/spi/procedure/IProcedureRegistry.java
Outdated
Show resolved
Hide resolved
| return source; | ||
| } | ||
|
|
||
| @JsonIgnore |
There was a problem hiding this comment.
Is this intentionally ignored?
There was a problem hiding this comment.
Yes, this is intentionally ignored. Subclasses of WriteTarget are only used during planning -- they will not be serialized.
presto-spi/src/main/java/com/facebook/presto/spi/procedure/DistributedProcedure.java
Outdated
Show resolved
Hide resolved
8d53e4d to
106c2aa
Compare
|
@tdcmeehan thanks for your review and feedback. I've addressed all your comments except the one about adding documentation. Please take a look when you have time.
Are you suggesting that I split this into three separate PRs, or should I squash it into three commits within a single PR? |
|
@hantangwangd since we now squash commits on merge, let's make 3 separate PRs. |
|
@tdcmeehan Sure, I'll do it. |
…ailable in analyzer
Use a subclass `TableDataRewriteDistributedProcedure` for table rewrite tasks, for example, merge small data files, sort table data, repartition table data etc.
Accordingly rename previous ProcedureRegistry to BuiltInProcedureRegistry
… abstract classes And introduce a new class `LocalProcedure` to represent the former coordinator-only procedures
106c2aa to
a192c74
Compare
…#26373) ## Description This PR is the first part of many PRs to support distributed procedure into Presto. It is a split of the original entire PR which is located here: #22659. The whole work in this PR includes the following parts: 1. Re-factor `ProcedureRegistry`/`Procedure` data structure to support the creation and register of `DistributedProcedure`. And make sure `ProcedureRegistry` be available in `presto-analyzer` module and connectors, so that we can recognize distributed procedures in call statement during prepare analyze stages. 2. Handle call statement on distributed procedures in preparer stage. In this stage, we figure out the procedure's type in call statement, and define a new query type `CALL_DISTRIBUTED_PROCEDURE` for `call distributed procedure` in `BuiltInPreparedQuery`. In this way, `call distributed procedure` statement would be handled by `SqlQueryExecutionFactory`, then be created and handled as a `SqlQueryExecution`. 3. Analyze and plan the `call distributed procedure` statement based on the subtype of the distributed procedure. For subtype `TableDataRewriteDistributedProcedure`, ultimately generate a logical plan for it as follows: ``` OutputNode <- TableFinishNode <- CallDistributedProcedureNode <- FilterNode <- TableScanNode ``` 4. Optimize, segmentation, grouped tag and local plan for the logical plan generated above. The handle logical for `CallDistributedProcedureNode` is similar as `TableWriterNode`. Besides, a new optimizer `RewriteWriterTarget` is added, which is placed after all optimization rules. It is used to update the `TableHandle` held in `TableFinishNode` and `CallDistributedProcedureNode` based on the underlying `TableScanNode` after the entire optimization is completed, considering the possible filter pushing down. ## Motivation and Context prestodb/rfcs#12 ## Impact N/A ## Test Plan - Add test cases in each phase involving the procedure architecture expansion, including creating and registering for distributed procedures, preparing for call distributed procedure, analyzing for call distributed procedure, logical planning and optimizing for call distributed procedure ## Contributor checklist - [x] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [x] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [x] Adequate tests were added if applicable. - [x] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes ``` == RELEASE NOTES == General Changes * Upgrade the procedure architecture to support distributed executing procedures ```
## Description This PR is the second part of many PRs to support distributed procedure into Presto. It is a split of the original entire PR which is located here: #22659. The whole work in this PR includes the following parts: 1. Re-factor Iceberg connector to support `call distributed procedure`. Introduce Iceberg's procedure context and expand `IcebergSplitManager` to support split source planned by `IcebergAbstractMetadata.beginCallDistributedProcedure(...)`. This split source will be set to procedure context, and use procedure context to hold all the files to be rewritten as well. 2. Support Iceberg `rewrite_data_files` procedure. It build a customized split source, set the split source to procedure context in order to be used in `IcebergSplitManager`. And register a file scan task consumer to collector and hold all the scanned files into procedure context. Then finally in the commit stage, get all the data files and delete files that has been rewritten, and all the files that has been newly generated, change and commit their metadata through Iceberg table's `RewriteFiles` transaction. ## Motivation and Context prestodb/rfcs#12 ## Impact N/A ## Test Plan - Add test cases for validating the result and plan tree shape of iceberg specific distributed procedure: `rewrite_data_files` ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes ``` == NO RELEASE NOTE == ```
|
Closing this PR in favor of #26788. |
Description
This PR expand the current procedure architecture in presto, support defining, registering and calling procedures which need to be executed in a distributed way. Then support distributed procedure in Iceberg connector and implement a specific procedure
rewrite_data_filesfor it.Referring to: prestodb/rfcs#12
The whole PR is separated into 6 parts:
Re-factor
ProcedureRegistry/Proceduredata structure to support the creation and register ofDistributedProcedure. And make sureProcedureRegistrybe available in presto-analyzer module, so that we can recognize distributed procedures in call statement during prepare and analyze stages.Handle call statement on distributed procedures in preparer stage. In this stage, we figure out the procedure's type in call statement, and define a new query type
CALL_DISTRIBUTED_PROCEDUREforcall distributed procedureinBuiltInPreparedQuery. In this way,call distributed procedurestatement would be handled bySqlQueryExecutionFactory, then be created and handled as aSqlQueryExecution.Analyze and plan the
call distributed procedurestatement based on the subtype of the distributed procedure. For subtypeTableDataRewriteDistributedProcedure, ultimately generate a logical plan for it as follows:Optimize, segmentation, grouped tag and local plan for the logical plan generated above. The handle logical for
CallDistributedProcedureNodeis similar asTableWriterNode. Besides, a new optimizerRewriteWriterTargetis added, which is placed after all optimization rules. It is used to update theTableHandleheld inTableFinishNodeandCallDistributedProcedureNodebased on the underlyingTableScanNodeafter the entire optimization is completed, considering the possible filter pushing down.Re-factor Iceberg connector to support
call distributed procedure. Introduce Iceberg's procedure context and expandIcebergSplitManagerto support split source planned byIcebergAbstractMetadata.beginCallDistributedProcedure(...). This split source will be set to procedure context, and use procedure context to hold all the files to be rewritten as well.Support Iceberg
rewrite_data_filesprocedure. It build a customized split source, set the split source to procedure context in order to be used inIcebergSplitManager. And register a file scan task consumer to collector and hold all the scanned files into procedure context. Then finally in the commit stage, get all the data files and delete files that has been rewritten, and all the files that has been newly generated, change and commit their metadata through Iceberg table'sRewriteFilestransaction.Motivation and Context
N/A
Impact
N/A
Test Plan
rewrite_data_filesContributor checklist
Release Notes