Add connector specific partitioning support for remote exchanges#12373
Add connector specific partitioning support for remote exchanges#12373arhimondr merged 7 commits intoprestodb:masterfrom
Conversation
wenleix
left a comment
There was a problem hiding this comment.
"Tiny ExchangeNode refactorings" and "Rename query.initial-hash-partitions property" looks good.
wenleix
left a comment
There was a problem hiding this comment.
"Add max_tasks_per_stage session property"
Minor comment. Maybe explain the motivation in commit message? (control the number of tasks in
There was a problem hiding this comment.
Curious: where is the name fixed distribution stage from? I guess it's from FixedSourcePartitionedScheduler ? :)
There was a problem hiding this comment.
I aggree that the description is kinda confusing. But i coudn't come up with anything better.
There was a problem hiding this comment.
Shall we name it stage.max-tasks-per-stage given how other property named ?
There was a problem hiding this comment.
You might need to change other places used getHashPartitionCount(). Especially the usage in AddExchange and RewriteSpatialPartitioningAggregation .
Maybe refactor this min operation as a static helper method in SystemPartitioningHandle ?
There was a problem hiding this comment.
In fact, do we want to apply min here ? Today, we cannot execute when we have more system partitions than number of tasks.
However, in the future, we might want to materialize system partitioning result (as in contrast to Hive partitioning). And at that time it would be legit to have 1000 partitions, while the max tasks per stages is only 300?
Update: I think it's OK for now to have this behavior for SystemPartitioning
There was a problem hiding this comment.
Especially the usage in
AddExchange
This is used when creating connector partitioning. The number of hash partitions for connector partitioning shouldn't depend on the number of nodes. (we want to have more buckets than we have nodes)
RewriteSpatialPartitioningAggregation
This is needed for partitioned spatial join. We can work on that later.
Maybe refactor this
minoperation as a static helper method inSystemPartitioningHandle
It is used only in one place. Once there are more places - we can refactor it.
There was a problem hiding this comment.
You are right, this is about getting the NodePartitioningMap, so we should enforce max-stages-per-task. While in AddExchange and other plan optimizer, we should only use hash_partition_count since it's a logical partitioning. I am assuming today we will fail when hash_partition_count > max-stages-per-task
There was a problem hiding this comment.
It won't fail, as we don't have to specify the number of partitions for the system partitioning upfront. System partitioning is still a special case for now.
wenleix
left a comment
There was a problem hiding this comment.
"Use static imports in AddExchanges" and "Remove initial capacity hint when selecting nodes"
Looks good. I think in previous use case use initial capacity hint is not a bad idea ;)
Alternatively, you can make NodeSelector.selectRandomNodes short cut to allNodes() when limit > totalNodeCount
shixuan-fan
left a comment
There was a problem hiding this comment.
First two commits look good.
"Add max_tasks_per_stage session property"
presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
shixuan-fan
left a comment
There was a problem hiding this comment.
Code looks good. Would you mind adding motivation to the last commit message?
79a0e2d to
ed91fd4
Compare
wenleix
left a comment
There was a problem hiding this comment.
"Support connector partitioning provider"
Generally looks good to me. Maybe call the commit message should be "Allow use connector-specific partitioning for remote exchange" ?
And as @shixuan-fan suggests, we should explain the motivation in commit message :) .
There are two TODOs for me as reviewer:
-
I will do a more careful pass into
AddExchanges. The refactors looks reasonable, but might worthy a bit deeper looking given the complication ofAddExchanges. -
I will contemplate/experiment about how does this work together with compatible partitioning. I suggest @shixuan-fan and @arhimondr also think about this ;) . For example:
Assume we are performing A JOIN B. A is bucketed into 512 buckets, and B is not bucketed. The session property tells us to use HivePartitioning with 1024 partitions for remote exchange.
Do we think the partitioning over A and B (after exchange) are considered as compatible partitioning? If so, with today's code we will try to re-adjust B's partitioning to be 512 partitions -- will it work since B's partitioning is derived from Remote Exchange, rather than Table Scan. -- We don't have to solve all the problems in this PR, and we should start thinking about these interesting synergies :)
A more fundamental problem is today the compatible partitioning is adapted in the connector's read path and totally opaque to engine. We might make engine aware of that at some time :)
.../src/test/java/com/facebook/presto/hive/TestHiveDistributedAggregationsHivePartitioning.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/execution/QueryManagerConfig.java
Outdated
Show resolved
Hide resolved
...-hive/src/test/java/com/facebook/presto/hive/TestHiveDistributedQueriesHivePartitioning.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
presto-spi/src/main/java/com/facebook/presto/spi/connector/ConnectorMetadata.java
Outdated
Show resolved
Hide resolved
|
Also Travis are failure :). I assume they are trivial fixes such as some tests has to be disabled :) |
That's a very good question. So yeah, now if one table is already bucketed - the code will try to bucket the other table with the same number of buckets. So the |
ed91fd4 to
55a4228
Compare
Ah I see. At least there is no compatible buckets comes into complicate the issue :).
Yeah... the problem is we don't want to have 100 configs on this... tough question :)
Agree. We should keep this PR as simple as possible. Just keep in mind there might be interesting behaviors in some cases... Shall we add |
|
I looked into a code a bit, for a join consists of all remote sources will use The So, for connector provided partitioning, Otherwise it goes into This is probably OK for now, but just keep in mind at some point we might want to have more consistent behavior between |
wenleix
left a comment
There was a problem hiding this comment.
Some additional minor comments to Support connector partitioning provider. Otherwise looks good.
presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java
Outdated
Show resolved
Hide resolved
|
This is part of the effort on Support Materialized Exchange (#12387) |
Unfortunately nodes are provided as an Iterator. Iterator doesn't have the size method in it. We can try to change the interfaces, but honestly i don't think that expanding the list would be an issue. Most of the time we will expland it to 1000 elements max. |
I added the |
I'm not sure. We got a lot of properties that are "experimental" and that have been there for years. So i'm a little bit reluctant of adding that prefix. Approaching from the other side - no matter if we prefix it with |
It doesn't make sense to have more partitions than we have nodes for the system partitioning. We are not going to use the system partitioning for materialization and bucket-by-bucket for now. And if all the partitions have to be executed all at once - there's no point of having more of them than there are nodes in the cluster. |
55a4228 to
9003715
Compare
|
@wenleix @shixuan-fan Comments addressed. Please have another pass. |
Correct for now. But I expect at some time we want to support materializing system partitioning as well. And it seems to me make sense to unify system partitioning and connector partitioning. I do agree we don't need to worry about it now. But just something keep in mind :) |
nezihyigitbasi
left a comment
There was a problem hiding this comment.
Add max_tasks_per_stage session property
- commit details message has the config name incorrect.
There was a problem hiding this comment.
for non-source stages OR for intermediate stages ?
There was a problem hiding this comment.
@nezihyigitbasi : Thanks for the comment! In our case it's more about whether the stage/fragment's distribution is SOURCE_DISTRIBUTION (
Maybe we can say Maximum number of tasks for stage, unless its partitioning handle is SOURCE_DISTRIBUTION to avoid confusion?
There was a problem hiding this comment.
I'm not sure if an average end user would now what it is a partitioning handle.
There was a problem hiding this comment.
Maximum number of tasks for a non source distributed stage - I think it should be pretty intuitive.
The user should understand that it is not possible (at least for all the cases) to limit the number of source distributed tasks. As the source distribute tasks might be bound to the data location. So, somehow it makes sense. However i agree that it is still quite confusing.
nezihyigitbasi
left a comment
There was a problem hiding this comment.
Allow use connector-specific partitioning for remote exchange
- Commit message title can be
Add connector specific partitioning support for remote exchanges If connector specified partitioning is used for exchanges, remote exchanges can be replaced with a bucketed table write followed by a bucketed table read.
There was a problem hiding this comment.
Why not plural (getPartitioningHandleForExchanges)?
Or getExchangePartitioningHandle?
There was a problem hiding this comment.
Because we are getting partitioning handles per exchange, as the handle contains the list of types for partitioning columns. So technically there will be one partitioning handle per exchange.
There was a problem hiding this comment.
Looks like we need a better description here as this repeats the property name. Maybe something like Name of the catalog providing custom partitioning. or sth along those lines.
There was a problem hiding this comment.
partitioningProviderCatalogName
There was a problem hiding this comment.
Maybe just partitioningProviderCatalog to be consistent with the property?
There was a problem hiding this comment.
We have multiple partitionedExchange methods with complex list of arguments, would be nice to unify some of them to simplify the API surface here.
There was a problem hiding this comment.
@nezihyigitbasi : Does the next commit (rename some of them specifically to systemPartitiontedExchange ) help to clarify the API ?
wenleix
left a comment
There was a problem hiding this comment.
"Improve selected nodes initial capacity hint"
minor comments... I really feel we can just change it to List<Node :) . ResettableRandomizedIterator will anyway copy it into a List.
There was a problem hiding this comment.
Wow . I really feel we just want to use a List<Node> (and do ImmutableList.copyOf if the original input is a Set .
There was a problem hiding this comment.
Apparently we were trying to optimize randomization. So instead of shuffling all nodes everytime, only the number of nodes needed can be shuffled.
There was a problem hiding this comment.
By using ResettableRandomizedIterator, it already defeat the purpose since it will do a copy into ArrayList in constructor :)
Maybe we should just use your previous solution and allow list resizing. Sorry about the back-and-forth :(.
There was a problem hiding this comment.
By using
ResettableRandomizedIterator, it already defeat the purpose since it will do a
We do a copy, but we don't call to the random number generator unless needed. Not sure how big a problem is that. But since it was optimized that way - let's keep it as is.
There was a problem hiding this comment.
It's kind of weird for an iterator to have size. Plus, this looks really like an ListIterator :)
There was a problem hiding this comment.
Yeah, not nice. But the class is pretty private. I would simply keep it. The usages are pretty readable (e.g: candidates.size())
There was a problem hiding this comment.
Should this be an IllegalStateException? This method seems only applicable to DML, which I think if catalog is missing, it should be caught earlier than here (not entirely sure about this).
There was a problem hiding this comment.
We should keep the exception as user error. IllegalStateException will be categorized as in internal error.
wenleix
left a comment
There was a problem hiding this comment.
"Allow use connector-specific partitioning for remote exchange"
Looks good. I agree with @nezihyigitbasi 's comment about commit message, with some additions about this bucketed execution is a future work. Feel free to revise it.
Add connector specific partitioning support for remote exchanges
This opens the opportunities to perform exchange through connector tables (by replacing remote exchanges with bucketed table write followed by a bucketed table read), which is required by the following use cases:
* Supports recoverable exchange
* Supports arbitrary large JOIN/AGGREGATE by first bucket input data, and use grouped execution.
There was a problem hiding this comment.
@nezihyigitbasi : Does the next commit (rename some of them specifically to systemPartitiontedExchange ) help to clarify the API ?
wenleix
left a comment
There was a problem hiding this comment.
"Rename static factory methods in ExchangeNode"
Looks good.
|
Make sure also addresses @nezihyigitbasi and @shixuan-fan 's comments :) |
9003715 to
397a776
Compare
|
@nezihyigitbasi , @shixuan-fan comments addressed. Could you please have an another look? |
nezihyigitbasi
left a comment
There was a problem hiding this comment.
Add max_tasks_per_stage session property
The commit details doesn't have the correct config name (task.max-tasks-per-stage is not correct).
nezihyigitbasi
left a comment
There was a problem hiding this comment.
Allow use connector-specific partitioning for remote exchange
Please see my previous comment for the commit message.
nezihyigitbasi
left a comment
There was a problem hiding this comment.
LGTM too % minor comments about the commit messages.
Sorry. I don't know why was i sure that i have it changed. |
397a776 to
8ba0d6a
Compare
Rename the query.initial-hash-partitions to query.hash-partition-count to match the session property name
And coresponding stage.max-tasks-per-stage configuration property
If connector specified partitioning is used for exchanges, remote exchanges can be replaced with a bucketed table write followed by a bucketed table read
Rename partitionedExchange to systemPartitionedExchange to emphasize that the system partitioning is created within.
535e878 to
42b7a6c
Compare
No description provided.