Add connector specific partitioning support for remote exchanges by arhimondr · Pull Request #12373 · prestodb/presto

arhimondr · 2019-02-22T00:11:54Z

No description provided.

wenleix

"Tiny ExchangeNode refactorings" and "Rename query.initial-hash-partitions property" looks good.

wenleix

"Add max_tasks_per_stage session property"

Minor comment. Maybe explain the motivation in commit message? (control the number of tasks in

wenleix · 2019-02-22T02:08:08Z

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

Curious: where is the name fixed distribution stage from? I guess it's from FixedSourcePartitionedScheduler ? :)

I aggree that the description is kinda confusing. But i coudn't come up with anything better.

wenleix · 2019-02-22T02:08:34Z

presto-main/src/main/java/com/facebook/presto/execution/TaskManagerConfig.java

Shall we name it stage.max-tasks-per-stage given how other property named ?

wenleix · 2019-02-22T02:13:02Z

presto-main/src/main/java/com/facebook/presto/sql/planner/SystemPartitioningHandle.java

You might need to change other places used getHashPartitionCount(). Especially the usage in AddExchange and RewriteSpatialPartitioningAggregation .

Maybe refactor this min operation as a static helper method in SystemPartitioningHandle ?

~~In fact, do we want to apply min here ? Today, we cannot execute when we have more system partitions than number of tasks.~~

However, in the future, we might want to materialize system partitioning result (as in contrast to Hive partitioning). And at that time it would be legit to have 1000 partitions, while the max tasks per stages is only 300?

Update: I think it's OK for now to have this behavior for SystemPartitioning

Especially the usage in AddExchange

This is used when creating connector partitioning. The number of hash partitions for connector partitioning shouldn't depend on the number of nodes. (we want to have more buckets than we have nodes)

RewriteSpatialPartitioningAggregation

This is needed for partitioned spatial join. We can work on that later.

Maybe refactor this min operation as a static helper method in SystemPartitioningHandle

It is used only in one place. Once there are more places - we can refactor it.

You are right, this is about getting the NodePartitioningMap, so we should enforce max-stages-per-task. While in AddExchange and other plan optimizer, we should only use hash_partition_count since it's a logical partitioning. I am assuming today we will fail when hash_partition_count > max-stages-per-task

It won't fail, as we don't have to specify the number of partitions for the system partitioning upfront. System partitioning is still a special case for now.

wenleix

"Use static imports in AddExchanges" and "Remove initial capacity hint when selecting nodes"

Looks good. I think in previous use case use initial capacity hint is not a bad idea ;)

Alternatively, you can make NodeSelector.selectRandomNodes short cut to allNodes() when limit > totalNodeCount

shixuan-fan

First two commits look good.

"Add max_tasks_per_stage session property"

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

shixuan-fan

Code looks good. Would you mind adding motivation to the last commit message?

wenleix

"Support connector partitioning provider"

Generally looks good to me. Maybe call the commit message should be "Allow use connector-specific partitioning for remote exchange" ?

And as @shixuan-fan suggests, we should explain the motivation in commit message :) .

There are two TODOs for me as reviewer:

I will do a more careful pass into AddExchanges. The refactors looks reasonable, but might worthy a bit deeper looking given the complication of AddExchanges.
I will contemplate/experiment about how does this work together with compatible partitioning. I suggest @shixuan-fan and @arhimondr also think about this ;) . For example:

Assume we are performing A JOIN B. A is bucketed into 512 buckets, and B is not bucketed. The session property tells us to use HivePartitioning with 1024 partitions for remote exchange.

Do we think the partitioning over A and B (after exchange) are considered as compatible partitioning? If so, with today's code we will try to re-adjust B's partitioning to be 512 partitions -- will it work since B's partitioning is derived from Remote Exchange, rather than Table Scan. -- We don't have to solve all the problems in this PR, and we should start thinking about these interesting synergies :)

A more fundamental problem is today the compatible partitioning is adapted in the connector's read path and totally opaque to engine. We might make engine aware of that at some time :)

.../src/test/java/com/facebook/presto/hive/TestHiveDistributedAggregationsHivePartitioning.java

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

presto-main/src/main/java/com/facebook/presto/execution/QueryManagerConfig.java

...-hive/src/test/java/com/facebook/presto/hive/TestHiveDistributedQueriesHivePartitioning.java

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

presto-spi/src/main/java/com/facebook/presto/spi/connector/ConnectorMetadata.java

wenleix · 2019-02-23T23:20:02Z

Also Travis are failure :). I assume they are trivial fixes such as some tests has to be disabled :)

arhimondr · 2019-02-25T15:04:43Z

Do we think the partitioning over A and B (after exchange) are considered as compatible partitioning? If so, with today's code we will try to re-adjust B's partitioning to be 512 partitions -- will it work since B's partitioning is derived from Remote Exchange, rather than Table Scan. -- We don't have to solve all the problems in this PR, and we should start thinking about these interesting synergies :)

That's a very good question. So yeah, now if one table is already bucketed - the code will try to bucket the other table with the same number of buckets. So the hash_partitions property won't override the bucket number for now. That is something we should consider doing (maybe making it optional). If you don't mind i would like to keep this PR simple for now, and contemplate the approach for overriding the number of buckets later.

wenleix · 2019-02-25T17:53:08Z

@arhimondr

the code will try to bucket the other table with the same number of buckets.

Ah I see. At least there is no compatible buckets comes into complicate the issue :).

That is something we should consider doing (maybe making it optional).

Yeah... the problem is we don't want to have 100 configs on this... tough question :)

i would like to keep this PR simple for now

Agree. We should keep this PR as simple as possible. Just keep in mind there might be interesting behaviors in some cases...

Shall we add experimental- prefix to this new session property ? :)

wenleix · 2019-02-26T00:58:47Z

I looked into a code a bit, for a join consists of all remote sources will use FixedCountScheduler:

presto/presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java

Lines 370 to 377 in ee60dab

    
           // remote source requires nodePartitionMap 
        
           NodePartitionMap nodePartitionMap = partitioningCache.apply(plan.getFragment().getPartitioning()); 
        
           if (groupedExecutionForStage) { 
        
               checkState(connectorPartitionHandles.size() == nodePartitionMap.getBucketToPartition().length); 
        
           } 
        
           stageNodeList = nodePartitionMap.getPartitionToNode(); 
        
           bucketNodeMap = nodePartitionMap.asBucketNodeMap(); 
        
           bucketToPartition = Optional.of(nodePartitionMap.getBucketToPartition());

The nodePartitionMap computation logic is in NodePartitioningManager. getNodePartitioningMap:

presto/presto-main/src/main/java/com/facebook/presto/sql/planner/NodePartitioningManager.java

Line 122 in ee60dab

    
           public NodePartitionMap getNodePartitioningMap(Session session, PartitioningHandle partitioningHandle)

.

So, for connector provided partitioning, #nodes = all-nodes-count, #partitions = hash-partition-count :

presto/presto-main/src/main/java/com/facebook/presto/sql/planner/NodePartitioningManager.java

Lines 145 to 147 in ee60dab

    
           bucketToNode = createArbitraryBucketToNode( 
        
                   nodeScheduler.createNodeSelector(connectorId).allNodes(), 
        
                   connectorBucketNodeMap.getBucketCount());

Otherwise it goes into SystemPartitioningHandle. getNodePartitionMap. And #nodes = #partitions = min(hash-partition-count, max-task-per-stage).

presto/presto-main/src/main/java/com/facebook/presto/sql/planner/NodePartitioningManager.java

Lines 127 to 129 in ee60dab

    
           if (partitioningHandle.getConnectorHandle() instanceof SystemPartitioningHandle) { 
        
               return ((SystemPartitioningHandle) partitioningHandle.getConnectorHandle()).getNodePartitionMap(session, nodeScheduler); 
        
           }

This is probably OK for now, but just keep in mind at some point we might want to have more consistent behavior between SystemPartitioning and ConnectorPartitioning

wenleix

Some additional minor comments to Support connector partitioning provider. Otherwise looks good.

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java

wenleix · 2019-02-26T06:30:41Z

This is part of the effort on Support Materialized Exchange (#12387)

arhimondr · 2019-02-26T18:31:39Z

@wenleix

Alternatively, you can make NodeSelector.selectRandomNodes short cut to allNodes() when limit > totalNodeCount

Unfortunately nodes are provided as an Iterator. Iterator doesn't have the size method in it. We can try to change the interfaces, but honestly i don't think that expanding the list would be an issue. Most of the time we will expland it to 1000 elements max.

arhimondr · 2019-02-26T18:45:18Z

@wenleix

Alternatively, you can make NodeSelector.selectRandomNodes short cut to allNodes() when limit > totalNodeCount

I added the size() method to the ResettableRandomizedIterator, so the initial capacity hint can take the total available number of nodes into account

arhimondr · 2019-02-26T19:09:12Z

@wenleix

Shall we add experimental- prefix to this new session property ? :)

I'm not sure. We got a lot of properties that are "experimental" and that have been there for years. So i'm a little bit reluctant of adding that prefix.

Approaching from the other side - no matter if we prefix it with experimental - if we decided to get rid of that property - we will have to migrate all the clients anyway

arhimondr · 2019-02-26T19:15:03Z

@wenleix

This is probably OK for now, but just keep in mind at some point we might want to have more consistent behavior between SystemPartitioning and ConnectorPartitioning

It doesn't make sense to have more partitions than we have nodes for the system partitioning. We are not going to use the system partitioning for materialization and bucket-by-bucket for now. And if all the partitions have to be executed all at once - there's no point of having more of them than there are nodes in the cluster.

arhimondr · 2019-02-26T19:59:35Z

@wenleix @shixuan-fan Comments addressed. Please have another pass.

wenleix · 2019-02-27T05:58:14Z

@arhimondr

It doesn't make sense to have more partitions than we have nodes for the system partitioning. We are not going to use the system partitioning for materialization and bucket-by-bucket for now. And if all the partitions have to be executed all at once - there's no point of having more of them than there are nodes in the cluster.

Correct for now. But I expect at some time we want to support materializing system partitioning as well. And it seems to me make sense to unify system partitioning and connector partitioning.

I do agree we don't need to worry about it now. But just something keep in mind :)

nezihyigitbasi

Add max_tasks_per_stage session property

commit details message has the config name incorrect.

nezihyigitbasi · 2019-02-27T07:53:36Z

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

for non-source stages OR for intermediate stages ?

@nezihyigitbasi : Thanks for the comment! In our case it's more about whether the stage/fragment's distribution is SOURCE_DISTRIBUTION (

presto/presto-main/src/main/java/com/facebook/presto/sql/planner/SystemPartitioningHandle.java

Line 63 in 941255b

public static final PartitioningHandle SOURCE_DISTRIBUTION = createSystemPartitioning(SystemPartitioning.SOURCE, SystemPartitionFunction.UNKNOWN);

) . For example, a collocated join might be a leaf stage, but the distribution of that stage/fragment would follow table partitioning.

Maybe we can say Maximum number of tasks for stage, unless its partitioning handle is SOURCE_DISTRIBUTION to avoid confusion?

I'm not sure if an average end user would now what it is a partitioning handle.

Maximum number of tasks for a non source distributed stage - I think it should be pretty intuitive.

The user should understand that it is not possible (at least for all the cases) to limit the number of source distributed tasks. As the source distribute tasks might be bound to the data location. So, somehow it makes sense. However i agree that it is still quite confusing.

nezihyigitbasi · 2019-02-27T07:55:32Z

presto-main/src/main/java/com/facebook/presto/execution/TaskManagerConfig.java

nezihyigitbasi

Allow use connector-specific partitioning for remote exchange

Commit message title can be Add connector specific partitioning support for remote exchanges
If connector specified partitioning is used for exchanges, remote exchanges can be replaced with a bucketed table write followed by a bucketed table read.

nezihyigitbasi · 2019-02-27T08:00:04Z

presto-main/src/main/java/com/facebook/presto/metadata/Metadata.java

. at the end.

Why not plural (getPartitioningHandleForExchanges)?
Or getExchangePartitioningHandle?

Because we are getting partitioning handles per exchange, as the handle contains the list of types for partitioning columns. So technically there will be one partitioning handle per exchange.

nezihyigitbasi · 2019-02-27T08:06:29Z

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

Looks like we need a better description here as this repeats the property name. Maybe something like Name of the catalog providing custom partitioning. or sth along those lines.

nezihyigitbasi · 2019-02-27T08:09:16Z

presto-spi/src/main/java/com/facebook/presto/spi/connector/ConnectorMetadata.java

. at the end.

nezihyigitbasi · 2019-02-27T08:13:56Z

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java

partitioningProviderCatalogName

Maybe just partitioningProviderCatalog to be consistent with the property?

nezihyigitbasi · 2019-02-27T08:16:24Z

presto-main/src/main/java/com/facebook/presto/sql/planner/plan/ExchangeNode.java

We have multiple partitionedExchange methods with complex list of arguments, would be nice to unify some of them to simplify the API surface here.

@nezihyigitbasi : Does the next commit (rename some of them specifically to systemPartitiontedExchange ) help to clarify the API ?

wenleix

"Improve selected nodes initial capacity hint"

minor comments... I really feel we can just change it to List<Node :) . ResettableRandomizedIterator will anyway copy it into a List.

wenleix · 2019-02-27T18:53:16Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeScheduler.java

Wow . I really feel we just want to use a List<Node> (and do ImmutableList.copyOf if the original input is a Set .

Apparently we were trying to optimize randomization. So instead of shuffling all nodes everytime, only the number of nodes needed can be shuffled.

By using ResettableRandomizedIterator, it already defeat the purpose since it will do a copy into ArrayList in constructor :)

Maybe we should just use your previous solution and allow list resizing. Sorry about the back-and-forth :(.

By using ResettableRandomizedIterator, it already defeat the purpose since it will do a

We do a copy, but we don't call to the random number generator unless needed. Not sure how big a problem is that. But since it was optimized that way - let's keep it as is.

wenleix · 2019-02-27T18:54:23Z

...main/src/main/java/com/facebook/presto/execution/scheduler/ResettableRandomizedIterator.java

It's kind of weird for an iterator to have size. Plus, this looks really like an ListIterator :)

Yeah, not nice. But the class is pretty private. I would simply keep it. The usages are pretty readable (e.g: candidates.size())

shixuan-fan · 2019-02-27T19:27:27Z

presto-main/src/main/java/com/facebook/presto/metadata/MetadataManager.java

Should this be an IllegalStateException? This method seems only applicable to DML, which I think if catalog is missing, it should be caught earlier than here (not entirely sure about this).

We should keep the exception as user error. IllegalStateException will be categorized as in internal error.

wenleix

"Allow use connector-specific partitioning for remote exchange"

Looks good. I agree with @nezihyigitbasi 's comment about commit message, with some additions about this bucketed execution is a future work. Feel free to revise it.

Add connector specific partitioning support for remote exchanges

This opens the opportunities to perform exchange through connector tables (by replacing remote exchanges with bucketed table write followed by a bucketed table read), which is required by the following use cases:
* Supports recoverable exchange
* Supports arbitrary large JOIN/AGGREGATE by first bucket input data, and use grouped execution.

wenleix · 2019-02-27T20:02:45Z

presto-main/src/main/java/com/facebook/presto/sql/planner/plan/ExchangeNode.java

@nezihyigitbasi : Does the next commit (rename some of them specifically to systemPartitiontedExchange ) help to clarify the API ?

wenleix

"Rename static factory methods in ExchangeNode"

Looks good.

wenleix · 2019-02-27T20:12:36Z

Make sure also addresses @nezihyigitbasi and @shixuan-fan 's comments :)

arhimondr · 2019-02-27T20:48:45Z

@nezihyigitbasi , @shixuan-fan comments addressed. Could you please have an another look?

shixuan-fan

Looks good.

nezihyigitbasi

Add max_tasks_per_stage session property

The commit details doesn't have the correct config name (task.max-tasks-per-stage is not correct).

nezihyigitbasi

Allow use connector-specific partitioning for remote exchange

Please see my previous comment for the commit message.

nezihyigitbasi

LGTM too % minor comments about the commit messages.

arhimondr · 2019-02-27T22:32:46Z

Please see my previous comment for the commit message.

Sorry. I don't know why was i sure that i have it changed.

Rename the query.initial-hash-partitions to query.hash-partition-count to match the session property name

And coresponding stage.max-tasks-per-stage configuration property

If connector specified partitioning is used for exchanges, remote exchanges can be replaced with a bucketed table write followed by a bucketed table read

Rename partitionedExchange to systemPartitionedExchange to emphasize that the system partitioning is created within.

arhimondr requested a review from wenleix February 22, 2019 00:11

facebook-github-bot added the CLA Signed label Feb 22, 2019

arhimondr requested a review from shixuan-fan February 22, 2019 00:12

wenleix reviewed Feb 22, 2019

View reviewed changes

shixuan-fan reviewed Feb 22, 2019

View reviewed changes

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java Outdated Show resolved Hide resolved

shixuan-fan reviewed Feb 22, 2019

View reviewed changes

arhimondr force-pushed the connector-partitioning branch from 79a0e2d to ed91fd4 Compare February 22, 2019 18:57

wenleix reviewed Feb 23, 2019

View reviewed changes

arhimondr force-pushed the connector-partitioning branch from ed91fd4 to 55a4228 Compare February 25, 2019 15:09

wenleix reviewed Feb 26, 2019

View reviewed changes

wenleix assigned arhimondr Feb 26, 2019

arhimondr force-pushed the connector-partitioning branch from 55a4228 to 9003715 Compare February 26, 2019 19:58

arhimondr requested review from shixuan-fan and wenleix February 26, 2019 19:58

wenleix assigned wenleix and unassigned arhimondr Feb 27, 2019

nezihyigitbasi reviewed Feb 27, 2019

View reviewed changes

wenleix reviewed Feb 27, 2019

View reviewed changes

shixuan-fan reviewed Feb 27, 2019

View reviewed changes

wenleix reviewed Feb 27, 2019

View reviewed changes

wenleix approved these changes Feb 27, 2019

View reviewed changes

wenleix assigned arhimondr and unassigned wenleix Feb 27, 2019

wenleix changed the title ~~Support arbitrary connector partitioning~~ Add connector specific partitioning support for remote exchanges Feb 27, 2019

arhimondr force-pushed the connector-partitioning branch from 9003715 to 397a776 Compare February 27, 2019 20:47

shixuan-fan approved these changes Feb 27, 2019

View reviewed changes

nezihyigitbasi reviewed Feb 27, 2019

View reviewed changes

nezihyigitbasi approved these changes Feb 27, 2019

View reviewed changes

arhimondr force-pushed the connector-partitioning branch from 397a776 to 8ba0d6a Compare February 27, 2019 22:34

arhimondr added 7 commits February 28, 2019 13:16

Tiny ExchangeNode refactorings

b620739

Rename query.initial-hash-partitions property

c8d2a08

Rename the query.initial-hash-partitions to query.hash-partition-count to match the session property name

Add max_tasks_per_stage session property

3c79a80

And coresponding stage.max-tasks-per-stage configuration property

Use static imports in AddExchanges

0bbcd32

Improve selected nodes initial capacity hint

4c32927

Add connector specific partitioning support for remote exchanges

5b20ac0

If connector specified partitioning is used for exchanges, remote exchanges can be replaced with a bucketed table write followed by a bucketed table read

Rename static factory methods in ExchangeNode

42b7a6c

Rename partitionedExchange to systemPartitionedExchange to emphasize that the system partitioning is created within.

arhimondr force-pushed the connector-partitioning branch from 535e878 to 42b7a6c Compare February 28, 2019 18:16

arhimondr merged commit e170666 into prestodb:master Feb 28, 2019

arhimondr deleted the connector-partitioning branch February 28, 2019 18:17

wenleix mentioned this pull request Oct 14, 2019

Support max_tasks_per_stage for scan #13477

Merged

Conversation

arhimondr commented Feb 22, 2019

Uh oh!

wenleix left a comment

Choose a reason for hiding this comment

Uh oh!

wenleix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenleix Feb 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenleix left a comment

Choose a reason for hiding this comment

Uh oh!

shixuan-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shixuan-fan left a comment

Choose a reason for hiding this comment

Uh oh!

wenleix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenleix commented Feb 23, 2019

Uh oh!

arhimondr commented Feb 25, 2019

Uh oh!

wenleix commented Feb 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenleix commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenleix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenleix commented Feb 26, 2019

Uh oh!

arhimondr commented Feb 26, 2019

Uh oh!

arhimondr commented Feb 26, 2019

Uh oh!

arhimondr commented Feb 26, 2019

Uh oh!

arhimondr commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arhimondr commented Feb 26, 2019

Uh oh!

wenleix commented Feb 27, 2019

Uh oh!

nezihyigitbasi left a comment

Choose a reason for hiding this comment

wenleix Feb 23, 2019 •

edited

Loading

wenleix commented Feb 25, 2019 •

edited

Loading

wenleix commented Feb 26, 2019 •

edited

Loading

arhimondr commented Feb 26, 2019 •

edited

Loading

wenleix Feb 27, 2019 •

edited

Loading