Support max_tasks_per_stage for scan by cemcayiroglu · Pull Request #13477 · prestodb/presto

cemcayiroglu · 2019-09-29T05:29:35Z

stage.max-tasks-per-stage configuration property can be used by to limit the number of tasks for scan.

== RELEASE NOTES ==

General Changes

* Respect stage.max-tasks-per-stage to limit number of tasks for scan.

swapsmagic

LGTM

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeSelector.java

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SimpleNodeSelector.java

arhimondr · 2019-10-01T19:18:05Z

...to-main/src/main/java/com/facebook/presto/execution/scheduler/TopologyAwareNodeSelector.java

Why doesn't the TopologyAwareNodeSelector respect the limit?

I am not sure if it makes sense to use limit here. The topology awere selector selects the best nodes for splits for maximize locality. I am not sure we can even apply a limit here.

@cemcayiroglu : Note even for SimpleNodeSelector, when the splits doesn't support remote access (e..g Raptor), the limit is not applied there. cc @highker

On the other hand, for TopologyAwareNodeSelector, the current algorithm makes best effort to co-locate data and compute nodes. Thus we should still be able to apply a limit over the candidate nodes and do best-effort topology-aware selection. Does that make sense?

Agree, let's hide limit from those non-applicable cases.

presto-main/src/test/java/com/facebook/presto/execution/TestNodeScheduler.java

wenleix · 2019-10-02T05:13:00Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SimpleNodeSelector.java

A ResettableRandomizedIterator is basically just a List of the originally data:

presto/presto-main/src/main/java/com/facebook/presto/execution/scheduler/ResettableRandomizedIterator.java

Lines 23 to 27 in 2e29401

public class ResettableRandomizedIterator<T>

implements Iterator<T>

{

private final List<T> list;

private int position;

Should we just keep a List here? In that case we can just use NodeSelector#selectRandomNodes . @arhimondr ?

I remember this discussion happens before, let me see if I can find the previous discussion.

wenleix · 2019-10-02T05:13:48Z

...to-main/src/main/java/com/facebook/presto/execution/scheduler/TopologyAwareNodeSelector.java

@cemcayiroglu : Note even for SimpleNodeSelector, when the splits doesn't support remote access (e..g Raptor), the limit is not applied there. cc @highker

On the other hand, for TopologyAwareNodeSelector, the current algorithm makes best effort to co-locate data and compute nodes. Thus we should still be able to apply a limit over the candidate nodes and do best-effort topology-aware selection. Does that make sense?

wenleix · 2019-10-02T05:18:39Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeSelector.java

Instead of making limit a parameter to computeAssignments, another idea is to allow construct NodeSelector over a random set of nodes? For example, today we construct NodeSelector through:

NodeSelector nodeSelector = nodeScheduler.createNodeSelector(connectorId);

We can have a different API:

NodeSelector nodeSelector = nodeScheduler.createNodeSelector(connectorId, maxNodeCount);

I am suggesting this because today, once NodeSelector#computeAssignments is called, it memorize
randomCandidates based on limit. It makes NodeSelector stateful which is a bit difficult to reason the behavior (e.g. what about the NodeSelector get called in other place)

What do you think? @highker , @arhimondr ?

I think @wenleix's suggestion is much better. Having limit on the interface is not intuitive to understand

I think the direction Cem is heading in is that node selection won't be random it will be based on resources. So constructing node selector with a subset of the nodes would mean it could no longer make decisions resource based decisions.

Even today we have a TopologyAwareNodeSelector and if we passed that a random set of nodes at construction then it might be restricted as to what topology aware scheduling it can do.

@wenleix @highker @aweisberg Thanks for the input. I agree all of you. I had a trouble to weher to place the limit. Actually, there is a bigger problem here that creates confusion. DynamicSplitPlacementPolicy should define the node selection criteria like is remote or limited etc... Today it does almost nothing. Delegates everything to nodeselector. BucketedSplitPlacementPolicy is very similar. I think nodeselector should select nodes based on the criteria provided by SplitPlacementPolicy. As @aweisberg said we are going to have different policies in the future to select nodes for memory aware scheduling. I am thinking of following what @wenleix described and hide limit. I will create a follow up PR for refactoring this parts to get ready for memory aware scheduling. Ant thoughts?

After a second thought I will pass session to createNodeSelector instead of limit. nodeScheduler will access to limit if it is needed. Cleaner this way.

NodeSelector nodeSelector = nodeScheduler.createNodeSelector(connectorId, session);

Le me know how does it sound

The other important question is that node selection is currently happening continuously. By memoizing the selected node we are breaking the current semantic. I'm not sure how important is that. Maybe it is not.

But, if we really want to keep the semantics - in the DynamicSplitPlacementPolicy the nodes have to be selected every time. But it should also keep track of the nodes that already have some splits assigned. This list should be extended every time a new node is selected. And once the size limit is reached - the extention might be terminted, and only the already selected nodes be scheduled.

@arhimondr this sounds great! Do you think it is better to handle the refactoring part on another PR? you are also right about the comment of the semantics. The current way can cause a problem if the initial split size is lower than the limit and the next call can be over the limit. Enough nodes will not be selected. I am going to make this change or find another way to avoid this.

@arhimondr Actually we need to keep as is if we want to keep to semantics same. The current code does not support node addition. Nodemap is created by "createNodeSelector " and stays the same throughout the query execution. If it is ok I would like to merge it as is.

I would like to suggest to change the second method signature to something like SplitPlacementResult computeAssignments(Set splits, List existingTasks, List selectedNodes)

upd: per offline discussion it is not possible, as it would break the topology aware node selection

With topology aware node selection nodes cannot be selected from the outside

highker

limit on interface may need some cleaning

highker · 2019-10-02T08:47:49Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeSelector.java

I think @wenleix's suggestion is much better. Having limit on the interface is not intuitive to understand

highker · 2019-10-02T08:48:56Z

...to-main/src/main/java/com/facebook/presto/execution/scheduler/TopologyAwareNodeSelector.java

Agree, let's hide limit from those non-applicable cases.

aweisberg · 2019-10-02T15:48:03Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SimpleNodeSelector.java

It's not intuitive that this returns the same list if it is already constructed regardless of the parameters. Is this stuff all done exactly once? If it's done multiple times why is it guaranteed parameters like limit haven't changed?

If this needs to be the case then should it error if the parameters have changed?

aweisberg · 2019-10-02T15:50:45Z

presto-main/src/test/java/com/facebook/presto/execution/TestNodeScheduler.java

What behavior are we testing here where the same set of nodes is returned each time? When would we want to get the list of nodes multiple times deterministically?

arhimondr · 2019-10-10T15:40:31Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SimpleNodeSelector.java

I still don't like the idea with memoizing. The NodeMap right now is mutable (although with 5 second memoization). Fixing the selected nodes will change the existing semantics and cluster behaviour. Although i don't know how big a problem is that, i would still prefer to keep behaviour the same.

The List<RemoteTask> existingTasks is the list of existing tasks.

RemoteTask#getNodeId returns the assigned node id of the task. To keep the existing semantins we should probably change the algorithm to something like:

Get all the node

Filter out all the nodes that already have tasks

Select random N nodes from the remaining nodes, where N = maxTasksPerStage - existingTasks.size()

Union nodes with existing tasks ...

arhimondr · 2019-10-10T15:45:40Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java

The interface now is really weird.

nodeScheduler.createNodeSelector now acceps the maxTasksPerStage. And then the nodeSelector.selectRandomNodes accepts the maxTasksPerStage

I wonder if passing the limit to the NodeSelector#computeAssignments is a lesser evil?

CC: @wenleix

I think changing NodeSelector#computeAssignmentsis a more evil cuz it will require to change TopologyAwareNodeSelector too.

I think changing NodeSelector#computeAssignmentsis a more evil

Let me elaborate a little bit more on this:

Creating the NodeSelector with the limit (that in fact the maxTasksPerStage is) is quite confusing.

The NodeSelector has other methods, that accept the limit as parameters. Thus it is not clear what limit will be applied, say in the NodeSelector#selectRandomNodes? The maxTasksPerStage limit? Or the selectRandomNodes(limit..)? Or min() / max()?

it will require to change TopologyAwareNodeSelector too.

Currently it is simply ignored anyway. So i don't see a big difference.

Anyhow, i don't feel particularly strong about this. Feel free to leave it as is if you do.

@arhimondr , @cemcayiroglu : If we are OK to have maxTasksPerStage when creating the NodeSelector, do we want to just pre-compute a limited set of nodes when NodeSelector is created?

Thus the only reason why need to consult nodeManager during computeAssignments is when the query is submitted at cluster initialization time, which is not a common case.

Thus the only reason why need to consult nodeManager during computeAssignments is when the query is submitted at cluster initialization time, which is not a common case.

I'm not sure how important that in practice, but i think it is better to preserve the existing behaviour.

Looks like this part of the code is super messy. nodescheduler and nodeselector is calling each other. no one knows who is doing what. I am creating an issue to address this. please look: https://en.wikipedia.org/wiki/Spaghetti_code

arhimondr · 2019-10-10T15:46:04Z

presto-main/src/test/java/com/facebook/presto/execution/TestNodeScheduler.java

nit: unrelated change, please revert

arhimondr · 2019-10-10T15:47:16Z

...in/src/test/java/com/facebook/presto/execution/scheduler/TestSourcePartitionedScheduler.java

nit: unrelated change, please revert

arhimondr

LGTM % minor comments to testing and small performance optimizations

@aweisberg @highker @wenleix Do you guys want to have another look?

arhimondr · 2019-10-11T23:29:50Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeMap.java

nit: new line (wrap it)

arhimondr · 2019-10-11T23:46:47Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SimpleNodeSelector.java

suggestion:

This code is gonna be called for every batch of splits. We can make it more efficient by

Collecting it here to a List

Create a set only in the less "likely" branch: alreadySelectedNodeCount < limit, right before it has to be passed to the selectNodes

Agreed with @arhimondr . Calling it per each computeAssignments call might be expensive.

Also, we usually have new lines for stream API call:

existingTasks.stream() .map(remoteTask -> nodeMap.getNodesByNodeId().get(remoteTask.getNodeId())) .collect(toSet());

arhimondr · 2019-10-11T23:51:01Z

presto-main/src/test/java/com/facebook/presto/execution/TestNodeScheduler.java

I wonder if it makes sense to exercise more corner cases here:

Select one node

Call the second time, make sure than the second node is selected, and it is different than the first one

Call the third time, make sure that no additional nodes have been selected.

Add only a single node to the InmemoryNodeManager

Call once, make sure it is selected

Call the second time, make sure the output is the same.

Add one more node to the InmemoryNodeManager

Call the third time, make sure it has a new node selected

Add one more node to the InmemoryNodeManager

Call for the forth time, make sure the latest added node is not selected, as the limit is reached.

@arhimondr I couldnt understand the first test case. select one node twice?

e.g.: provide only a single split. So only a single node of three is selected at the first step.

wenleix

Generally looks good to me. @highker would you also like to take a look? (especially about the interface discussion) :)

wenleix · 2019-10-14T04:03:28Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SimpleNodeSelector.java

Agreed with @arhimondr . Calling it per each computeAssignments call might be expensive.

Also, we usually have new lines for stream API call:

existingTasks.stream() .map(remoteTask -> nodeMap.getNodesByNodeId().get(remoteTask.getNodeId())) .collect(toSet());

wenleix · 2019-10-14T04:12:32Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java

@arhimondr , @cemcayiroglu : If we are OK to have maxTasksPerStage when creating the NodeSelector, do we want to just pre-compute a limited set of nodes when NodeSelector is created?

Thus the only reason why need to consult nodeManager during computeAssignments is when the query is submitted at cluster initialization time, which is not a common case.

wenleix · 2019-10-14T04:17:45Z

@arhimondr : This is off-topic of this PR. But here is the ancient discussion about ResettableRandomizedIterator:
#12373 (comment)

I see your point... but anyhow, it's not a "common iterator" (the long name also suggests it 😉 )

cemcayiroglu · 2019-10-14T04:37:05Z

@wenleix thanks for your feedback! We are recomputing the nodes again since the nodemap gets refreshed every 5 seconds. New worker nodes can join while we are scheduling the splits. We can cache the nodes if they reach the limit but this also makes the code more complicated and stateful. I think we can keep it as is (after adding @arhimondr suggestions. I would like to start refactoring the interfaces after this PR :) The interface names dont make sense.

highker

Didn't read the test. The e2e logic/interface looks good to me

highker · 2019-10-14T04:44:38Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SimpleNodeSelector.java

nit: toImmutableSet

highker · 2019-10-14T04:46:48Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SqlQueryScheduler.java

Do we wanna do some sanity check on this value?

wenleix · 2019-10-15T03:19:54Z

@cemcayiroglu

@wenleix thanks for your feedback! We are recomputing the nodes again since the nodemap gets refreshed every 5 seconds. New worker nodes can join while we are scheduling the splits. We can cache the nodes if they reach the limit but this also makes the code more complicated and stateful. I think we can keep it as is (after adding @arhimondr suggestions. I would like to start refactoring the interfaces after this PR :) The interface names dont make sense.

Correct, what I mean is we only do recomputing when we don't have enough nodes for schedule. I think @arhimondr suggests some similar ideas. But I am open to wait for the refactor/cleanup given the current interface is a bit messy already.

Another idea is can we skip this recomputing when limit is large enough? -- I remember this max-task-per-stage is 2^31-1 by default. So in the default case this computing doesn't need to kicked in :)

arhimondr · 2019-10-15T17:29:41Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeScheduler.java

Please use io.airlift.units.Duration instead.

See the 9th bullet point here: https://github.com/prestodb/presto/wiki/Presto-Development-Guidelines#development

arhimondr · 2019-10-15T17:30:06Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeScheduler.java

Accept Duration here

arhimondr · 2019-10-15T17:31:14Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeScheduler.java

The memoizeWithExpiration supplier doesn't allow 0 =\

I recommend to have a condition here. If the duration is 0, then create a regular supplier. Otherwise the memoizeWithExpiration one.

arhimondr · 2019-10-15T17:33:27Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/SimpleNodeSelector.java

We prefer immutable collections whenever possible. Please use ImmutableSet.copyOf().

See the second bullet point here: https://github.com/prestodb/presto/wiki/Presto-Development-Guidelines#development

arhimondr · 2019-10-15T17:55:59Z

presto-main/src/test/java/com/facebook/presto/execution/TestNodeScheduler.java

The memoizeWithExpiration supplier uses the System.nanoTime() to check if the memoize duration has passed.

From the System.nanoTime() javadoc:

This method provides nanosecond precision, but not necessarily nanosecond resolution

It means that the System.nanoTime() may not return the updated value even if non zero number of nanoseconds have passed.

Although it is rather unlikely for the calls to take less then 1 ns, it still possible for System.nanoTime() to do not provide enough nanotime resolution at some platforms.

This behaviour can result in a flakiness of this test. Thus i recommend to have

... condition here. If the duration is 0, then create a regular supplier. Otherwise the memoizeWithExpiration one ...

arhimondr · 2019-10-15T22:40:49Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeScheduler.java

nit: new Duration(5, SECONDS)

arhimondr · 2019-10-15T22:41:06Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeScheduler.java

requireNonNull

arhimondr · 2019-10-15T22:41:54Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeScheduler.java

nit: create*

arhimondr · 2019-10-15T22:42:19Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeScheduler.java

memoizeWithExpiration - static import

arhimondr · 2019-10-15T23:04:18Z

presto-main/src/main/java/com/facebook/presto/execution/scheduler/NodeScheduler.java

nodeMapRefreshInterval

arhimondr · 2019-10-16T01:00:26Z

@wenleix @highker It looks like all the comments are addressed. I'm going to merge it soon. Let me know if you got any additional comments.

stage.max-tasks-per-stage configuration property can be used by to limit the number of tasks for scan,

cemcayiroglu · 2019-10-23T19:05:38Z

@highker, @wenleix, @arhimondr do you all have any additional comments?

facebook-github-bot added the CLA Signed label Sep 29, 2019

cemcayiroglu requested review from arhimondr, aweisberg and wenleix September 29, 2019 05:29

swapsmagic approved these changes Sep 30, 2019

View reviewed changes

arhimondr reviewed Oct 1, 2019

View reviewed changes

wenleix requested a review from highker October 2, 2019 00:16

cemcayiroglu force-pushed the add_task_limit branch from d04375d to c21b45a Compare October 2, 2019 02:37

wenleix reviewed Oct 2, 2019

View reviewed changes

highker reviewed Oct 2, 2019

View reviewed changes

aweisberg reviewed Oct 2, 2019

View reviewed changes

cemcayiroglu force-pushed the add_task_limit branch from c21b45a to bdab68b Compare October 3, 2019 04:38

arhimondr reviewed Oct 10, 2019

View reviewed changes

cemcayiroglu force-pushed the add_task_limit branch 2 times, most recently from b91d528 to 84be3db Compare October 11, 2019 19:24

arhimondr approved these changes Oct 11, 2019

View reviewed changes

wenleix reviewed Oct 14, 2019

View reviewed changes

highker self-requested a review October 14, 2019 04:33

highker approved these changes Oct 14, 2019

View reviewed changes

cemcayiroglu force-pushed the add_task_limit branch 2 times, most recently from 2a0c382 to d659532 Compare October 15, 2019 00:55

cemcayiroglu force-pushed the add_task_limit branch 2 times, most recently from 5f53d9d to 8b2800d Compare October 15, 2019 17:41

arhimondr reviewed Oct 15, 2019

View reviewed changes

cemcayiroglu force-pushed the add_task_limit branch 2 times, most recently from 8eb63ba to 8cb1bdb Compare October 15, 2019 18:46

arhimondr approved these changes Oct 15, 2019

View reviewed changes

cemcayiroglu force-pushed the add_task_limit branch 2 times, most recently from 93cf92d to 448a16f Compare October 15, 2019 23:46

Support max_tasks_per_stage for scan

ef787d9

stage.max-tasks-per-stage configuration property can be used by to limit the number of tasks for scan,

cemcayiroglu force-pushed the add_task_limit branch from 448a16f to ef787d9 Compare October 22, 2019 23:37

arhimondr merged commit 9350602 into prestodb:master Oct 24, 2019

	public class ResettableRandomizedIterator<T>
	implements Iterator<T>
	{
	private final List<T> list;
	private int position;

Conversation

cemcayiroglu commented Sep 29, 2019 • edited by caithagoras Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swapsmagic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

highker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cemcayiroglu commented Sep 29, 2019 •

edited by caithagoras

Loading