Implement sequential execution policy#10350
Conversation
|
cc @martint |
|
Query example: |
|
cc @losipiuk |
eb28a42 to
c140287
Compare
There was a problem hiding this comment.
Not sure we want to switch the default. You mentioned in the commit message that overally sequential provides perf improvement. But are there queries which degraded with the change?
There was a problem hiding this comment.
Not sure we want to switch the default. You mentioned in the commit message that overally sequential provides perf improvement. But are there queries which degraded with the change?
I don't think there are. Some small things:
- there might be slight CPU increase for some queries due to more concurrency now at source stage. Source stage
concurrency is another topic to look at. isAnyTaskBlockedis not propagated byFutureyet, but it shouldn't happen in practice too much because of
broadcast join limits).
Overall this is much, much better than current scheduler in high concurrency workloads and complex queries.
Hence IMO benefits greatly exceeds risks, which are small here.
There was a problem hiding this comment.
I guess low latency queries can degrade. For example if the build side is extremely small it might be better to start scheduling probe tasks right away.
There was a problem hiding this comment.
I haven't observed that. Current code starts next stage when parent stage enters SCHEDULED stage (all splits are scheduled, but they still might be running).
Note that with current notification mechanism (via futures) we respond to task/stage state change very fast
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
note: this allows for scheduling of dependent stage before parent stage completes or enters flushing state.
I've benchmarked code where we consider flushing and done stages to be completed, but results were not better (there was even a slight duration regression for tpcds):
| label | TPCH wall time | TPC-DS wall time | TPCH CPU time | TPC-DS CPU time | TPCH peak mem | TPC-DS peak mem |
|---|---|---|---|---|---|---|
| before staged unpart | 769.708500 | 1925.400000 | 98571.0 | 237859.442667 | 2.093233e+09 | 1.147734e+09 |
| after staged unpart | 758.691167 | 1816.289500 | 99912.3 | 242806.413833 | 2.082893e+09 | 1.132478e+09 |
| after staged unpart compl | 757.288833 | 1848.024833 | 99515.5 | 242440.454833 | 2.082462e+09 | 1.101463e+09 |
| label | TPCH wall time | TPC-DS wall time | TPCH CPU time | TPC-DS CPU time | TPCH peak mem | TPC-DS peak mem |
|---|---|---|---|---|---|---|
| before staged part | 937.515833 | 1184.331667 | 111597.4 | 135037.311333 | 2.144486e+09 | 1.284780e+09 |
| after staged part | 829.780167 | 1133.327333 | 111747.4 | 132606.885667 | 2.141816e+09 | 1.267524e+09 |
| after staged part compl | 827.721667 | 1153.291667 | 110661.2 | 133056.365000 | 2.148532e+09 | 1.246323e+09 |
There was a problem hiding this comment.
This is useful comment - please put that in the code.
core/trino-main/src/main/java/io/trino/execution/scheduler/ExecutionSchedule.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/scheduler/ExecutionSchedule.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/scheduler/SqlQueryScheduler.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Honestly I never really understood this condition. In theory as long as there's at least a single stage that is not blocked the scheduler should keep scheduling. Currently the logic is "as long as there's at least a single stage that is blocked wait for at least a single stage to get unblocked or wait a second".
@dain Do you remember why it is implemented this way?
There was a problem hiding this comment.
This is because not every signal is governed by Futures. Examples are:
isAnyTaskBlocked- is task memory full
- ...
etc.
We might improve that as we go (propagate more signals with futures), but for now we have this loop.
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I guess low latency queries can degrade. For example if the build side is extremely small it might be better to start scheduling probe tasks right away.
|
It's not "sequential" as such, right? More like a topological sort for stage ordering. |
c140287 to
c7888b0
Compare
There was a problem hiding this comment.
I haven't observed that. Current code starts next stage when parent stage enters SCHEDULED stage (all splits are scheduled, but they still might be running).
Note that with current notification mechanism (via futures) we respond to task/stage state change very fast
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
That is the contract of the future. It can only be finished once and then it's to be disposed. Asking for isBlocked at different time might yield different results (since situation has changed)
We have similar contract in places like:
io.trino.operator.exchange.LocalExchangeSource#waitForReading
io.trino.operator.exchange.LocalExchangeMemoryManager#getNotFullFuture
and possibly others.
There was a problem hiding this comment.
While visiting plan nodes Fragments does not contain currently visited stage. However, processFragment result does contain stage for which it was invoked. Maybe we could split Fragments into some intermediate structure (while visting plan nodes) and final one, but I don't think it would be simpler or easier to understand.
...rino-main/src/main/java/io/trino/execution/scheduler/policy/SequentialExecutionSchedule.java
Outdated
Show resolved
Hide resolved
8a554eb to
e5b18b6
Compare
...rc/main/java/io/trino/execution/scheduler/policy/PrioritizeUtilizationExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/trino/execution/scheduler/policy/PrioritizeUtilizationExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/trino/execution/scheduler/policy/PrioritizeUtilizationExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/trino/execution/scheduler/policy/PrioritizeUtilizationExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/trino/execution/scheduler/policy/PrioritizeUtilizationExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/trino/execution/scheduler/policy/PrioritizeUtilizationExecutionSchedule.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/trino/execution/scheduler/policy/PrioritizeUtilizationExecutionSchedule.java
Outdated
Show resolved
Hide resolved
This makes it easier to write scheduler tests
Currently, Trino uses AllAtOnceExecutionSchedule execution policy by default.
This makes all stages start at same time. This means that join probe splits take
valuable driver slots even though they cannot progress until build side stage completes.
This can throttle cluster resource utilization for more complex queries
or in high concurrency scenarios.
New sequential execution policy follows principles:
* schedules join build source stages before join probe source stages
* immediately schedules stages for which it's know that data is going to be consumed
(e.g there is final aggregation downstream)
Benchmark results:
label TPCH wall time TPC-DS wall time TPCH CPU time TPC-DS CPU time TPCH peak mem TPC-DS peak mem
0 before sequential part 937.515833 1184.331667 111597.4 135037.311333 2.144486e+09 1.284780e+09
1 after sequential part 829.780167 1133.327333 111747.4 132606.885667 2.141816e+09 1.267524e+09
label TPCH wall time TPC-DS wall time TPCH CPU time TPC-DS CPU time TPCH peak mem TPC-DS peak mem
0 before seq unpart 769.708500 1925.4000 98571.0 237859.442667 2.093233e+09 1.147734e+09
1 after seq unpart 758.691167 1816.2895 99912.3 242806.413833 2.082893e+09 1.132478e+09
Improvments
* part tpch wall time 11.5%
* part tpcds wall time ~4.5%
* unpart tpcds wall time ~5.7%
e5b18b6 to
7ef035b
Compare
|
Will enable phased scheduler in separate PR |
| import static java.util.function.Function.identity; | ||
|
|
||
| /** | ||
| * Schedules stages choosing to order to provide the best resource utilization. |
There was a problem hiding this comment.
provide the best resource utilization.
This is misleading. Resource utilization is governed by many other factors unrelated to the order in which stages are scheduled.
What this strategy avoid is scheduling stages that will immediately block waiting for another stage to start producing data. In a way, if I'm understanding the code correctly, it schedules stages that form full non-blocking pipelines in the order in which they are able to make progress.
There was a problem hiding this comment.
Yup. That's how it works. Do you want a suggestion for a better name? I would use phased execution policy, but it's already taken
There was a problem hiding this comment.
Hmm.. maybe we could just take the name "phased execution" and call the previous one "legacy phased execution". If this is better in every way, we'll eventually get rid of the old one, anyway.
Incidentally, back when we added phased execution, the original intent was to eventually take advantage of knowledge of which operators were "blocking" vs which ones could make progress in a pipelined manner for the purpose of scheduling stages. We could think of this PR as finally realizing that, but implementing it in a way that provides an escape hatch for the old behavior in case of bugs.
Fixes #10269