Fix deadlock in join spilling#16293
Conversation
02df9f8 to
aa7b2ee
Compare
|
Haven't reviewed yet. From the description it sounds like it fixes a real issue, but not the only issue since the test still hangs 2021-06-18T05:38:07.479-0500 WARN TestHangMonitor com.facebook.presto.testng.services.LogTestDurationListener No test started or completed in 8.00m. Running tests: |
|
@rschlussel Yeah, looks like a Klondike =D I think I know what is the second issue is. I just thought it should never happen in practice though. Let me fix it and resubmit. |
b30a50b to
4987e55
Compare
|
I did a quick read.
|
You should feel free to land your patch first. I will rebase, no worries.
While you correctly pointed out that the issue is likely related to PartitionConsumption it wasn't clear to me how the issue is triggered, why it is triggered only for Presto on Spark and what makes Presto on Spark special. That was a little concerning. The non spilling test failing was the biggest concern though. At that time it wasn't clear to me that the failure of non spilling test is just a fluke of 2 join limit tests executing in parallel at the same time. That seemed rather unlikely to me (though it ended up being true). Thus I felt obliged to get to the bottom of it to avoid potential regression in production. I had to reiterate the entire investigation e2e as at the beginning it wasn't obvious why the problem in the PartitionConsumption is only triggered on Presto on Spark and not on Presto Classic. |
|
I had to split the tests into 3 separate jobs and readjust the thread pools and the concurrency settings. Now the tests pass. |
why did this fix the timeouts? was it taking too long but doing real work and this fixed the fundamental issue, or did it just make some still existing deadlock issue less likely happen? |
The deadlock is fixed by the first commit, I tried to rerun the tests locally several times, and locally they seem to finish much faster (timing out at 1h+ on CI vs <30m on my desktop with nvme flash). My guess it might be related to lower IOPS provided on the CI (maybe a spinning disk?). Decreasing task concurrency decreases IO pressure, as with lower task concurrency there are less tiny partitions to spill. I haven't been able to confirm whether increasing the number of spiller threads actually improves throughput, but I thought it could be a good idea in general to have enough threads for each partition to spill in parallel. However even after the threads / concurrency adjustments the overall runtime was still over ~30 minutes, so I decided to also split the tests into 3 separate jobs to improve the build time. |
rschlussel
left a comment
There was a problem hiding this comment.
Generally looks good, but I'm confused why the second lookup join operator doesn't also get an early termination signal if the limit was reached by the first one? Shouldn't all operators below the limit get an early termination signal?
There was a problem hiding this comment.
I would add a comment here that this code is needed for the case of early termination where the operator is closed before being finished.
That's actually something that also bothered me. I had to take a closer look. So it looks like it actually depends on the plan and on the execution mode. Assuming the task concurrency is Driver 1:
Driver 2:
Now for the finish to get propagated to the
It looks like we have an understanding why the first doesn't happen. But the question is why the second doesn't happen? The second doesn't happen due to some specifics how we plan partial limits and execution specifics of Presto on Spark. Our planner doesn't try to plan the query in a way that partial limit outputs are merged locally. Basically if N threads are running partial limit, each thread is going to send it's output to the downstream stage. This in theory is suboptimal, as Ideally there should be a local (merge) exchange and a single partial limit operator on top to reduce the number of values sent over the network to the downstream stage: Now, in Presto Classic this is not a problem, since all stages are running in parallel. So the downstream stage can signal it back immediately once the limit is reached. However in Presto on Spark it is not the case, as an upstream stage has to write full results before the downstream stage can start execution. So it looks like the assumption I made in the PR description is actually incorrect:
Let me know if it makes sense. If you think it does I will update the PR description with the latest findings. |
|
that makes a lot of sense. Can you update the commit description also? |
rschlussel
left a comment
There was a problem hiding this comment.
Looks good pending update to pr/commit description and comment
072851e to
93fdd78
Compare
|
@rschlussel updated |
viczhang861
left a comment
There was a problem hiding this comment.
"Fix another deadlock in join spilling"
- "another" is not necessary, as people may wonder what is the first one.
- It feels like it is not deadlock, but a future that is waiting for something to happen but never happens, still looking.
|
Another deadlock was fixed by @rschlussel by #15975. This is the second one =D But yeah, I agree, the word |
93fdd78 to
494ed24
Compare
The problem occurs when a spillable join is followed by a limit operator. This is a very tricky bug to reproduce. In order to reproduce it the task concurrency should be at least 2. The issue is happening when the first probe operator (LookupJoinOperator) is able to produce enough rows to fill the limit operator so the early termination is triggered. It has to produce enough rows with only a single build side partition unspilled. During the early termination it doesn't release the current open partition, and as long as the first partition is not released the second one cannot be opened. Now the second instance of the LookupJoinOperator must not produce enough rows for early termination from joining with the first unspilled partition and must request the second partition to unspill. But since the first operator never released the first - the second partition cannot be unspilled, thus the second instance of the LookupJoinOperator cannot produce any additional output. The issue is also specific to Presto on Spark. It doesn't reproduce for Classic Presto. In classic Presto all stages are running in parallel. When any partial limit operator produces enough rows for the downstream final limit to finish the signal is propagated from the downstream task to the upstream task, and the upstream task is being canceled terminating all operators including LookupJoinOperator's that are stuck. In Presto on Spark for a stage to start running all upstream stages have to finish first. The fix is to always close open partitions on close to make sure partitions are always released even when the LookupJoinOperator terminates early due to the limit operator back pressure.
The tests finish in less then 20m locally on a machine with nvme flash. It looks like on the CI the bottleneck is IOPS. Decreasing task concurrency should result in less granular spills decreasing overall IO pressure. Increasing number of threads for spiller is needed to avoid thread starvation when the spilling is slow.
494ed24 to
3369d43
Compare
The problem occurs when a spillable join is followed by a limit
operator.
This is a very tricky bug to reproduce. In order to reproduce it the
task concurrency should be at least 2. The issue is happening when the
first probe operator (LookupJoinOperator) is able to produce enough rows
to fill the limit operator so the early termination is triggered. It has
to produce enough rows with only a single build side partition
unspilled. During the early termination it doesn't release the current
open partition, and as long as the first partition is not released the
second one cannot be opened. Now the second instance of the
LookupJoinOperator must not produce enough rows for early termination
from joining with the first unspilled partition and must request the
second partition to unspill. But since the first operator never released
the first - the second partition cannot be unspilled, thus the second
instance of the LookupJoinOperator cannot produce any additional output.
The issue is also specific to Presto on Spark. It doesn't reproduce for
Classic Presto. In classic Presto all stages are running in parallel.
When any partial limit operator produces enough rows for the downstream
final limit to finish the signal is propagated from the downstream task
to the upstream task, and the upstream task is being canceled
terminating all operators including LookupJoinOperator's that are stuck.
In Presto on Spark for a stage to start running all upstream stages have
to finish first.
The fix is to always close open partitions on close to make sure
partitions are always released even when the LookupJoinOperator
terminates early due to the limit operator back pressure.
Test plan: