Improve memory for HashBuilderOperator unspill#16212
Improve memory for HashBuilderOperator unspill#16212arhimondr merged 2 commits intoprestodb:masterfrom
Conversation
4487c91 to
d63d1cf
Compare
arhimondr
left a comment
There was a problem hiding this comment.
Let me try to summarize how I understand this change.
This PR moves the unspilling process from a background thread to the main thread. This is needed to be able to compact page index iteratively to save memory. Please let me know if my understanding is incorrect.
LGTM % nits
There was a problem hiding this comment.
Use log.debug or log.info (here and other places)
There was a problem hiding this comment.
Done, I was using stdout for debug purpose as it is easier to access in spark node.
Page compact is a different issue, I will open a new PR to address it separately. Simply, compact makes a copy of data and corrects a bug in calculation of retained size for deserialized page.
There was a problem hiding this comment.
It feels like we need to get rid of this codepath all together. Loading pages in a background thread is generally a very bad idea, as the CPU used for pages decoding is not accounted. Once we remove this codepath we would be able to get rid of the ListenableFuture<List<Page>> getAllSpilledPages interface and thread pools used in Spiller internally. Do you think we can try to gradually enable this codepath in production and remove it eventually?
CC: @rschlussel @highker
eaf85b4 to
ac8648e
Compare
arhimondr
left a comment
There was a problem hiding this comment.
Sorry for not mentioning it initially. Could you please add a TestPrestoSparkSpilledJoins test and enable the new code path there, so it remains tested?
presto-main/src/main/java/com/facebook/presto/operator/HashBuilderOperator.java
Outdated
Show resolved
Hide resolved
ac8648e to
02b8cc6
Compare
There was a problem hiding this comment.
curious why you reordered these. Previously they were in the order that you expect to encounter the different states
There was a problem hiding this comment.
Grouping is suggested by IntelliJ, maybe some lint rule.
There was a problem hiding this comment.
could you undo this bit? I think it's easier to understand the different states when the options are listed in order of how they happen.
There was a problem hiding this comment.
what's the downside of this change? why do we only do it for spillInsmallMemoryPool
There was a problem hiding this comment.
Eventually I am planning to remove INPUT_UNSPILLING and related code completely, but I prefer to do it gradually just in case there is regression. Another reason is to make testing and rollout easier, I can test and experiment with multiple changes without worrying about breaking release.
02b8cc6 to
9898c33
Compare
9898c33 to
02254fe
Compare
|
rebase, trying to pass facebook-integration |
|
presto-spark job timed out due to hanging join spill tests. can you look into it? |
|
@arhimondr @rschlussel com.facebook.presto.spark.TestPrestoSparkSpilledJoinQueries::testLimitWithJoin is a flaky test (it passed before rebase but failed in recent runs). I was able to reproduce locally but could not figure out root cause after spending more than a half day, however, it is not related to this PR so I disabled it for now. TestPrestoSparkJoinQueries takes 58 minutes to complete, have you seen such large wall time variance before? 2021-06-16T11:42:20.167-0500 WARN pool-2-thread-1 com.facebook.presto.testng.services.LogTestDurationListener Tests from com.facebook.presto.spark.TestPrestoSparkJoinQueries took 58.01m |
|
we saw spill join tests hang for other modules before #15975, but that pr fixed it 100% for those cases. |
|
@rschlussel It could be similar issue, tuning task.concurrency and hash partition count prevents the issue from happening but I need more time to figure out root cause. |
0f1c2d5 to
2c9ca64
Compare
|
Lower task concurrency improves presto-spark-base test suites run time to 41 minutes, will rerun for a few more times to confirm
Rerun:
Rerun 06-21-2021:
|
2c9ca64 to
d954918
Compare
|
Based on my understanding this PR addresses two problems:
However, in addition to fixing these 2 bugs, this PR also changes an important aspect of spilling design. For some reason Spilling was designed to do spills / unspills in the background when spilling / unspilling cannot be done iteratively (one or few pages at a time). Unspill is only done synchronously when it can be done iteratively (through I'm not exactly sure why the decision to spill in a background was made, but I guess it is related to the high level design principle of the It's an interesting tradeoff, and it feels like a right, long term solution would be to move the spill/unspill process from the background thread back to the thread managed by the @rschlussel @highker Since you guys are the spilling experts we would love to hear your opinion. Do you think it makes sense to move the unspilling process in this single place from the background without making it iterative? Or do you think we should postpone this decision, leave it as is, fix the bugs, and come back to it later once we are ready to approach this problem more holistically? |
|
Fix for the hanging test: #16293 |
|
That's a good point that I hadn't considered. I don't think it makes sense to move spill to the main process without making it iterative (certainly not without very rigorous testing). It might not be so bad for Presto-on-spark because you'll only affect the query that's doing the spilling, but in a multitenant environment breaking that assumption can negatively affect other queries |
d954918 to
b35b20a
Compare
It is important to allow cancelling the long running thread thus it makes sense to keep using
|
There was a problem hiding this comment.
This line looks like a trivial change, but it is not true. State INPUT_UNSPILLING requires it to be present. That's why it can only be destroyed after state changes to next state. In case there are other dependencies for unspillInProgress, the code path was controlled by this session property.
There was a problem hiding this comment.
The Operator access is always single threaded and the finishLookupSourceUnspilling invocation is atomic. Thus in theory an intermittent state inconsistency should never be observed.
Could you please elaborate more on why the finishLookupSourceUnspilling reference cannot be set to Optional.empty() right after the Queue<Page> pages = new ArrayDeque<>(unspilledPages);?
There was a problem hiding this comment.
what was being retained here without nulling it out? The comment earlier suggests we use a queue so that we don't retain the unspilled pages after they get added to the index. Were we still actually holding on to all of it because of the unspillInProgress future? did that cause jvm ooms in the small memory pool environment? It wasn't being accounted for in the query memory, so it wouldn't have caused query ooms.
There was a problem hiding this comment.
I'm having a hard time trying to understand how does this codepath improve memory footprint for small memory pools.
From what I understand there are two differences:
- The new code path updates the memory reservation is one shot
- The new code path nullifies the
unspillInProgressreference
While the second difference makes sense to me, could you please elaborate more on the first?
There was a problem hiding this comment.
- The new code path updates the memory reservation is one shot
This line can be moved before if condition, but if so, refactoring of old path is needed. I prefer not to refactor the old path.
There was a problem hiding this comment.
I'm also confused about this change. It looks like the main difference between this and the else block is that we are adding all the pages to the index all at once, and no longer updating the memory after adding each page to to the index (with the queue it's no longer part of the retainedSizeOfUnspilledPages, but will be accounted for in index.getEstimatedSize()). However, this will not accurately reflect the memory used by the index in the meantime and I wonder if that could cause its own problems
If the problem is just that the unspilled pages are retained from the unspillInProgress future, why not clear it as soon as we add them to the queue. That way we can get a more accurate accounting from the index as we add pages to it and still get the benefits of having a page only in the queue or the index, but not both.
There was a problem hiding this comment.
@rschlussel Because if unspillInProgress is not cleared, removing page from queue won't really release memory. As @arhimondr pointed out this function is atomic, it should be safe to clear it and update memory while removing page from queue.
There was a problem hiding this comment.
I think we might be saying the same thing here. I'm suggesting nulling out unspillInProgress as soon as we create the queue and leaving everything else as it was before this change. That way removing the page form the queue will release the memory.
There was a problem hiding this comment.
The Operator access is always single threaded and the finishLookupSourceUnspilling invocation is atomic. Thus in theory an intermittent state inconsistency should never be observed.
Could you please elaborate more on why the finishLookupSourceUnspilling reference cannot be set to Optional.empty() right after the Queue<Page> pages = new ArrayDeque<>(unspilledPages);?
presto-main/src/main/java/com/facebook/presto/operator/HashBuilderOperator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
The spilling / unspilling operations with the current implementation are non cancellable. If the task is already scheduled on a threadpool this will be a no-op operation. It feels like if we want to improve cancellation we also need to make sure the background spilling tasks are cancellable.
There was a problem hiding this comment.
@arhimondr can you explain why this isn't cancellable? Won't it get interrupted?
There was a problem hiding this comment.
@rschlussel It will get an interrupt flag set, but the task is not checking it.
There was a problem hiding this comment.
ah that's too bad. In that case, I think it could be misleading to cancel the task since it would be a no-op. Would be better to do this in conjunction with adding proper cancellation support.
There was a problem hiding this comment.
@rschlussel Removed cancellation code to handle that in separate PR.
presto-main/src/main/java/com/facebook/presto/spiller/TempStorageSingleStreamSpiller.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I'm a little confused about this test. If the goal is to disable it why is it needed to change the implementation?
There was a problem hiding this comment.
It narrows down which query caused failure.
There was a problem hiding this comment.
Do you think leaving a note in a comment section would suffice?
There was a problem hiding this comment.
+1 I think a comment would be more helpful. Otherwise it's likely that someone will just switch it to enabled.
There was a problem hiding this comment.
I would recommend calling super.testLimitWithJoin() or simply leaving the method body empty (maybe with only the comment explaining why the test is disabled) instead of copying the entire test implementation (similarly to what we do in other places: https://github.com/prestodb/presto/blob/master/presto-spark-base/src/test/java/com/facebook/presto/spark/TestPrestoSparkAbstractTestQueries.java#L74, https://github.com/prestodb/presto/blob/master/presto-hive/src/test/java/com/facebook/presto/hive/TestHiveDistributedQueries.java#L67)
An operator that is blocked is removed from scheduling until it is unblocked. The threads from the main thread pool can be reused for running other operators, tasks or queries.
It looks like the issue described here is fixed in Trino and backported to prestodb by @rschlussel : #15975. This looks like a different issue (#16293). |
b35b20a to
1c63b75
Compare
arhimondr
left a comment
There was a problem hiding this comment.
This PR now only addresses a single problem, the problem with the unspillInProgress feature over-retaining pages. The second problem related to pages being non compact after unspill is not addressed here. It should be fine to address it in a separate PR. Could you please update the commit messages accordingly?
There was a problem hiding this comment.
I would recommend calling super.testLimitWithJoin() or simply leaving the method body empty (maybe with only the comment explaining why the test is disabled) instead of copying the entire test implementation (similarly to what we do in other places: https://github.com/prestodb/presto/blob/master/presto-spark-base/src/test/java/com/facebook/presto/spark/TestPrestoSparkAbstractTestQueries.java#L74, https://github.com/prestodb/presto/blob/master/presto-hive/src/test/java/com/facebook/presto/hive/TestHiveDistributedQueries.java#L67)
There was a problem hiding this comment.
Why this extra test is needed? There should be plenty of join without limit tests in the AbstractTestJoinQueries test suite
There was a problem hiding this comment.
- It was used for comparison with the same query when
limitwas added. This test was removed and described in comment. - The issue of "page not being compacted" will be in different PR so that @rschlussel didn't get confused as she didn't have context about the other issue.
- Before the fix, all unspilled data are read into memory and hold on until HashBuilderOperator is destructed. Nullify unspilled pages allows memory to be freed. - Add Presto on Spark spill test for join queries. - Add Presto spill test for join queries.
The test hangs when spilling is enabled for Presto on Spark.
1c63b75 to
0a618be
Compare
|
TestMongoDistributedQueries fail , created #16326 to track. |
What are improved: