Implement Spilling for TopNRowNumber Operator#18400
Implement Spilling for TopNRowNumber Operator#18400rschlussel merged 1 commit intoprestodb:masterfrom
Conversation
68135be to
9abc21d
Compare
eb63df0 to
f949637
Compare
aa0a439 to
270c096
Compare
270c096 to
6a83a7f
Compare
|
Thanks for the detailed algorithm description. Question, and please correct me if I'm not understanding this right: Without spill we keep the topn elements for k groups in memory at a time, so the maximum memory it would use is k * n. With this spill approach, we process one group at a time, but generate a page for all of the values in each group, so the maximum memory will be . If some group is very large compared to k * n, you could actually end up using more memory with spill. Is that correct? Or is it the case that because it's pipelined, you don't actually keep the whole group in memory, you have a page of that group in memory, and then it goes to the heap and only have topn, and then process more pages from the group? Was basically wondering
|
|
This is very good point.
Yes. Your suspicion is right. So I can see a case, based on the current implementation, where this would still fail, because output of merge sort buffers full groups into pages and not let groups span across pages.
Yes. because we now have worst case M (spill files) X N values for that group (instead of just N) to accomodate in memory. But note that, in this case, if we didn't use spill and tried to do it in memory, we would surely fail.
Are you suggesting that after we sort on write, we can skip on read to reduce memory pressure ? IMO, a simpler approach is to switch from page pipelining to row pipelining and so the topNBuilder would do the sort and skip implicitly. I am still trying to comprehend what would be the conditions under which this would theoretically happen (the input TopN heaps should eliminate a lot of values). I guess if N is big and the data is encountered in the input such a way that it maximizes the number of spill across which this big group would be spread ? I didn't find cases from the production I sample of 200 queries. Maybe I should look at a bigger sample.. |
|
I think I missed that it was stored on disk as a heap, so so we never keep more than n values per-group per-file. so it sounds like without spill we run out of memory if there are a lot of groups (because we are maintaining so many heaps) with also n playing a role in how many heaps we can support. With spill if you have a small or medium number of groups, but they are extremely large, would you end up with a lot of files for every group so that you could use more memory with spill than without? It sounds like definitely for a large number of groups and a lot of data in each group, it would fail both with and without spill with OOM. |
Yes that is correct. So this would be another case where the in-memory and spill version would both fail |
A small number of large groups wouldn't necessarily fail without spill because the amount of memory doesn't grow with the size of the group. However, if spill gets triggered for some reason, it could fail with OOM because each group is large, it will use more memory. |
|
Ah yes. I was under the impression that up until you exceeded local memory limits, the in-memory and spill paths are effectively the same. But when I looked at the operator revoke logic and I see what you mean. Essentially, when there is revokable memory pressure, regardless of which operator is causing it, all operators are spilled, until pressure is tolerable. This makes the spill and in-memory paths quite different, even if the operator is using very less memory. Interesting... |
|
I wonder if other Operators that use similar approach (example: HashAggregationOperator/SpillableHashAggregationBuilder) are also susceptible to the same behavior. For example: If we are doing a distinct operation, then it would consume more memory if we kept spilling all the intermediate hash tables and then try to read them back. Because there too, the merge sort output tries to buffer until you read all the values for the group - https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/MergeHashSort.java#L62 This could be a good opportunity to fix this in all such instances. |
274d443 to
eb3a7f8
Compare
2df1bdb to
a2db98d
Compare
rschlussel
left a comment
There was a problem hiding this comment.
looks good aside from adding the missing requireNonNull checks
| DriverYieldSignal driverYieldSignal, | ||
| SpillerFactory spillerFactory) | ||
| { | ||
| this.inputInMemoryGroupedTopNBuilderSupplier = inputInMemoryGroupedTopNBuilderSupplier; |
There was a problem hiding this comment.
can you add requireNonNull checks for all the non-primitive fields
fa7c974 to
0ae39f0
Compare
|
If you squash the commits together i can merge it. |
|
@rschlussel Thanks! I am running another round of regression tests. Will comment here when it's done. |
|
@rschlussel I have found the cause of the memory leak discovered during final testing. I have also fixed it. But I still see about 10% of the cases OOMing. Working on fixing that. |
|
when you finish, i'd love to learn how you debugged it. |
cde521a to
3d883bf
Compare
|
@rschlussel Fixed the issues and re-tested the following setups
Based on these tests results, I believe we should be good to final review and merge |
|
@shrinidhijoshi did you look at the column and row count mismatches from this test: https://www.internalfb.com/intern/presto/verifier/results/?test_id=101281. I want to be sure we're not introducing a correctness bug. |
|
@rschlussel When I looked into the errors in other suites, it looked like its due to non-determinism, but the verifier isn't recognizing it as so. I will do the same validation for this (https://www.internalfb.com/intern/presto/verifier/results/?test_id=101281) suite as well. I have submitted a new suite of these |
|
@rschlussel Looks like the column mismatch cases now succeed.
Any ideas ? |
|
looking back at the failures the determinism check failed, so it marked the tests failed since it couldn't tell whether the test was deterministic. I guess it was non-deterministic. |
rschlussel
left a comment
There was a problem hiding this comment.
can you explain what was the root cause of the crashes you were seeing? Was it the missing updateMemoryReservation calls or some combination of things?
There was a problem hiding this comment.
when would the inMemoryGroupedTopNBuilder be not empty but revocable memory be > 0?
There was a problem hiding this comment.
Checking for localRevocableMemoryContext tells us if we are inside the buildResult function and inside the migrateMemoryContext function, which will mean that the current input is either going to be returned or spilled, so this startRevokingThread doesn't need to take any action (specially - do not trigger another spill) of the same input
There was a problem hiding this comment.
Want to make sure I understand what's going on here - it looks like previously we would build the result and switch to user memory as long as inputInMemoryGroupedTopNBuilder fit in the user memory, and now we only do it if we haven't started spilling. Is that correct?
There was a problem hiding this comment.
Context to understand this change: The current design is that, if we have previously spilled input , then we will spill the last accumulated input regardless of wether it fits into userMem or not.
This code change is just trying to cleanly implement accounting for that design. i.e if we have previously spilled, then do not try to unnecessarily move to localMemory , because we are going to spill it anyway
There was a problem hiding this comment.
so this wasn't causing the OOM issues, right because we would have failed? it was just an additional thing that needed to be fixed (though also related to memory management)
There was a problem hiding this comment.
Yes that's correct.
This was an interesting one. From what I understood, the way memory accounting works inside the Spiller is imperfect. For example, the |
|
just squash the commits together and it'll be good to merge. |
|
I further verified this by looking at other spilling Operators. Once a spill is started, the |
3d883bf to
06e135c
Compare
|
Done |
Implement Spilling for TopNRowNumber Operator.
The general idea for spill is - When input does not fit in memory, we spill it sorted (by groupId) to disk. We keep doing this until all input is consumed. Then, when we want to generate output, we will merge sort all the spill files on disk and process a few groupIds at a time to produce the final output.
What does Spill mean in the case of TopNRowNumber Operator ?
TopN is calculated using a Heap per groupId (ie. if we are tracking Top 5, for K groups, then we maintain K heaps of max size 5 each). When we decide to spill, we extract all (K*5) rows across these K heaps ordered by the GroupId. And we serialize this onto disk.
What does Unspill mean ?
Unspill has 2 parts/stages (pipelined).
1st part - merge sorts the entries across all the spill files and creates new Pages at the group boundaries. This means that each page generated by the merge sort will contain ALL the data for some of groups. This helps us in the next stage.
2nd part - We create an inMemoryGroupedTopNBuilder and pass each page to this. The distinction here (v/s the inMemoryGroupedTopNBuilder we created during the input) is that as soon as we process the first page, we are sure that it contains all the the data for the set of groups encapsulated in it. So we can safely generate the output and flush the builder. Then move on to process the next page.
This is what enables us to process data that initially did not fit in memory.
When is the spill triggered ?
The Spill of data can be triggered in 2 cases
addInput()) and we exceeded therevoke-memory-threshold. This triggers the revoke flowgetOutput()is called and the operator attempts to move the input collected so far from revokableMemory to userMemory to create output pages, and this fails because the collected input is too large to fit in UserMemoryA picture is worth a thousand words :)
Test Plan
[[ Feature test - Currently OOMing on TopNRowNumber ]] - Tested the current implementation on production queries that failed with LOCAL_MEMORY_LIMIT_EXCEEDED and TopNRowNumber Operator was the one consuming the most amount of memory.
Results :
Set of 20 - 100% now succeed (https://our.internmc.facebook.com/intern/presto/verifier/results/?test_id=95183)
Set of 200 - 99.5% now succeed (https://our.internmc.facebook.com/intern/presto/verifier/results/?test_id=95197)
[[ Regression test - on queries that currently pass ]] TBD.