Implement Spilling for TopNRowNumber Operator by shrinidhijoshi · Pull Request #18400 · prestodb/presto

shrinidhijoshi · 2022-09-25T14:36:34Z

Implement Spilling for TopNRowNumber Operator.

The general idea for spill is - When input does not fit in memory, we spill it sorted (by groupId) to disk. We keep doing this until all input is consumed. Then, when we want to generate output, we will merge sort all the spill files on disk and process a few groupIds at a time to produce the final output.

What does Spill mean in the case of TopNRowNumber Operator ?

TopN is calculated using a Heap per groupId (ie. if we are tracking Top 5, for K groups, then we maintain K heaps of max size 5 each). When we decide to spill, we extract all (K*5) rows across these K heaps ordered by the GroupId. And we serialize this onto disk.

What does Unspill mean ?

Unspill has 2 parts/stages (pipelined).
1st part - merge sorts the entries across all the spill files and creates new Pages at the group boundaries. This means that each page generated by the merge sort will contain ALL the data for some of groups. This helps us in the next stage.
2nd part - We create an inMemoryGroupedTopNBuilder and pass each page to this. The distinction here (v/s the inMemoryGroupedTopNBuilder we created during the input) is that as soon as we process the first page, we are sure that it contains all the the data for the set of groups encapsulated in it. So we can safely generate the output and flush the builder. Then move on to process the next page.
This is what enables us to process data that initially did not fit in memory.

When is the spill triggered ?

The Spill of data can be triggered in 2 cases

When we are collecting input (addInput()) and we exceeded the revoke-memory-threshold. This triggers the revoke flow
When getOutput() is called and the operator attempts to move the input collected so far from revokableMemory to userMemory to create output pages, and this fails because the collected input is too large to fit in UserMemory

A picture is worth a thousand words :)

Test Plan

Unit tests (WIP)
Production Samples -
[[ Feature test - Currently OOMing on TopNRowNumber ]] - Tested the current implementation on production queries that failed with LOCAL_MEMORY_LIMIT_EXCEEDED and TopNRowNumber Operator was the one consuming the most amount of memory.
Results :
Set of 20 - 100% now succeed (https://our.internmc.facebook.com/intern/presto/verifier/results/?test_id=95183)
Set of 200 - 99.5% now succeed (https://our.internmc.facebook.com/intern/presto/verifier/results/?test_id=95197)
[[ Regression test - on queries that currently pass ]] TBD.

== RELEASE NOTES ==

General Changes
* ...
* ...

Hive Changes
* ...
* ...

rschlussel · 2022-10-27T13:38:03Z

Thanks for the detailed algorithm description.

Question, and please correct me if I'm not understanding this right: Without spill we keep the topn elements for k groups in memory at a time, so the maximum memory it would use is k * n. With this spill approach, we process one group at a time, but generate a page for all of the values in each group, so the maximum memory will be . If some group is very large compared to k * n, you could actually end up using more memory with spill. Is that correct? Or is it the case that because it's pipelined, you don't actually keep the whole group in memory, you have a page of that group in memory, and then it goes to the heap and only have topn, and then process more pages from the group?

Was basically wondering

Do we understand if any are the cases where we would expect this to fail too?
Does it make sense to sort by both the grouping key and the ordering keys, so that when you merge the pages back you only need to keep the top n?

shrinidhijoshi · 2022-10-27T16:41:40Z

This is very good point.
To answer your questions.

Do we understand if any are the cases where we would expect this to fail too?

Yes. Your suspicion is right. So I can see a case, based on the current implementation, where this would still fail, because output of merge sort buffers full groups into pages and not let groups span across pages.

you could actually end up using more memory with spill. Is that correct?

Yes. because we now have worst case M (spill files) X N values for that group (instead of just N) to accomodate in memory. But note that, in this case, if we didn't use spill and tried to do it in memory, we would surely fail.

Does it make sense to sort by both the grouping key and the ordering keys, so that when you merge the pages back you only need to keep the top n?

Are you suggesting that after we sort on write, we can skip on read to reduce memory pressure ? IMO, a simpler approach is to switch from page pipelining to row pipelining and so the topNBuilder would do the sort and skip implicitly.

I am still trying to comprehend what would be the conditions under which this would theoretically happen (the input TopN heaps should eliminate a lot of values). I guess if N is big and the data is encountered in the input such a way that it maximizes the number of spill across which this big group would be spread ? I didn't find cases from the production I sample of 200 queries. Maybe I should look at a bigger sample..

rschlussel · 2022-10-27T17:01:51Z

I think I missed that it was stored on disk as a heap, so so we never keep more than n values per-group per-file. so it sounds like without spill we run out of memory if there are a lot of groups (because we are maintaining so many heaps) with also n playing a role in how many heaps we can support.

With spill if you have a small or medium number of groups, but they are extremely large, would you end up with a lot of files for every group so that you could use more memory with spill than without?

It sounds like definitely for a large number of groups and a lot of data in each group, it would fail both with and without spill with OOM.

shrinidhijoshi · 2022-10-27T17:14:39Z

With spill if you have a small or medium number of groups, but they are extremely large, would you end up with a lot of files for every group so that you could use more memory with spill than without?

Yes that is correct. So this would be another case where the in-memory and spill version would both fail

rschlussel · 2022-10-27T18:25:42Z

With spill if you have a small or medium number of groups, but they are extremely large, would you end up with a lot of files for every group so that you could use more memory with spill than without?

Yes that is correct. So this would be another case where the in-memory and spill version would both fail

A small number of large groups wouldn't necessarily fail without spill because the amount of memory doesn't grow with the size of the group. However, if spill gets triggered for some reason, it could fail with OOM because each group is large, it will use more memory.

shrinidhijoshi · 2022-10-27T19:13:55Z

Ah yes. I was under the impression that up until you exceeded local memory limits, the in-memory and spill paths are effectively the same. But when I looked at the operator revoke logic and I see what you mean.

Essentially, when there is revokable memory pressure, regardless of which operator is causing it, all operators are spilled, until pressure is tolerable. This makes the spill and in-memory paths quite different, even if the operator is using very less memory. Interesting...

shrinidhijoshi · 2022-10-27T19:26:32Z

I wonder if other Operators that use similar approach (example: HashAggregationOperator/SpillableHashAggregationBuilder) are also susceptible to the same behavior.

For example: If we are doing a distinct operation, then it would consume more memory if we kept spilling all the intermediate hash tables and then try to read them back. Because there too, the merge sort output tries to buffer until you read all the values for the group - https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/MergeHashSort.java#L62

This could be a good opportunity to fix this in all such instances.

rschlussel

looks good aside from adding the missing requireNonNull checks

rschlussel · 2023-01-04T16:33:07Z

presto-main/src/main/java/com/facebook/presto/operator/SpillableGroupedTopNBuilder.java

+            DriverYieldSignal driverYieldSignal,
+            SpillerFactory spillerFactory)
+    {
+        this.inputInMemoryGroupedTopNBuilderSupplier = inputInMemoryGroupedTopNBuilderSupplier;


can you add requireNonNull checks for all the non-primitive fields

rschlussel · 2023-01-05T17:37:56Z

If you squash the commits together i can merge it.

shrinidhijoshi · 2023-01-05T17:39:04Z

@rschlussel Thanks! I am running another round of regression tests. Will comment here when it's done.

shrinidhijoshi · 2023-01-12T19:20:59Z

@rschlussel I have found the cause of the memory leak discovered during final testing. I have also fixed it. But I still see about 10% of the cases OOMing. Working on fixing that.

rschlussel · 2023-01-12T19:52:59Z

when you finish, i'd love to learn how you debugged it.

shrinidhijoshi · 2023-01-17T22:35:25Z

@rschlussel Fixed the issues and re-tested the following setups

Presto Interactive (already succeeding queries) - https://our.internmc.facebook.com/intern/presto/verifier/results/?test_id=101296
Presto-on-Spark (already succeeding queries) - https://www.internalfb.com/intern/presto/verifier/results/?test_id=101281
Presto-on-Spark (OOMing queries) - https://www.internalfb.com/intern/presto/verifier/results/?test_id=101297

Based on these tests results, I believe we should be good to final review and merge

rschlussel · 2023-01-19T15:03:51Z

@shrinidhijoshi did you look at the column and row count mismatches from this test: https://www.internalfb.com/intern/presto/verifier/results/?test_id=101281. I want to be sure we're not introducing a correctness bug.

shrinidhijoshi · 2023-01-19T21:44:37Z

@rschlussel When I looked into the errors in other suites, it looked like its due to non-determinism, but the verifier isn't recognizing it as so. I will do the same validation for this (https://www.internalfb.com/intern/presto/verifier/results/?test_id=101281) suite as well.

I have submitted a new suite of these COLUMN_MISMATCH failing queries, test and control both being on stable to check if these are also non-deterministic and are being wrongly classified as deterministic. Will update once it's done.

shrinidhijoshi · 2023-01-20T01:08:06Z

@rschlussel Looks like the column mismatch cases now succeed.
https://www.internalfb.com/intern/presto/verifier/results/?test_id=101567.
There is definitely some flakiness/unreliability in the verifier here, as I am not able to explain this behavior..
Either the test cases should consistently pass or consistently fail (with non-determinism as the cause)
I submitted 2 runs

control - stable, test-stable - ALL SUCCEED
control -stable, test-1d4e4be - ALL SUCCEED (https://our.internmc.facebook.com/intern/presto/verifier/results/?test_id=101567)

Any ideas ?

rschlussel · 2023-01-20T14:14:46Z

looking back at the failures the determinism check failed, so it marked the tests failed since it couldn't tell whether the test was deterministic. I guess it was non-deterministic.

rschlussel

can you explain what was the root cause of the crashes you were seeing? Was it the missing updateMemoryReservation calls or some combination of things?

rschlussel · 2023-01-20T14:29:26Z

presto-main/src/main/java/com/facebook/presto/operator/SpillableGroupedTopNBuilder.java

when would the inMemoryGroupedTopNBuilder be not empty but revocable memory be > 0?

Checking for localRevocableMemoryContext tells us if we are inside the buildResult function and inside the migrateMemoryContext function, which will mean that the current input is either going to be returned or spilled, so this startRevokingThread doesn't need to take any action (specially - do not trigger another spill) of the same input

rschlussel · 2023-01-20T14:33:07Z

presto-main/src/main/java/com/facebook/presto/operator/SpillableGroupedTopNBuilder.java

Want to make sure I understand what's going on here - it looks like previously we would build the result and switch to user memory as long as inputInMemoryGroupedTopNBuilder fit in the user memory, and now we only do it if we haven't started spilling. Is that correct?

Context to understand this change: The current design is that, if we have previously spilled input , then we will spill the last accumulated input regardless of wether it fits into userMem or not.

This code change is just trying to cleanly implement accounting for that design. i.e if we have previously spilled, then do not try to unnecessarily move to localMemory , because we are going to spill it anyway

rschlussel · 2023-01-20T14:40:02Z

presto-main/src/main/java/com/facebook/presto/operator/SpillableGroupedTopNBuilder.java

so this wasn't causing the OOM issues, right because we would have failed? it was just an additional thing that needed to be fixed (though also related to memory management)

Yes that's correct.

shrinidhijoshi · 2023-01-20T18:05:06Z

can you explain what was the root cause of the crashes you were seeing? Was it the missing updateMemoryReservation calls or some combination of things?

This was an interesting one. From what I understood, the way memory accounting works inside the Spiller is imperfect. For example, the spill call takes in an iterator to a list of pages to spill, but during the course of spilling, it is expected that the caller of spill account for the memory of those pages. So in this case, it is the Spillable-Builder's responsibility to ensure that the currently spilling pages are accounted for in revocable memory until spill finishes.
I had made a code change along the way where i moved the updateMemoryReservation call into the spill function itself to make the code more DRY. Along with the fact that, when we trigger spill, we let go on the current InMem-Builder and instantiate a new one, calling updateMemoryReservation would reset the revocable mem to this new builder size (empty/0).
So this led to a scenario where effectively no one is accounting for the pages "being spilled", thus underestimating the currently used JVM memory, and causing 137

rschlussel · 2023-01-20T18:11:44Z

just squash the commits together and it'll be good to merge.

shrinidhijoshi · 2023-01-20T18:13:00Z

I further verified this by looking at other spilling Operators. Once a spill is started, the updateMemory is next called only in the finishMemoryRevoke function. This is to ensure that we the caller (builder/operator) is accounting/holding onto that memory.
Possibly we can simplify this if we change the interface to spill function. So that it not only takes ownership of the pages to spill but also their memory accounting. Then the caller wouldn't to do this complex coordination.

shrinidhijoshi · 2023-01-20T18:18:51Z

Done

shrinidhijoshi requested a review from a team as a code owner September 25, 2022 14:36

shrinidhijoshi requested a review from presto-oss September 25, 2022 14:36

shrinidhijoshi marked this pull request as draft September 25, 2022 14:36

shrinidhijoshi changed the title ~~Spillable topn operator c~~ [TopNOperator-Spill] [Part3] Add SpillableGroupedTopNBuilder Sep 25, 2022

shrinidhijoshi mentioned this pull request Sep 26, 2022

[TopNOperator-Spilling] [Part1] Move GroupByHash management inside GroupedTopNBuilder #18379

Closed

shrinidhijoshi changed the title ~~[TopNOperator-Spill] [Part3] Add SpillableGroupedTopNBuilder~~ [TopNOperator/TopNRowNumberOperator-Spill] Add SpillableGroupedTopNBuilder Sep 30, 2022

shrinidhijoshi force-pushed the SpillableTopnOperator-C branch 8 times, most recently from 68135be to 9abc21d Compare October 4, 2022 09:38

shrinidhijoshi changed the title ~~[TopNOperator/TopNRowNumberOperator-Spill] Add SpillableGroupedTopNBuilder~~ Implement Spilling for Topn/TopnRowNumber Operator Oct 4, 2022

shrinidhijoshi changed the title ~~Implement Spilling for Topn/TopnRowNumber Operator~~ Implement Spilling for TopNRowNumber Operator Oct 4, 2022

shrinidhijoshi force-pushed the SpillableTopnOperator-C branch 5 times, most recently from eb63df0 to f949637 Compare October 5, 2022 06:50

shrinidhijoshi force-pushed the SpillableTopnOperator-C branch from aa0a439 to 270c096 Compare October 13, 2022 03:27

shrinidhijoshi force-pushed the SpillableTopnOperator-C branch from 270c096 to 6a83a7f Compare October 26, 2022 22:09

shrinidhijoshi force-pushed the SpillableTopnOperator-C branch from 274d443 to eb3a7f8 Compare December 22, 2022 21:10

shrinidhijoshi force-pushed the SpillableTopnOperator-C branch 4 times, most recently from 2df1bdb to a2db98d Compare January 3, 2023 23:19

rschlussel approved these changes Jan 4, 2023

View reviewed changes

shrinidhijoshi force-pushed the SpillableTopnOperator-C branch from fa7c974 to 0ae39f0 Compare January 5, 2023 16:14

shrinidhijoshi force-pushed the SpillableTopnOperator-C branch 2 times, most recently from cde521a to 3d883bf Compare January 13, 2023 21:33

rschlussel reviewed Jan 20, 2023

View reviewed changes

rschlussel approved these changes Jan 20, 2023

View reviewed changes

Implement Spilling for TopNRowNumberOperator

06e135c

shrinidhijoshi force-pushed the SpillableTopnOperator-C branch from 3d883bf to 06e135c Compare January 20, 2023 18:16

rschlussel merged commit 77fb254 into prestodb:master Jan 20, 2023

shrinidhijoshi mentioned this pull request Jan 25, 2023

[Spilling] Reduce memory pressure during unspill-and-merge-sort stage for large groups #18978

Open

wanglinsong mentioned this pull request Feb 25, 2023

Add release notes for 0.280 #19110

Closed

12 tasks

rschlussel mentioned this pull request Jul 18, 2023

Add Shrinidhi as codeowner for presto-spark* modules #20331

Merged

Conversation

shrinidhijoshi commented Sep 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does Spill mean in the case of TopNRowNumber Operator ?

What does Unspill mean ?

When is the spill triggered ?

Uh oh!

rschlussel commented Oct 27, 2022

Uh oh!

shrinidhijoshi commented Oct 27, 2022

Uh oh!

rschlussel commented Oct 27, 2022

Uh oh!

shrinidhijoshi commented Oct 27, 2022

Uh oh!

rschlussel commented Oct 27, 2022

Uh oh!

shrinidhijoshi commented Oct 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shrinidhijoshi commented Oct 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rschlussel left a comment

Choose a reason for hiding this comment

Uh oh!

rschlussel Jan 4, 2023

Choose a reason for hiding this comment

Uh oh!

shrinidhijoshi Jan 4, 2023

Choose a reason for hiding this comment

Uh oh!

rschlussel commented Jan 5, 2023

Uh oh!

shrinidhijoshi commented Jan 5, 2023

Uh oh!

shrinidhijoshi commented Jan 12, 2023

Uh oh!

rschlussel commented Jan 12, 2023

Uh oh!

shrinidhijoshi commented Jan 17, 2023

Uh oh!

rschlussel commented Jan 19, 2023

Uh oh!

shrinidhijoshi commented Jan 19, 2023

Uh oh!

shrinidhijoshi commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rschlussel commented Jan 20, 2023

Uh oh!

rschlussel left a comment

Choose a reason for hiding this comment

Uh oh!

rschlussel Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

shrinidhijoshi Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

rschlussel Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

shrinidhijoshi Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

rschlussel Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shrinidhijoshi Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

shrinidhijoshi commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rschlussel commented Jan 20, 2023

Uh oh!

shrinidhijoshi commented Jan 20, 2023

Uh oh!

shrinidhijoshi commented Sep 25, 2022 •

edited

Loading

shrinidhijoshi commented Oct 27, 2022 •

edited

Loading

shrinidhijoshi commented Oct 27, 2022 •

edited

Loading

shrinidhijoshi commented Jan 20, 2023 •

edited

Loading

rschlussel Jan 20, 2023 •

edited

Loading

shrinidhijoshi commented Jan 20, 2023 •

edited

Loading