[SPARK-4808] Remove Spillable minimum threshold and sampling #3656

drawlerr · 2014-12-10T01:52:37Z

No description provided.

andrewor14 · 2014-12-10T21:25:49Z

core/src/main/scala/org/apache/spark/util/collection/Spillable.scala

this is not correct. You skipped elementsRead % 32

This was actually intentional - see below

andrewor14 · 2014-12-10T21:26:45Z

Hey @lawlerd I don't think we want to get rid of this check because we want the thread to ramp up and not just spill all the time. Would it make sense to make the spill threshold configurable instead?

andrewor14 · 2014-12-10T21:26:55Z

ok to test

drawlerr · 2014-12-10T22:39:58Z

Thanks for the review, @andrewor14!

I figured that to be the case. I weighed this against the hazard of OOMs and figured it was the lesser evil, but I might have missed something.
In which situations does a configurable count help?

SparkQA · 2014-12-10T23:17:50Z

Test build #24320 has finished for PR 3656 at commit 2df2e2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sryza · 2014-12-11T19:48:25Z

@lawlerd things are done this way because estimating the size for every record would be prohibitively expensive. Also, the trackMemoryThreshold is required at least until we figure out a solution for SPARK-4452. Without it, when there are multiple shuffle data structures in a thread and the first takes a bunch of memory, the second ends up spilling on every record (this was a blocker for 1.2).

Your concern of course is valid - that we're not tracking memory 100% accurately. One response to this is that we're conservative with. E.g. we only use up to spark.shuffle.safetyFraction (default 80%) of the available shuffle memory.

One improvement that might make sense would be to do the sampling based on memory size rather than number of records. So if we notice that records are larger we would sample more frequently and maybe adjust the trackMemoryThreshold.

mccheah · 2015-01-26T19:55:04Z

Seeing some problems in my usage of Spark that this PR could address, so reviving this thread.

@lawlerd the configurable count would help because if it is known that the individual objects would be large, the sampling could be set to be done more frequently. So if sampling every 32 times is too passive then a more aggressive option can be configured, say sampling every 5 times.

andrewor14 · 2015-02-19T20:47:44Z

Hey @lawlerd this seems to be superseded by #4420. Would you mind closing this PR? Please follow the latest discussion there.

In the general case, Spillable's heuristic of checking for memory stress on every 32nd item after 1000 items are read is good enough. In general, we do not want to be enacting the spilling checks until later on in the job; checking for disk-spilling too early can produce unacceptable performance impact in trivial cases. However, there are non-trivial cases, particularly if each serialized object is large, where checking for the necessity to spill too late would allow the memory to overflow. Consider if every item is 1.5 MB in size, and the heap size is 1000 MB. Then clearly if we only try to spill the in-memory contents to disk after 1000 items are read, we would have already accumulated 1500 MB of RAM and overflowed the heap. Patch #3656 attempted to circumvent this by checking the need to spill on every single item read, but that would cause unacceptable performance in the general case. However, the convoluted cases above should not be forced to be refactored to shrink the data items. Therefore it makes sense that the memory spilling thresholds be configurable. Author: mcheah <[email protected]> Closes #4420 from mingyukim/memory-spill-configurable and squashes the following commits: 6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check

In the general case, Spillable's heuristic of checking for memory stress on every 32nd item after 1000 items are read is good enough. In general, we do not want to be enacting the spilling checks until later on in the job; checking for disk-spilling too early can produce unacceptable performance impact in trivial cases. However, there are non-trivial cases, particularly if each serialized object is large, where checking for the necessity to spill too late would allow the memory to overflow. Consider if every item is 1.5 MB in size, and the heap size is 1000 MB. Then clearly if we only try to spill the in-memory contents to disk after 1000 items are read, we would have already accumulated 1500 MB of RAM and overflowed the heap. Patch #3656 attempted to circumvent this by checking the need to spill on every single item read, but that would cause unacceptable performance in the general case. However, the convoluted cases above should not be forced to be refactored to shrink the data items. Therefore it makes sense that the memory spilling thresholds be configurable. Author: mcheah <[email protected]> Closes #4420 from mingyukim/memory-spill-configurable and squashes the following commits: 6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check (cherry picked from commit 3be92cd) Signed-off-by: Andrew Or <[email protected]>

In the general case, Spillable's heuristic of checking for memory stress on every 32nd item after 1000 items are read is good enough. In general, we do not want to be enacting the spilling checks until later on in the job; checking for disk-spilling too early can produce unacceptable performance impact in trivial cases. However, there are non-trivial cases, particularly if each serialized object is large, where checking for the necessity to spill too late would allow the memory to overflow. Consider if every item is 1.5 MB in size, and the heap size is 1000 MB. Then clearly if we only try to spill the in-memory contents to disk after 1000 items are read, we would have already accumulated 1500 MB of RAM and overflowed the heap. Patch apache#3656 attempted to circumvent this by checking the need to spill on every single item read, but that would cause unacceptable performance in the general case. However, the convoluted cases above should not be forced to be refactored to shrink the data items. Therefore it makes sense that the memory spilling thresholds be configurable. Author: mcheah <[email protected]> Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits: 6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check

[SPARK-4808] Remove Spillable minimum threshold and sampling

2df2e2d

andrewor14 reviewed Dec 10, 2014
View reviewed changes

mingyukim mentioned this pull request Feb 6, 2015

[SPARK-4808] Removing minimum number of elements read before spill check #4420

Closed

asfgit closed this in 46462ff Feb 22, 2015

drawlerr deleted the branch-1.2 branch August 19, 2016 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-4808] Remove Spillable minimum threshold and sampling #3656

[SPARK-4808] Remove Spillable minimum threshold and sampling #3656

Uh oh!

drawlerr commented Dec 10, 2014

Uh oh!

andrewor14 Dec 10, 2014

Uh oh!

drawlerr Dec 10, 2014

Uh oh!

andrewor14 commented Dec 10, 2014

Uh oh!

andrewor14 commented Dec 10, 2014

Uh oh!

drawlerr commented Dec 10, 2014

Uh oh!

SparkQA commented Dec 10, 2014

Uh oh!

sryza commented Dec 11, 2014

Uh oh!

mccheah commented Jan 26, 2015

Uh oh!

andrewor14 commented Feb 19, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-4808] Remove Spillable minimum threshold and sampling #3656

[SPARK-4808] Remove Spillable minimum threshold and sampling #3656

Uh oh!

Conversation

drawlerr commented Dec 10, 2014

Uh oh!

andrewor14 Dec 10, 2014

Choose a reason for hiding this comment

Uh oh!

drawlerr Dec 10, 2014

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Dec 10, 2014

Uh oh!

andrewor14 commented Dec 10, 2014

Uh oh!

drawlerr commented Dec 10, 2014

Uh oh!

SparkQA commented Dec 10, 2014

Uh oh!

sryza commented Dec 11, 2014

Uh oh!

mccheah commented Jan 26, 2015

Uh oh!

andrewor14 commented Feb 19, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants