Skip to content

Conversation

@drawlerr
Copy link

No description provided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not correct. You skipped elementsRead % 32

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually intentional - see below

@andrewor14
Copy link
Contributor

Hey @lawlerd I don't think we want to get rid of this check because we want the thread to ramp up and not just spill all the time. Would it make sense to make the spill threshold configurable instead?

@andrewor14
Copy link
Contributor

ok to test

@drawlerr
Copy link
Author

Thanks for the review, @andrewor14!

I figured that to be the case. I weighed this against the hazard of OOMs and figured it was the lesser evil, but I might have missed something.
In which situations does a configurable count help?

@SparkQA
Copy link

SparkQA commented Dec 10, 2014

Test build #24320 has finished for PR 3656 at commit 2df2e2d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sryza
Copy link
Contributor

sryza commented Dec 11, 2014

@lawlerd things are done this way because estimating the size for every record would be prohibitively expensive. Also, the trackMemoryThreshold is required at least until we figure out a solution for SPARK-4452. Without it, when there are multiple shuffle data structures in a thread and the first takes a bunch of memory, the second ends up spilling on every record (this was a blocker for 1.2).

Your concern of course is valid - that we're not tracking memory 100% accurately. One response to this is that we're conservative with. E.g. we only use up to spark.shuffle.safetyFraction (default 80%) of the available shuffle memory.

One improvement that might make sense would be to do the sampling based on memory size rather than number of records. So if we notice that records are larger we would sample more frequently and maybe adjust the trackMemoryThreshold.

@mccheah
Copy link
Contributor

mccheah commented Jan 26, 2015

Seeing some problems in my usage of Spark that this PR could address, so reviving this thread.

@lawlerd the configurable count would help because if it is known that the individual objects would be large, the sampling could be set to be done more frequently. So if sampling every 32 times is too passive then a more aggressive option can be configured, say sampling every 5 times.

@andrewor14
Copy link
Contributor

Hey @lawlerd this seems to be superseded by #4420. Would you mind closing this PR? Please follow the latest discussion there.

asfgit pushed a commit that referenced this pull request Feb 20, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch #3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <[email protected]>

Closes #4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
asfgit pushed a commit that referenced this pull request Feb 20, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch #3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <[email protected]>

Closes #4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check

(cherry picked from commit 3be92cd)
Signed-off-by: Andrew Or <[email protected]>
@asfgit asfgit closed this in 46462ff Feb 22, 2015
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Feb 24, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch apache#3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <[email protected]>

Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
mingyukim pushed a commit to palantir/spark that referenced this pull request Mar 6, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch apache#3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <[email protected]>

Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
mccheah added a commit to palantir/spark that referenced this pull request Mar 27, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch apache#3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <[email protected]>

Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
mccheah added a commit to palantir/spark that referenced this pull request May 15, 2015
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.

However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.

Patch apache#3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.

Author: mcheah <[email protected]>

Closes apache#4420 from mingyukim/memory-spill-configurable and squashes the following commits:

6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
@drawlerr drawlerr deleted the branch-1.2 branch August 19, 2016 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants