-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3885] Provide mechanism to remove accumulators once they are no longer used #4021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… references to allow garbage collection of old accumulators
… WeakRef Accumulator storage
|
Test build #25475 has started for PR 4021 at commit
|
|
Test build #25475 has finished for PR 4021 at commit
|
|
Test PASSed. |
|
It looks like a number of file permissions changes got mixed into this PR; mind reverting those changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guava MapMaper supports weakValues; not sure if we want to use that here, since it's not super Scala-friendly (e.g. returns nulls, etc): http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/collect/MapMaker.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Josh - are you suggesting to replace this snippet with a MapMaker just to simplify the initialization code? I believe the usage of either object would be the same - do you see a specific advantage to trying to use the MapMaker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just leave this as-is; I don't think MapMaker will buy us much now that I think about it.
|
Test build #25606 has started for PR 4021 at commit
|
|
I've updated the code to throw an exception in the error case you mentioned and I've reverted the file permission change. Thanks! |
|
Test build #25606 has finished for PR 4021 at commit
|
|
Test PASSed. |
|
This looks pretty nice. I wonder if there's a good way to add a unit test for this behavior (maybe in AccumulatorSuite or something) to ensure that we don't accidentally break the garbage-collection. This could be tricky, though, since we don't want to introduce a test that's flaky due to nondeterminism. Maybe a good test would be something like creating a job that uses an accumulator, storing its id, letting all references to it fall out of scope, then calling |
|
Thanks for the suggestion Josh - that sounds quite reasonable. I was having trouble coming up with an effective test for this. I'll add a test based on this idea to the accumulator suite. |
|
Hi @JoshRosen, I've added a test for this patch to the Accumulator Suite. Please let me know if you think it's sufficient. Thanks! |
|
Test build #27005 has started for PR 4021 at commit
|
|
Test build #27005 has finished for PR 4021 at commit
|
|
Test FAILed. |
|
Test FAILed. |
|
Test build #27801 has started for PR 4021 at commit
|
|
Test build #27801 has finished for PR 4021 at commit
|
|
Test FAILed. |
|
Test build #27803 has started for PR 4021 at commit
|
|
Test build #27803 has finished for PR 4021 at commit
|
|
Test PASSed. |
|
I just realized that ContextCleaner won't manage cleanup of accumulators in the thread-local accumulator registries in executors, but I don't think this is a big deal: we're more concerned about the driver OOMing in a long-running app than the executors, and it would be prohibitively expensive to add a bunch of driver -> executor RPCs just to reclaim one map entry. Therefore, this looks good to me. There's one minor import-ordering nit, but I'll just fix that up myself on merge. Thanks for working on this, esp. since this code is a bit tricky to test. |
|
Actually, we won't leak accumulators on executors because we clear the thread-local between tasks. So this is great! |
|
I've merged this into @ilganeli, it looks like your text editor might be leaving trailing whitespace on lines, which makes the diffs look kind of messy: I fixed this myself when committing, so not a huge deal, but you might consider changing your editor settings to prevent htis in the future (I have a special vim setting that complains about trailing whitespace, and I think there's a way to configure IntelliJ to do the same). |
|
Hi Josh - thanks for the review and suggestions. With regards to the local thread cleanup I didn't add anything since as you pointed out it already gets cleaned up. Cheers! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Accumulator is garbage collected, should we log and continue ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception thrown here is caught at higher levels of the stack. For example, DAGScheduler wraps calls to accumulator methods in a try block and logs any uncaught exceptions. Have you run into a case where the current behavior causes a problem?
|
Not yet. |
|
I think that this patch may have introduced a bug that may cause accumulator updates to be lost: I'm still trying to see if I can spot the problem, but my hunch is that maybe the |
|
Another thought: if |
|
I think I've figured it out: consider the lifecycle of an accumulator in a task, say ShuffleMapTask: on the executor, each task deserializes its own copy of the RDD inside of its The fix is to keep strong references in |
At this point, I should have spotted that it doesn't make sense to store WeakReferences if those references are going to be destroyed at the end of tasks anyways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was dumb for me to overlook, too: if the data structure is invalid, then this would just silently ignore it; even if this was a rare error-condition, there should have been a warning here.
|
I have a fix at #4835. |
|
Thanks for the detailed write up and the fix. |
…store WeakReferences in localAccums map This fixes a non-deterministic bug introduced in #4021 that could cause tasks' accumulator updates to be lost. The problem is that `localAccums` should not hold weak references: after the task finishes running there won't be any strong references to these local accumulators, so they can get garbage-collected before the executor reads the `localAccums` map. We don't need weak references here anyways, since this map is cleared at the end of each task. Author: Josh Rosen <[email protected]> Closes #4835 from JoshRosen/SPARK-6075 and squashes the following commits: 4f4b5b2 [Josh Rosen] Remove defensive assertions that caused test failures in code unrelated to this change 120c7b0 [Josh Rosen] [SPARK-6075] Do not store WeakReferences in localAccums map

Instead of storing a strong reference to accumulators, I've replaced this with a weak reference and updated any code that uses these accumulators to check whether the reference resolves before using the accumulator. A weak reference will be cleared when there is no longer an existing copy of the variable versus using a soft reference in which case accumulators would only be cleared when the GC explicitly ran out of memory.