[SPARK-13988][Core] Make replaying event logs multi threaded in Histo…ry server to ensure a single large log does not block other logs from being rendered.#11800
Conversation
5bac606 to
c19e919
Compare
|
Can someone review this patch? |
|
Jenkins, test this please |
There was a problem hiding this comment.
I would rather see this configurable. Many times history server runs on same machine as other things (like Yarn ResourceManager or history server for MapReduce, etc) and I wouldn't want the history server to starve out more important things.
There was a problem hiding this comment.
I agree this should be configurable and I have set the default so it will only use 25% of cores.
|
Test build #56361 has finished for PR 11800 at commit
|
|
we are now replaying a bunch of things in parallel, what does that do to the memory usage or the history server? Have you done any scale testing of this? |
…ry server to ensure a single large log does not block other logs from being rendered.
…rver replay log processing.
c19e919 to
704a619
Compare
|
Even before this change we were getting OOM errors. The issue primarily seems to be creation of lot of young objects. In addition to this fix we also moved to G1 gc and we are using -XX:NewRatio=1 to allocate half the space to Eden. We have deployed this fix in production since a week and we have observed one OOM crash. The heap dump is 12GB and I am still analyzing it but initial analysis again points at lot of string,char[] instances being created. If you are interested I can share the heap dump. Overall one of the big issue is during startup history server tries to load all the logs available ( with default 7 day retention) which in a large multi tenant cluster like ours is a lot of files. Most users won't really click through their application but deleting the event log too early is also not a good option. Ideally I would propose that history server creates simple summary files (needed to actually show the application summary on UI) so the next time history server starts it does not need to process entire event log but only a summary file. Only when a user clicks on the application we need to process the entire event log. |
|
Yeah there are other jira to improve the startup, I just haven't had time to get to them yet. Feel free to work on if you have time. :) this just makes it so you are actually reading X number of files in parallel which could increase memory pressure and I was wondering if you had look to see by how much that is. We have very large files all the time so if all threads are reading 10GB files I was wondering how much that would increase memory usage vs only reading one at a time. |
|
|
||
| private val NOT_STARTED = "<Not Started>" | ||
|
|
||
| private val SPARK_HISTORY_FS_NUM_PROCESSING_THREADS = "spark.history.fs.num.processing.threads" |
There was a problem hiding this comment.
How about calling this spark.history.fs.numReplayThreads.
|
I can take a look at the other open jiras related to History server. I haven't done actual analysis on how fast the memory footprint increases. I can try and come up with the actual comparison however its easy to go back to single threaded version if this really becomes a memory issue. |
|
ok, I'm guessing you didn't push the changes to rename, but this looks good other then that. I was trying to test out on one of our clusters but ran out of time. I'll be offline til next tuesday so if I don't get to it later I'll recheck then. |
|
@tgravescs Pushed the changes, let me know if I can help test this in any way. |
|
Jenkins, test this please |
|
Test build #56450 has finished for PR 11800 at commit
|
|
Jenkins, test this please |
|
Test build #56474 has finished for PR 11800 at commit
|
|
test failure is unrelated, +1. |
……ry server to ensure a single large log does not block other logs from being rendered. ## What changes were proposed in this pull request? The patch makes event log processing multi threaded. ## How was this patch tested? Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes apache#11800 from Parth-Brahmbhatt/SPARK-13988.
……ry server to ensure a single large log does not block other logs from being rendered. The patch makes event log processing multi threaded. Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes apache#11800 from Parth-Brahmbhatt/SPARK-13988.
What changes were proposed in this pull request?
The patch makes event log processing multi threaded.
How was this patch tested?
Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI.