[SPARK-13988][Core] Make replaying event logs multi threaded in Histo…ry server to ensure a single large log does not block other logs from being rendered. by Parth-Brahmbhatt · Pull Request #11800 · apache/spark

Parth-Brahmbhatt · 2016-03-17T22:37:25Z

What changes were proposed in this pull request?

The patch makes event log processing multi threaded.

How was this patch tested?

Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI.

Parth-Brahmbhatt · 2016-04-11T22:16:03Z

Can someone review this patch?

tgravescs · 2016-04-20T14:15:24Z

Jenkins, test this please

tgravescs · 2016-04-20T14:20:06Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

I would rather see this configurable. Many times history server runs on same machine as other things (like Yarn ResourceManager or history server for MapReduce, etc) and I wouldn't want the history server to starve out more important things.

I agree this should be configurable and I have set the default so it will only use 25% of cores.

SparkQA · 2016-04-20T16:10:45Z

Test build #56361 has finished for PR 11800 at commit c19e919.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-04-20T17:19:09Z

we are now replaying a bunch of things in parallel, what does that do to the memory usage or the history server?

Have you done any scale testing of this?

…ry server to ensure a single large log does not block other logs from being rendered.

…rver replay log processing.

Parth-Brahmbhatt · 2016-04-20T18:11:39Z

Even before this change we were getting OOM errors. The issue primarily seems to be creation of lot of young objects. In addition to this fix we also moved to G1 gc and we are using -XX:NewRatio=1 to allocate half the space to Eden.

We have deployed this fix in production since a week and we have observed one OOM crash. The heap dump is 12GB and I am still analyzing it but initial analysis again points at lot of string,char[] instances being created. If you are interested I can share the heap dump.

Overall one of the big issue is during startup history server tries to load all the logs available ( with default 7 day retention) which in a large multi tenant cluster like ours is a lot of files. Most users won't really click through their application but deleting the event log too early is also not a good option. Ideally I would propose that history server creates simple summary files (needed to actually show the application summary on UI) so the next time history server starts it does not need to process entire event log but only a summary file. Only when a user clicks on the application we need to process the entire event log.

tgravescs · 2016-04-20T19:56:29Z

Yeah there are other jira to improve the startup, I just haven't had time to get to them yet. Feel free to work on if you have time. :)

this just makes it so you are actually reading X number of files in parallel which could increase memory pressure and I was wondering if you had look to see by how much that is. We have very large files all the time so if all threads are reading 10GB files I was wondering how much that would increase memory usage vs only reading one at a time.

tgravescs · 2016-04-20T21:33:37Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala


  private val NOT_STARTED = "<Not Started>"

+  private val SPARK_HISTORY_FS_NUM_PROCESSING_THREADS = "spark.history.fs.num.processing.threads"


How about calling this spark.history.fs.numReplayThreads.

Parth-Brahmbhatt · 2016-04-20T22:07:25Z

I can take a look at the other open jiras related to History server.

I haven't done actual analysis on how fast the memory footprint increases. I can try and come up with the actual comparison however its easy to go back to single threaded version if this really becomes a memory issue.

tgravescs · 2016-04-20T23:58:34Z

ok, I'm guessing you didn't push the changes to rename, but this looks good other then that. I was trying to test out on one of our clusters but ran out of time. I'll be offline til next tuesday so if I don't get to it later I'll recheck then.

Parth-Brahmbhatt · 2016-04-21T00:05:06Z

@tgravescs Pushed the changes, let me know if I can help test this in any way.

tgravescs · 2016-04-21T00:52:58Z

Jenkins, test this please

SparkQA · 2016-04-21T02:34:58Z

Test build #56450 has finished for PR 11800 at commit 858e8ff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-04-21T03:55:30Z

Jenkins, test this please

SparkQA · 2016-04-21T05:22:30Z

Test build #56474 has finished for PR 11800 at commit 858e8ff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-04-21T11:55:41Z

test failure is unrelated, +1.

……ry server to ensure a single large log does not block other logs from being rendered. ## What changes were proposed in this pull request? The patch makes event log processing multi threaded. ## How was this patch tested? Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes apache#11800 from Parth-Brahmbhatt/SPARK-13988.

……ry server to ensure a single large log does not block other logs from being rendered. The patch makes event log processing multi threaded. Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes apache#11800 from Parth-Brahmbhatt/SPARK-13988.

Parth-Brahmbhatt force-pushed the SPARK-13988 branch from 5bac606 to c19e919 Compare March 17, 2016 22:39

tgravescs reviewed Apr 20, 2016
View reviewed changes

Parth-Brahmbhatt added 2 commits April 20, 2016 10:47

[SPARK-13988][Core] Make replaying event logs multi threaded in Histo…

7c3921c

…ry server to ensure a single large log does not block other logs from being rendered.

Added a config to control number of processing threads for history se…

704a619

…rver replay log processing.

Parth-Brahmbhatt force-pushed the SPARK-13988 branch from c19e919 to 704a619 Compare April 20, 2016 17:52

tgravescs reviewed Apr 20, 2016
View reviewed changes

Renaming the config to spark.history.fs.numReplayThreads.

858e8ff

asfgit closed this in 6fdd0e3 Apr 21, 2016


		private val NOT_STARTED = "<Not Started>"

		private val SPARK_HISTORY_FS_NUM_PROCESSING_THREADS = "spark.history.fs.num.processing.threads"

Conversation

Parth-Brahmbhatt commented Mar 17, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Parth-Brahmbhatt commented Apr 11, 2016

Uh oh!

tgravescs commented Apr 20, 2016

Uh oh!

tgravescs Apr 20, 2016

Choose a reason for hiding this comment

Uh oh!

Parth-Brahmbhatt Apr 20, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

tgravescs commented Apr 20, 2016

Uh oh!

Parth-Brahmbhatt commented Apr 20, 2016

Uh oh!

tgravescs commented Apr 20, 2016

Uh oh!

tgravescs Apr 20, 2016

Choose a reason for hiding this comment

Uh oh!

Parth-Brahmbhatt Apr 20, 2016

Choose a reason for hiding this comment

Uh oh!

Parth-Brahmbhatt commented Apr 20, 2016

Uh oh!

tgravescs commented Apr 20, 2016

Uh oh!

Parth-Brahmbhatt commented Apr 21, 2016

Uh oh!

tgravescs commented Apr 21, 2016

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

tgravescs commented Apr 21, 2016

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

tgravescs commented Apr 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants