Skip to content

[SPARK-13988][Core] Make replaying event logs multi threaded in Histo…ry server to ensure a single large log does not block other logs from being rendered.#11800

Closed
Parth-Brahmbhatt wants to merge 3 commits intoapache:masterfrom
Parth-Brahmbhatt:SPARK-13988

Conversation

@Parth-Brahmbhatt
Copy link
Contributor

What changes were proposed in this pull request?

The patch makes event log processing multi threaded.

How was this patch tested?

Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI.

@Parth-Brahmbhatt
Copy link
Contributor Author

Can someone review this patch?

@tgravescs
Copy link
Contributor

Jenkins, test this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather see this configurable. Many times history server runs on same machine as other things (like Yarn ResourceManager or history server for MapReduce, etc) and I wouldn't want the history server to starve out more important things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should be configurable and I have set the default so it will only use 25% of cores.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56361 has finished for PR 11800 at commit c19e919.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

we are now replaying a bunch of things in parallel, what does that do to the memory usage or the history server?

Have you done any scale testing of this?

…ry server to ensure a single large log does not block other logs from being rendered.
@Parth-Brahmbhatt
Copy link
Contributor Author

Even before this change we were getting OOM errors. The issue primarily seems to be creation of lot of young objects. In addition to this fix we also moved to G1 gc and we are using -XX:NewRatio=1 to allocate half the space to Eden.

We have deployed this fix in production since a week and we have observed one OOM crash. The heap dump is 12GB and I am still analyzing it but initial analysis again points at lot of string,char[] instances being created. If you are interested I can share the heap dump.

Overall one of the big issue is during startup history server tries to load all the logs available ( with default 7 day retention) which in a large multi tenant cluster like ours is a lot of files. Most users won't really click through their application but deleting the event log too early is also not a good option. Ideally I would propose that history server creates simple summary files (needed to actually show the application summary on UI) so the next time history server starts it does not need to process entire event log but only a summary file. Only when a user clicks on the application we need to process the entire event log.

@tgravescs
Copy link
Contributor

Yeah there are other jira to improve the startup, I just haven't had time to get to them yet. Feel free to work on if you have time. :)

this just makes it so you are actually reading X number of files in parallel which could increase memory pressure and I was wondering if you had look to see by how much that is. We have very large files all the time so if all threads are reading 10GB files I was wondering how much that would increase memory usage vs only reading one at a time.


private val NOT_STARTED = "<Not Started>"

private val SPARK_HISTORY_FS_NUM_PROCESSING_THREADS = "spark.history.fs.num.processing.threads"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling this spark.history.fs.numReplayThreads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed.

@Parth-Brahmbhatt
Copy link
Contributor Author

I can take a look at the other open jiras related to History server.

I haven't done actual analysis on how fast the memory footprint increases. I can try and come up with the actual comparison however its easy to go back to single threaded version if this really becomes a memory issue.

@tgravescs
Copy link
Contributor

ok, I'm guessing you didn't push the changes to rename, but this looks good other then that. I was trying to test out on one of our clusters but ran out of time. I'll be offline til next tuesday so if I don't get to it later I'll recheck then.

@Parth-Brahmbhatt
Copy link
Contributor Author

@tgravescs Pushed the changes, let me know if I can help test this in any way.

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56450 has finished for PR 11800 at commit 858e8ff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56474 has finished for PR 11800 at commit 858e8ff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

test failure is unrelated, +1.

@asfgit asfgit closed this in 6fdd0e3 Apr 21, 2016
superbobry pushed a commit to criteo-forks/spark that referenced this pull request Feb 15, 2017
……ry server to ensure a single large log does not block other logs from being rendered.

## What changes were proposed in this pull request?
The patch makes event log processing multi threaded.

## How was this patch tested?
Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI.

Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>

Closes apache#11800 from Parth-Brahmbhatt/SPARK-13988.
superbobry pushed a commit to criteo-forks/spark that referenced this pull request Mar 20, 2017
……ry server to ensure a single large log does not block other logs from being rendered.

The patch makes event log processing multi threaded.

Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI.

Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>

Closes apache#11800 from Parth-Brahmbhatt/SPARK-13988.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants