Skip to content

[SPARK-29003][CORE] Add start method to ApplicationHistoryProvider to avoid deadlock on startup#25705

Closed
shanyu wants to merge 7 commits intoapache:masterfrom
shanyu:shanyu-29003
Closed

[SPARK-29003][CORE] Add start method to ApplicationHistoryProvider to avoid deadlock on startup#25705
shanyu wants to merge 7 commits intoapache:masterfrom
shanyu:shanyu-29003

Conversation

@shanyu
Copy link
Contributor

@shanyu shanyu commented Sep 6, 2019

What changes were proposed in this pull request?

During Spark History Server startup, there are two things happening simultaneously that call into java.nio.file.FileSystems.getDefault() and we sometime hit JDK-8194653.

  1. start jetty server
  2. start ApplicationHistoryProvider (which reads files from HDFS)

We should do these two things sequentially instead of in parallel.
We introduce a start() method in ApplicationHistoryProvider (and its subclass FsHistoryProvider), and we do initialize inside the start() method instead of the constructor.
In HistoryServer, we explicitly call provider.start() after we call bind() which starts the Jetty server.

Why are the changes needed?

It is a bug that occasionally starting Spark History Server results in process hang due to deadlock among threads.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I stress tested this PR with a bash script to stop and start Spark History Server more than 1000 times, it worked fine. Previously I can only do the stop/start loop less than 10 times before I hit the deadlock issue.

merge from apache master
Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks OK to me, though I'd call provider.start() to different place - commented.

Left some comments. There's some simpler workaround (like pre-calling FileSystem.getDefault() in main method of HistoryServer) so reviewer can weigh which approach makes more sense.

Before that, could you change the title of PR to respect the format, like below?

[SPARK-29003][CORE] Fix deadlock in startup of Spark History Server

Most of titles in Spark PRs are explaining what the patch is addressing, not the symptom of the issue.

}

val initThread = initialize()
var initThread:Thread = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space between : and Thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/** Bind to the HTTP server behind this web interface. */
override def bind() {
super.bind()
provider.start()
Copy link
Contributor

@HeartSaVioR HeartSaVioR Sep 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather place provider.start() before or after server.bind() in main(), not here. It feels really odd to start provider when calling bind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Done.

@felixcheung
Copy link
Member

what he says ^^

@HeartSaVioR
Copy link
Contributor

As I commented in JIRA issue, it seems to be a known issue in JDK 8, though the fixed version is too high (8u231) which Spark cannot always expect on this. So good to have workaround.

@dongjoon-hyun dongjoon-hyun changed the title SPARK-29003: Spark history server startup hang due to deadlock [SPARK-29003][CORE] Spark history server startup hang due to deadlock Sep 7, 2019
@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Sep 7, 2019

Test build #110274 has finished for PR 25705 at commit c27e35e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 7, 2019

Hi, @shanyu .

  1. There is a Scala style issue. Could you run dev/scalastyle in your PR directory and fix?
Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:203:16: No space after token :
  1. Also, please don't forget the above comments from @HeartSaVioR and @felixcheung .

@shanyu
Copy link
Contributor Author

shanyu commented Sep 10, 2019

Will do. Thanks.

Sorry for the delay, for some reason, I don't get email update when my PR is reviewed and comments left here...

Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
@shanyu shanyu changed the title [SPARK-29003][CORE] Spark history server startup hang due to deadlock [SPARK-29003][CORE] Fix deadlock in startup of Spark History Server Sep 10, 2019
@shanyu
Copy link
Contributor Author

shanyu commented Sep 10, 2019

Addressed all feedback. Regarding to the point of simpler workaround raised by @HeartSaVioR, I feel that adding start() to provider seems more appropriate. There should be a way to control when to start provider as it is odd today that there is a stop() but not a start() method for provider. And calling FileSystems.getDefault() in main function seems strange and hacky.

}

val initThread = initialize()
var initThread : Thread = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry nit again: no space between d and :. Please refer replayExecutor for how to space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except nit.

Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
@SparkQA
Copy link

SparkQA commented Sep 10, 2019

Test build #110381 has finished for PR 25705 at commit 6fcf84c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I forgot the test suites - test failures look relevant.

There's FsHistoryProvider in HistoryServerSuite and we haven't modify the test to call start().
Could you fix the case, as well as search all cases initializing ApplicationHistoryProvider and its descendant to call start() properly?

@SparkQA
Copy link

SparkQA commented Sep 10, 2019

Test build #110386 has finished for PR 25705 at commit c3b4411.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
@HeartSaVioR
Copy link
Contributor

Hmm... you may need to search and fix manually instead of just fixing the case in failed suite. (Or maybe you may need to run tests in your dev. environment at least for core module.)

I can easily find more spots to fix. Mostly you may need to call start() wherever instances of descendants of ApplicationHistoryProvider class are initialized.

@SparkQA
Copy link

SparkQA commented Sep 10, 2019

Test build #110398 has finished for PR 25705 at commit 046dd76.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shanyu
Copy link
Contributor Author

shanyu commented Sep 10, 2019

I searched the code base and I didn't see other descendants of ApplicationHistoryProvider. The latest build passed all tests in "org.apache.spark.deploy.history", but it says "fails due to an unknown error code, -9", I have no idea what this means. Could it be an infra issue?

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Sep 10, 2019

You can find all implementations for ApplicationHistoryProvider in IDE, for me, IntelliJ. Other than FsHistoryProvider, they're in test suites.

Seems like you missed to add start() for all places initializing FsHistoryProvider, which means calling startPolling() is missing: for testing, background threads will not be launched so that's OK, but initializing disk manager and validating log directory are missing. Test suites don't concern about it so test passes (I'd say it's luck), but given it's unintentional to not add start(), we would be better to fix that.

@HeartSaVioR
Copy link
Contributor

Just submitted a PR shanyu#2 which addresses missing spots.

@HeartSaVioR
Copy link
Contributor

it says "fails due to an unknown error code, -9", I have no idea what this means. Could it be an infra issue?

That might be intermittent failure and we can run the build again. If you merge my PR and push to the branch the build will rerun, so no worries.

@shanyu
Copy link
Contributor Author

shanyu commented Sep 11, 2019

Thank you @HeartSaVioR!
I have added the provider.start() to HistoryServerSuite, it is in my PR.
I didn't add them to FsHistoryProvider because I see no need to, as you mentioned already. The test suite knows the implementation so it does not have to follow the start/stop pattern. For one thing the tests didn't call stop(). If we are to modify the implementation of FsHistoryProvider.start() we can add start() at that point.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Sep 11, 2019

Hmm... yes I agree it might be weird if we don't follow start/stop pattern. What I'm concerned about is the change would also make tests skip initializing. If the test calls updateAndCheck(), it misses initialization. It doesn't let existing tests be broken, but brings some unintentional side-effect.

I think we even need to call initialize instead if it's weird for us to call start without calling stop, but given tests are passing, let's hear more voice on committers.

@HeartSaVioR
Copy link
Contributor

retest this, please

@shanyu
Copy link
Contributor Author

shanyu commented Sep 11, 2019

What I'm concerned about is the change would also make tests skip initializing. If the test calls updateAndCheck(), it misses initialization. It doesn't let existing tests be broken, but brings some unintentional side-effect.

Calling updateAndCheck() without start() in FsHistoryProviderSuite is fine because start() just waits for safe mode exit and for this suite of tests it is irrelevant.

@SparkQA
Copy link

SparkQA commented Sep 11, 2019

Test build #110452 has finished for PR 25705 at commit 046dd76.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Sep 11, 2019

start() just waits for safe mode exit and for this suite of tests it is irrelevant.

No. There're some initializations happening in startPolling(), and it's only called once we call initialize(). Even IS_TESTING is set, remaining initializations are executed.

private def startPolling(): Unit = {
diskManager.foreach(_.initialize())
// Validate the log directory.
val path = new Path(logDir)
try {
if (!fs.getFileStatus(path).isDirectory) {
throw new IllegalArgumentException(
"Logging directory specified is not a directory: %s".format(logDir))
}
} catch {
case f: FileNotFoundException =>
var msg = s"Log directory specified does not exist: $logDir"
if (logDir == DEFAULT_LOG_DIR) {
msg += " Did you configure the correct one through spark.history.fs.logDirectory?"
}
throw new FileNotFoundException(msg).initCause(f)
}
// Disable the background thread during tests.
if (!conf.contains(IS_TESTING)) {
// A task that periodically checks for event log updates on disk.
logDebug(s"Scheduling update thread every $UPDATE_INTERVAL_S seconds")
pool.scheduleWithFixedDelay(
getRunner(() => checkForLogs()), 0, UPDATE_INTERVAL_S, TimeUnit.SECONDS)
if (conf.get(CLEANER_ENABLED)) {
// A task that periodically cleans event logs on disk.
pool.scheduleWithFixedDelay(
getRunner(() => cleanLogs()), 0, CLEAN_INTERVAL_S, TimeUnit.SECONDS)
}
if (conf.contains(DRIVER_LOG_DFS_DIR) && conf.get(DRIVER_LOG_CLEANER_ENABLED)) {
pool.scheduleWithFixedDelay(getRunner(() => cleanDriverLogs()),
0,
conf.get(DRIVER_LOG_CLEANER_INTERVAL),
TimeUnit.SECONDS)
}
} else {
logDebug("Background update thread disabled for testing")
}
}

Safe mode exit is just one of functionalities in initialize(). So while we might be able to say it doesn't break the existing tests so it's kind of safe, it still brings side-effect.

@shanyu
Copy link
Contributor Author

shanyu commented Sep 11, 2019

@HeartSaVioR Yeah you are right, it also calls startPolling() after exit from safe mode. That's why we need it in HistoryServerSuite tests. For FsHistoryProviderSuite, we always force update so it should be OK.

@HeartSaVioR
Copy link
Contributor

I'm thinking out logically, we're just missing to initialize FsHistoryProvider and its descendants for some of tests, regardless of the fact the initialization doesn't break actual tests. (diskManager.initialize is not a "must" thing to call so tests are passing, but what if it's a "must"?)

It doesn't seem to be logically correct to not initialize just because tests are passing, but that's only me, and remaining code change itself is good so it's not a blocker (maybe minor?) and I'll leave a decision to committers. Let's wait and hear feedbacks.

@HeartSaVioR
Copy link
Contributor

cc. @felixcheung @dongjoon-hyun Could you revisit this PR? Thanks in advance!

*/
def stop(): Unit = { }

def start(): Unit = { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Could you add a function description like def stop()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM with one nit comment. This is fixed since 8u212 but we had better have this as @HeartSaVioR commented before. Since both classes are private, I think the change will be okay. Thank you, @shanyu and @HeartSaVioR !

cc @vanzin

Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110525 has finished for PR 25705 at commit e8d8024.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def stop(): Unit = { }

/**
* Called when the server is starting up.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, this seems to be copied blindly~
This is not called. We need to call this, doesn't we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe the expected behavior and the expected exceptions of this function, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun This is the same as stop() that is called by the server (HistoryServer), start() should be called during server starting up. I'll add a few more sentence to describe that it should do and why we added it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
@SparkQA
Copy link

SparkQA commented Sep 14, 2019

Test build #110576 has finished for PR 25705 at commit 75dc86d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29003][CORE] Fix deadlock in startup of Spark History Server [SPARK-29003][CORE] Add start method to ApplicationHistoryProvider to avoid deadlock on startup Sep 14, 2019
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM again. Thank you, @shanyu and @HeartSaVioR .
Since it's difficult to add a deadlock test case here, this look okay to me.
Merged to master.

PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
…to avoid deadlock on startup

### What changes were proposed in this pull request?

During Spark History Server startup, there are two things happening simultaneously that call into `java.nio.file.FileSystems.getDefault()` and we sometime hit [JDK-8194653](https://bugs.openjdk.java.net/browse/JDK-8194653).
1) start jetty server
2) start ApplicationHistoryProvider (which reads files from HDFS)

We should do these two things sequentially instead of in parallel.
We introduce a start() method in ApplicationHistoryProvider (and its subclass FsHistoryProvider), and we do initialize inside the start() method instead of the constructor.
In HistoryServer, we explicitly call provider.start() after we call bind() which starts the Jetty server.

### Why are the changes needed?
It is a bug that occasionally starting Spark History Server results in process hang due to deadlock among threads.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
I stress tested this PR with a bash script to stop and start Spark History Server more than 1000 times, it worked fine. Previously I can only do the stop/start loop less than 10 times before I hit the deadlock issue.

Closes apache#25705 from shanyu/shanyu-29003.

Authored-by: Shanyu Zhao <shzhao@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants