[SPARK-29003][CORE] Add start method to ApplicationHistoryProvider to avoid deadlock on startup#25705
[SPARK-29003][CORE] Add start method to ApplicationHistoryProvider to avoid deadlock on startup#25705shanyu wants to merge 7 commits intoapache:masterfrom
start method to ApplicationHistoryProvider to avoid deadlock on startup#25705Conversation
merge from apache master
Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
There was a problem hiding this comment.
The approach looks OK to me, though I'd call provider.start() to different place - commented.
Left some comments. There's some simpler workaround (like pre-calling FileSystem.getDefault() in main method of HistoryServer) so reviewer can weigh which approach makes more sense.
Before that, could you change the title of PR to respect the format, like below?
[SPARK-29003][CORE] Fix deadlock in startup of Spark History Server
Most of titles in Spark PRs are explaining what the patch is addressing, not the symptom of the issue.
| } | ||
|
|
||
| val initThread = initialize() | ||
| var initThread:Thread = null |
There was a problem hiding this comment.
nit: space between : and Thread.
| /** Bind to the HTTP server behind this web interface. */ | ||
| override def bind() { | ||
| super.bind() | ||
| provider.start() |
There was a problem hiding this comment.
I'd rather place provider.start() before or after server.bind() in main(), not here. It feels really odd to start provider when calling bind.
There was a problem hiding this comment.
That makes sense. Done.
|
what he says ^^ |
|
As I commented in JIRA issue, it seems to be a known issue in JDK 8, though the fixed version is too high (8u231) which Spark cannot always expect on this. So good to have workaround. |
|
ok to test |
|
Test build #110274 has finished for PR 25705 at commit
|
|
Hi, @shanyu .
|
|
Will do. Thanks. Sorry for the delay, for some reason, I don't get email update when my PR is reviewed and comments left here... |
Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
|
Addressed all feedback. Regarding to the point of simpler workaround raised by @HeartSaVioR, I feel that adding start() to provider seems more appropriate. There should be a way to control when to start provider as it is odd today that there is a stop() but not a start() method for provider. And calling FileSystems.getDefault() in main function seems strange and hacky. |
| } | ||
|
|
||
| val initThread = initialize() | ||
| var initThread : Thread = null |
There was a problem hiding this comment.
sorry nit again: no space between d and :. Please refer replayExecutor for how to space.
Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
|
Test build #110381 has finished for PR 25705 at commit
|
HeartSaVioR
left a comment
There was a problem hiding this comment.
Ah I forgot the test suites - test failures look relevant.
There's FsHistoryProvider in HistoryServerSuite and we haven't modify the test to call start().
Could you fix the case, as well as search all cases initializing ApplicationHistoryProvider and its descendant to call start() properly?
|
Test build #110386 has finished for PR 25705 at commit
|
Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
|
Hmm... you may need to search and fix manually instead of just fixing the case in failed suite. (Or maybe you may need to run tests in your dev. environment at least for I can easily find more spots to fix. Mostly you may need to call |
|
Test build #110398 has finished for PR 25705 at commit
|
|
I searched the code base and I didn't see other descendants of ApplicationHistoryProvider. The latest build passed all tests in "org.apache.spark.deploy.history", but it says "fails due to an unknown error code, -9", I have no idea what this means. Could it be an infra issue? |
|
You can find all implementations for ApplicationHistoryProvider in IDE, for me, IntelliJ. Other than FsHistoryProvider, they're in test suites. Seems like you missed to add start() for all places initializing FsHistoryProvider, which means calling startPolling() is missing: for testing, background threads will not be launched so that's OK, but initializing disk manager and validating log directory are missing. Test suites don't concern about it so test passes (I'd say it's luck), but given it's unintentional to not add start(), we would be better to fix that. |
|
Just submitted a PR shanyu#2 which addresses missing spots. |
That might be intermittent failure and we can run the build again. If you merge my PR and push to the branch the build will rerun, so no worries. |
|
Thank you @HeartSaVioR! |
|
Hmm... yes I agree it might be weird if we don't follow start/stop pattern. What I'm concerned about is the change would also make tests skip initializing. If the test calls I think we even need to call |
|
retest this, please |
Calling updateAndCheck() without start() in FsHistoryProviderSuite is fine because start() just waits for safe mode exit and for this suite of tests it is irrelevant. |
|
Test build #110452 has finished for PR 25705 at commit
|
No. There're some initializations happening in Safe mode exit is just one of functionalities in |
|
@HeartSaVioR Yeah you are right, it also calls startPolling() after exit from safe mode. That's why we need it in HistoryServerSuite tests. For FsHistoryProviderSuite, we always force update so it should be OK. |
|
I'm thinking out logically, we're just missing to initialize FsHistoryProvider and its descendants for some of tests, regardless of the fact the initialization doesn't break actual tests. ( It doesn't seem to be logically correct to not initialize just because tests are passing, but that's only me, and remaining code change itself is good so it's not a blocker (maybe minor?) and I'll leave a decision to committers. Let's wait and hear feedbacks. |
|
cc. @felixcheung @dongjoon-hyun Could you revisit this PR? Thanks in advance! |
| */ | ||
| def stop(): Unit = { } | ||
|
|
||
| def start(): Unit = { } |
There was a problem hiding this comment.
nit. Could you add a function description like def stop()?
There was a problem hiding this comment.
+1, LGTM with one nit comment. This is fixed since 8u212 but we had better have this as @HeartSaVioR commented before. Since both classes are private, I think the change will be okay. Thank you, @shanyu and @HeartSaVioR !
cc @vanzin
Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
|
Test build #110525 has finished for PR 25705 at commit
|
| def stop(): Unit = { } | ||
|
|
||
| /** | ||
| * Called when the server is starting up. |
There was a problem hiding this comment.
Ur, this seems to be copied blindly~
This is not called. We need to call this, doesn't we?
There was a problem hiding this comment.
Could you describe the expected behavior and the expected exceptions of this function, please?
There was a problem hiding this comment.
@dongjoon-hyun This is the same as stop() that is called by the server (HistoryServer), start() should be called during server starting up. I'll add a few more sentence to describe that it should do and why we added it.
Signed-off-by: Shanyu Zhao <shzhao@microsoft.com>
|
Test build #110576 has finished for PR 25705 at commit
|
start method to ApplicationHistoryProvider to avoid deadlock on startup
There was a problem hiding this comment.
+1, LGTM again. Thank you, @shanyu and @HeartSaVioR .
Since it's difficult to add a deadlock test case here, this look okay to me.
Merged to master.
…to avoid deadlock on startup ### What changes were proposed in this pull request? During Spark History Server startup, there are two things happening simultaneously that call into `java.nio.file.FileSystems.getDefault()` and we sometime hit [JDK-8194653](https://bugs.openjdk.java.net/browse/JDK-8194653). 1) start jetty server 2) start ApplicationHistoryProvider (which reads files from HDFS) We should do these two things sequentially instead of in parallel. We introduce a start() method in ApplicationHistoryProvider (and its subclass FsHistoryProvider), and we do initialize inside the start() method instead of the constructor. In HistoryServer, we explicitly call provider.start() after we call bind() which starts the Jetty server. ### Why are the changes needed? It is a bug that occasionally starting Spark History Server results in process hang due to deadlock among threads. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I stress tested this PR with a bash script to stop and start Spark History Server more than 1000 times, it worked fine. Previously I can only do the stop/start loop less than 10 times before I hit the deadlock issue. Closes apache#25705 from shanyu/shanyu-29003. Authored-by: Shanyu Zhao <shzhao@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
During Spark History Server startup, there are two things happening simultaneously that call into
java.nio.file.FileSystems.getDefault()and we sometime hit JDK-8194653.We should do these two things sequentially instead of in parallel.
We introduce a start() method in ApplicationHistoryProvider (and its subclass FsHistoryProvider), and we do initialize inside the start() method instead of the constructor.
In HistoryServer, we explicitly call provider.start() after we call bind() which starts the Jetty server.
Why are the changes needed?
It is a bug that occasionally starting Spark History Server results in process hang due to deadlock among threads.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
I stress tested this PR with a bash script to stop and start Spark History Server more than 1000 times, it worked fine. Previously I can only do the stop/start loop less than 10 times before I hit the deadlock issue.