-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31632][CORE][WEBUI] Enrich the exception message when application information is unavailable #28444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Just 2 cents as I'm not committer as well as not an expert of web UI area. I see what's happening, but I'm not sure the fix is on the right way. I feel it's probably better to wait from other place (instead of AppStatusStore which is simply serving the information), even better if we can serve web UI once such precondition is fulfilled. Having empty ApplicationInfo (and showing properly) might be also one of valid options. |
|
Hi @HeartSaVioR, thanks for your suggestion. Yes, I'm also not quite sure if doing a while-loop-check is suitable here. I also considered the solutions you mentioned, e.g., making the |
|
Hi @Ngone51 and @HyukjinKwon , do you have any suggestions on fixing this? |
|
I've took a look at those invokers. It seems we still need some retry logic where the end user can not help. If that's the truth, I'd prefer to keep retry with a timeout in |
|
As there're no more other comments, I just updated the PR as @Ngone51 suggested. |
| Thread.sleep(200) | ||
| iterator = store.view(classOf[ApplicationInfoWrapper]).max(1).iterator() | ||
| } | ||
| // The next() call may still throw a NoSuchElement exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should explicitly throw a SparkException here if the iterator is still empty to say something like "No available ApplicationInfo after waiting 5 seconds..."
Ngone51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
Outdated
Show resolved
Hide resolved
| * This method contains an automatic retry logic and tries to get a valid [[v1.ApplicationInfo]]. | ||
| * See [SPARK-31632] The ApplicationInfo in KVStore may be accessed before it's prepared | ||
| */ | ||
| def applicationInfo(): v1.ApplicationInfo = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of make this function blocking, how about capture the Exception outside and display the proper message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If blocking is supposed to be harmful, I'd prefer an optional result instead of throwing an exception here. In fact, due to the previous design and implementation, all the invokers of the method are expecting a valid ApplicationInfo, which means they are not capable of dealing with a None value or an exception now. If we change the returned type to Option, all the logic of the invokers (and invokers of these invokers maybe) should be somehow adjusted. There will be more design choices then.
Adding the retry logic solves the problem from the root. As I said, however, I can definitely make another choice if we reach a consensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think capturing the exception outside is actually a good compromise considering the issue is rather a corner case. Let's scope narrow here since the issue is rather minor, and consider a better fix with a standard approach when any bigger issue is found in the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also sounds reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback, @HyukjinKwon. However, I didn't get your point (sorry for that). Did you mean changing the implementation to capturing the exception outside, or let's just keep the current approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wasn't clear. I was thinking a way, for example, as below:
- store.view(classOf[ApplicationInfoWrapper]).max(1).iterator().next().info
+ try {
+ store.view(classOf[ApplicationInfoWrapper]).max(1).iterator().next().info
+ } catch {
+ case _: NoSuchElementException =>
+ throw new NoSuchElementException(
+ "Application information was not found. If your Spark application is " +
+ "starting up, try again when your Spark application is ready.")
+ }There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I'll update the PR as suggested.
|
ok to test |
|
Test build #122472 has finished for PR 28444 at commit
|
|
Hi @HyukjinKwon, I've updated this PR. Now the exception looks like below. |
|
Looks good, please also update the PR description. |
|
Test build #122566 has finished for PR 28444 at commit
|
|
Thanks for your reminder @jiangxb1987. Done. |
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think this is good compromise.
|
Thanks @xccui and all. Merged to master, branch-3.0 and branch-2.4. |
…ion information is unavailable ### What changes were proposed in this pull request? This PR caught the `NoSuchElementException` and enriched the error message for `AppStatusStore.applicationInfo()` when Spark is starting up and the application information is unavailable. ### Why are the changes needed? During the initialization of `SparkContext`, it first starts the Web UI and then set up the `LiveListenerBus` thread for dispatching the `SparkListenerApplicationStart` event (which will trigger writing the requested `ApplicationInfo` to `InMemoryStore`). If the Web UI is accessed before this info's being written to `InMemoryStore`, the following `NoSuchElementException` will be thrown. ``` WARN org.eclipse.jetty.server.HttpChannel: /jobs/ java.util.NoSuchElementException at java.util.Collections$EmptyIterator.next(Collections.java:4191) at org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467) at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39) at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266) at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80) at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:505) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested. This can be reproduced: 1. `./bin/spark-shell` 2. at the same time, open `http://localhost:4040/jobs/` in your browser with quickly refreshing. Closes #28444 from xccui/SPARK-31632. Authored-by: Xingcan Cui <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 42951e6) Signed-off-by: HyukjinKwon <[email protected]>
…ion information is unavailable ### What changes were proposed in this pull request? This PR caught the `NoSuchElementException` and enriched the error message for `AppStatusStore.applicationInfo()` when Spark is starting up and the application information is unavailable. ### Why are the changes needed? During the initialization of `SparkContext`, it first starts the Web UI and then set up the `LiveListenerBus` thread for dispatching the `SparkListenerApplicationStart` event (which will trigger writing the requested `ApplicationInfo` to `InMemoryStore`). If the Web UI is accessed before this info's being written to `InMemoryStore`, the following `NoSuchElementException` will be thrown. ``` WARN org.eclipse.jetty.server.HttpChannel: /jobs/ java.util.NoSuchElementException at java.util.Collections$EmptyIterator.next(Collections.java:4191) at org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467) at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39) at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266) at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80) at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:505) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested. This can be reproduced: 1. `./bin/spark-shell` 2. at the same time, open `http://localhost:4040/jobs/` in your browser with quickly refreshing. Closes #28444 from xccui/SPARK-31632. Authored-by: Xingcan Cui <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 42951e6) Signed-off-by: HyukjinKwon <[email protected]>
|
Thanks all for your suggestions and feedbacks. |
…n application summary is unavailable ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This PR enriches the exception message when application summary is not available. #28444 covers the case when application information is not available but the case application summary is not available is not covered. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> To complement #28444 . ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> Yes. Before this change, we can get the following error message when we access to `/jobs` if application summary is not available. <img width="707" alt="no-such-element-exception-error-message" src="https://user-images.githubusercontent.com/4736016/84562182-6aadf200-ad8d-11ea-8980-d63edde6fad6.png"> After this change, we can get the following error message. It's like #28444 does. <img width="1349" alt="enriched-errorm-message" src="https://user-images.githubusercontent.com/4736016/84562189-85806680-ad8d-11ea-8346-4da2ec11df2b.png"> ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> I checked with the following procedure. 1. Set breakpoint in the line of `kvstore.write(appSummary)` in `AppStatusListener#onStartApplicatin`. Only the thread reaching this line should be suspended. 2. Start spark-shell and wait few seconds. 3. Access to `/jobs` Closes #28820 from sarutak/fix-no-such-element. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…n application summary is unavailable ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This PR enriches the exception message when application summary is not available. #28444 covers the case when application information is not available but the case application summary is not available is not covered. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> To complement #28444 . ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> Yes. Before this change, we can get the following error message when we access to `/jobs` if application summary is not available. <img width="707" alt="no-such-element-exception-error-message" src="https://user-images.githubusercontent.com/4736016/84562182-6aadf200-ad8d-11ea-8980-d63edde6fad6.png"> After this change, we can get the following error message. It's like #28444 does. <img width="1349" alt="enriched-errorm-message" src="https://user-images.githubusercontent.com/4736016/84562189-85806680-ad8d-11ea-8346-4da2ec11df2b.png"> ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> I checked with the following procedure. 1. Set breakpoint in the line of `kvstore.write(appSummary)` in `AppStatusListener#onStartApplicatin`. Only the thread reaching this line should be suspended. 2. Start spark-shell and wait few seconds. 3. Access to `/jobs` Closes #28820 from sarutak/fix-no-such-element. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit c2e5012) Signed-off-by: HyukjinKwon <[email protected]>
…n application summary is unavailable ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This PR enriches the exception message when application summary is not available. #28444 covers the case when application information is not available but the case application summary is not available is not covered. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> To complement #28444 . ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> Yes. Before this change, we can get the following error message when we access to `/jobs` if application summary is not available. <img width="707" alt="no-such-element-exception-error-message" src="https://user-images.githubusercontent.com/4736016/84562182-6aadf200-ad8d-11ea-8980-d63edde6fad6.png"> After this change, we can get the following error message. It's like #28444 does. <img width="1349" alt="enriched-errorm-message" src="https://user-images.githubusercontent.com/4736016/84562189-85806680-ad8d-11ea-8346-4da2ec11df2b.png"> ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> I checked with the following procedure. 1. Set breakpoint in the line of `kvstore.write(appSummary)` in `AppStatusListener#onStartApplicatin`. Only the thread reaching this line should be suspended. 2. Start spark-shell and wait few seconds. 3. Access to `/jobs` Closes #28820 from sarutak/fix-no-such-element. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit c2e5012) Signed-off-by: HyukjinKwon <[email protected]>
…n application summary is unavailable ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This PR enriches the exception message when application summary is not available. apache#28444 covers the case when application information is not available but the case application summary is not available is not covered. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> To complement apache#28444 . ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> Yes. Before this change, we can get the following error message when we access to `/jobs` if application summary is not available. <img width="707" alt="no-such-element-exception-error-message" src="https://user-images.githubusercontent.com/4736016/84562182-6aadf200-ad8d-11ea-8980-d63edde6fad6.png"> After this change, we can get the following error message. It's like apache#28444 does. <img width="1349" alt="enriched-errorm-message" src="https://user-images.githubusercontent.com/4736016/84562189-85806680-ad8d-11ea-8346-4da2ec11df2b.png"> ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> I checked with the following procedure. 1. Set breakpoint in the line of `kvstore.write(appSummary)` in `AppStatusListener#onStartApplicatin`. Only the thread reaching this line should be suspended. 2. Start spark-shell and wait few seconds. 3. Access to `/jobs` Closes apache#28820 from sarutak/fix-no-such-element. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit c2e5012) Signed-off-by: HyukjinKwon <[email protected]>

What changes were proposed in this pull request?
This PR caught the
NoSuchElementExceptionand enriched the error message forAppStatusStore.applicationInfo()when Spark is starting up and the application information is unavailable.Why are the changes needed?
During the initialization of
SparkContext, it first starts the Web UI and then set up theLiveListenerBusthread for dispatching theSparkListenerApplicationStartevent (which will trigger writing the requestedApplicationInfotoInMemoryStore). If the Web UI is accessed before this info's being written toInMemoryStore, the followingNoSuchElementExceptionwill be thrown.Does this PR introduce any user-facing change?
No
How was this patch tested?
Manually tested.
This can be reproduced:
./bin/spark-shellhttp://localhost:4040/jobs/in your browser with quickly refreshing.