-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down #36991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…usedByApp when driver is shutting down
|
Can one of the admins verify this patch? |
|
@tgravescs would you please take a look? thx. |
| // ache/hadoop/yarn/util/Apps.java#L273 for details) | ||
| if (NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS.contains(other_exit_status)) { | ||
| if (NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS.contains(other_exit_status) || | ||
| SparkContext.getActive.forall(_.isStopped)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to know we are in shutdown vs checking the spark context active level but I need to refresh my memory on that shutdown sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SparkContext set stopped to false at the beginning of the stop procedure, do you have other suggestions to check if spark context is stopped?
spark/core/src/main/scala/org/apache/spark/SparkContext.scala
Lines 2056 to 2065 in c1d1ec5
| def stop(): Unit = { | |
| if (LiveListenerBus.withinListenerThread.value) { | |
| throw new SparkException(s"Cannot stop SparkContext within listener bus thread.") | |
| } | |
| // Use the stopping variable to ensure no contention for the stop scenario. | |
| // Still track the stopped variable for use elsewhere in the code. | |
| if (!stopped.compareAndSet(false, true)) { | |
| logInfo("SparkContext already stopped.") | |
| return | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't going to work in Cluster mode on yarn where the application master and yarn allocator are not in the same process as the SparkContext. I assume in that case the getActive is returning None and we would do this all the time when we really shouldn't.
Can we tell the allocator we are shutting down when the ApplicationMaster is told to shutdown and do a similar check to prevent this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion, let me try.
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
…usedByApp when driver is shutting down ### What changes were proposed in this pull request? Treating container `AllocationFailure` as not "exitCausedByApp" when driver is shutting down. The approach is suggested at #36991 (comment) ### Why are the changes needed? I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is Driver - Job success, Spark starts shutting down procedure. ``` 2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped Spark74e9431b{HTTP/1.1, (http/1.1)}{0.0.0.0:0} 2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at http://hadoop2627.xxx.org:28446 2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all executors ``` Driver - A container allocate successful during shutting down phase. ``` 2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container container_e94_1649986670278_7743380_02_000025 on host hadoop4388.xxx.org for executor with ID 24 for ResourceProfile Id 0 ``` Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint. ``` Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413) at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.Range.foreach(Range.scala:158) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) ... 4 more Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedSchedulerhadoop2627.xxx.org:21956 at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144) at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307) at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99) at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ``` Driver - YarnAllocator received container launch error message and treat it as `exitCausedByApp` ``` 2022-06-23 19:52:27 CST YarnAllocator INFO - Completed container container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org (state: COMPLETE, exit status: 1) 2022-06-23 19:52:27 CST YarnAllocator WARN - Container from a bad node: container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org. Exit status: 1. Diagnostics: [2022-06-23 19:52:24.932]Exception from container-launch. Container id: container_e94_1649986670278_7743380_02_000025 Exit code: 1 Shell output: main : command provided 1 main : run as user is bdms_pm main : requested yarn user is bdms_pm Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /mnt/dfs/2/yarn/local/nmPrivate/application_1649986670278_7743380/container_e94_1649986670278_7743380_02_000025/container_e94_1649986670278_7743380_02_000025.pid.tmp Writing to cgroup task files... Creating local dirs... Launching container... Getting exit code file... Creating script paths... [2022-06-23 19:52:24.938]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) at scala.concurrent.Promise.trySuccess(Promise.scala:94) at scala.concurrent.Promise.trySuccess$(Promise.scala:94) at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187) at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246) at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ``` Driver - Eventually application failed because ”failed“ executor reached threshold ``` 2022-06-23 19:52:30 CST ApplicationMaster INFO - Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (16) reached) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Update UT. Closes #38622 from pan3793/SPARK-39601. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Thomas Graves <[email protected]>
…usedByApp when driver is shutting down ### What changes were proposed in this pull request? Treating container `AllocationFailure` as not "exitCausedByApp" when driver is shutting down. The approach is suggested at apache#36991 (comment) ### Why are the changes needed? I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is Driver - Job success, Spark starts shutting down procedure. ``` 2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped Spark74e9431b{HTTP/1.1, (http/1.1)}{0.0.0.0:0} 2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at http://hadoop2627.xxx.org:28446 2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all executors ``` Driver - A container allocate successful during shutting down phase. ``` 2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container container_e94_1649986670278_7743380_02_000025 on host hadoop4388.xxx.org for executor with ID 24 for ResourceProfile Id 0 ``` Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint. ``` Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413) at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.Range.foreach(Range.scala:158) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) ... 4 more Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedSchedulerhadoop2627.xxx.org:21956 at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144) at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307) at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99) at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ``` Driver - YarnAllocator received container launch error message and treat it as `exitCausedByApp` ``` 2022-06-23 19:52:27 CST YarnAllocator INFO - Completed container container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org (state: COMPLETE, exit status: 1) 2022-06-23 19:52:27 CST YarnAllocator WARN - Container from a bad node: container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org. Exit status: 1. Diagnostics: [2022-06-23 19:52:24.932]Exception from container-launch. Container id: container_e94_1649986670278_7743380_02_000025 Exit code: 1 Shell output: main : command provided 1 main : run as user is bdms_pm main : requested yarn user is bdms_pm Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /mnt/dfs/2/yarn/local/nmPrivate/application_1649986670278_7743380/container_e94_1649986670278_7743380_02_000025/container_e94_1649986670278_7743380_02_000025.pid.tmp Writing to cgroup task files... Creating local dirs... Launching container... Getting exit code file... Creating script paths... [2022-06-23 19:52:24.938]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) at scala.concurrent.Promise.trySuccess(Promise.scala:94) at scala.concurrent.Promise.trySuccess$(Promise.scala:94) at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187) at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246) at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246) at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ``` Driver - Eventually application failed because ”failed“ executor reached threshold ``` 2022-06-23 19:52:30 CST ApplicationMaster INFO - Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (16) reached) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Update UT. Closes apache#38622 from pan3793/SPARK-39601. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Thomas Graves <[email protected]>
What changes were proposed in this pull request?
Treating container
AllocationFailureas not "exitCausedByApp" when driver is shutting downWhy are the changes needed?
I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is
Driver - Job success, Spark starts shutting down procedure.
Driver - A container allocate successful during shutting down phase.
Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint.
Driver - YarnAllocator received container launch error message and treat it as
exitCausedByAppDriver - Eventually application failed because ”failed“ executor reached threshold
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Update UT.