Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Jun 25, 2022

What changes were proposed in this pull request?

Treating container AllocationFailure as not "exitCausedByApp" when driver is shutting down

Why are the changes needed?

I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is

Driver - Job success, Spark starts shutting down procedure.

2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped Spark@74e9431b{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at http://hadoop2627.xxx.org:28446
2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all executors

Driver - A container allocate successful during shutting down phase.

2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container container_e94_1649986670278_7743380_02_000025 on host hadoop4388.xxx.org for executor with ID 24 for ResourceProfile Id 0

Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint.

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
  at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
  at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413)
  at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
  at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
  at scala.collection.immutable.Range.foreach(Range.scala:158)
  at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
  ... 4 more
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://[email protected]:21956
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144)
  at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
  at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
  at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99)
  at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)

Driver - YarnAllocator received container launch error message and treat it as exitCausedByApp

2022-06-23 19:52:27 CST YarnAllocator INFO - Completed container container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org (state: COMPLETE, exit status: 1)
2022-06-23 19:52:27 CST YarnAllocator WARN - Container from a bad node: container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org. Exit status: 1. Diagnostics: [2022-06-23 19:52:24.932]Exception from container-launch.
Container id: container_e94_1649986670278_7743380_02_000025
Exit code: 1
Shell output: main : command provided 1
main : run as user is bdms_pm
main : requested yarn user is bdms_pm
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /mnt/dfs/2/yarn/local/nmPrivate/application_1649986670278_7743380/container_e94_1649986670278_7743380_02_000025/container_e94_1649986670278_7743380_02_000025.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...


[2022-06-23 19:52:24.938]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
  at scala.concurrent.Promise.trySuccess(Promise.scala:94)
  at scala.concurrent.Promise.trySuccess$(Promise.scala:94)
  at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187)
  at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90)
  at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
  at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)

Driver - Eventually application failed because ”failed“ executor reached threshold

2022-06-23 19:52:30 CST ApplicationMaster INFO - Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (16) reached)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Update UT.

@github-actions github-actions bot added the YARN label Jun 25, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@pan3793
Copy link
Member Author

pan3793 commented Jun 27, 2022

cc @srowen @yaooqinn

@pan3793
Copy link
Member Author

pan3793 commented Jun 28, 2022

@tgravescs would you please take a look? thx.

// ache/hadoop/yarn/util/Apps.java#L273 for details)
if (NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS.contains(other_exit_status)) {
if (NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS.contains(other_exit_status) ||
SparkContext.getActive.forall(_.isStopped)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to know we are in shutdown vs checking the spark context active level but I need to refresh my memory on that shutdown sequence

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SparkContext set stopped to false at the beginning of the stop procedure, do you have other suggestions to check if spark context is stopped?

def stop(): Unit = {
if (LiveListenerBus.withinListenerThread.value) {
throw new SparkException(s"Cannot stop SparkContext within listener bus thread.")
}
// Use the stopping variable to ensure no contention for the stop scenario.
// Still track the stopped variable for use elsewhere in the code.
if (!stopped.compareAndSet(false, true)) {
logInfo("SparkContext already stopped.")
return
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't going to work in Cluster mode on yarn where the application master and yarn allocator are not in the same process as the SparkContext. I assume in that case the getActive is returning None and we would do this all the time when we really shouldn't.

Can we tell the allocator we are shutting down when the ApplicationMaster is told to shutdown and do a similar check to prevent this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion, let me try.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 16, 2022
@github-actions github-actions bot closed this Oct 17, 2022
asfgit pushed a commit that referenced this pull request Dec 13, 2022
…usedByApp when driver is shutting down

### What changes were proposed in this pull request?

Treating container `AllocationFailure` as not "exitCausedByApp" when driver is shutting down.

The approach is suggested at #36991 (comment)

### Why are the changes needed?

I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is

Driver - Job success, Spark starts shutting down procedure.
```
2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped Spark74e9431b{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at http://hadoop2627.xxx.org:28446
2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all executors
```

Driver - A container allocate successful during shutting down phase.
```
2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container container_e94_1649986670278_7743380_02_000025 on host hadoop4388.xxx.org for executor with ID 24 for ResourceProfile Id 0
```

Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint.
```
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
  at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
  at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413)
  at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
  at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
  at scala.collection.immutable.Range.foreach(Range.scala:158)
  at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
  ... 4 more
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedSchedulerhadoop2627.xxx.org:21956
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144)
  at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
  at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
  at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99)
  at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
```

Driver - YarnAllocator received container launch error message and treat it as `exitCausedByApp`
```
2022-06-23 19:52:27 CST YarnAllocator INFO - Completed container container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org (state: COMPLETE, exit status: 1)
2022-06-23 19:52:27 CST YarnAllocator WARN - Container from a bad node: container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org. Exit status: 1. Diagnostics: [2022-06-23 19:52:24.932]Exception from container-launch.
Container id: container_e94_1649986670278_7743380_02_000025
Exit code: 1
Shell output: main : command provided 1
main : run as user is bdms_pm
main : requested yarn user is bdms_pm
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /mnt/dfs/2/yarn/local/nmPrivate/application_1649986670278_7743380/container_e94_1649986670278_7743380_02_000025/container_e94_1649986670278_7743380_02_000025.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...

[2022-06-23 19:52:24.938]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
  at scala.concurrent.Promise.trySuccess(Promise.scala:94)
  at scala.concurrent.Promise.trySuccess$(Promise.scala:94)
  at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187)
  at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90)
  at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
  at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
```

Driver - Eventually application failed because ”failed“ executor reached threshold
```
2022-06-23 19:52:30 CST ApplicationMaster INFO - Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (16) reached)
```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Update UT.

Closes #38622 from pan3793/SPARK-39601.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…usedByApp when driver is shutting down

### What changes were proposed in this pull request?

Treating container `AllocationFailure` as not "exitCausedByApp" when driver is shutting down.

The approach is suggested at apache#36991 (comment)

### Why are the changes needed?

I observed some Spark Applications successfully completed all jobs but failed during the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is

Driver - Job success, Spark starts shutting down procedure.
```
2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped Spark74e9431b{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at http://hadoop2627.xxx.org:28446
2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all executors
```

Driver - A container allocate successful during shutting down phase.
```
2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container container_e94_1649986670278_7743380_02_000025 on host hadoop4388.xxx.org for executor with ID 24 for ResourceProfile Id 0
```

Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint.
```
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
  at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
  at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
  at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413)
  at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
  at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
  at scala.collection.immutable.Range.foreach(Range.scala:158)
  at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
  ... 4 more
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedSchedulerhadoop2627.xxx.org:21956
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144)
  at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
  at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
  at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99)
  at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
```

Driver - YarnAllocator received container launch error message and treat it as `exitCausedByApp`
```
2022-06-23 19:52:27 CST YarnAllocator INFO - Completed container container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org (state: COMPLETE, exit status: 1)
2022-06-23 19:52:27 CST YarnAllocator WARN - Container from a bad node: container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.xxx.org. Exit status: 1. Diagnostics: [2022-06-23 19:52:24.932]Exception from container-launch.
Container id: container_e94_1649986670278_7743380_02_000025
Exit code: 1
Shell output: main : command provided 1
main : run as user is bdms_pm
main : requested yarn user is bdms_pm
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /mnt/dfs/2/yarn/local/nmPrivate/application_1649986670278_7743380/container_e94_1649986670278_7743380_02_000025/container_e94_1649986670278_7743380_02_000025.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...

[2022-06-23 19:52:24.938]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
  at scala.concurrent.Promise.trySuccess(Promise.scala:94)
  at scala.concurrent.Promise.trySuccess$(Promise.scala:94)
  at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187)
  at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246)
  at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90)
  at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
  at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
  at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
```

Driver - Eventually application failed because ”failed“ executor reached threshold
```
2022-06-23 19:52:30 CST ApplicationMaster INFO - Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (16) reached)
```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Update UT.

Closes apache#38622 from pan3793/SPARK-39601.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants