[SPARK-30059][CORE]Stop AsyncEventQueue when interrupted in dispatch #26674

wangshuo128 · 2019-11-26T07:23:16Z

What changes were proposed in this pull request?

PR #21356 stop AsyncEventQueue when interrupted in postToAll.
However, if it's interrupted in AsyncEventQueue#dispatch, SparkContext would be stopped.
This PR proposes to stop AsyncEventQueue when interrupted in dispatch, rather than stop the SparkContext.

Why are the changes needed?

Avoid stopping the SparkContext when interrupted in AsyncEventQueue#dispatch.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT.

wangshuo128 · 2019-11-26T07:38:39Z

I applied patch #21356 in my cluster. Found that the AsyncEventQueue thread was set interrupted when queue.take() sometimes. I guess it's interrupted by some other thread asynchronously. Unfortunately, I didn't find which thread (in Spark or HDFS) did this.

Here is the log:

java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.132.165.35:46887
remote=/10.132.78.10:50010]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2319)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1087)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1056)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1197)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:942)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:453)
19/11/24 03:58:01 ERROR spark-listener-group-eventLog Utils: uncaught error in thread spark-listener-group-eventLog, stopping SparkContext
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
        at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
        at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:97)
        at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
        at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
        at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
        at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1303)
        at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
19/11/24 03:58:01 ERROR spark-listener-group-eventLog Utils: throw uncaught fatal error in thread spark-listener-group-eventLog
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
        at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
        at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:97)
        at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
        at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
        at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
        at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1303)
        at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)

Stopping the entire queue when interrupted in dispatch maybe not the best choice. If it's an important queue (e.g. dynamic resource allocation), I think it's better to stop the SparkContext.
However, in this case, it's in event log queue, I think we could keep the job running rather than stop the SparkContext. So stop AsyncEventQueue for event log and stop SparkContext for others may be an option.

Do you have any advice? cc @squito @cloud-fan :)

squito · 2019-11-26T18:17:47Z

Can you please open another jira for this, since SPARK-24309 is already in shipped releases?

I haven't thought about this a lot, but I don't know if I really like this idea. You would be able to stop the running job if there were only one job, but what about with concurrent jobs? I wonder if we should just have some special case handling in the EventLoggingListener to retry once after interrupt?

wangshuo128 · 2019-11-27T07:15:01Z

Thanks for your reply. @squito

You would be able to stop the running job if there were only one job, but what about with concurrent jobs?

I didn't get the point. What would happen if just stopping event log queue when concurrent jobs running. Could you explain this in detail?

I wonder if we should just have some special case handling in the EventLoggingListener to retry once after interrupt?

I agree with this.
AFAIK, the interruption issue only appears in the event log queue, however, it seems that the current approach can't cover all the cases, e.g. interrupted in queue.take().
I came up with an idea that wrapping EventLoggingListener#logEvent in an isolated thread and handle InterruptedException in that thread, thus AsyncEventQueue thread wouldn't be affected.

squito · 2019-11-27T13:02:46Z

You would be able to stop the running job if there were only one job, but what about with concurrent jobs?

I didn't get the point. What would happen if just stopping event log queue when concurrent jobs running. Could you explain this in detail?

sorry, please ignore that -- I misread your earlier comments, I had thought you were discussing stopping running jobs.

AFAIK, the interruption issue only appears in the event log queue, however, it seems that the current approach can't cover all the cases, e.g. interrupted in queue.take().
I came up with an idea that wrapping EventLoggingListener#logEvent in an isolated thread and handle InterruptedException in that thread, thus AsyncEventQueue thread wouldn't be affected.

yes good point. I'd need to walk through this very carefully but that sounds reasonable to me.

squito · 2019-11-27T14:14:24Z

do you know what version of hadoop you are on? I am trying to compare with the code -- its clearly not trunk (the DataStreamer class has been moved and plenty of other refactoring has happened).

Still, even looking at trunk, I have a guess at what is happening. The first part of your log shows the interrupt is coming from the DataStreamer, though that is running in a separate thread and isnt' directly interrupting the event log queue thread. But my guess is that calls to flush() in the event log thread check the status of that data streamer, and will set the event log thread's interrupt status. (eg. something like this, though probably not these lines: https://github.com/apache/hadoop/blob/7f2ea2ac46596883fb8f110f754a0eadeb69205e/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L730-L734)

so its possible that we're actually leaving the previous call to flush() with the interrupt status set, but we're ignoring it. And then when we get to the next call to queue.take(), it immediately throws an interrupted exception because the interrupt status has already been set.

that would probably be an hdfs bug, but seems to at least fit the pattern of what we see here, and something we could at least check for.

@steveloughran do you have any idea how this interrupt from DataStreamer is getting back to the spark event log writer?

wangshuo128 · 2019-11-28T06:42:54Z

do you know what version of hadoop you are on?

my hadoop version is 2.7.1

steveloughran · 2019-11-29T14:30:16Z

@steveloughran do you have any idea how this interrupt from DataStreamer is getting back to the spark event log writer?

errors terminating the datastreamer thread are caught and then escalated to whichever the next API call uses the instance. Usually they are IO problems considered non-recoverable

I don't know HDFS internals, but a look at the code hints this happened due to failures to talk to any datanode. Or at least, that's what the code is assuming -that any interrupt is a timeout in connections,

If you can show there's a problem happening on the 3.2.x libraries you should be able to persuade some (else!) to have a look @ this.

Ngone51 · 2020-02-04T03:25:07Z

core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala

+      val listenerThread = Thread.currentThread()
+      new Thread(new Runnable {
+        override def run(): Unit = {
+          while (sleep) {
+            Thread.sleep(10)
+          }
+          listenerThread.interrupt()
+        }
+      }).start()


Does it means that EventLogListener does similar thing like this inside?

AmplabJenkins · 2020-04-02T13:27:22Z

Can one of the admins verify this patch?

github-actions · 2020-07-12T00:31:28Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Stop AsyncEventQueue when interrupted in dispatch

8df4c88

wangshuo128 force-pushed the event-queue branch from affa555 to 8df4c88 Compare November 26, 2019 07:45

dongjoon-hyun added SCHEDULER SPARK CORE labels Nov 27, 2019

wangshuo128 changed the title ~~[SPARK-24309][CORE][FOLLOWUP]Stop AsyncEventQueue when interrupted in dispatch~~ [SPARK-30059][CORE]Stop AsyncEventQueue when interrupted in dispatch Nov 27, 2019

Ngone51 reviewed Feb 4, 2020

View reviewed changes

github-actions bot added the Stale label Jul 12, 2020

github-actions bot closed this Jul 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-30059][CORE]Stop AsyncEventQueue when interrupted in dispatch #26674

[SPARK-30059][CORE]Stop AsyncEventQueue when interrupted in dispatch #26674

Uh oh!

wangshuo128 commented Nov 26, 2019

Uh oh!

wangshuo128 commented Nov 26, 2019 •

edited

Loading

Uh oh!

squito commented Nov 26, 2019

Uh oh!

wangshuo128 commented Nov 27, 2019

Uh oh!

squito commented Nov 27, 2019

Uh oh!

squito commented Nov 27, 2019

Uh oh!

wangshuo128 commented Nov 28, 2019

Uh oh!

steveloughran commented Nov 29, 2019

Uh oh!

Ngone51 Feb 4, 2020

Uh oh!

AmplabJenkins commented Apr 2, 2020

Uh oh!

github-actions bot commented Jul 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-30059][CORE]Stop AsyncEventQueue when interrupted in dispatch #26674

[SPARK-30059][CORE]Stop AsyncEventQueue when interrupted in dispatch #26674

Uh oh!

Conversation

wangshuo128 commented Nov 26, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangshuo128 commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squito commented Nov 26, 2019

Uh oh!

wangshuo128 commented Nov 27, 2019

Uh oh!

squito commented Nov 27, 2019

Uh oh!

squito commented Nov 27, 2019

Uh oh!

wangshuo128 commented Nov 28, 2019

Uh oh!

steveloughran commented Nov 29, 2019

Uh oh!

Ngone51 Feb 4, 2020

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 2, 2020

Uh oh!

github-actions bot commented Jul 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wangshuo128 commented Nov 26, 2019 •

edited

Loading