[SPARK-17919] Make timeout to RBackend configurable in SparkR #15471

falaki · 2016-10-13T21:16:17Z

What changes were proposed in this pull request?

This patch makes RBackend connection timeout configurable by user.

How was this patch tested?

N/A

SparkQA · 2016-10-13T23:54:25Z

Test build #66915 has finished for PR 15471 at commit 9dde457.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-14T03:39:30Z

Test build #66928 has finished for PR 15471 at commit d27233c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-14T03:45:05Z

would it be possible to pull "6000" into a constant so it could be found/changed in one shot?

falaki · 2016-10-14T21:55:13Z

@felixcheung that is a good suggestion. I will try to use a single constant.
I changed the label to WIP because something is still timing out the connection in my tests. Maybe very long timeouts are not properly implemented in R? If you noticed I am setting timeout on all the connections that are every opened.

@shivaram I am going to experiment with a helper thread inside the main netty thread that sends heartbeats (-1 as result) and then invokeJava() would try reading again if it sees the heartbeat value. What do you think? I noticed we establish and keep a monitor connection as monitorConn but don't use it anywhere!? Am I right?

shivaram · 2016-10-14T23:15:26Z

@falaki Thats an interesting approach to try -- the other thing to try might be to use a separate connection to send back results. i.e. as soon as JVM gets the request it returns OK or some such status and then whenever the result is ready the JVM pushes data on the reader socket that is created in R. We might need to try a couple of options here to see which looks best.

BTW the monitorConn is used - its used in cases where the JVM comes up first, say in YARN / spark-submit and the JVM uses that to detect if the R process has crashed.

falaki · 2016-10-14T23:40:18Z

@shivaram thanks for clarification.
I realized we were not setting socket timeout on Netty socket. So I added that as well.
I also introduced the heartbeat mechanism and tested it locally. Next I am going to test if this works on real workload.

SparkQA · 2016-10-14T23:44:09Z

Test build #66997 has finished for PR 15471 at commit 630467e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-15T05:52:34Z

Test build #67001 has finished for PR 15471 at commit 3744623.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-15T08:37:31Z

Test build #67005 has finished for PR 15471 at commit 6f15a15.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

falaki · 2016-10-17T17:20:44Z

@shivaram this worked on my stress tests. The question is how to unit test this?

yhuai · 2016-10-17T17:21:47Z

test this please

shivaram · 2016-10-17T17:26:03Z

@falaki Do we know if the tests timeout are due to this change ? Or are they unrelated ?

SparkQA · 2016-10-17T20:07:38Z

Test build #67079 has finished for PR 15471 at commit 6f15a15.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

falaki · 2016-10-17T22:12:06Z

@shivaram I think they are unrelated. Can you trigger another test?

shivaram · 2016-10-17T22:16:02Z

Jenkins, retest this please

SparkQA · 2016-10-18T02:27:35Z

Test build #67091 has finished for PR 15471 at commit 6f15a15.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-10-18T03:58:59Z

@falaki looks like the SparkR MLlib unit tests are timing out on Jenkins. Do they pass on your machine ?

falaki · 2016-10-19T00:56:11Z

@shivaram it was indeed my fault. I did not run local tests after I added the heartbeat. I am now using +1 for heartbeat.

SparkQA · 2016-10-19T03:15:33Z

Test build #67153 has finished for PR 15471 at commit fcc3376.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

Looks good aside from minor comments.
This goes to 2.1.0?

felixcheung · 2016-10-19T21:29:10Z

core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala

+      }
+      val conf = new SparkConf()
+      val heartBeatInterval = conf.getInt(
+        "spark.r.heartBeatInterval", SparkRDefaults.DEFAULT_HEARTBEAT_INTERVAL)


should this be documented too?

felixcheung · 2016-10-19T21:31:10Z

core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala

      }
    } else {
+      // To avoid timeouts when reading results in SparkR driver, we will be regularly sending
+      // heartbeat responses. We use special character -1 to signal the client that backend is


-1 -> +1?

felixcheung · 2016-10-19T21:31:14Z

R/pkg/R/backend.R

  returnStatus <- readInt(conn)
+  handleErrors(returnStatus, conn)
+
+  # Backend will send -1 as keep alive value to prevent various connection timeouts


-1 -> +1?

felixcheung · 2016-10-19T21:35:59Z

core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala

+    cause match {
+      case timeout: ReadTimeoutException =>
+        // Do nothing. We don't want to timeout on read
+        logInfo("Ignoring read timeout in RBackendHandler")


logWarning?

falaki · 2016-10-20T20:34:56Z

Thanks @felixcheung addressed your comments.

felixcheung · 2016-10-20T22:28:12Z

LGTM

SparkQA · 2016-10-20T23:03:04Z

Test build #67287 has finished for PR 15471 at commit 749c8b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-10-20T23:08:20Z

@falaki @felixcheung Since this is a big change I'd like to also take a look at this once - Will try to get to it tonight.

felixcheung · 2016-10-20T23:19:02Z

sure. I think the target is 2.0.2 - it will be good to review this more closely.

falaki · 2016-10-20T23:22:03Z

Thanks @shivaram. I ran a real workload consisting of long running parallel simulations that took about 3.5 hours. I also tested it by calling Sys.sleep() inside workers with dapply and spark.lapply. Would be good to torture it in new ways.

Also would be interesting to think about ways of unit-testing this.

shivaram

Thanks @falaki -- This is a very useful change. I just did a pass of comments.

shivaram · 2016-10-19T16:41:13Z

R/pkg/inst/worker/daemon.R

 # Worker daemon

 rLibDir <- Sys.getenv("SPARKR_RLIBDIR")
+connectionTimeout <- Sys.getenv("SPARKR_BACKEND_CONNECTION_TIMEOUT")


We should take a default value here and in worker.R as well - 6000 is fine to use as default everywhere

shivaram · 2016-10-24T16:40:02Z

R/pkg/inst/worker/worker.R

 bootElap <- elapsedSecs()

 rLibDir <- Sys.getenv("SPARKR_RLIBDIR")
+connectionTimeout <- Sys.getenv("SPARKR_BACKEND_CONNECTION_TIMEOUT")


Default value here as well

shivaram · 2016-10-24T16:44:37Z

core/src/main/scala/org/apache/spark/api/r/RBackend.scala

+      // Connection timeout is set by socket client. To make it configurable we will pass the
+      // timeout value to client inside the temp file
+      val conf = new SparkConf()
+      val backendConnectionTimeout = conf.getInt(


Are we sure SparkConf has been initialized successfully at this point ? Or to to put it another way, in which cases does this code path get called from ? Is this in the spark-submit case or the shell etc. ?

This is for spark-submit. Basically the JVM starts before the R process. As a result the only way for R process to get these configuration parameters from the JVM. In this case, RBackend sets the environment variables based on configs.

For the other mode where JVM is started after the R process, we are sending this timeout value through the TCP connection.

At least that is my current understanding of how deploy modes work. In our production environment we launch the R process from the JVM.

shivaram · 2016-10-24T16:45:41Z

core/src/main/scala/org/apache/spark/api/r/RRunner.scala

    var rCommand = sparkConf.get("spark.sparkr.r.command", "Rscript")
    rCommand = sparkConf.get("spark.r.command", rCommand)

+    val rConnectionTimeout = SparkEnv.get.conf.getInt(


can we just use sparkConf.get similar to the line above ?

shivaram · 2016-10-24T16:47:42Z

core/src/main/scala/org/apache/spark/api/r/RBackendHandler.scala

+      // To avoid timeouts when reading results in SparkR driver, we will be regularly sending
+      // heartbeat responses. We use special code +1 to signal the client that backend is
+      // alive and it should continue blocking for result.
+      val execService = ThreadUtils.newDaemonSingleThreadScheduledExecutor("SparkRKeepAliveThread")


I'm not sure how expensive it is to create and destroy an executor service each time. Can we just schedule at fixed rate when we get the request and then cancel the schedule at the end of the request ?

I was not sure about this either. I used this method based on advice from @zsxwing.

Hmm - my question on whether we can reuse this still stands. @zsxwing do you think thats possible ?

@shivaram we can reuse it. scheduleAtFixedRate returns ScheduledFuture which can be used to cancel the task. However, there is no awaitTermination for ScheduledFuture after cancelling a future. We need some extra work.

I took a closer look at this and (a) it looks like the executor just calls cancel on the tasks during shutdown, so that part of the behavior is the same as calling cancel on the task we have [1]. But you are right that if we wanted to wait for termination we'd need to do some extra work. We could use the get call but its unclear what the semantics of that are. It might be more easier to just setup a semaphore or mutex that is shared by the runnable and the outside thread.

But overall it looks like thread pool creation is only around 100 micro-seconds [2] and I also benchmarked this locally.

[1] http://www.docjar.com/html/api/java/util/concurrent/ScheduledThreadPoolExecutor.java.html line 367
[2] http://stackoverflow.com/a/5483467/4577954

falaki · 2016-10-26T22:04:03Z

@shivaram sorry for delay getting back to this. Please take another look.

SparkQA · 2016-10-27T00:27:12Z

Test build #67602 has finished for PR 15471 at commit 666b609.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-10-27T18:40:05Z

retest this please

SparkQA · 2016-10-27T20:31:52Z

Test build #67661 has finished for PR 15471 at commit 666b609.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

falaki · 2016-10-27T20:40:44Z

retest this please

SparkQA · 2016-10-27T23:07:13Z

Test build #67667 has finished for PR 15471 at commit 666b609.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

falaki · 2016-10-29T03:48:22Z

@shivaram is there a chance this makes it to the 2.0.2 release?

shivaram · 2016-10-29T03:57:56Z

Taking another look now

shivaram · 2016-10-29T06:24:18Z

@falaki The code change looks pretty good to me, but I'm still a bit worried about introducing a big change in a minor release. Can we have this disabled by default and flip the flag only in the master branch ?

@HyukjinKwon Is there any way to retrigger the AppVeyor build ?

HyukjinKwon · 2016-10-29T06:50:50Z

Build started: [SparkR] ALL
Diff: master...spark-test:1175779D-A053-45AF-BC6C-EA34931CFC37

@shivaram I first thought committers can access to the Apache's AppVeyor account but it seems not. We should go to the Web UI and click the rebuild button.. So, I just made (locally) a bunch of scripts to launch a build via @spark-test account in such cases. Please cc me (treat me like a bot) then I will leave such comments above until we find a good way to do it.

falaki · 2016-10-29T07:15:44Z

@shivaram that is fine. We can merge it to 2.1 (or whatever the next major release is going to be).

shivaram · 2016-10-29T17:09:20Z

Thanks @HyukjinKwon - The appveyor tests seem to pass as well

Change LGTM to merge to master. @felixcheung Any other comments ?

felixcheung · 2016-10-30T23:16:15Z

LGTM.

felixcheung · 2016-10-30T23:17:32Z

merged to master.

## What changes were proposed in this pull request? This patch makes RBackend connection timeout configurable by user. ## How was this patch tested? N/A Author: Hossein <[email protected]> Closes apache#15471 from falaki/SPARK-17919.

QCTW · 2017-07-17T17:34:20Z

R/pkg/R/backend.R

+
+  # Backend will send -1 as keep alive value to prevent various connection timeouts
+  # on very long running jobs. See spark.r.heartBeatInterval
+  while (returnStatus == 1) {


Shoudn't it have a retry limit for the returnStatus check to avoid infinite loop?

I have an infinite loop when the it is called by Toree sparkr_runner.R with error message "Failed to connect JVM: Error in socketConnection(host = hostname, port = port, server = FALSE, : argument "timeout" is missing, with no default"

@falaki @felixcheung any thoughts on this ?

+1 I think it's a good idea to avoid infinite loop in general.
how is toree calling this?
could you open a JIRA?

Making the timeout of connection to RBackend configurable

9dde457

Setting configurable timeout on monitor and worker connections as well

d27233c

falaki changed the title ~~[SPARK-17919] Make timeout to RBackend configurable in SparkR~~ [WIP][SPARK-17919] Make timeout to RBackend configurable in SparkR Oct 14, 2016

falaki added 3 commits October 14, 2016 16:37

Using static file for various hard coded default values

ae655bc

Setting the socket timeout for netty child sockets

84bf8df

Sending heartbeats to prevent idle connections and timeouts

630467e

falaki added 3 commits October 14, 2016 18:37

Sharable BackendHandler

26e9179

Style

3744623

Setting ReadTimeoutHandler

6f15a15

falaki changed the title ~~[WIP][SPARK-17919] Make timeout to RBackend configurable in SparkR~~ [SPARK-17919] Make timeout to RBackend configurable in SparkR Oct 17, 2016

Use +1 for heartbeat

fcc3376

felixcheung requested changes Oct 19, 2016

View reviewed changes

Addressed comments

749c8b0

shivaram reviewed Oct 24, 2016

View reviewed changes

Addressed comments

666b609

asfgit closed this in 2881a2d Oct 30, 2016

QCTW reviewed Jul 18, 2017

View reviewed changes

[SPARK-17919] Make timeout to RBackend configurable in SparkR #15471

[SPARK-17919] Make timeout to RBackend configurable in SparkR #15471

Uh oh!

Conversation

falaki commented Oct 13, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 13, 2016

Uh oh!

SparkQA commented Oct 14, 2016

Uh oh!

felixcheung commented Oct 14, 2016

Uh oh!

falaki commented Oct 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shivaram commented Oct 14, 2016

Uh oh!

falaki commented Oct 14, 2016

Uh oh!

SparkQA commented Oct 14, 2016

Uh oh!

SparkQA commented Oct 15, 2016

Uh oh!

SparkQA commented Oct 15, 2016

Uh oh!

falaki commented Oct 17, 2016

Uh oh!

yhuai commented Oct 17, 2016

Uh oh!

shivaram commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 17, 2016

Uh oh!

falaki commented Oct 17, 2016

Uh oh!

shivaram commented Oct 17, 2016

Uh oh!

SparkQA commented Oct 18, 2016

Uh oh!

shivaram commented Oct 18, 2016

Uh oh!

falaki commented Oct 19, 2016

Uh oh!

SparkQA commented Oct 19, 2016

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

falaki commented Oct 20, 2016

Uh oh!

felixcheung commented Oct 20, 2016

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

shivaram commented Oct 20, 2016

Uh oh!

felixcheung commented Oct 20, 2016

Uh oh!

falaki commented Oct 20, 2016

Uh oh!

shivaram left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

falaki commented Oct 14, 2016 •

edited

Loading

HyukjinKwon commented Oct 29, 2016 •

edited

Loading