[SPARK-2244] Fix hang introduced by SPARK-1466 #1197

mattf · 2014-06-24T19:01:09Z

The fix to SPARK-1466 (sha 3870248) opens a buffer for stderr, but
does not drain it under normal operation. The result is an eventual
hang during IPC.

The fix here is to close stderr after it is no longer used.

Related, but not addressed here, SPARK-1466 also removes stderr from
the console in the pyspark shell. It should be reintroduced with a
-verbose option.

The fix to SPARK-1466 (sha 3870248) opens a buffer for stderr, but does not drain it under normal operation. The result is an eventual hang during IPC. The fix here is to close stderr after it is no longer used. Related, but not addressed here, SPARK-1466 also removes stderr from the console in the pyspark shell. It should be reintroduced with a -verbose option.

AmplabJenkins · 2014-06-24T19:05:17Z

Can one of the admins verify this patch?

pwendell · 2014-06-25T02:36:59Z

Jenkins, test this please.

AmplabJenkins · 2014-06-25T02:40:18Z

Merged build triggered.

AmplabJenkins · 2014-06-25T02:40:27Z

Merged build started.

AmplabJenkins · 2014-06-25T02:50:32Z

Merged build finished.

AmplabJenkins · 2014-06-25T02:50:32Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16098/

rxin · 2014-06-25T07:18:50Z

@andrewor14 @mattf Did you guys figure out which pr is a better way to solve this problem? (this one or #1178)

mattf · 2014-06-25T17:12:26Z

@rxin not yet -

my current position is that the hang should be resolved independently of other changes (i.e. not in conjunction w/ a masked output change - keep the changed simple and single purpose). for that reason i still prefer the simple close() solution.

however, there is a case that @andrewor14 has mentioned that close() does not cover. i'd like to reproduce that case as well before making a final recommendation on approach.

andrewor14 · 2014-06-25T17:30:45Z

@mattf, whether or not close() works out in the end, we still need to redirect all of Spark's logging to the console output. As long as we pass in stderr=PIPE in subprocess it will swallow all of this. Part of my PR is to fix that.

andrewor14 · 2014-06-25T17:33:08Z

My PR is intended to be a hot fix anyway. The whole issue with reading the py4j port through stdout is hacky and prone to interference from output of other scripts. If you would like to, you are welcome to submit a patch for the longer term solution.

mattf · 2014-06-25T17:48:04Z

@rxin & @andrewor14

from what i can tell there are three issues here -

a. hang on simple job; reported as SPARK-2244 and SPARK-2242; root cause is stderr buffer deadlock
b. masked output from shell subprocess; introduced by SPARK-1466; root cause is lack of pass through for stderr
c. fragile port passing between child and parent in pyspark

all should be addressed in isolation (andrewor14, the fact that your patch tries to address multiple concerns at the same time is why i'd prefer an alternative).

i recommend -
. first, fix (a) w/ close() and resolve both SPARK-2242 and SPARK-2244
. second, file a bug for (b) and address it w/ enhanced exception handling based on the current SPARK-2242 patch
. third, file a new bug for (c) with a solution that is yet to be determined

andrewor14 · 2014-06-25T18:03:26Z

The thing is pyspark is still broken even if we fix (a) but not (b). For example, if your driver cannot communicate with the master somehow, it normally prints the warning messages "Cannot connect to master" or something. If Spark logging is masked, then running sc.parallelize in this case still hangs without any output. This is actually the case I personally ran into in the first place.

Since, issues (a) and (b) are related and have a common simple fix, I think it makes sense to fix them both at once. I agree that (c) should be a new issue and is outside of the scope of this issue. For now, I just want to make sure pyspark is not broken on master.

mattf · 2014-06-25T18:11:53Z

i can simulate "Cannot connect to master" w/ pyspark --master spark://localhost:12345 (where 12345 is a bogus port)

i see what you're seeing, no error and an apparent hang on actions. however, it's not truly a hang, the action will error out with a visible exception "Job aborted due to stage failure: All masters are unresponsive! Giving up." i'll agree that's not ideal user experience, but it's different from the functional hang that is described in SPARK-2244 and SPARK-2242.

imho - (a) is an urgent functional issue, (b) is not.

andrewor14 · 2014-06-25T18:34:59Z

I'd argue that (b) is also an urgent issue.

Yes, for the particular case of not being able to talk to the master, an exception is thrown (though after a long timeout). Now consider the case in which you can talk to the master, but there are no workers. Then in the normal case, a simple Spark job leads to the following:

>>> sc.parallelize(range(100)).count()
14/06/25 11:26:12 INFO SparkContext: Starting job: count at <stdin>:1
14/06/25 11:26:12 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 2 output partitions (allowLocal=false)
14/06/25 11:26:12 INFO DAGScheduler: Final stage: Stage 0(count at <stdin>:1)
14/06/25 11:26:12 INFO DAGScheduler: Parents of final stage: List()
14/06/25 11:26:12 INFO DAGScheduler: Missing parents: List()
14/06/25 11:26:12 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at RDD at PythonRDD.scala:40), which has no missing parents
14/06/25 11:26:12 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[1] at RDD at PythonRDD.scala:40)
14/06/25 11:26:12 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/06/25 11:26:27 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/06/25 11:26:42 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/06/25 11:26:57 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
...

Now imagine all the WARN messages go away. This will continue indefinitely and is basically a hang. I think this is more serious than a user experience issue and deserves a hot fix.

mattf · 2014-06-25T19:26:24Z

that is pretty nasty indeed.

imho, changes should be single purpose and urgent changes should be doubly so.

i'm not arguing that (a) or (b) shouldn't be fixed, just that they should be handled separately, and if they're urgent (or HOT FIX) they should have their own jiras and commits.

mattf · 2014-06-26T13:08:23Z

@rxin how would you like to proceed?

rxin · 2014-06-26T23:59:08Z

@mattf,

@andrewor14 has already committed the other fix (that fixes the two separate issues) based on your feedback. I agree with you that the two issues deserve separate tickets, so we created the following two JIRA tickets:
https://issues.apache.org/jira/browse/SPARK-2242 (this is the original one - which duplicates https://issues.apache.org/jira/browse/SPARK-2244 too)
and
https://issues.apache.org/jira/browse/SPARK-2300

Can you verify the hang you ran into in 2244 no longer apply in the master branch? If it works, do you mind closing this one? Thanks a lot for identifying this and helping out with the fix.

mattf · 2014-06-27T02:05:44Z

@rxin & @andrewor14 - i've confirmed that the patch for SPARK-2242 resolves SPARK-2244. thanks for working with me on this!

rxin · 2014-06-27T02:08:20Z

Thanks for confirming!

mattf mentioned this pull request Jun 24, 2014

[SPARK-2242] HOTFIX: pyspark shell hangs on simple job #1178

Closed

mattf closed this Jun 27, 2014

mattf deleted the SPARK-2244 branch September 6, 2014 19:01

[SPARK-2244] Fix hang introduced by SPARK-1466 #1197

[SPARK-2244] Fix hang introduced by SPARK-1466 #1197

Uh oh!

Conversation

mattf commented Jun 24, 2014

Uh oh!

AmplabJenkins commented Jun 24, 2014

Uh oh!

pwendell commented Jun 25, 2014

Uh oh!

AmplabJenkins commented Jun 25, 2014

Uh oh!

AmplabJenkins commented Jun 25, 2014

Uh oh!

AmplabJenkins commented Jun 25, 2014

Uh oh!

AmplabJenkins commented Jun 25, 2014

Uh oh!

rxin commented Jun 25, 2014

Uh oh!

mattf commented Jun 25, 2014

Uh oh!

andrewor14 commented Jun 25, 2014

Uh oh!

andrewor14 commented Jun 25, 2014

Uh oh!

mattf commented Jun 25, 2014

Uh oh!

andrewor14 commented Jun 25, 2014

Uh oh!

mattf commented Jun 25, 2014

Uh oh!

andrewor14 commented Jun 25, 2014

Uh oh!

mattf commented Jun 25, 2014

Uh oh!

mattf commented Jun 26, 2014

Uh oh!

rxin commented Jun 26, 2014

Uh oh!

mattf commented Jun 27, 2014

Uh oh!

rxin commented Jun 27, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants