Skip to content

Conversation

@mattf
Copy link

@mattf mattf commented Jun 24, 2014

The fix to SPARK-1466 (sha 3870248) opens a buffer for stderr, but
does not drain it under normal operation. The result is an eventual
hang during IPC.

The fix here is to close stderr after it is no longer used.

Related, but not addressed here, SPARK-1466 also removes stderr from
the console in the pyspark shell. It should be reintroduced with a
-verbose option.

The fix to SPARK-1466 (sha 3870248) opens a buffer for stderr, but
does not drain it under normal operation. The result is an eventual
hang during IPC.

The fix here is to close stderr after it is no longer used.

Related, but not addressed here, SPARK-1466 also removes stderr from
the console in the pyspark shell. It should be reintroduced with a
-verbose option.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@pwendell
Copy link
Contributor

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16098/

@rxin
Copy link
Contributor

rxin commented Jun 25, 2014

@andrewor14 @mattf Did you guys figure out which pr is a better way to solve this problem? (this one or #1178)

@mattf
Copy link
Author

mattf commented Jun 25, 2014

@rxin not yet -

my current position is that the hang should be resolved independently of other changes (i.e. not in conjunction w/ a masked output change - keep the changed simple and single purpose). for that reason i still prefer the simple close() solution.

however, there is a case that @andrewor14 has mentioned that close() does not cover. i'd like to reproduce that case as well before making a final recommendation on approach.

@andrewor14
Copy link
Contributor

@mattf, whether or not close() works out in the end, we still need to redirect all of Spark's logging to the console output. As long as we pass in stderr=PIPE in subprocess it will swallow all of this. Part of my PR is to fix that.

@andrewor14
Copy link
Contributor

My PR is intended to be a hot fix anyway. The whole issue with reading the py4j port through stdout is hacky and prone to interference from output of other scripts. If you would like to, you are welcome to submit a patch for the longer term solution.

@mattf
Copy link
Author

mattf commented Jun 25, 2014

@rxin & @andrewor14

from what i can tell there are three issues here -

a. hang on simple job; reported as SPARK-2244 and SPARK-2242; root cause is stderr buffer deadlock
b. masked output from shell subprocess; introduced by SPARK-1466; root cause is lack of pass through for stderr
c. fragile port passing between child and parent in pyspark

all should be addressed in isolation (andrewor14, the fact that your patch tries to address multiple concerns at the same time is why i'd prefer an alternative).

i recommend -
. first, fix (a) w/ close() and resolve both SPARK-2242 and SPARK-2244
. second, file a bug for (b) and address it w/ enhanced exception handling based on the current SPARK-2242 patch
. third, file a new bug for (c) with a solution that is yet to be determined

@andrewor14
Copy link
Contributor

The thing is pyspark is still broken even if we fix (a) but not (b). For example, if your driver cannot communicate with the master somehow, it normally prints the warning messages "Cannot connect to master" or something. If Spark logging is masked, then running sc.parallelize in this case still hangs without any output. This is actually the case I personally ran into in the first place.

Since, issues (a) and (b) are related and have a common simple fix, I think it makes sense to fix them both at once. I agree that (c) should be a new issue and is outside of the scope of this issue. For now, I just want to make sure pyspark is not broken on master.

@mattf
Copy link
Author

mattf commented Jun 25, 2014

i can simulate "Cannot connect to master" w/ pyspark --master spark://localhost:12345 (where 12345 is a bogus port)

i see what you're seeing, no error and an apparent hang on actions. however, it's not truly a hang, the action will error out with a visible exception "Job aborted due to stage failure: All masters are unresponsive! Giving up." i'll agree that's not ideal user experience, but it's different from the functional hang that is described in SPARK-2244 and SPARK-2242.

imho - (a) is an urgent functional issue, (b) is not.

@andrewor14
Copy link
Contributor

I'd argue that (b) is also an urgent issue.

Yes, for the particular case of not being able to talk to the master, an exception is thrown (though after a long timeout). Now consider the case in which you can talk to the master, but there are no workers. Then in the normal case, a simple Spark job leads to the following:

>>> sc.parallelize(range(100)).count()
14/06/25 11:26:12 INFO SparkContext: Starting job: count at <stdin>:1
14/06/25 11:26:12 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 2 output partitions (allowLocal=false)
14/06/25 11:26:12 INFO DAGScheduler: Final stage: Stage 0(count at <stdin>:1)
14/06/25 11:26:12 INFO DAGScheduler: Parents of final stage: List()
14/06/25 11:26:12 INFO DAGScheduler: Missing parents: List()
14/06/25 11:26:12 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at RDD at PythonRDD.scala:40), which has no missing parents
14/06/25 11:26:12 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[1] at RDD at PythonRDD.scala:40)
14/06/25 11:26:12 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/06/25 11:26:27 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/06/25 11:26:42 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/06/25 11:26:57 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
...

Now imagine all the WARN messages go away. This will continue indefinitely and is basically a hang. I think this is more serious than a user experience issue and deserves a hot fix.

@mattf
Copy link
Author

mattf commented Jun 25, 2014

that is pretty nasty indeed.

imho, changes should be single purpose and urgent changes should be doubly so.

i'm not arguing that (a) or (b) shouldn't be fixed, just that they should be handled separately, and if they're urgent (or HOT FIX) they should have their own jiras and commits.

@mattf
Copy link
Author

mattf commented Jun 26, 2014

@rxin how would you like to proceed?

@rxin
Copy link
Contributor

rxin commented Jun 26, 2014

@mattf,

@andrewor14 has already committed the other fix (that fixes the two separate issues) based on your feedback. I agree with you that the two issues deserve separate tickets, so we created the following two JIRA tickets:
https://issues.apache.org/jira/browse/SPARK-2242 (this is the original one - which duplicates https://issues.apache.org/jira/browse/SPARK-2244 too)
and
https://issues.apache.org/jira/browse/SPARK-2300

Can you verify the hang you ran into in 2244 no longer apply in the master branch? If it works, do you mind closing this one? Thanks a lot for identifying this and helping out with the fix.

@mattf
Copy link
Author

mattf commented Jun 27, 2014

@rxin & @andrewor14 - i've confirmed that the patch for SPARK-2242 resolves SPARK-2244. thanks for working with me on this!

@mattf mattf closed this Jun 27, 2014
@rxin
Copy link
Contributor

rxin commented Jun 27, 2014

Thanks for confirming!

@mattf mattf deleted the SPARK-2244 branch September 6, 2014 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants