-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2244] Fix hang introduced by SPARK-1466 #1197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The fix to SPARK-1466 (sha 3870248) opens a buffer for stderr, but does not drain it under normal operation. The result is an eventual hang during IPC. The fix here is to close stderr after it is no longer used. Related, but not addressed here, SPARK-1466 also removes stderr from the console in the pyspark shell. It should be reintroduced with a -verbose option.
|
Can one of the admins verify this patch? |
|
Jenkins, test this please. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16098/ |
|
@andrewor14 @mattf Did you guys figure out which pr is a better way to solve this problem? (this one or #1178) |
|
@rxin not yet - my current position is that the hang should be resolved independently of other changes (i.e. not in conjunction w/ a masked output change - keep the changed simple and single purpose). for that reason i still prefer the simple close() solution. however, there is a case that @andrewor14 has mentioned that close() does not cover. i'd like to reproduce that case as well before making a final recommendation on approach. |
|
@mattf, whether or not close() works out in the end, we still need to redirect all of Spark's logging to the console output. As long as we pass in |
|
My PR is intended to be a hot fix anyway. The whole issue with reading the py4j port through |
|
from what i can tell there are three issues here - a. hang on simple job; reported as SPARK-2244 and SPARK-2242; root cause is stderr buffer deadlock all should be addressed in isolation (andrewor14, the fact that your patch tries to address multiple concerns at the same time is why i'd prefer an alternative). i recommend - |
|
The thing is pyspark is still broken even if we fix (a) but not (b). For example, if your driver cannot communicate with the master somehow, it normally prints the warning messages "Cannot connect to master" or something. If Spark logging is masked, then running Since, issues (a) and (b) are related and have a common simple fix, I think it makes sense to fix them both at once. I agree that (c) should be a new issue and is outside of the scope of this issue. For now, I just want to make sure pyspark is not broken on master. |
|
i can simulate "Cannot connect to master" w/ pyspark --master spark://localhost:12345 (where 12345 is a bogus port) i see what you're seeing, no error and an apparent hang on actions. however, it's not truly a hang, the action will error out with a visible exception "Job aborted due to stage failure: All masters are unresponsive! Giving up." i'll agree that's not ideal user experience, but it's different from the functional hang that is described in SPARK-2244 and SPARK-2242. imho - (a) is an urgent functional issue, (b) is not. |
|
I'd argue that (b) is also an urgent issue. Yes, for the particular case of not being able to talk to the master, an exception is thrown (though after a long timeout). Now consider the case in which you can talk to the master, but there are no workers. Then in the normal case, a simple Spark job leads to the following: Now imagine all the WARN messages go away. This will continue indefinitely and is basically a hang. I think this is more serious than a user experience issue and deserves a hot fix. |
|
that is pretty nasty indeed. imho, changes should be single purpose and urgent changes should be doubly so. i'm not arguing that (a) or (b) shouldn't be fixed, just that they should be handled separately, and if they're urgent (or HOT FIX) they should have their own jiras and commits. |
|
@rxin how would you like to proceed? |
|
@andrewor14 has already committed the other fix (that fixes the two separate issues) based on your feedback. I agree with you that the two issues deserve separate tickets, so we created the following two JIRA tickets: Can you verify the hang you ran into in 2244 no longer apply in the master branch? If it works, do you mind closing this one? Thanks a lot for identifying this and helping out with the fix. |
|
@rxin & @andrewor14 - i've confirmed that the patch for SPARK-2242 resolves SPARK-2244. thanks for working with me on this! |
|
Thanks for confirming! |
The fix to SPARK-1466 (sha 3870248) opens a buffer for stderr, but
does not drain it under normal operation. The result is an eventual
hang during IPC.
The fix here is to close stderr after it is no longer used.
Related, but not addressed here, SPARK-1466 also removes stderr from
the console in the pyspark shell. It should be reintroduced with a
-verbose option.