[SPARK-1740] [PySpark] kill the python worker #1643

davies · 2014-07-30T00:15:23Z

Kill only the python worker related to cancelled tasks.

The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker.

When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon.

SparkQA · 2014-07-30T00:18:58Z

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17395/consoleFull

SparkQA · 2014-07-30T01:02:58Z

QA results for PR 1643:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17395/consoleFull

SparkQA · 2014-07-30T04:38:59Z

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17413/consoleFull

SparkQA · 2014-07-30T05:28:33Z

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17413/consoleFull

aarondav · 2014-07-30T16:18:00Z

python/pyspark/daemon.py

Does sock.recv return 0 itself?

Are there any concerns (performance or otherwise) related to two different processes polling on the same file descriptor?

If an Exception raised, n is not defined.

This threads will sleep 0.1 after every poll, the overhead will be low, it should not effect another process reading the socket.

I was wondering if the kernel does anything special to tie sockets to processes (similar to Java lock biasing).

For the n thing, I was asking why we bother reading n, why not just set something like socketClosed = False which we set to True in the except condition, rather than reusing this variable n.

In most cases, there is no exception, recv() will return 0.

SparkQA · 2014-07-30T16:43:58Z

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17457/consoleFull

SparkQA · 2014-07-30T17:30:51Z

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17457/consoleFull

SparkQA · 2014-07-30T21:58:53Z

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17493/consoleFull

SparkQA · 2014-07-30T22:48:30Z

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17493/consoleFull

JoshRosen · 2014-07-30T23:15:59Z

Would it be possible to store the PIDs of the workers inside of PythonWorkerFactory and directly kill the workers via SIGTERM? Or send a command to the Python daemon and have it kill the workers? That seems like it would be less complex than this socket-polling approach.

davies · 2014-07-31T00:44:14Z

I had tried these two approach, we have not easy way to get PIDs of workers, because it's platform dependent. Sending command to Python daemon will also need the identity of worker, it will need another channel to send commands.

If there are exceptions in Java reading/writing threads, which may lead to close the socket, so we still need this kind of method to kill worker if the socket is closed.

JoshRosen · 2014-07-31T01:32:18Z

I don't think we have platform-dependent problems using PIDs, since Windows doesn't use the daemon to launch its PySpark workers; we only have to support Unix platforms in this code.

JoshRosen · 2014-07-31T03:20:38Z

I think that some of the existing daemon.py code is unnecessarily complicated and confusing (see some of my comments at mesos/spark#563), so I'd like to clean things up before we add any more code. I'll have a patch for this cleanup by tonight or tomorrow, so it would probably be best if you held off from working on this for a day or so.

One note: I'm a bit wary of the mixture of fork() and Thread in the same daemon's forked process; is this safe?

davies · 2014-07-31T06:33:51Z

Good question, it's dangerous to mix threads and fork(), it may be cause dead lock in child process. But in this case, because of GIL, then fork() happens, monitor thread is blocked or sleeping or polling, they are thread safe, so it will be a problem.

davies · 2014-07-31T06:35:11Z

I will wait for your patch, and think about using PIDs.

davies · 2014-08-02T05:22:48Z

@JoshRosen I had redo this PR based your cleanup, plz review again.

SparkQA · 2014-08-02T05:24:05Z

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17751/consoleFull

SparkQA · 2014-08-02T06:25:46Z

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17751/consoleFull

JoshRosen · 2014-08-02T19:53:40Z

python/pyspark/daemon.py

I think that os.fork() already handles negative return values by throwing OSError, so I think this else block is dead code: https://docs.python.org/2/library/os.html#os.fork

SparkQA · 2014-08-02T19:59:09Z

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17775/consoleFull

JoshRosen · 2014-08-02T20:05:18Z

core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala

The other accesses of daemonWorkers are guarded by synchronized blocks; does this access also need synchronization? It looks like calls to stopWorker() only occur from destroyPythonWorker(), which is synchronized using the SparkEnv object, but that's a different lock. To be on the safe side, we should probably add synchronized here unless there's a good reason not to.

Actually, I think the current synchronization is fine: every call of PythonWorkerFactory's public methods is guarded by SparkEnv's lock.

JoshRosen · 2014-08-02T20:24:15Z

This looks good overall and I'd say it's ready to merge once we address my last two comments.

SparkQA · 2014-08-02T20:58:45Z

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17775/consoleFull

fix several bugs

SparkQA · 2014-08-02T22:44:09Z

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17787/consoleFull

davies · 2014-08-02T22:47:27Z

I had fixed several bugs and improve the kill approach, it has unit tests now.

SparkQA · 2014-08-02T23:42:36Z

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17787/consoleFull

JoshRosen · 2014-08-03T22:39:51Z

Thanks for updating this! In my earlier review, I had overlooked that the kill involved forking the JVM; your new approach of having the daemon kill the workers is much better.

The test case looks good, too (clever use of Python's for ... else construct; I hadn't seen that before).

In #1680, there was some discussion over whether to use SIGKILL vs. SIGHUP to kill the Python workers. Now that I've had more time to think about it, I think SIGKILL is a fine approach:

Spark doesn't provide any documented, user-facing mechanisms for allowing tasks to perform cleanup work when they're cancelled.
The only case where it might make sense to have a cleanup mechanism is when performing side-effects, such as writing to an external database. A machine could immediately lose power or otherwise fail without executing the cleanup mechanism, so users already have to guard against the effects of immediate failures (e.g. by using transactions).

JoshRosen · 2014-08-03T22:53:51Z

I've merged this into master and branch-1.1. Thanks!

Kill only the python worker related to cancelled tasks. The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker. When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon. Author: Davies Liu <[email protected]> Closes #1643 from davies/kill and squashes the following commits: 8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too heavy 46ca150 [Davies Liu] address comment acd751c [Davies Liu] kill the worker when task is canceled (cherry picked from commit 55349f9) Signed-off-by: Josh Rosen <[email protected]>

davies · 2014-08-04T02:56:38Z

@JoshRosen Thanks to review this, your comments help me a lot.

Kill only the python worker related to cancelled tasks. The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker. When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon. Author: Davies Liu <[email protected]> Closes apache#1643 from davies/kill and squashes the following commits: 8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too heavy 46ca150 [Davies Liu] address comment acd751c [Davies Liu] kill the worker when task is canceled

…metric (apache#1643) ### What changes were proposed in this pull request? This patch updates how `SQLMetric` merges two invalid instances where the value is both -1. ### Why are the changes needed? We use -1 as initial value of `SQLMetric`, and change it to 0 while merging with other `SQLMetric` instances. A `SQLMetric` will be treated as invalid and filtered out later. While we are developing with Spark, it is trouble behavior that two invalid `SQLMetric` instances merge to a valid `SQLMetric` because merging will set the value to 0. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes apache#38969 from viirya/minor_sql_metrics. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

aarondav reviewed Jul 30, 2014
View reviewed changes

JoshRosen mentioned this pull request Jul 31, 2014

[SPARK-2764] Simplify daemon.py process structure #1680

Closed

kill the worker when task is canceled

acd751c

JoshRosen reviewed Aug 2, 2014
View reviewed changes

address comment

46ca150

JoshRosen reviewed Aug 2, 2014
View reviewed changes

kill worker by deamon, because runtime.exec() is too heavy

8ffe9f3

fix several bugs

asfgit closed this in 55349f9 Aug 3, 2014

davies deleted the kill branch September 15, 2014 22:18

[SPARK-1740] [PySpark] kill the python worker #1643

[SPARK-1740] [PySpark] kill the python worker #1643

Uh oh!

Conversation

davies commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

JoshRosen commented Jul 30, 2014

Uh oh!

davies commented Jul 31, 2014

Uh oh!

JoshRosen commented Jul 31, 2014

Uh oh!

JoshRosen commented Jul 31, 2014

Uh oh!

davies commented Jul 31, 2014

Uh oh!

davies commented Jul 31, 2014

Uh oh!

davies commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

davies commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

JoshRosen commented Aug 3, 2014

Uh oh!

JoshRosen commented Aug 3, 2014

Uh oh!

davies commented Aug 4, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development