Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Jul 30, 2014

Kill only the python worker related to cancelled tasks.

The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker.

When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon.

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17395/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA results for PR 1643:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17395/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17413/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17413/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sock.recv return 0 itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any concerns (performance or otherwise) related to two different processes polling on the same file descriptor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an Exception raised, n is not defined.

This threads will sleep 0.1 after every poll, the overhead will be low, it should not effect another process reading the socket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if the kernel does anything special to tie sockets to processes (similar to Java lock biasing).

For the n thing, I was asking why we bother reading n, why not just set something like socketClosed = False which we set to True in the except condition, rather than reusing this variable n.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, there is no exception, recv() will return 0.

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17457/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17457/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17493/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17493/consoleFull

@JoshRosen
Copy link
Contributor

Would it be possible to store the PIDs of the workers inside of PythonWorkerFactory and directly kill the workers via SIGTERM? Or send a command to the Python daemon and have it kill the workers? That seems like it would be less complex than this socket-polling approach.

@davies
Copy link
Contributor Author

davies commented Jul 31, 2014

I had tried these two approach, we have not easy way to get PIDs of workers, because it's platform dependent. Sending command to Python daemon will also need the identity of worker, it will need another channel to send commands.

If there are exceptions in Java reading/writing threads, which may lead to close the socket, so we still need this kind of method to kill worker if the socket is closed.

@JoshRosen
Copy link
Contributor

I don't think we have platform-dependent problems using PIDs, since Windows doesn't use the daemon to launch its PySpark workers; we only have to support Unix platforms in this code.

@JoshRosen
Copy link
Contributor

I think that some of the existing daemon.py code is unnecessarily complicated and confusing (see some of my comments at mesos/spark#563), so I'd like to clean things up before we add any more code. I'll have a patch for this cleanup by tonight or tomorrow, so it would probably be best if you held off from working on this for a day or so.

One note: I'm a bit wary of the mixture of fork() and Thread in the same daemon's forked process; is this safe?

@davies
Copy link
Contributor Author

davies commented Jul 31, 2014

Good question, it's dangerous to mix threads and fork(), it may be cause dead lock in child process. But in this case, because of GIL, then fork() happens, monitor thread is blocked or sleeping or polling, they are thread safe, so it will be a problem.

@davies
Copy link
Contributor Author

davies commented Jul 31, 2014

I will wait for your patch, and think about using PIDs.

@davies
Copy link
Contributor Author

davies commented Aug 2, 2014

@JoshRosen I had redo this PR based your cleanup, plz review again.

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17751/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17751/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that os.fork() already handles negative return values by throwing OSError, so I think this else block is dead code: https://docs.python.org/2/library/os.html#os.fork

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17775/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other accesses of daemonWorkers are guarded by synchronized blocks; does this access also need synchronization? It looks like calls to stopWorker() only occur from destroyPythonWorker(), which is synchronized using the SparkEnv object, but that's a different lock. To be on the safe side, we should probably add synchronized here unless there's a good reason not to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think the current synchronization is fine: every call of PythonWorkerFactory's public methods is guarded by SparkEnv's lock.

@JoshRosen
Copy link
Contributor

This looks good overall and I'd say it's ready to merge once we address my last two comments.

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17775/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA tests have started for PR 1643. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17787/consoleFull

@davies
Copy link
Contributor Author

davies commented Aug 2, 2014

I had fixed several bugs and improve the kill approach, it has unit tests now.

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA results for PR 1643:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17787/consoleFull

@JoshRosen
Copy link
Contributor

Thanks for updating this! In my earlier review, I had overlooked that the kill involved forking the JVM; your new approach of having the daemon kill the workers is much better.

The test case looks good, too (clever use of Python's for ... else construct; I hadn't seen that before).

In #1680, there was some discussion over whether to use SIGKILL vs. SIGHUP to kill the Python workers. Now that I've had more time to think about it, I think SIGKILL is a fine approach:

  • Spark doesn't provide any documented, user-facing mechanisms for allowing tasks to perform cleanup work when they're cancelled.
  • The only case where it might make sense to have a cleanup mechanism is when performing side-effects, such as writing to an external database. A machine could immediately lose power or otherwise fail without executing the cleanup mechanism, so users already have to guard against the effects of immediate failures (e.g. by using transactions).

@JoshRosen
Copy link
Contributor

I've merged this into master and branch-1.1. Thanks!

asfgit pushed a commit that referenced this pull request Aug 3, 2014
Kill only the python worker related to cancelled tasks.

The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker.

When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon.

Author: Davies Liu <[email protected]>

Closes #1643 from davies/kill and squashes the following commits:

8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too heavy
46ca150 [Davies Liu] address comment
acd751c [Davies Liu] kill the worker when task is canceled

(cherry picked from commit 55349f9)
Signed-off-by: Josh Rosen <[email protected]>
@asfgit asfgit closed this in 55349f9 Aug 3, 2014
@davies
Copy link
Contributor Author

davies commented Aug 4, 2014

@JoshRosen Thanks to review this, your comments help me a lot.

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Kill only the python worker related to cancelled tasks.

The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker.

When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon.

Author: Davies Liu <[email protected]>

Closes apache#1643 from davies/kill and squashes the following commits:

8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too heavy
46ca150 [Davies Liu] address comment
acd751c [Davies Liu] kill the worker when task is canceled
@davies davies deleted the kill branch September 15, 2014 22:18
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…metric (apache#1643)

### What changes were proposed in this pull request?

This patch updates how `SQLMetric` merges two invalid instances where the value is both -1.

### Why are the changes needed?

We use -1 as initial value of `SQLMetric`, and change it to 0 while merging with other `SQLMetric` instances. A `SQLMetric` will be treated as invalid and filtered out later.

While we are developing with Spark, it is trouble behavior that two invalid `SQLMetric` instances merge to a valid `SQLMetric` because merging will set the value to 0.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes apache#38969 from viirya/minor_sql_metrics.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants