-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Backports for 0.6.2 #24519
WIP: Backports for 0.6.2 #24519
Conversation
c75018a
to
eaf7368
Compare
Looks like the builds are freezing, not sure why yet. |
The following testsets does not have any output on any of the Travis bots: |
Has 943c8e5 been considered for the list? (Perhaps 6.3?) This commit fixes a issue that requires some unfortunate work-arounds in JuliaCall and rjulia. |
@phaverty, according to Yichao that commit is irrelevant. Did you mean a different one? |
eaf7368
to
c767cdc
Compare
Looks like |
3b9ab8d
to
0bcb332
Compare
0bcb332
to
b8bb2b0
Compare
Now all tests are passing on AppVeyor and Travis OS X, but the socket tests are failing on Travis Linux. Oddly enough, the socket tests pass on Nanosoldier (x86-64 Linux). |
Could #24503 be included as it fixes a severe regression on windows? |
Yes it will be. |
Does anyone have any ideas about the socket tests on Travis Linux? |
Seems like it should be easy to pull in #24540. |
That one isn't relevant for backporting since FFTW isn't deprecated on 0.6. |
There are a couple of commits marked for backport that I've requested the original authors backport since I wasn't able to determine how to do it correctly, but the current state of the PR contains nearly all of the relevant commits for 0.6.2. |
#24530 would be good to have in my opinion, I think it should be good to merge. |
Sure. Just needs to be merged and I can backport it. 😉 |
Waiting on a couple more commits but might as well check performance in the meantime. @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Hmmm. @nanosoldier |
The SIGABRT on Travis Linux when building the documentation is very concerning but I can't reproduce it. |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
d9be6af
to
29b6dad
Compare
Should #24112 be included? |
The latest state of this branch now results in additional test failures for CuArrays, CUFFT and CUBLAS. The last two seem slightly unreliable, so @MikeInnes could you look into the CuArrays test failure? |
@amitmurthy Should I revert #21818 in 0.6.2 or is there a better solution? |
Here it is diff --git a/NEWS.md b/NEWS.md
index 62e2983c8e..13d26cdb4d 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -259,7 +259,8 @@ This section lists changes that do not have deprecation warnings.
rather than from environment variables ([#19636]).
* Workers now listen on an ephemeral port assigned by the OS. Previously workers would
- listen on the first free port available from 9009 ([#21818]).
+ listen on the first free port available from 9009 ([#21818]). Version 0.6.1 only.
+ Reverted in 0.6.2
Library improvements
diff --git a/base/distributed/cluster.jl b/base/distributed/cluster.jl
index c12f8ed847..1d1f74d7f2 100644
--- a/base/distributed/cluster.jl
+++ b/base/distributed/cluster.jl
@@ -153,7 +153,7 @@ function start_worker(out::IO, cookie::AbstractString)
init_worker(cookie)
interface = IPv4(LPROC.bind_addr)
if LPROC.bind_port == 0
- (port, sock) = listenany(interface, UInt16(0))
+ (port, sock) = listenany(interface, UInt16(9009))
LPROC.bind_port = port
else
sock = listen(interface, LPROC.bind_port)
diff --git a/doc/src/manual/parallel-computing.md b/doc/src/manual/parallel-computing.md
index 8b3c15dc37..4d2fb1e97d 100644
--- a/doc/src/manual/parallel-computing.md
+++ b/doc/src/manual/parallel-computing.md
@@ -1231,8 +1231,8 @@ as local laptops, departmental clusters, or even the cloud. This section covers
requirements for the inbuilt `LocalManager` and `SSHManager`:
* The master process does not listen on any port. It only connects out to the workers.
- * Each worker binds to only one of the local interfaces and listens on an ephemeral port number
- assigned by the OS.
+ * Each worker binds to only one of the local interfaces and listens on the first free port starting
+ from `9009`.
* `LocalManager`, used by `addprocs(N)`, by default binds only to the loopback interface. This means
that workers started later on remote hosts (or by anyone with malicious intentions) are unable
to connect to the cluster. An `addprocs(4)` followed by an `addprocs(["remote_host"])` will fail.
@@ -1250,9 +1250,8 @@ requirements for the inbuilt `LocalManager` and `SSHManager`:
authenticated via public key infrastructure (PKI). Authentication credentials can be supplied
via `sshflags`, for example ```sshflags=`-e <keyfile>` ```.
- In an all-to-all topology (the default), all workers connect to each other via plain TCP sockets.
- The security policy on the cluster nodes must thus ensure free connectivity between workers for
- the ephemeral port range (varies by OS).
+ Note that worker-worker connections are still plain TCP and the local security policy on the remote
+ cluster must allow for free connections between worker nodes, at least for ports 9009 and above.
Securing and encrypting all worker-worker traffic (via SSH) or encrypting individual messages
can be done via a custom ClusterManager. |
Great, thanks @amitmurthy! @MikeInnes, any word on CuArrays? I don't have the right system to be able to test myself so I'll need some help ensuring that this PR isn't breaking for the GPU ecosystem. |
@amitmurthy It looks like the tests may also have to be adjusted to accommodate that change? |
86a0e9c
to
0ad7336
Compare
I think I figured out how to fix the test: revert the changes made to test/socket.jl in #21818. |
Hmm, no that should be a problem. What is the error you are seeing? |
I think it timed out listening for a connection on port 0 rather than 9009. It was in the previous Travis macOS log, which has since been overwritten. |
Looks like my fix did not fix it: https://travis-ci.org/JuliaLang/julia/jobs/306991362#L5678 |
The socket tests need to be left as is. It tests the bugfix in #21818 for listenany() with port 0. Let me build and test this branch locally. Will take a while to build all dependencies. |
0ad7336
to
eda0b7c
Compare
Sounds good, thanks. In the meantime I've reverted the change to the tests. |
eda0b7c
to
ed763c5
Compare
Changing Pushed directly to this branch. |
I'll also test it out later on anubis. Just realized that Travis Linux
setups do not support client socket reuse.
|
CuArrays passes tests for me, although with quite a few warnings about invalid debug info being generated. I don't know if it's just a local issue, but on my usual GPU machine I can't compile this branch (it's asking me what files I want to apply some LLVM patches to). |
That's a change to CUDAnative which'll disappear when the version string matches
Probably a dirty LLVM source tree from moving between eg. |
ed763c5
to
6736f45
Compare
@ararslan , the run was successful on Anubis. However, taking a conservative approach I reverted to both 1) listening on 9009 onwards and 2) disabling client socket reuse on all platforms. |
Might as well @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Latest PkgEval results: https://gist.github.com/ararslan/0b1112ea62d9f37ac10aac6c4b624572
Looks like this branch is now non-breaking. |
Unless there are any objections or someone beats me to it, I'm going to merge this tomorrow. |
That was quite some patient work! |
No description provided.