Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reuse of client port on Linux. Implement for OSX. #21818

Merged
merged 3 commits into from
May 22, 2017

Conversation

amitmurthy
Copy link
Contributor

Closes #19893 .

ci skip till #21799 is merged.

This commit should backport cleanly. Will add another commit that uses getsockname everywhere once #21801 is closed.

@amitmurthy amitmurthy changed the title WIP : Fix reuse of client port on Linux. Implement for OSX. [ci skip] Fix reuse of client port on Linux. Implement for OSX. May 13, 2017
@amitmurthy amitmurthy requested a review from vtjnash May 13, 2017 07:50
base/socket.jl Outdated
function TCPSocket()
function TCPSocket(; delay=true) # kw arg "delay": if true, libuv delays creation of the socket
# fd till the first bind call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we avoid starting the function with an empty line? it's not the usual formatting style

@amitmurthy amitmurthy changed the title Fix reuse of client port on Linux. Implement for OSX. WIP: Fix reuse of client port on Linux. Implement for OSX. May 14, 2017
@@ -153,8 +153,16 @@ function start_worker(out::IO, cookie::AbstractString)
init_worker(cookie)
interface = IPv4(LPROC.bind_addr)
if LPROC.bind_port == 0
(actual_port,sock) = listenany(interface, UInt16(9009))
LPROC.bind_port = actual_port
# (actual_port,sock) = listenany(interface, UInt16(9009))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing this PR, i.e., reusing client port numbers, our current strategy for workers to start listening for a free port starting from 9009 caused a major slowdown in addprocs when it was part of a repeated addprocs(x); run tests....; rmprocs(workers) cycle - as the test script indeed does. I suspect it was because the new workers listened on the same ports (starting from 9009), but setup of new connections with the same client port were being delayed, as TCP connections to previous workers (now in a TIME_WAIT state) still existed. And new connections with the same (clienthostport,serverhostport) tuple had to wait till the system cleaned them up (OSX in my case).

I changed the worker listening port selection logic to select a free ephemeral port and the slowdown disappeared. While that works, I am uncomfortable using ephemeral ports for workers to listen on as firewalling policies may disallow incoming connections to the ephemeral port range.

However, the current strategy of listening on a free port starting from 9009 is also not ideal when we have many workers on a node. For example, the 100th worker in an addprocs(100) will try and fail in listening on every port between 9009 and 9108 before succeeding to listen on 9109.

Looking for suggestions to work around this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can suggest that someone who needs that will write a UPnP package that automates the firewall configuration step? Although I think that the typical usage of this cluster code is intended to be in a closed environment where all of the devices are on the same side of the firewall (whether virtually or physically).

rc = ccall(:jl_tcp_reuseport, Int32, (Ptr{Void},), s.handle)
if rc > 0 # SO_REUSEPORT is unsupported, just return the ephemerally bound socket
return s
elseif rc < 0
throw(SystemError("setsockopt() SO_REUSEPORT : "))
end
getsockname(s)
is_apple() && bind_client_port(s)
catch e
# This is an issue only on systems with lots of client connections, hence delay the warning
nworkers() > 128 && warn_once("Error trying to reuse client port number, falling back to plain socket : ", e)
Copy link
Member

@vtjnash vtjnash May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this done as a try/catch rather than printing this message where we threw it 6 lines above and then returning from there

base/socket.jl Outdated
else
err = ccall(:uv_tcp_init_ex, Cint, (Ptr{Void}, Ptr{Void}, Cuint),
eventloop(), tcp.handle, 2) # AF_INET is 2
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can call the _ex version unconditionally, and just select between AF_INET = 2 and AF_UNSPEC = 0 (that's what libuv will do internally anyways)

@vtjnash
Copy link
Member

vtjnash commented May 15, 2017

Looks like this will be a bit better after #21801 is merged too.

@amitmurthy amitmurthy force-pushed the amitm/reuseportfix branch 2 times, most recently from f5ad929 to c354dfd Compare May 19, 2017 05:26
LPROC.bind_port = port
else
close(sock)
error("no ports available")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should throw a more specific exception type if possible

@amitmurthy amitmurthy changed the title WIP: Fix reuse of client port on Linux. Implement for OSX. Fix reuse of client port on Linux. Implement for OSX. May 19, 2017
@amitmurthy
Copy link
Contributor Author

Added a NEWS entry. Will this fix make 0.6? Should it be in the 0.6 section?

@amitmurthy
Copy link
Contributor Author

AV failure is unrelated.

@amitmurthy
Copy link
Contributor Author

Will merge this in a day.

@amitmurthy amitmurthy merged commit e5fb87d into master May 22, 2017
@amitmurthy amitmurthy deleted the amitm/reuseportfix branch May 22, 2017 03:55
@tkelman
Copy link
Contributor

tkelman commented May 22, 2017

amitmurthy added a commit that referenced this pull request May 22, 2017
Should fix buildbot failure  #21818 (comment)

A bit of a mystery as to how Linux CI passed correctly.....
ararslan pushed a commit that referenced this pull request Sep 11, 2017
Fix reuse of client port on Linux. Implement for OSX.

Ref #21818
(cherry picked from commit e5fb87d)
ararslan pushed a commit that referenced this pull request Sep 13, 2017
ararslan pushed a commit that referenced this pull request Sep 13, 2017
ararslan pushed a commit that referenced this pull request Sep 13, 2017
Ref #21818
(cherry picked from commit fa8c4d2)
vtjnash pushed a commit that referenced this pull request Sep 14, 2017
vtjnash pushed a commit that referenced this pull request Sep 14, 2017
vtjnash pushed a commit that referenced this pull request Sep 14, 2017
Ref #21818
(cherry picked from commit fa8c4d2)
ararslan pushed a commit that referenced this pull request Sep 15, 2017
ararslan pushed a commit that referenced this pull request Sep 15, 2017
ararslan pushed a commit that referenced this pull request Sep 15, 2017
Ref #21818
(cherry picked from commit fa8c4d2)
ararslan pushed a commit that referenced this pull request Nov 24, 2017
ararslan pushed a commit that referenced this pull request Nov 25, 2017
ararslan pushed a commit that referenced this pull request Nov 25, 2017
Partially reverts the backport of #21818 in 0.6.1. Fixes #24722.
amitmurthy added a commit that referenced this pull request Nov 25, 2017
Partially reverts the backport of #21818 in 0.6.1. Fixes #24722.

Remove support for OSX client port reuse.
amitmurthy added a commit that referenced this pull request Nov 27, 2017
Partially reverts the backport of #21818 in 0.6.1. Fixes #24722.

Revert client_socket_reuse
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable SO_REUSEPORT for OSX.
5 participants