Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reuse client port number if possible #12158

Merged
merged 1 commit into from
Jul 17, 2015
Merged

reuse client port number if possible #12158

merged 1 commit into from
Jul 17, 2015

Conversation

amitmurthy
Copy link
Contributor

This fixes a couple of issues that we ran into while launching a 1000 workers, with 64 workers to a node

  • one, client side port number reuse, so as to not run out of the ephemeral port numbers used on the client side of a TCP connection.
  • prevent runaway error messages logged when an accept call fails.

@amitmurthy amitmurthy changed the title reuse port if possible reuse client port number if possible Jul 15, 2015
@vtjnash
Copy link
Member

vtjnash commented Jul 15, 2015

probably worth taking a look at libuv/libuv@a385ae4, where libuv explictly disabled this and the bug referenced there https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=174087

@amitmurthy
Copy link
Contributor Author

Thanks.

Those links are about issues with SO_REUSEADDR. The option being set in this PR is SO_REUSEPORT which was introduced first in BSD and fairly recently in Linux (kernel 3.9 onwards).

This PR gets an ephemeral port number the first time around and then reuses the same port number for all subsequent connect request. This works because we only setup a single connection to each remote ip:port

The funny thing though is that on the client side, setting SO_REUSEPORT before bind throws an error, while setting it after results in the desired behaviour.

ccall(:jl_tcp_reuseport, Int32, (Ptr{Void}, ), s.handle) < 0 && throw("SO_REUSEPORT error")
ccall(:jl_tcp_getsockname_v4, Int32,
(Ptr{Void}, Ref{Cuint}, Ref{Cushort}),
s.handle, client_host, client_port) < 0 && throw("getsockname() error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are throwing strings here

@jakebolewski
Copy link
Member

👍

amitmurthy added a commit that referenced this pull request Jul 17, 2015
reuse client port number if possible
@amitmurthy amitmurthy merged commit a8536c3 into master Jul 17, 2015
@amitmurthy amitmurthy deleted the amitm/slurm1000 branch July 17, 2015 03:36
@yuyichao
Copy link
Contributor

Is this warning during Travis CI related to this PR and is it expected?

@yuyichao
Copy link
Contributor

It's in the CI run for this PR as well. @amitmurthy

@amitmurthy
Copy link
Contributor Author

I forgot Travis runs on Ubuntu 12.04, so it is expected.

I'll modify the warning to show up only when we are adding a large number of workers, say > 128, since it is not really an issue below that.

@yuyichao
Copy link
Contributor

Is that because the kernel is not new enough? and can it be detected?

@amitmurthy
Copy link
Contributor Author

Yes. And yes.

@StefanKarpinski
Copy link
Member

I'm seeing this during tests:

    JULIA test/all
WARNING: Unable to reuse port : SystemError("getsockname() : ",48)
WARNING: Unable to reuse port : SystemError("getsockname() : ",48)WARNING: Unable to reuse port : SystemError("getsockname() : ",48)WARNING: Unable to reuse port : SystemError("getsockname() : ",48)
WARNING: Unable to reuse port : SystemError("getsockname() : ",48)WARNING: Unable to reuse port : SystemError("getsockname() : ",48)
WARNING: Unable to reuse port : SystemError("getsockname() : ",48)
Julia Version 0.4.0-dev+6022
Commit f4eb92b (2015-07-17 13:41 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.3.0)
  CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

@amitmurthy
Copy link
Contributor Author

Working on this here - #12192

Looks like on OSX we need to setsockopt before bind (as sounds logical), but on Linux that did not work and bind followed by setsockopt worked.

Maybe I should limit setting this only for Linux since it is unlikely that folks on OSX will run with hundreds of workers?

@StefanKarpinski
Copy link
Member

Nah, let's fix it right. Better to have things just work. Someone may want hundreds of workers on OS X.

@ScottPJones
Copy link
Contributor

Maybe this is just because in my use cases, things are generally I/O bound, so I'm used to needing lots and lots of workers, even on a old 8-core Mac Pro, as opposed to the probably more CPU bound stuff that is more frequently done in numerical/scientific computing.
Please make it work right on OS X also.

@StefanKarpinski
Copy link
Member

Let me know if you need help testing, @amitmurthy.

@amitmurthy
Copy link
Contributor Author

Thanks @StefanKarpinski . I have access to a Mac and was trying it make it work there. The bad news is that I don't think I can make it work on OSX because we don't have a way to access the socket fd before a bind - see libuv/libuv#386 (comment) .

On Linux kernels > 3.9 setsockopt after bind just happens to work.

The ability to get access to a socket before bind landed fairly recently in mainstream libuv - libuv/libuv#400 .

Also SO_REUSEPORT is not supported on Windows.

I'll go ahead and make this a Linux only thing for now. Will revisit OSX when our libuv fork catches up with the above mentioned patch.

@StefanKarpinski
Copy link
Member

Sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants