quic: disable SO_REUSEPORT for integration tests#19188
quic: disable SO_REUSEPORT for integration tests#19188wrowe merged 4 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Dan Zhang <danzh@google.com>
|
For my own understanding, shouldn't binding to port 0 prevent this issue? Or does binding to 0 not work correctly with REUSE_PORT? |
Signed-off-by: Dan Zhang <danzh@google.com>
19fbd08 to
bcad94f
Compare
I believe we are already binding listener to port 0. Does this mean Envoy will pick up random port? Somehow, REUSE_PORT still allows port sharing among parallel jobs. |
Yes, binding to port 0 should pick a random unused port. I would be very surprised if reuse_port effects this. Are we positive it's not some other issue? |
|
Perhaps it works differently for UDP somehow? |
In #9495, I noticed different runs of tests picked up the same port in the overlapped period. TBH I'm not sure if TCP has the same issue or not as I haven't run any TCP tests with large concurrency. And I suspect the issue will be less disruptive in TCP because the listen socket only accept connections. |
Signed-off-by: Dan Zhang <danzh@google.com>
|
Can you check proliferation of closed sockets in any _WAIT state? What we might be seeing here is exhaustion of the pool of ports due to lingering close wait / time wait. I'm wondering if it isn't worth tweaking the MTU for tests, such that the tuples are released more quickly for port reuse. |
This is UDP socket, so I don't think they will be in any _WAIT state. |
|
As long as the UDP responders are entirely released on close, we are fine. The issue is that a TIME_WAIT for all our TCP tests can end up in a state where it's hard to pull up an available port from the ephemeral pool, if they fire quickly enough. It's possible to tweak the MTU timing to reduce the time these sockets are held apart from port 0. But glancing around at entries like https://stackoverflow.com/questions/775638/using-so-reuseaddr-what-happens-to-previously-open-socket I'm beginning to wonder if SO_REUSEADDR isn't confounding the assignments of port 0. This patch seems harmless enough in the first place. |
|
(It also suggests reuseport may be unwise in production, if these are port 0 based.) |
This is very likely the case. |
Probably never, though never say never. (I could imagine some scenario in which someone is binding to 0 because they have multiple processes running and then figuring out the bound port somehow.) If port 0 combined with reuse port is broken then so be it. My main concern is that we are going to lose a bunch of coverage for our default setting. Any thoughts on how to balance this? |
|
cc @alyssawilk for thoughts on this. |
|
hmmm, @danzh2010 @RyanTheOptimist can you remind me what we do internally for this? |
Internally we have a port pool utilities which provides unused ports, does book keeping and manages port recycling. |
The original default setting (SO_REUSEPORT with concurrency = 1) probably is not common in production anyway. I think having some TCP tests explicitly test concurrency > 1 like what we do to QuicHttpIntegrationTest would be a more realistic approach. |
alyssawilk
left a comment
There was a problem hiding this comment.
Yeah given we have a few tests making sure it works I'm Ok with this as an interim solution.
@danzh2010 would you file an issue to continue to track this unexpected behavior and link it to the dialog here? That way we keep a placeholder to continue to research why we've observed the unexpected behavior |
Done in #19283 |
SO_REUSEPORT socket option can lead bazel test to use the same port across parallel jobs. This causes test flakiness in http_quic_integration_test with --runs_per_test > 1. To mitigate this issue, the integration test should run with enable_reuse_port explicitly false unless it needs concurrency_ > 1. For those tests with concurrency_ > 1, running with --jobs=1 would eliminate the flakiness. Additional Description: We used to disable SO_REUSEPORT for tests whose concurrency_ = 1 in envoyproxy#9495. However this was undo in envoyproxy#17259. Signed-off-by: Dan Zhang <danzh@google.com> Co-authored-by: Dan Zhang <danzh@google.com> Signed-off-by: Josh Perry <josh.perry@mx.com>
Commit Message: SO_REUSEPORT socket option can lead bazel test to use the same port across parallel jobs. This causes test flakiness in http_quic_integration_test with --runs_per_test > 1. To mitigate this issue, the integration test should run with enable_reuse_port explicitly false unless it needs concurrency_ > 1. For those tests with concurrency_ > 1, running with --jobs=1 would eliminate the flakiness.
Additional Description: We used to disable SO_REUSEPORT for tests whose concurrency_ = 1 in #9495. However this was undo in #17259.
Risk Level: low, test only
Testing: existing tests passed