Skip to content

quic: disable SO_REUSEPORT for integration tests#19188

Merged
wrowe merged 4 commits intoenvoyproxy:mainfrom
danzh2010:fixkeepalive
Dec 14, 2021
Merged

quic: disable SO_REUSEPORT for integration tests#19188
wrowe merged 4 commits intoenvoyproxy:mainfrom
danzh2010:fixkeepalive

Conversation

@danzh2010
Copy link
Copy Markdown
Contributor

Commit Message: SO_REUSEPORT socket option can lead bazel test to use the same port across parallel jobs. This causes test flakiness in http_quic_integration_test with --runs_per_test > 1. To mitigate this issue, the integration test should run with enable_reuse_port explicitly false unless it needs concurrency_ > 1. For those tests with concurrency_ > 1, running with --jobs=1 would eliminate the flakiness.

Additional Description: We used to disable SO_REUSEPORT for tests whose concurrency_ = 1 in #9495. However this was undo in #17259.

Risk Level: low, test only
Testing: existing tests passed

Signed-off-by: Dan Zhang <danzh@google.com>
@mattklein123
Copy link
Copy Markdown
Member

For my own understanding, shouldn't binding to port 0 prevent this issue? Or does binding to 0 not work correctly with REUSE_PORT?

Signed-off-by: Dan Zhang <danzh@google.com>
@danzh2010
Copy link
Copy Markdown
Contributor Author

For my own understanding, shouldn't binding to port 0 prevent this issue? Or does binding to 0 not work correctly with REUSE_PORT?

I believe we are already binding listener to port 0. Does this mean Envoy will pick up random port? Somehow, REUSE_PORT still allows port sharing among parallel jobs.

Signed-off-by: Dan Zhang <danzh@google.com>
@mattklein123
Copy link
Copy Markdown
Member

I believe we are already binding listener to port 0. Does this mean Envoy will pick up random port? Somehow, REUSE_PORT still allows port sharing among parallel jobs.

Yes, binding to port 0 should pick a random unused port. I would be very surprised if reuse_port effects this. Are we positive it's not some other issue?

@mattklein123
Copy link
Copy Markdown
Member

Perhaps it works differently for UDP somehow?

@danzh2010
Copy link
Copy Markdown
Contributor Author

I believe we are already binding listener to port 0. Does this mean Envoy will pick up random port? Somehow, REUSE_PORT still allows port sharing among parallel jobs.

Yes, binding to port 0 should pick a random unused port. I would be very surprised if reuse_port effects this. Are we positive it's not some other issue?

In #9495, I noticed different runs of tests picked up the same port in the overlapped period.

TBH I'm not sure if TCP has the same issue or not as I haven't run any TCP tests with large concurrency. And I suspect the issue will be less disruptive in TCP because the listen socket only accept connections.

@wrowe wrowe self-assigned this Dec 6, 2021
Signed-off-by: Dan Zhang <danzh@google.com>
@wrowe
Copy link
Copy Markdown
Contributor

wrowe commented Dec 7, 2021

Can you check proliferation of closed sockets in any _WAIT state? What we might be seeing here is exhaustion of the pool of ports due to lingering close wait / time wait. I'm wondering if it isn't worth tweaking the MTU for tests, such that the tuples are released more quickly for port reuse.

@danzh2010
Copy link
Copy Markdown
Contributor Author

Can you check proliferation of closed sockets in any _WAIT state? What we might be seeing here is exhaustion of the pool of ports due to lingering close wait / time wait. I'm wondering if it isn't worth tweaking the MTU for tests, such that the tuples are released more quickly for port reuse.

This is UDP socket, so I don't think they will be in any _WAIT state.
What is test MTU?

@wrowe
Copy link
Copy Markdown
Contributor

wrowe commented Dec 7, 2021

As long as the UDP responders are entirely released on close, we are fine. The issue is that a TIME_WAIT for all our TCP tests can end up in a state where it's hard to pull up an available port from the ephemeral pool, if they fire quickly enough. It's possible to tweak the MTU timing to reduce the time these sockets are held apart from port 0. But glancing around at entries like https://stackoverflow.com/questions/775638/using-so-reuseaddr-what-happens-to-previously-open-socket I'm beginning to wonder if SO_REUSEADDR isn't confounding the assignments of port 0.

This patch seems harmless enough in the first place.

@wrowe
Copy link
Copy Markdown
Contributor

wrowe commented Dec 7, 2021

(It also suggests reuseport may be unwise in production, if these are port 0 based.)

@danzh2010
Copy link
Copy Markdown
Contributor Author

(It also suggests reuseport may be unwise in production, if these are port 0 based.)

This is very likely the case.
I'm curious in what case users bind Envoy port to 0 in production?

@mattklein123
Copy link
Copy Markdown
Member

I'm curious in what case users bind Envoy port to 0 in production?

Probably never, though never say never. (I could imagine some scenario in which someone is binding to 0 because they have multiple processes running and then figuring out the bound port somehow.)

If port 0 combined with reuse port is broken then so be it. My main concern is that we are going to lose a bunch of coverage for our default setting. Any thoughts on how to balance this?

@mattklein123
Copy link
Copy Markdown
Member

cc @alyssawilk for thoughts on this.

@alyssawilk
Copy link
Copy Markdown
Contributor

hmmm, @danzh2010 @RyanTheOptimist can you remind me what we do internally for this?

@danzh2010
Copy link
Copy Markdown
Contributor Author

hmmm, @danzh2010 @RyanTheOptimist can you remind me what we do internally for this?

Internally we have a port pool utilities which provides unused ports, does book keeping and manages port recycling.

@danzh2010
Copy link
Copy Markdown
Contributor Author

danzh2010 commented Dec 13, 2021

If port 0 combined with reuse port is broken then so be it. My main concern is that we are going to lose a bunch of coverage for our default setting. Any thoughts on how to balance this?

The original default setting (SO_REUSEPORT with concurrency = 1) probably is not common in production anyway. I think having some TCP tests explicitly test concurrency > 1 like what we do to QuicHttpIntegrationTest would be a more realistic approach.

Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah given we have a few tests making sure it works I'm Ok with this as an interim solution.

@wrowe wrowe merged commit eb39110 into envoyproxy:main Dec 14, 2021
@wrowe
Copy link
Copy Markdown
Contributor

wrowe commented Dec 14, 2021

Yeah given we have a few tests making sure it works I'm Ok with this as an interim solution.

@danzh2010 would you file an issue to continue to track this unexpected behavior and link it to the dialog here? That way we keep a placeholder to continue to research why we've observed the unexpected behavior

@danzh2010
Copy link
Copy Markdown
Contributor Author

@danzh2010 would you file an issue to continue to track this unexpected behavior and link it to the dialog here? That way we keep a placeholder to continue to research why we've observed the unexpected behavior

Done in #19283

joshperry pushed a commit to joshperry/envoy that referenced this pull request Feb 13, 2022
SO_REUSEPORT socket option can lead bazel test to use the same port across parallel jobs. 

This causes test flakiness in http_quic_integration_test with --runs_per_test > 1. 

To mitigate this issue, the integration test should run with enable_reuse_port explicitly false unless it needs concurrency_ > 1. 

For those tests with concurrency_ > 1, running with --jobs=1 would eliminate the flakiness.

Additional Description: We used to disable SO_REUSEPORT for tests whose concurrency_ = 1 in envoyproxy#9495. However this was undo in envoyproxy#17259.

Signed-off-by: Dan Zhang <danzh@google.com>

Co-authored-by: Dan Zhang <danzh@google.com>
Signed-off-by: Josh Perry <josh.perry@mx.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants