Skip to content

Remove direct access to SO_REUSEPORT#5965

Closed
alexpyattaev wants to merge 11 commits intoanza-xyz:masterfrom
alexpyattaev:deprecate_explicit_reuseport
Closed

Remove direct access to SO_REUSEPORT#5965
alexpyattaev wants to merge 11 commits intoanza-xyz:masterfrom
alexpyattaev:deprecate_explicit_reuseport

Conversation

@alexpyattaev
Copy link
Copy Markdown

@alexpyattaev alexpyattaev commented Apr 24, 2025

Problem

  • SO_REUSEPORT flag is tricky to use correctly
  • API in net-utils encouraged incorrect usage of the flag
  • Multiple devs tripped this particular trap, API needs to improve

This is a split-off from #5832

Summary of Changes

  • Remove the need to explicitly manage SO_REUSEPORT in SocketConfig, thus removing the burden of thinking about it from the caller side (which, as practice shows, was super easy to get wrong).
  • Manage SO_REUSEPORT for sockets from high-level API, preventing misuse
  • Extra tests to make sure it works correctly.

@alexpyattaev alexpyattaev marked this pull request as draft April 24, 2025 12:10
Comment thread net-utils/src/lib.rs Outdated
Comment thread net-utils/src/lib.rs Outdated
Comment thread net-utils/src/lib.rs
@alexpyattaev alexpyattaev marked this pull request as ready for review April 24, 2025 13:10
@alexpyattaev alexpyattaev changed the title Deprecate explicit reuseport Deprecate direct access to SO_REUSEPORT Apr 24, 2025
Comment thread net-utils/src/lib.rs
Comment thread net-utils/src/lib.rs Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2025

Codecov Report

❌ Patch coverage is 87.11656% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.8%. Comparing base (3281f61) to head (bbc1184).
⚠️ Report is 3082 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5965   +/-   ##
=======================================
  Coverage    82.8%    82.8%           
=======================================
  Files         849      849           
  Lines      379373   379434   +61     
=======================================
+ Hits       314300   314374   +74     
+ Misses      65073    65060   -13     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. I'm still not a fan of the silent rugging of config. I can get onboard if we at least thrown a warn to say "hey, you called me with an invalid config and I fixed it for you"

let (_port, read_sockets) = solana_net_utils::multi_bind_in_range_with_config(
ip_addr,
(port, port + num_sockets as u16),
SocketConfig::default().reuseport(true),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why were we explicitly setting reuse port to true? Was this working around some problem that no longer exists? Were we just foolish?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We used to return error if reuse port was not set on the config. Now we just force reuse port to true within the function

Comment thread net-utils/src/lib.rs Outdated
Comment thread net-utils/src/lib.rs Outdated
Comment thread net-utils/src/lib.rs
Comment thread net-utils/src/lib.rs Outdated
};
#[cfg(not(any(windows, target_os = "ios")))]
set_reuse_port(&socket)?;
config.reuseport = true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably warn here as well

Copy link
Copy Markdown
Author

@alexpyattaev alexpyattaev Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, since legitimate use assumes reuseport is not set externally, so no point warning here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we let someone pass in a config if we don't respect it?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this. never change a config behind the callers back. if it's wrong, return error

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole idea is that reuseport will not be controllable by the caller anymore. So we can change it without caller knowing if we did or not. Or we can break public API and just delete it from config completely. Why would we want to make it possible to use this incorrectly and get runtime panics when it is trivial to make it impossible?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards either:

  1. Deprecate all of these APIs and create a new SocketConfig type that doesn't allow setting reuseport + APIs that use the new config type.
  2. Setup the configurations properly everywhere and throw errors when they're not.

While I agree overwriting the config is probably a better user experience in 99% of cases, it is a behavioral breaking change, and I would rather not set the precedent that it's something we want to do

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point that is still missed is that all the functions that are "rugging" the config are impossible to use correctly with the old API and old behavior. For example, when we bind TPU ports, we need to call pub fn bind_two_in_range_with_offset_and_config which will yield 2 ports with correct QUIC offset. For that function to make sense, it needs to have reuseport=false (else it will bind on top of existing ports). After that, we want to bind additional sockets on top of the stuff bound by bind_two_in_range_with_offset_and_config, which means the original socket needs to have reuseport=true (else binding on top of it will fail). We could change the logic in bind_two_in_range_with_offset_and_config such that it first binds without SO_REUSEPORT, then sets the flag on both sockets AFTER they are bound to reflect what is in the config (but that is changing what the function does, which is something you want to avoid). bind_in_range_with_config is more tricky as it can concievably be used with reuseport=true (assuming the user WANTS it to bind on top of existing stuff) even though it makes no sense. So, we end up either changing the API towards a sane option (which is what I'm proposing) or changing the implementation (which would be necesary for option 2).

As for option 1, I do not think we should maintain code that is objectively hard to use correctly, we are not making C++ standard library here =)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that function to make sense, it needs to have reuseport=false (else it will bind on top of existing ports)

Can you break this down for me? We call it with reuseport set to true today, right? Which existing port is it binding on top of? Wouldn't this imply that the port range being passed in is the problem?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you break this down for me?

Gladly :)

We call it with reuseport set to true today, right?

Yes, noone calls this with false, but that is technically allowed (just does not do anything).

Which existing port is it binding on top of?

The first one in range, + a random offset some of the functions pick. Of course, they can only bind on top of ports with reuseport=true (but this does not save us)

Wouldn't this imply that the port range being passed in is the problem?

Well yes, functions that operate on specific port number are not problematic. if we were to "consume" ports from the port range as we bind sockets we would not have as much of an issue here. But that would call for agave to keep stateful mutable port ranges rather than a tuple of ints, not sure it is a better way.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not rugging config anymore. If user sets reuseport, we will respect his choice. We will only crash in totally pathological case where user asks for extra sockets on top of existing one, and explicitly sets SO_REUSEPORT = false (which would get IO error from the OS anyway).

@alexpyattaev alexpyattaev requested a review from bw-solana April 24, 2025 18:24
Comment thread net-utils/src/lib.rs Outdated
Comment on lines 255 to 242
#[deprecated(since = "2.3.0", note = "SO_REUSEPORT is now managed automatically")]
pub fn reuseport(mut self, reuseport: bool) -> Self {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change - we are making it such that users know that manually controlling reuseport flag is not supported anymore. This allows us to manipulate field in the SocketConfig as we deem appropriate.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would typically deprecate and then wait a version before removing.
I view overwriting the field as almost the same thing as just removing the API because the API becomes completely meaningless

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone keeps using this API, there are principally 2 outcomes: they are setting this field correctly for each of the functions they are calling, and our "rugging" is a noop, or they are setting it incorrectly, and "rugging" fixes a bug in their code. We can of course preserve the exact semantics here at the cost of complexity inside agave.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. "Fixing" their code is a behavioral breaking API change that probably makes life better in 99% of cases

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we are tracking whether user manually set the flag, and if they did we fall back to blindly obeying what they have specified.

@alexpyattaev alexpyattaev force-pushed the deprecate_explicit_reuseport branch from 0a3a3e8 to 76bd9cd Compare April 24, 2025 18:55
@alexpyattaev
Copy link
Copy Markdown
Author

Rebased to pull in fix for new_with_external_ip_test_random which flaked in CI

@alexpyattaev
Copy link
Copy Markdown
Author

Left a few comments. I'm still not a fan of the silent rugging of config.

Neither am I, but since external callers are not supposed to be setting up that part of config by hand they should never notice the rugging. We have discussed a full rework of API with @KirillLykov today and deemed it to be not warranted here since correct usage will never trigger any issues with rugging, and incorrect use would be a bug anyway.

I can get onboard if we at least thrown a warn to say "hey, you called me with an invalid config and I fixed it for you"

Added warn! calls in 76bd9cd

@alexpyattaev alexpyattaev force-pushed the deprecate_explicit_reuseport branch from 76bd9cd to 41e501e Compare April 24, 2025 19:10
@alexpyattaev
Copy link
Copy Markdown
Author

I think I've found a way to have the cake and eat it too in 6e16683. We can check if the user has set the reuseport manually, and if they did fall back to old behavior. Old behavior has obviously problematic usage patterns patched, i.e. when calling for mulitple ports to be scanned/bound to, we will bind to free ports first, then set whatever reuseport flag the user actually asked for.

TL;DR: if a user manually sets reuseport through (now deprecated) setter method, bind_in_range_with_config will bind to a FREE port no matter which reuseport flag was chosen by the user, then set the reuseport as requested (which is what it was expected to do in the first place). By default the new system will manage reuseport automagically (thus enabling us to eventually deprecate manual configuration of that flag as originally intended) & simplifying agave code.

@alexpyattaev alexpyattaev force-pushed the deprecate_explicit_reuseport branch 2 times, most recently from 9ea8a8c to 1d67467 Compare April 25, 2025 13:39
Comment thread net-utils/src/lib.rs Outdated
let orig_reuseport2 = sock2_config.reuseport;
// clear flags to be able to find actually free ports
sock1_config.reuseport = false;
sock2_config.reuseport = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like these config shenanigans. We should respect what gets passed in even if it seems undesirable. If we want to take away some optionality and force the "right" behavior, we should make new APIs and deprecate the broken ones (or fix them and roll the major version)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I will bump major version then :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided against version bumps just yet, got rid of the shenanigans in 21c8316 (now our behavior is same as original if reuseport is set manually, including pathological cases).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the monorepo crates are versioned synchronously. we have a release schedule, versioning cadence and backward compatibility policy. there is zero chance we are bumping major for this

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've figured as much by looking at other PRs. Makes sense to keep everything coherent.

@gregcusack
Copy link
Copy Markdown

SO_REUSEPORT flag is tricky to use correctly
API in net-utils encouraged incorrect usage of the flag

can you explain what the proper usage of SO_REUSEPORT is, please? I don't fully understand.

My minimal understanding is if we let a user explicitly set SO_REUSEPORT, then they could bind on top of a port that also set SO_REUSEPORT. Therefore, we need to try to initially bind a socket without SO_REUSEPORT to avoid this issue. Once we are able to bind to a port w/o SO_REUSEPORT, then we need to explicitly set `SO_REUSEPORT.

For this PR you're trying to prevent a user for ever explicitly setting SO_REUSEPORT (unless they use deprecated function). so now if you want to use SO_REUSEPORT, you have to call bind_more_with_config()?

@alexpyattaev
Copy link
Copy Markdown
Author

SO_REUSEPORT flag is tricky to use correctly
API in net-utils encouraged incorrect usage of the flag

can you explain what the proper usage of SO_REUSEPORT is, please? I don't fully understand.

Gladly!

My minimal understanding is if we let a user explicitly set SO_REUSEPORT, then they could bind on top of a port that also set SO_REUSEPORT. Therefore, we need to try to initially bind a socket without SO_REUSEPORT to avoid this issue. Once we are able to bind to a port w/o SO_REUSEPORT, then we need to explicitly set `SO_REUSEPORT.

Yes, pretty much. Since they had to bind with SO_REUSEPORT set, they set this option in SocketConfig, which allowed them to bind on top of already bound sockets, causing tests to sometimes fail. We had a bunch of hacks to work around this flakiness such as doing find_available_port_in_range first and then binding with SO_REUSEPORT set hoping noone else did it on the same port at the same time.

For this PR you're trying to prevent a user for ever explicitly setting SO_REUSEPORT (unless they use deprecated function). so now if you want to use SO_REUSEPORT, you have to call bind_more_with_config()?

Yeah, exactly. With user calling bind_more_with_config() we can be sure that the intention is to do a multiple bind to the same port, and can manipulate the socket flags accordingly. User needs not concern themselves with SO_REUSEPORT semantics, and API always does the right thing for them. Following the ethos of "make invalid states unrepresentable" here=).

Copy link
Copy Markdown

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time understanding the logic flow of this code with reuseport_set_by_user. It's hard to tell when a port is actually getting set with SO_REUSEPORT or not. I kinda agree with Brennan on this one. I think we should almost just start over. Deprecate all the methods that are currently allowing users to misuse SO_REUSEPORT and then create new APIs that force the proper use. It sucks because it's going to mean more code in this library until we can completely remove remove the old API misusing SO_REUSEPORT. But i think it is a lot more clear and less convoluted/hacky than using reuseport_set_by_user.

let port_range = localhost_port_range_for_tests();
let udp_config = SocketConfig::default();
let quic_config = SocketConfig::default().reuseport(true);
let quic_config = SocketConfig::default();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we find to tpu_quic socket without SO_REUSEPORT and then rebind using SO_REUSEPORT below here:

let tpu_quic = bind_more_with_config(tpu_quic, num_quic_endpoints, quic_config).unwrap();

Comment thread net-utils/src/lib.rs
Comment on lines +388 to +359
if orig_reuseport & !config.reuseport {
set_reuse_port(&udp_socket)?;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can this ever be true? orig_reuseport cannot be true without config.reuseport_set_by_user also being true, right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, functions can call each other, so it could be set by us within this module.

@gregcusack
Copy link
Copy Markdown

Yes, pretty much. Since they had to bind with SO_REUSEPORT set, they set this option in SocketConfig, which allowed them to bind on top of already bound sockets, causing tests to sometimes fail. We had a bunch of hacks to work around this flakiness such as doing find_available_port_in_range first and then binding with SO_REUSEPORT set hoping noone else did it on the same port at the same time.

thank you for the info on this!

@alexpyattaev
Copy link
Copy Markdown
Author

I'm having a hard time understanding the logic flow of this code with reuseport_set_by_user. It's hard to tell when a port is actually getting set with SO_REUSEPORT or not. I kinda agree with Brennan on this one.

Yeah it was much simpler when we were just rugged the config.

I think we should almost just start over. Deprecate all the methods that are currently allowing users to misuse SO_REUSEPORT and then create new APIs that force the proper use. It sucks because it's going to mean more code in this library until we can completely remove remove the old API misusing SO_REUSEPORT. But i think it is a lot more clear and less convoluted/hacky than using reuseport_set_by_user.

The existing API is mostly fine. There are a few pathological functions allowing SO_REUSEPORT to be specified. We will gradually eradicate them. No need to uproot half the codebase for this.

@t-nelson
Copy link
Copy Markdown

you've been misled by a symptom. the true bug is here

pub const VALIDATOR_PORT_RANGE: PortRange = (8000, 10_000);

every "default" that's not a param to a clap arg is a bug. the port range is particularly insidious

@alexpyattaev
Copy link
Copy Markdown
Author

you've been misled by a symptom. the true bug is here

pub const VALIDATOR_PORT_RANGE: PortRange = (8000, 10_000);

every "default" that's not a param to a clap arg is a bug. the port range is particularly insidious

Yes, looks like best thing we can do to this crate is deprecate the whole thing and just make better ones instead.

@alexpyattaev alexpyattaev changed the title Deprecate direct access to SO_REUSEPORT Remove direct access to SO_REUSEPORT Jun 9, 2025
@alexpyattaev alexpyattaev force-pushed the deprecate_explicit_reuseport branch from dc049c8 to 2ff4456 Compare June 9, 2025 18:59
@alexpyattaev
Copy link
Copy Markdown
Author

Since we are at 3.0 I have made the breaking changes to implement the originally intended changes.

@alexpyattaev alexpyattaev requested a review from gregcusack June 9, 2025 19:54
@t-nelson
Copy link
Copy Markdown

t-nelson commented Jun 9, 2025

Since we are at 3.0 I have made the breaking changes to implement the originally intended changes.

are the symbols deprecated in 2.3? we typically commit to deprecations being published for at least one minor release cycle

@alexpyattaev
Copy link
Copy Markdown
Author

Since we are at 3.0 I have made the breaking changes to implement the originally intended changes.

are the symbols deprecated in 2.3? we typically commit to deprecations being published for at least one minor release cycle

nope, this PR was supposed to be a preparation to completely remove the broken API, but it never got merged because reasons.
If we need to wait till 4.0 to get this into manageable shape, I'd prefer just making a new crate from scratch and abandon this crate completely. Not the most elegant solution but definitely easier on everyone.

@alexpyattaev alexpyattaev requested a review from bw-solana June 12, 2025 18:08
@gregcusack
Copy link
Copy Markdown

Since we are at 3.0 I have made the breaking changes to implement the originally intended changes.

are the symbols deprecated in 2.3? we typically commit to deprecations being published for at least one minor release cycle

ya we probably should have deprecated these in v2.3 but it slipped through. Chatted with a couple of the devrel folks at foundation and they basically said as long as we are following semver, we should be ok to break the public API. They also mentioned they would help with comms when we roll out the breaking changes in v3.0. So, I'd suggest we break it when we have the chance

@t-nelson
Copy link
Copy Markdown

guys you can deprecate the symbols in a separate pr and bp that. we're still early

@alexpyattaev alexpyattaev force-pushed the deprecate_explicit_reuseport branch 2 times, most recently from 2a3988f to 1ee0377 Compare June 17, 2025 20:18
@alexpyattaev alexpyattaev force-pushed the deprecate_explicit_reuseport branch from 1ee0377 to 7b2ae82 Compare June 17, 2025 21:07
@mergify
Copy link
Copy Markdown

mergify Bot commented Jun 17, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

Comment thread net-utils/src/lib.rs Outdated
Comment on lines +407 to +422
// binds many sockets to the same port in a range
// Note: The `mut` modifier for `num` is unused but kept for compatibility with the public API.
#[deprecated(
since = "2.2.0",
note = "use `multi_bind_in_range_with_config` instead"
)]
#[allow(unused_mut)]
pub fn multi_bind_in_range(
ip_addr: IpAddr,
range: PortRange,
mut num: usize,
) -> io::Result<(u16, Vec<UdpSocket>)> {
let config = SocketConfig::default();
multi_bind_in_range_with_config(ip_addr, range, config, num)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little confused here. are you creating a new function and deprecating it at the same time?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad rebase. Fixed in bbc1184

@alexpyattaev
Copy link
Copy Markdown
Author

Basically the situation is that we can not backport just the deprecation warnings without providing a new implementation to actually use. So this PR is supposed to provide a stopgap implementation that people can actually use, and in 3.0 we remove the backwards compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants