Skip to content

Multihoming: Hotswap gossip socketaddr #6474

Merged
gregcusack merged 9 commits intoanza-xyz:masterfrom
gregcusack:hot-swap-gossip-socketaddr
Jun 26, 2025
Merged

Multihoming: Hotswap gossip socketaddr #6474
gregcusack merged 9 commits intoanza-xyz:masterfrom
gregcusack:hot-swap-gossip-socketaddr

Conversation

@gregcusack
Copy link
Copy Markdown

@gregcusack gregcusack commented Jun 9, 2025

Problem

No ability to swap IP address for multihoming support

Summary of Changes

  1. Add AdminRPC command to rebind gossip socket and refresh gossip contactinfo
  2. Add channel to read command and execute hot swap of gossip socket
  3. Update streamer to take in an AtomicUdpSocket which is hot swapable.

HOW TO:

  • Say you want to switch your gossip port from 8000 to 9998. Run:
echo '{"jsonrpc":"2.0","id":1,"method":"setGossipSocket","params":["<your-nodes-ip>",9998]}' | socat - ./<path-to-admin-rpc>

example:

echo '{"jsonrpc":"2.0","id":1,"method":"setGossipSocket","params":["xxx.xxx.xx.x",9998]}' | socat - ./admin.rpc

@gregcusack gregcusack changed the title Multihoming: Hot swap gossip socketaddr Multihoming: Hotswap gossip socketaddr Jun 9, 2025
@gregcusack gregcusack marked this pull request as draft June 9, 2025 19:48
Comment thread streamer/src/atomic_udp_socket.rs Outdated
Comment thread streamer/src/atomic_udp_socket.rs
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 9, 2025

Codecov Report

❌ Patch coverage is 72.22222% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.4%. Comparing base (bbea86a) to head (1809a7a).
⚠️ Report is 3084 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #6474     +/-   ##
=========================================
- Coverage    83.4%    83.4%   -0.1%     
=========================================
  Files         850      851      +1     
  Lines      377710   377863    +153     
=========================================
+ Hits       315086   315166     +80     
- Misses      62624    62697     +73     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alexpyattaev
Copy link
Copy Markdown

Overall, I think this is a good approach for now, we should follow through. I'll clean up the socket binding logic to make it easier to integrate this for other services with more esoteric binding patterns.

@gregcusack gregcusack marked this pull request as ready for review June 10, 2025 15:58
@gregcusack
Copy link
Copy Markdown
Author

in my last commit I took out last_id which was needed to determine if the socket was set, even for CurrentSocket::Same(). needed this to set the read timeout in recv_loop() when first called. ended up putting that logic in recv_loop() instead of putting it back in SocketProvider

@gregcusack gregcusack force-pushed the hot-swap-gossip-socketaddr branch from 3d3eb14 to 2e4203a Compare June 10, 2025 16:20
@gregcusack gregcusack requested a review from alexpyattaev June 10, 2025 17:00
Comment thread gossip/src/gossip_service.rs Outdated
should_check_duplicate_instance: bool,
stats_reporter_sender: Option<Sender<Box<dyn FnOnce() + Send>>>,
exit: Arc<AtomicBool>,
gossip_rebind_rx: Option<Receiver<SocketAddr>>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, since we are sending a socket and a rebinder to the service, would it not be better if we sent both in the same argument? They are logically tied to each other fairly hard... Would it make sense to send AtomicUdpSocket rather than a UdpSocket and a notification channel separately?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we create GossipService in bootstrap as well, but we don't necessarily want the GossipService in bootstrap to be rebindable: https://github.com/gregcusack/solana/blob/2e4203abd39239a8fce1ff7894bdb28f354f9c1b/validator/src/bootstrap.rs#L161. We could but it will never be used.

we could create some GossipSocket enum like:

pub enum GossipSocket {
    Static(UdpSocket)
    Rebindable(RebindableSocket) // maybe not the best name lol
}

where

pub struct RebindableSocket {
    pub socket: Arc<AtomicUdpSocket>,
    pub rebind_rx: Receiver<SocketAddr>,
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use an equivalent of RebindableSocket always and just fill rebind_rx with https://docs.rs/crossbeam/latest/crossbeam/channel/fn.never.html ? Then you can trivially create one from UdpSocket for cases when rebinding is not necessary.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ended up just removing the crossbeam channel altogether

Comment thread gossip/src/gossip_service.rs Outdated
socket_addr_space,
stats_reporter_sender,
);
let t_rebind = gossip_rebind_rx.map(|rebind_rx| {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that binding a socket takes literally microseconds, do we need a thread waiting on a channel to do it? Can we maybe rebind immediately and just swap the thing in place immediately in the admin RPC handling code?

Comment thread validator/src/admin_rpc_service.rs Outdated

meta.with_post_init(|post_init| {
if let Some(gossip_rebinder) = &post_init.gossip_rebinder {
gossip_rebinder.rebind(new_addr).map_err(|e| {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be possible to perform the actual rebinding here and avoid having the channel and Rebinder abstraction on top of it? We are not calling this every second, so even if this RPC blocks for a few milliseconds it is not a big deal.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better idea. yes i can do that

@gregcusack gregcusack requested a review from alexpyattaev June 10, 2025 23:13
alexpyattaev
alexpyattaev previously approved these changes Jun 13, 2025
Copy link
Copy Markdown

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this does not need to be perfect at the moment.

@gregcusack
Copy link
Copy Markdown
Author

LGTM, this does not need to be perfect at the moment.

I'd like to wait on merging this until I can test on a box w/ two routable IPs just to make sure we're good to go here

@gregcusack gregcusack force-pushed the hot-swap-gossip-socketaddr branch from f1ee213 to 1809a7a Compare June 24, 2025 03:18
@gregcusack gregcusack requested a review from alexpyattaev June 24, 2025 19:53
@gregcusack
Copy link
Copy Markdown
Author

pushed another commit: 1809a7a

removes a double Arc around the hotswapable socket and changes Sockets struct to hold an AtomicUdpSocket for gossip socket

@gregcusack gregcusack requested review from alexpyattaev and removed request for alexpyattaev June 24, 2025 20:26
@gregcusack
Copy link
Copy Markdown
Author

@alexpyattaev UPDATE: this has been tested on a multihomed machine and is working! Now that we have support to pass in --bind-address, I wonder if it makes sense to close up the AdminRPC Api on this?

Probably doesn't really matter. But currently you can pass in any IP/port into AdminRPC. Of course, if you pass in an IP that doesn't exist as an interface on your machine the swap will fail. Should we instead check if the IP exists in BindIpAddrs first? Run the swap if it does. If not, log and return an error?

Copy link
Copy Markdown

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us not add this coupling just yet. Keeping the changeset small has benefits too. For now we can just assume the adminRPC does not get garbage in (which should be a reasonable assumption for what is an experimental feature at the moment). Once we have the logic mostly sorted we can add idiotproofing where appropriate.

@gregcusack gregcusack merged commit 4e33b78 into anza-xyz:master Jun 26, 2025
39 checks passed
@gregcusack gregcusack deleted the hot-swap-gossip-socketaddr branch June 26, 2025 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants