Skip to content

Clean up socket allocation logic in cluster_info.rs#5832

Merged
alexpyattaev merged 7 commits intoanza-xyz:masterfrom
alexpyattaev:cleanup_cluster_info
Jul 6, 2025
Merged

Clean up socket allocation logic in cluster_info.rs#5832
alexpyattaev merged 7 commits intoanza-xyz:masterfrom
alexpyattaev:cleanup_cluster_info

Conversation

@alexpyattaev
Copy link
Copy Markdown

@alexpyattaev alexpyattaev commented Apr 15, 2025

Problem

#5824

  • Socket binding logic is a disaster of code duplication (because we have had nearly identical logic for localhost and public IP binding) with sudden flakiness mixed in (since multiple tests can be trying to book the same port ranges at the same time).
  • Tests would also flake because we were relying on port binding logic to never try to bind an already bound port while actually not blocking this properly

Summary of Changes

  • Merge the socket binding code into one codepath (down from 3) move some of the logic into net-utils for clarity (-200 LOC of boilerplate)

@alexpyattaev alexpyattaev force-pushed the cleanup_cluster_info branch from e8ce52e to a34fa4a Compare April 15, 2025 18:43
@alexpyattaev alexpyattaev marked this pull request as ready for review April 15, 2025 18:45
@alexpyattaev alexpyattaev requested a review from gregcusack April 15, 2025 19:26
@alexpyattaev alexpyattaev force-pushed the cleanup_cluster_info branch 2 times, most recently from 957f9a4 to 7e8b292 Compare April 15, 2025 21:54
@alexpyattaev alexpyattaev changed the title Cleanup cluster info Cleanup socket allocation logic in cluster info (and in tests in general) Apr 16, 2025
@alexpyattaev alexpyattaev force-pushed the cleanup_cluster_info branch from 0b78585 to b34f81c Compare April 16, 2025 10:35
@alexpyattaev alexpyattaev marked this pull request as draft April 17, 2025 09:33
@alexpyattaev alexpyattaev force-pushed the cleanup_cluster_info branch from b34f81c to 47861e7 Compare April 17, 2025 11:18
@alexpyattaev alexpyattaev force-pushed the cleanup_cluster_info branch 2 times, most recently from f436b15 to 1745647 Compare April 22, 2025 15:46
@alexpyattaev alexpyattaev changed the title Cleanup socket allocation logic in cluster info (and in tests in general) Clean up socket allocation logic in cluster_info.rs Apr 22, 2025
@alexpyattaev alexpyattaev force-pushed the cleanup_cluster_info branch 2 times, most recently from e4fbce3 to 44f1034 Compare April 22, 2025 19:01
@alexpyattaev alexpyattaev marked this pull request as ready for review April 22, 2025 19:27
@alexpyattaev alexpyattaev requested a review from bw-solana April 22, 2025 19:27
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 22, 2025

Codecov Report

Attention: Patch coverage is 81.89655% with 21 lines in your changes missing coverage. Please review.

Project coverage is 83.3%. Comparing base (4e33b78) to head (909c6f4).
Report is 6 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #5832     +/-   ##
=========================================
- Coverage    83.3%    83.3%   -0.1%     
=========================================
  Files         853      853             
  Lines      378158   377982    -176     
=========================================
- Hits       315300   315110    -190     
- Misses      62858    62872     +14     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alexpyattaev alexpyattaev requested a review from lijunwangs April 23, 2025 06:28
Copy link
Copy Markdown

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a tough time reviewing thoroughly. It might help to break this up into separate PRs for some of the major changes:

  • Reorganizing the socket binding APIs
  • Changing the reuse port usage
  • Switching VALIDATOR_PORT_RANGE --> localhost_port_range_for_tests
  • Moving get_gossip_port

All but the socket binding reorganization should be trivial to review. For that one, it might be nice to understand the current hierarchy of functions and the new hierarchy w/ your changes to help guide the review

Comment thread net-utils/src/lib.rs Outdated
Comment thread net-utils/src/lib.rs Outdated
Comment thread net-utils/src/lib.rs Outdated
Comment thread net-utils/src/lib.rs Outdated
Comment thread net-utils/src/lib.rs Outdated
Comment thread net-utils/src/lib.rs Outdated
@alexpyattaev
Copy link
Copy Markdown
Author

* Reorganizing the socket binding APIs

* Changing the reuse port usage

* Switching `VALIDATOR_PORT_RANGE` --> `localhost_port_range_for_tests`

* Moving `get_gossip_port`

This plan can not work because stuff will fail CI and we will be committing broken code, so first two steps would essentialy have to be merged into one.

We can definitely move get_gossip_port and switch to localhost_port_range in a separate PRs though.

@alexpyattaev alexpyattaev force-pushed the cleanup_cluster_info branch from e6e16d6 to f47ea7f Compare June 26, 2025 20:31
@alexpyattaev alexpyattaev marked this pull request as ready for review June 26, 2025 21:26
Copy link
Copy Markdown

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looking good, but it appears new_single_bind() is no longer doing single bind

Comment thread gossip/src/cluster_info.rs Outdated
}

/// create localhost node for tests with provided pubkey
/// unlike the public IP version, this will also bind RPC sockets.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the "public IP version"? are you referring to new_with_external_ip()? Can we refer to it directly by name in this comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good idea!

Comment thread gossip/src/cluster_info.rs Outdated
) -> (u16, UdpSocket) {
bind_in_range_with_config(bind_ip_addr, port_range, config).expect("Failed to bind")
let addr = IpAddr::V4(Ipv4Addr::LOCALHOST);
let gossip_addr = SocketAddr::new(addr, 0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before we were binding to gossip localhost within the port range provided by localhost_port_range_for_tests(). now we are just binding to a random port, possibly outside the port range. this is probably fine since it's just for tests and stuff but does change behavior

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this one, it is a bug! It would not "break" anything but would just cause flaky tests again.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined & deprecated in 909c6f4

Comment thread gossip/src/cluster_info.rs Outdated
Comment thread gossip/src/cluster_info.rs Outdated
let (tvu_port, tvu) = Self::bind_with_config(bind_ip_addr, port_range, socket_config);
let (tvu_quic_port, tvu_quic) =
Self::bind_with_config(bind_ip_addr, port_range, socket_config);
let ((tpu_port, tpu), (_tpu_quic_port, tpu_quic)) =
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i'm reading this wrong, but in this old code, tpu gets just a single UdpSocket. but then in the new code we call:

let ((tpu_port, tpu_sockets), (_tpu_port_quic, tpu_quic)) =
    bind_two_in_range_with_offset_and_config(
        bind_ip_addr,
        port_range,
        QUIC_PORT_OFFSET,
        socket_config,
        socket_config,
    )
    .expect("tou_socket primary bind");
let tpu_sockets =
    bind_more_with_config(tpu_sockets, 32, socket_config).expect("tpu_sockets multi_bind");

so now for new_single_bind() we are binding 32 sockets for tpu even though we only want one

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same goes for tvu, tpu_forwards, tpu_vote, broadcast, retransmit_sockets

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is different, but it is a distinction without a difference. new_single_bind is only used in test_validator, and it simply does not derive any meaningful benefit from multiple binds, but it does not hurt it either. This does remind that we need to finally deprecate TPU UDP for good though!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so we are good binding multiple sockets to the same ip/port for test validator? that works for me. but then probably would change name from new_single_bind() to match the fact that it is no longer a single bind

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its public API =( I guess we can go ahead and inline it to its sole caller instead.

alexpyattaev and others added 2 commits June 27, 2025 23:31
Co-authored-by: Greg Cusack <greg.cusack@anza.xyz>
@alexpyattaev alexpyattaev requested a review from gregcusack June 27, 2025 20:58
Comment thread gossip/src/cluster_info.rs Outdated
)
.expect("tou_socket primary bind");
let tpu_sockets =
bind_more_with_config(tpu_sockets, 32, socket_config).expect("tpu_sockets multi_bind");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hard coded the count? Should be based on the request

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has been hardcoded before. Also, do we even use use a single one of those 32? I thought TPU UDP is not used anymore.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sakridge confirmed it is going away so we do not care at the moment.

@alexpyattaev alexpyattaev requested a review from lijunwangs June 28, 2025 07:26
.expect("Number of QUIC endpoints can not be zero"),
vortexor_receiver_addr: None,
};
let mut node = Self::new_with_external_ip(pubkey, config);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section code is almost identical to the one in new_single_bind. except for the gossip port is set. Maybe put it in a common code?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe new_single_bind should get deprecated. Just axing it is also an option but in this case we can follow procedure. I think it has no practical purpose, just more code duplication. Am I missing something?

Copy link
Copy Markdown

@lijunwangs lijunwangs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. it much reduced code duplication

@alexpyattaev alexpyattaev merged commit ea9f28e into anza-xyz:master Jul 6, 2025
28 checks passed
@alexpyattaev alexpyattaev deleted the cleanup_cluster_info branch July 6, 2025 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants