Skip to content

reduce buffer size of "unused" side of socket#9218

Merged
gregcusack merged 2 commits into
anza-xyz:masterfrom
gregcusack:reduce-net-buffer-sizes-v2
Jan 29, 2026
Merged

reduce buffer size of "unused" side of socket#9218
gregcusack merged 2 commits into
anza-xyz:masterfrom
gregcusack:reduce-net-buffer-sizes-v2

Conversation

@gregcusack
Copy link
Copy Markdown

@gregcusack gregcusack commented Nov 21, 2025

Follow Up to PR: #3929 and redo of: #4313

Problem

According to the agave validator docs, the max buffer sizes should be set to 128 MB. See:

sudo bash -c "cat >/etc/sysctl.d/21-agave-validator.conf <<EOF
# Increase max UDP buffer sizes
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
. But we do not make a recommendation for default size. Default size is host dependent. The default size is the limit the buffer can be until the operator calls setsockopt on the socket and increases (limited by net.core.rmem_max and net.core.wmem_max).

While the kernel doesn't pre allocate memory for these buffers, we want to prevent the kernel from allocating twice the memory for protocols that don't read AND write from their socket(s). For example, retransmit_sockets are strictly write-only sockets. A peer can send to the retransmit_sockets and get the node to allocate memory that will never be used (except to hold the garbage data sent by the peer).

The "unused" side of a socket is either the read or write side of the socket that is not used. So for retransmit_sockets the "unused" side is the read side. For tvu, the unused side is the write side.

Summary of Changes

  1. QUIC: Reduce "unused" side of socket to 4 MB (need enough for control traffic)
  2. UDP: Reduce "unused" side of socket to 0 MB (the kernel will actually set these to 2048 bytes and 256 bytes for send and receiver buffers respectively): https://man7.org/linux/man-pages/man7/socket.7.html
  3. I purposely did not touch Alpenglow related sockets here

Services/Sockets as defined in Sockets:

pub struct Sockets {
pub gossip: Arc<[UdpSocket]>,
pub ip_echo: Option<TcpListener>,
pub tvu: Vec<UdpSocket>,
pub tvu_quic: UdpSocket,
pub tpu: Vec<UdpSocket>,
pub tpu_forwards: Vec<UdpSocket>,
pub tpu_vote: Vec<UdpSocket>,
pub broadcast: Vec<UdpSocket>,
// Socket sending out local repair requests,
// and receiving repair responses from the cluster.
pub repair: UdpSocket,
pub repair_quic: UdpSocket,
pub retransmit_sockets: Vec<UdpSocket>,
// Socket receiving remote repair requests from the cluster,
// and sending back repair responses.
pub serve_repair: UdpSocket,
pub serve_repair_quic: UdpSocket,
// Socket sending out local RepairProtocol::AncestorHashes,
// and receiving AncestorHashesResponse from the cluster.
pub ancestor_hashes_requests: UdpSocket,
pub ancestor_hashes_requests_quic: UdpSocket,
pub tpu_quic: Vec<UdpSocket>,
pub tpu_forwards_quic: Vec<UdpSocket>,
pub tpu_vote_quic: Vec<UdpSocket>,
/// Client-side socket for ForwardingStage vote transactions
pub tpu_vote_forwarding_client: UdpSocket,
/// Client-side socket for ForwardingStage non-vote transactions
pub tpu_transaction_forwarding_clients: Box<[UdpSocket]>,
/// Socket for alpenglow consensus logic
pub alpenglow: Option<UdpSocket>,
/// Connection cache endpoint for QUIC-based Vote
pub quic_vote_client: UdpSocket,
/// Connection cache endpoint for QUIC-based Alpenglow messages
pub quic_alpenglow_client: UdpSocket,
/// Client-side socket for RPC/SendTransactionService.
pub rpc_sts_client: UdpSocket,
pub vortexor_receivers: Option<Vec<UdpSocket>>,
}

Read/Write

gossip
ip_echo
repair
repair_quic
serve_repair
serve_repair_quic
ancestor_hashes_requests_quic
ancestor_hashes_requests
alpenglow

Read Only : <send buffer size>

tvu: 0 MB 
tpu: 0 MB
tpu_forwards: 0 MB
tpu_vote: 0 MB
vortexor_receivers: 0 MB
tvu_quic: 4 MB
tpu_quic: 4 MB 
tpu_forwards_quic:4 MB
tpu_vote_quic: 4 MB

Write Only : <read buffer size>

retransmitter: 0 MB
broadcast: 0 MB
tpu_vote_forwarding_client: 0 MB
quic_vote_client: 4 MB
tpu_transaction_forwarding_clients: 4MB
rpc_sts_client (quic): 4 MB
quic_alpenglow_client: <unchanged>

@gregcusack gregcusack force-pushed the reduce-net-buffer-sizes-v2 branch from 04e155e to 8738b08 Compare November 21, 2025 20:29
Comment thread gossip/src/node.rs Outdated
Comment on lines +113 to +119
let control_traffic_buffer_size = 4 * 1024 * 1024;

let read_write_socket_config = SocketConfig::default();
let primarily_read_socket_config =
SocketConfig::default().send_buffer_size(control_traffic_buffer_size);
let primarily_write_socket_config =
SocketConfig::default().recv_buffer_size(control_traffic_buffer_size);
Copy link
Copy Markdown
Author

@gregcusack gregcusack Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a number based off a what alessandro said a year ago: https://discord.com/channels/428295358100013066/478692221441409024/1309346044517023794. also not sure if we want to set the value even smaller for udp traffic (retransmit, broadcast, tvu, etc).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add suggested values in PR description since you have a nice list going there.

@gregcusack gregcusack marked this pull request as ready for review November 21, 2025 20:36
@gregcusack gregcusack requested a review from a team as a code owner November 21, 2025 20:36
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.9%. Comparing base (01f1def) to head (8bbe8c7).

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #9218     +/-   ##
=========================================
- Coverage    82.9%    82.9%   -0.1%     
=========================================
  Files         847      847             
  Lines      320563   320618     +55     
=========================================
+ Hits       265789   265806     +17     
- Misses      54774    54812     +38     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alexpyattaev
Copy link
Copy Markdown

To consider - for something like TVU retransmit sockets, what would be the downside of setting RX buffer to zero (since it is never used)?

@stablebits
Copy link
Copy Markdown

stablebits commented Nov 22, 2025

@gregcusack @alexpyattaev a couple nuances :-) note that these buffers aren’t pre-allocated on Linux. The value simply limits how much the receive buffer can hold as data arrives.

Internally, the Linux kernel also doubles this value:
https://github.com/torvalds/linux/blob/master/net/core/sock.c#L982
This is also mentioned in man socket under SO_RCVBUF.

Incoming data is kept in sk_buffs (this is the overhead). For UDP, for example, see:
https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L2069
https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L1955

e.g. here the kernel checks if more data can be enqueued for UDP:
https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L1720

@alexpyattaev
Copy link
Copy Markdown

@gregcusack @alexpyattaev a couple nuances :-) note that these buffers aren’t pre-allocated on Linux. The value simply limits how much the receive buffer can hold as data arrives.

Yes, this is understood. Nevertheless, no point allowing bloated buffers when we clearly have no productive use for them. If we are under serious flood it is better to shed the packets rather than buffer them for seconds and then shed anyway.

@gregcusack gregcusack force-pushed the reduce-net-buffer-sizes-v2 branch from 8738b08 to a8fd1e0 Compare December 3, 2025 21:14
@gregcusack
Copy link
Copy Markdown
Author

gregcusack commented Dec 3, 2025

For this PR, I'd like to focus specifically on sockets that are not bidirecational - aka read only or write only. Adjusting the buffer sizes for read/write sockets is going to take a decent amount of profiling on mnb nodes.

According to agave validator docs, the max buffer sizes should be set to 128 MB. See:

sudo bash -c "cat >/etc/sysctl.d/21-agave-validator.conf <<EOF
# Increase max UDP buffer sizes
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
. But we do not make a recommendation for default size. default size is host dependent. The default size is the limit the buffer can be until the operator calls setsockopt on the socket and increases (limited by net.core.rmem_max and net.core.wmem_max).

EDIT: i'm not convinced we want to resize the "used side" of the sockets even in a follow up. Feels like that would fall on the operator to adjust those if needed

@alexpyattaev
Copy link
Copy Markdown

For this PR, I'd like to focus specifically on sockets that are not bidirecational - aka read only or write only. Adjusting the buffer sizes for read/write sockets is going to take a decent amount of profiling on mnb nodes.

According to agave validator docs, the max buffer sizes should be set to 128 MB. See:

sudo bash -c "cat >/etc/sysctl.d/21-agave-validator.conf <<EOF
# Increase max UDP buffer sizes
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728

. But we do not make a recommendation for default size. default size is host dependent. The default size is the limit the buffer can be until the operator calls setsockopt on the socket and increases (limited by net.core.rmem_max and net.core.wmem_max).

EDIT: i'm not convinced we want to resize the "used side" of the sockets even in a follow up. Feels like that would fall on the operator to adjust those if needed

Agreed. Let us narrow this one to setting unused side to zero buffers. This will also allow us to bp this easily.

@gregcusack
Copy link
Copy Markdown
Author

gregcusack commented Dec 3, 2025

I want to resize the unused side of these sockets here. Below I am proposing what to set the unused side to.
UDP -> 0MB
QUIC -> 4MB for control traffic

Read Only Sockets: <send buffer size>
TVU: 0 MB
Tpu (udp): 0 MB
Tpu_forwards (udp): 0 MB
Tpu_vote (udp): 0 MB
Vortexor_receivers (udp): 0 MB
tvu_quic: 4 MB
Tpu_quic: 4 MB
Tpu_forwards_quic: 4 MB
Tpu_vote_quic: 4 MB

Write Only: <read buffer size>
tvu retransmit: 0 MB
Broadcast: 0 MB
Tpu_vote_forwarding_client (udp): 0 MB
Quic_vote_client: 4 MB
Tpu_transaction_forwarding_clients: 4MB
Rpc_sts_client (quic): 4 MB

Note that for linux sockets we can try to set the buffer size to 0 but the kernel will set it to 2048 bytes (Send buffer) or 256 bytes (receive buffer). See: https://man7.org/linux/man-pages/man7/socket.7.html

@gregcusack gregcusack marked this pull request as draft December 4, 2025 04:40
@gregcusack gregcusack force-pushed the reduce-net-buffer-sizes-v2 branch 3 times, most recently from 2673641 to 2e7844e Compare December 5, 2025 04:37
@gregcusack gregcusack marked this pull request as ready for review December 5, 2025 04:39
@gregcusack gregcusack force-pushed the reduce-net-buffer-sizes-v2 branch from 2e7844e to bc9a742 Compare January 6, 2026 16:41
Copy link
Copy Markdown

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM conceptually, left a few small nits to address.

Comment thread gossip/src/node.rs
Comment thread gossip/src/node.rs Outdated
Comment thread gossip/src/node.rs Outdated
@gregcusack gregcusack force-pushed the reduce-net-buffer-sizes-v2 branch from bc9a742 to 8bbe8c7 Compare January 29, 2026 18:01
Copy link
Copy Markdown

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gregcusack gregcusack added this pull request to the merge queue Jan 29, 2026
Merged via the queue into anza-xyz:master with commit 0f0d4e5 Jan 29, 2026
50 checks passed
@gregcusack gregcusack deleted the reduce-net-buffer-sizes-v2 branch January 29, 2026 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants