Updates static assertions for CLUSTER_NODES_CACHES_NUM_EPOCH_CAP by brooksprumo · Pull Request #8599 · anza-xyz/agave

brooksprumo · 2025-10-21T12:55:24Z

Problem

We'd like to reduce the number of epochs in MAX_LEADER_SCHEDULE_STAKES, but this constant is required to be a specific value by CLUSTER_NODES_CACHES_NUM_EPOCH_CAP in retransmit/broadcast.

(Why do we want to reduce the number of epochs in MAX_LEADER_SCHEDULE_STAKES? Good question! This is because the epoch stakes cache is serialized into the snapshot, and is by far the largest component—both in size and time. We don't need all the epochs either.

Note that MAX_LEADER_SCHEDULE_STAKES = 5 was arbitrarily chosen. Here's its origin: solana-labs#7668 (comment).

Also note that we used to actually store 6 epochs in the cache. That was changed to 5 in #8584.)

Summary of Changes

Update the static assertions for CLUSTER_NODES_CACHES_NUM_EPOCH_CAP to not require a specific value for MAX_LEADER_SCHEDULE_STAKES. Instead, assert the range in which it is valid.

brooksprumo · 2025-10-21T13:26:04Z

+    // 1. There must be at least two epochs because near an epoch boundary you might receive
+    //    shreds from the other side of the epoch boundary.
+    // 2. It does not make sense to have capacity more than the number of epoch-stakes in Bank.
+    assert!(CLUSTER_NODES_CACHE_NUM_EPOCH_CAP >= 2);


Is 2 sufficient here? The reasons cited in the linked discussion include warping, which shouldn't be an issue since all other nodes would have to warp to the same slot as well (and thus same epoch). This is really only used in ledger-tool and test-validator too.

The other reason was unknown unknowns. Should that suggest that we use a different value?

Unknown unknowns are not ideal.

Broadcast stage only operates on cluster nodes at the tip of the chain (+-32 slots), so 2 epochs is 100% enough there.

From what we have in ClusterSlotsService, anything > 50000 slots in the past is effectively impossible to repair, and thus 2 epochs should always be enough for repair. (Yes, during startup of the cluster epochs are shorter but then we are also unlikely to be catching up from >1 epoch in the past in those cases anyway).

Gossip only operates with latest available stake weights, and can tolerate massive divergence in stake amounts.

Overall, from networking PoV we should never need more than 2 epochs worth of information. If we do, its a bug we need to fix.

Generally speaking, 2 seems sufficient.

I could see some corner cases when we shrink epochs down to the minimum number of slots - especially during bootstrap period. But my understanding is this shouldn't fail - it would just thrash the LRU cache.

Might be nice to add some metrics in this area to get visibility into thrashing (doesn't have to be this PR)

brooksprumo · 2025-10-21T13:26:29Z

@bw-solana - requesting your review since you reviewed the PR that I grabbed the comments/values from: #1735
@gregcusack - requesting your review as SME for networking

Please request reviews from other people as needed. Thanks!

codecov-commenter · 2025-10-21T13:47:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.1%. Comparing base (e6661f4) to head (950473a).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #8599   +/-   ##
=======================================
  Coverage    83.1%    83.1%           
=======================================
  Files         850      850           
  Lines      369043   369043           
=======================================
+ Hits       306926   306959   +33     
+ Misses      62117    62084   -33

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alexpyattaev · 2025-10-21T14:17:25Z

From the network protocol PoV we should never need >2 epochs of ClusterNodes (current + upcoming).
Can you elaborate how epoch stakes cache ends up occupying a lot of space? It seems like we are storing 1 hashmap of ~2000 entries per epoch, no?

brooksprumo · 2025-10-21T15:22:37Z

Can you elaborate how epoch stakes cache ends up occupying a lot of space?

My understanding is that each epoch of the epoch stakes cache contains the pubkey and delegation from every single stake account.

There's about 1.175 million stake accounts on mnb, and the epoch stakes cache holds 6 epochs in v2.3 and v3.0. So over 7 million entries in each snapshot.

alexpyattaev · 2025-10-21T21:43:42Z

Can you elaborate how epoch stakes cache ends up occupying a lot of space?

My understanding is that each epoch of the epoch stakes cache contains the pubkey and delegation from every single stake account.

There's about 1.175 million stake accounts on mnb, and the epoch stakes cache holds 6 epochs in v2.3 and v3.0. So over 7 million entries in each snapshot.

This sounds very excessive. Either way storing 6 epochs should be overkill, I'd be surprised if we need more than 2, but we can be conservative and go down to 3. Either way this needs to be merged before =)

alexpyattaev

LGTM

brooksprumo · 2025-10-22T13:37:14Z

This sounds very excessive.

Oh yes 😸

Either way storing 6 epochs should be overkill, I'd be surprised if we need more than 2, but we can be conservative and go down to 3. Either way this needs to be merged before =)

🤝

…a-xyz#8599)

Updates static assertions for CLUSTER_NODES_CACHES_NUM_EPOCH_CAP

950473a

brooksprumo self-assigned this Oct 21, 2025

brooksprumo commented Oct 21, 2025

View reviewed changes

brooksprumo marked this pull request as ready for review October 21, 2025 13:26

brooksprumo requested a review from a team as a code owner October 21, 2025 13:26

brooksprumo requested review from bw-solana and gregcusack October 21, 2025 13:26

brooksprumo requested a review from alexpyattaev October 21, 2025 15:25

alexpyattaev approved these changes Oct 21, 2025

View reviewed changes

brooksprumo added this pull request to the merge queue Oct 22, 2025

Merged via the queue into anza-xyz:master with commit 6f91f67 Oct 22, 2025
43 checks passed

brooksprumo deleted the leader-schedule-stakes/network branch October 22, 2025 13:55

rustopian pushed a commit to rustopian/agave that referenced this pull request Nov 20, 2025

Updates static assertions for CLUSTER_NODES_CACHES_NUM_EPOCH_CAP (anz…

4ef8a10

…a-xyz#8599)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates static assertions for CLUSTER_NODES_CACHES_NUM_EPOCH_CAP#8599

Updates static assertions for CLUSTER_NODES_CACHES_NUM_EPOCH_CAP#8599
brooksprumo merged 1 commit intoanza-xyz:masterfrom
brooksprumo:leader-schedule-stakes/network

brooksprumo commented Oct 21, 2025

Uh oh!

brooksprumo Oct 21, 2025

Uh oh!

alexpyattaev Oct 21, 2025 •

edited

Loading

Uh oh!

bw-solana Oct 21, 2025

Uh oh!

brooksprumo commented Oct 21, 2025

Uh oh!

codecov-commenter commented Oct 21, 2025

Uh oh!

alexpyattaev commented Oct 21, 2025

Uh oh!

brooksprumo commented Oct 21, 2025

Uh oh!

alexpyattaev commented Oct 21, 2025

Uh oh!

alexpyattaev left a comment

Uh oh!

brooksprumo commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

brooksprumo commented Oct 21, 2025

Problem

Summary of Changes

Uh oh!

brooksprumo Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

alexpyattaev Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bw-solana Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

brooksprumo commented Oct 21, 2025

Uh oh!

codecov-commenter commented Oct 21, 2025

Codecov Report

Uh oh!

alexpyattaev commented Oct 21, 2025

Uh oh!

brooksprumo commented Oct 21, 2025

Uh oh!

alexpyattaev commented Oct 21, 2025

Uh oh!

alexpyattaev left a comment

Choose a reason for hiding this comment

Uh oh!

brooksprumo commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexpyattaev Oct 21, 2025 •

edited

Loading