Skip to content

fix: enable on-demand leader schedule computation in get_slot_leaders#7765

Merged
KirillLykov merged 8 commits intoanza-xyz:masterfrom
swarna1101:fix/get-slot-leaders-epoch-change
Aug 29, 2025
Merged

fix: enable on-demand leader schedule computation in get_slot_leaders#7765
KirillLykov merged 8 commits intoanza-xyz:masterfrom
swarna1101:fix/get-slot-leaders-epoch-change

Conversation

@swarna1101
Copy link
Copy Markdown

Fix get_slot_leaders epoch transition failures

Problem

get_slot_leaders RPC was failing with "Invalid slot range: leader schedule for epoch X is unavailable" for approximately 31 slots during every epoch transition. This occurred because:

  • get_slot_leaders only checked leader_schedule_cache.get_epoch_leader_schedule(epoch)
  • During epoch transitions, the new epoch's leader schedule isn't cached until the first slot of that epoch is rooted
  • This created a ~31 slot window where valid requests would fail
  • Caused network spam as clients sent transactions to wrong leaders during transitions

Solution

  • Add fallback to leader_schedule_utils::leader_schedule() on cache miss
  • Enables on-demand computation when bank has stake information available
  • Preserves all existing behavior and error handling
  • No performance impact for cached schedules

Closes #6845 , can you pls check. @KirillLykov

@mergify mergify Bot requested a review from a team August 28, 2025 07:27
@mergify
Copy link
Copy Markdown

mergify Bot commented Aug 28, 2025

If this PR represents a change to the public RPC API:

  1. Make sure it includes a complementary update to rpc-client/ (example)
  2. Open a follow-up PR to update the JavaScript client @solana/kit (example)

Thank you for keeping the RPC clients in sync with the server API @swarna1101.

@KirillLykov KirillLykov added the CI Pull Request is ready to enter CI label Aug 28, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Aug 28, 2025
Copy link
Copy Markdown

@KirillLykov KirillLykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a small suggestion

Comment thread rpc/src/rpc.rs Outdated
Comment on lines +997 to +998
leader_schedule_utils::leader_schedule(epoch, &bank)
.map(std::sync::Arc::new)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this work?

Suggested change
leader_schedule_utils::leader_schedule(epoch, &bank)
.map(std::sync::Arc::new)
Arc::new(leader_schedule(epoch, &bank))

Copy link
Copy Markdown
Author

@swarna1101 swarna1101 Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! You're right, using Arc::new is cleaner.

However, since leader_schedule_utils::leader_schedule() returns Option<LeaderSchedule>, we need .map(Arc::new) to properly transform it to Option<Arc<LeaderSchedule>> rather than wrapping the Option itself in Arc.

made the change

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that you need leader_schedule_utils:: since you use it already

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@KirillLykov
Copy link
Copy Markdown

@gregcusack since you've reported this problem in this comment, could you also review this PR?

@gregcusack gregcusack self-requested a review August 28, 2025 13:12
gregcusack
gregcusack previously approved these changes Aug 28, 2025
Copy link
Copy Markdown

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thank you for debugging this and fixing it!

KirillLykov
KirillLykov previously approved these changes Aug 28, 2025
@KirillLykov KirillLykov added the CI Pull Request is ready to enter CI label Aug 28, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Aug 28, 2025
@KirillLykov
Copy link
Copy Markdown

Fails CI:

error: unused import: `self`
  --> rpc/src/rpc.rs:47:33
   |
47 |         leader_schedule_utils::{self, leader_schedule},
   |                                 ^^^^
   |

@swarna1101
Copy link
Copy Markdown
Author

Fails CI:

error: unused import: `self`
  --> rpc/src/rpc.rs:47:33
   |
47 |         leader_schedule_utils::{self, leader_schedule},
   |                                 ^^^^
   |

oh!!
fixed

@KirillLykov KirillLykov added the CI Pull Request is ready to enter CI label Aug 28, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Aug 28, 2025
@swarna1101
Copy link
Copy Markdown
Author

Fails CI:

error: unused import: `self`
  --> rpc/src/rpc.rs:47:33
   |
47 |         leader_schedule_utils::{self, leader_schedule},
   |                                 ^^^^
   |

oh!! fixed

i see there is one more CI issue, sorry, fixing it

@KirillLykov KirillLykov added the CI Pull Request is ready to enter CI label Aug 28, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Aug 28, 2025
@gregcusack gregcusack added the CI Pull Request is ready to enter CI label Aug 28, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Aug 28, 2025
@KirillLykov KirillLykov added the CI Pull Request is ready to enter CI label Aug 28, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Aug 28, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 28, 2025

Codecov Report

❌ Patch coverage is 83.33333% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.0%. Comparing base (7865ba5) to head (de3641b).
⚠️ Report is 2367 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #7765     +/-   ##
=========================================
- Coverage    83.1%    83.0%   -0.1%     
=========================================
  Files         812      812             
  Lines      356963   356991     +28     
=========================================
- Hits       296642   296601     -41     
- Misses      60321    60390     +69     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@KirillLykov KirillLykov merged commit ce1e9b3 into anza-xyz:master Aug 29, 2025
45 checks passed
@mergify
Copy link
Copy Markdown

mergify Bot commented Aug 29, 2025

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

mergify Bot pushed a commit that referenced this pull request Aug 29, 2025
…#7765)

get_slot_leaders RPC was failing with "Invalid slot range: leader schedule for epoch X is unavailable" for approximately 31 slots during every epoch transition. This PR fixes it by doing the following:
* Add fallback to leader_schedule_utils::leader_schedule() on cache miss
* Enables on-demand computation when bank has stake information available
* Preserves all existing behavior and error handling
* No performance impact for cached schedules

(cherry picked from commit ce1e9b3)
@mergify
Copy link
Copy Markdown

mergify Bot commented Sep 3, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify Bot pushed a commit that referenced this pull request Sep 3, 2025
…#7765)

get_slot_leaders RPC was failing with "Invalid slot range: leader schedule for epoch X is unavailable" for approximately 31 slots during every epoch transition. This PR fixes it by doing the following:
* Add fallback to leader_schedule_utils::leader_schedule() on cache miss
* Enables on-demand computation when bank has stake information available
* Preserves all existing behavior and error handling
* No performance impact for cached schedules

(cherry picked from commit ce1e9b3)
willhickey pushed a commit that referenced this pull request Sep 3, 2025
…#7765)

get_slot_leaders RPC was failing with "Invalid slot range: leader schedule for epoch X is unavailable" for approximately 31 slots during every epoch transition. This PR fixes it by doing the following:
* Add fallback to leader_schedule_utils::leader_schedule() on cache miss
* Enables on-demand computation when bank has stake information available
* Preserves all existing behavior and error handling
* No performance impact for cached schedules

(cherry picked from commit ce1e9b3)
@t-nelson
Copy link
Copy Markdown

t-nelson commented Sep 3, 2025

wouldn't the simpler solution have been to simply query the root bank in the first place?

@KirillLykov KirillLykov removed the v2.3 label Sep 4, 2025
@KirillLykov
Copy link
Copy Markdown

KirillLykov commented Sep 4, 2025

wouldn't the simpler solution have been to simply query the root bank in the first place?

@swarna1101 could you try to create PR with fix as proposed? If it will work, we will revert the current PR and create a new one to have a clear history for backporting.

@swarna1101
Copy link
Copy Markdown
Author

swarna1101 commented Sep 4, 2025

wouldn't the simpler solution have been to simply query the root bank in the first place?

thanks @t-nelson for the suggestion.

TL;DR

While using the root bank might seem "simpler," it would break API semantics, degrade user experience, and doesn't actually solve the core problem better than the current fix.

The Core Problem We're Solving

The original issue wasn't about which bank to use - it was about the cache-only approach failing during epoch transitions:

// BEFORE (broken):
if let Some(leader_schedule) = cache.get_epoch_leader_schedule(epoch) {
    // Use cached schedule
} else {
    return Err("leader schedule for epoch X is unavailable"); // Always failed
}

// AFTER (current fix):
let leader_schedule = if let Some(leader_schedule) = cache.get_epoch_leader_schedule(epoch) {
    Some(leader_schedule)
} else {
    // Fallback: compute on-demand when bank has stake info
    leader_schedule_utils::leader_schedule(epoch, &bank).map(Arc::new)
};

Why "Just Use Root Bank" is Problematic

1. Breaks API Semantics

The get_slot_leaders RPC method accepts a commitment parameter for a reason. Users expect different behavior based on commitment level:

// Current (correct):
let bank = self.bank(commitment); // Respects user's choice

// Proposed alternative:
let bank = self.bank_forks.read().unwrap().root_bank(); // Ignores user intent

Real-world impact:

  • User requests processed commitment → expects latest available data
  • Root bank approach → forces them to use finalized data only
  • This breaks the semantic contract of commitment levels

2. Creates API Inconsistency

Every other RPC method uses commitment-based bank selection:

  • getAccountInfo - uses self.bank(commitment)
  • getBalance - uses self.bank(commitment)
  • getBlockHeight - uses self.bank(commitment)
  • getSlotLeader - uses self.bank(commitment)

Making get_slot_leaders the only method that ignores commitment would be inconsistent and confusing.

3. Reduces Data Freshness

// Example scenario during epoch transition:
// - Root bank: slot 1000 (finalized)
// - Processed bank: slot 1025 (latest)
// - User wants leader info for upcoming slots

// Root bank approach: Limited to slot 1000's view
// Current approach: Can use slot 1025's more recent stake information

4. Doesn't Actually Solve Availability Better

The root bank isn't guaranteed to have leader schedules for future epochs either. The real solution is the fallback computation, which works with any bank that has the necessary stake information.

Why Current Implementation is better

1. Preserves User Intent

// Honors commitment levels as designed
let bank = self.bank(commitment);

2. Robust Fallback Strategy

// First try cache (fast path)
if let Some(cached) = self.leader_schedule_cache.get_epoch_leader_schedule(epoch) {
    Some(cached)
} else {
    // Fallback: compute when possible (solves the transition problem)
    leader_schedule_utils::leader_schedule(epoch, &bank).map(Arc::new)
}

3. No Performance Impact

  • Cached schedules: same performance as before
  • Cache misses: now succeeds instead of failing

4. Future-Proof Design

The fallback mechanism works regardless of which bank is used, making it adaptable to future changes.

The Fallback solution

The insight in the current fix is using leader_schedule_utils::leader_schedule() as a fallback. This function:

pub fn leader_schedule(epoch: Epoch, bank: &Bank) -> Option<LeaderSchedule> {
    let use_new_leader_schedule = bank.should_use_vote_keyed_leader_schedule(epoch)?;
    if use_new_leader_schedule {
        bank.epoch_vote_accounts(epoch).map(|vote_accounts_map| {
            // Compute schedule from vote accounts
        })
    } else {
        bank.epoch_staked_nodes(epoch).map(|stakes| {
            // Compute schedule from staked nodes
        })
    }
}

This works with any bank that has the epoch's stake information, whether it's root, processed, or confirmed.

Would you like me to add additional test cases to demonstrate how the current approach handles different commitment levels correctly?

@t-nelson
Copy link
Copy Markdown

t-nelson commented Sep 4, 2025

is this llm slop?

@swarna1101
Copy link
Copy Markdown
Author

is this llm slop?

Not at all. I did create a test as well:

// Scenario: Processed bank lacks stake info, root bank has it
let processed_bank = MockBank { slot: 95, epoch: 1, has_stake_info: false };
let root_bank = MockBank { slot: 32, epoch: 1, has_stake_info: true };

// Results:
Current fix: Err("leader schedule unavailable") 
Root-only: Ok(["leader_1"])

I feel where the current approach is better is, it allows users to choose their risk/latency preference, fresh data with processed commitment vs conservative data with finalized.

Where Root-Only is better is, it always uses the most stable bank with guaranteed complete stake information, avoiding failures during bank transitions.

I'll create a PR with the root-only approach so we can test it , and see how it performs compared to the current implementation.

@t-nelson
Copy link
Copy Markdown

t-nelson commented Sep 4, 2025

there is no way a human would be that verbose and waste so much time on formatting to miss the point. the user's commitment specification is irrelevant here due to how the system works in reality

@swarna1101
Copy link
Copy Markdown
Author

there is no way a human would be that verbose and waste so much time on formatting to miss the point. the user's commitment specification is irrelevant here due to how the system works in reality

Thanks for the feedback. I understand your point, I’ll make sure to keep future responses more concise and focused on the core issue instead of spending time on formatting.

KirillLykov added a commit that referenced this pull request Sep 8, 2025
#7917)

Revert "fix: enable on-demand leader schedule computation in get_slot_leaders (#7765)"

This reverts commit ce1e9b3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_slot_leaders RPC call doesn't work after epoch change for ~31 slots

6 participants