greedy_scheduler: cache Batches by alessandrod · Pull Request #7193 · anza-xyz/agave

alessandrod · 2025-07-27T16:53:48Z

This avoids a bunch of allocations/deallocations in the hot path.

This must be the 4th time I do this change. Finally committing and PRing so I don't have to do it again next month.

codecov-commenter · 2025-07-27T18:05:46Z

Codecov Report

❌ Patch coverage is 97.87234% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.8%. Comparing base (d3038f3) to head (e9d3efa).
⚠️ Report is 2659 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #7193   +/-   ##
=======================================
  Coverage    82.8%    82.8%           
=======================================
  Files         801      801           
  Lines      363284   363292    +8     
=======================================
+ Hits       300802   300830   +28     
+ Misses      62482    62462   -20

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

apfitzge · 2025-07-28T13:35:32Z

I'd avoid doing anything like this, because these are just freed by same thread as allocated them, i.e. scheduler. jemalloc should (I think) just push them back into thread-local cache.
is this not the case?

apfitzge · 2025-07-28T13:37:44Z

        self.working_account_set.clear();
        // Use zero here to avoid allocating since we are done with `Batches`.
-        num_sent += self.common.send_batches(&mut batches, 0)?;
+        num_sent += self.common.send_batches(&mut self.batches, 0)?;


note to self: except in case of early exit (there shouldn't be any) this will guarantee the batches are empty by the end of each schedule call.

apfitzge · 2025-07-28T13:38:39Z

    common: SchedulingCommon<Tx>,
    working_account_set: ReadWriteAccountSet,
    unschedulables: Vec<TransactionPriorityId>,
+    batches: Batches<Tx>,


if allocation/deallocation of batches is an issue, we could put them in the common so that prio_graph variant also gets the benefit.

yeah happy to move it there

I've done this now. There are a couple of early returns in schedule(), but SchedulingError is unrecoverable (the banking thread dies), so it's ok

I can't read, it doesn't die

code wise this looks bad, but in practice we error only if the workers get disconnected, which never happens unless the validator gets in some kind of hosed state anyway

apfitzge · 2025-07-28T13:40:18Z

        }
    }

+    pub fn clear(&mut self) {


I'm hesitant to have a clear on this.

We should never have batches that last longer than a schedule call. If we do, then it's a bug.

Because at the end of each schedule call we send_batches which sends out all non-empty batches, right?

I don't know, I did a mechanical change: saw it in profiles, before this change it was dropping/reallocating, this is functionally equivalent to doing that modulo the churn.

I see your point, in which case we should assert that it's cleared by the time we return.

apfitzge · 2025-07-28T13:41:02Z

+        for ids in &mut self.ids {
+            ids.clear();
+        }
+        for transactions in &mut self.transactions {
+            transactions.clear();
+        }
+        for max_ages in &mut self.max_ages {
+            max_ages.clear();
+        }
+        for total_cus in &mut self.total_cus {


please ignore co-pilot here 😆

alessandrod · 2025-07-28T13:53:26Z

I'd avoid doing anything like this, because these are just freed by same thread as allocated them, i.e. scheduler. jemalloc should (I think) just push them back into thread-local cache. is this not the case?

Yeah but the problem is doing all these allocations thousands of times per slot, this currently takes 1/3 of process transactions:

Copilot

Pull Request Overview

This PR caches Batches instances within the SchedulingCommon struct to avoid repeated allocations and deallocations in the hot path of transaction scheduling. The change moves batch creation from being instantiated per scheduling pass to being reused across scheduling operations.

Moves Batches instance from local variable to SchedulingCommon field
Updates constructor to accept target_num_transactions_per_batch parameter
Removes batches parameter from batch sending methods since it's now internal state

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
scheduler_common.rs	Adds `batches` field to `SchedulingCommon`, updates constructor and batch methods
prio_graph_scheduler.rs	Removes local `Batches` creation, uses cached instance from `SchedulingCommon`
greedy_scheduler.rs	Removes local `Batches` creation, uses cached instance from `SchedulingCommon`

Copilot · 2025-08-03T14:43:33Z

            total_cus: vec![0; num_threads],
        }
    }



The is_empty method lacks documentation explaining its purpose and when it should be called. Consider adding a docstring to explain that this method is used for debug assertions to verify batches are properly cleared after scheduling.

Suggested change

/// Returns true if all batches are empty and no compute units are allocated.

///

/// This method is intended for use in debug assertions to verify that

/// batches are properly cleared after scheduling. It should be called

/// after batch processing to ensure no residual state remains.

apfitzge

lgtm. left small nit and a potential for more clean-up

apfitzge · 2025-08-04T13:54:37Z


+        debug_assert!(
+            self.common.batches.is_empty(),
+            "batches must be empty after scheduling"


nit: "batches must start empty for scheduling".
weird to say after scheduling when this check is before?

heh, I initially put the check after scheduling, but then saw the early returns and moved it. Fixed!

apfitzge · 2025-08-04T13:56:49Z

@@ -27,17 +27,31 @@ pub struct Batches<Tx> {

 impl<Tx> Batches<Tx> {
    pub fn new(num_threads: usize, target_num_transactions_per_batch: usize) -> Self {


since we now store Batches in common, we don't need to do the 0 target size to avoid allocation on the final send. So the target_num_transactions_per_batch is constant.

We could store it in the Batches itself now, making all the calls to send_* a bit cleaner.

This avoids a bunch of allocations/deallocations in the hot path

alessandrod requested a review from Copilot July 27, 2025 18:40

This comment was marked as outdated.

Sign in to view

apfitzge self-requested a review July 28, 2025 13:35

apfitzge reviewed Jul 28, 2025

View reviewed changes

alessandrod force-pushed the greedy-cache-batches branch from eab774e to 76100be Compare August 3, 2025 14:18

alessandrod requested a review from Copilot August 3, 2025 14:43

Copilot AI reviewed Aug 3, 2025

View reviewed changes

apfitzge previously approved these changes Aug 4, 2025

View reviewed changes

alessandrod added 3 commits August 5, 2025 04:36

greedy_scheduler: cache Batches

6c80932

This avoids a bunch of allocations/deallocations in the hot path

banking: move Batches to SchedulingCommon

b5ff148

banking: clean up send_batch(es)

e9d3efa

alessandrod dismissed apfitzge’s stale review via e9d3efa August 5, 2025 04:36

alessandrod force-pushed the greedy-cache-batches branch from 76100be to e9d3efa Compare August 5, 2025 04:36

apfitzge approved these changes Aug 5, 2025

View reviewed changes

alessandrod merged commit e2ca749 into anza-xyz:master Aug 5, 2025
41 checks passed

fkouteib mentioned this pull request Aug 5, 2025

Fix release build compilation error #7329

Merged

+    /// Returns true if all batches are empty and no compute units are allocated.
+    ///
+    /// This method is intended for use in debug assertions to verify that
+    /// batches are properly cleared after scheduling. It should be called
+    /// after batch processing to ensure no residual state remains.

		@@ -27,17 +27,31 @@ pub struct Batches<Tx> {

		impl<Tx> Batches<Tx> {
		pub fn new(num_threads: usize, target_num_transactions_per_batch: usize) -> Self {

Conversation

alessandrod commented Jul 27, 2025

Uh oh!

codecov-commenter commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

apfitzge commented Jul 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alessandrod commented Jul 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

apfitzge left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jul 27, 2025 •

edited

Loading