feat: efficient node catch-up on subnets with long DSM rounds #8232

mraszyk · 2026-01-06T13:21:53Z

WORK IN PROGRESS!

rs/state_manager/src/lib.rs

mraszyk · 2026-01-06T19:11:41Z

rs/consensus/certification/src/certifier.rs

-                },
-            )
+                    (None, Some(hash)) => Some((height, hash)),
+                    (None, None) => None,


We're always ending here for heights in between the current DSM height and the latest certified height of the subnet, i.e., certifications for those heights are not available in the pool.

By returning None here we are filtering out all heights that don't have a hash, which means they will not be part of the state_hashes_to_certify vector. Later on in the function we only validate artifacts for heights part of this vector.

Since we don't have a hash for those heights, we could only easily execute this part:

let certifications = state_hashes_to_certify .iter() .flat_map(|(height, _)| self.aggregate(certification_pool, *height)) .collect::<Vec<_>>(); if !certifications.is_empty() { self.metrics .certifications_aggregated .inc_by(certifications.len() as u64); trace!( &self.log, "Aggregated {} threshold-signatures in {:?}", certifications.len(), start.elapsed() ); return certifications .into_iter() .map(ChangeAction::AddToValidated) .collect(); }

but not

let change_set = self.validate(certification_pool, &state_hashes_to_certify);

Is it enough to execute the former or do we also need to execute the latter? If we also need to execute the latter, I'd refactor the functions validate_share and validate_certification to take hash: &Option<CryptoHashOfPartialState> and skip the hash check if there's no hash available, right?

It looks to me that you have to do the validate changes. Otherwise, you would never validate the certification shares, so the other code will also never have enough validated shares to aggregate.

Yes exactly, I think we briefly touched on that in this comment

It works now that I call validate for the corresponding state heights, but then I wonder how pool.get_finalized_tip().context.certified_height can advance to begin with. Is it because the consensus pool stores unvalidated blocks (i.e., pool.get_finalized_tip().context.certified_height is only an approximation, but not necessarily trustworthy) and only after calling validate, the certification pool returns the certification as validated?

I'm not sure I 100% understand: As we finalize and execute new blocks, the state heights and hashes will be given to the certifier via list_state_hashes_to_certify above. For each height/hash each node then creates a certification share. These are validated until there are enough to aggregate a full certification. The full certification is delivered back to the DSM which increases the latest_certified_height(). Whenever a node constructs a new block proposal, it will include the current latest_certified_height() in block.context.certified_height. So as long as the latest certified height increases pool.get_finalized_tip().context.certified_height will increase accordingly.

schneiderstefan · 2026-01-07T09:22:43Z

rs/state_manager/src/lib.rs

        }

+        let tip_height = self.tip_height.load(Ordering::Relaxed);
+        let last_certification_height_to_keep = min(last_height_to_keep, Height::new(tip_height));


I think this should not be the tip height, but the latest_certified_height. We always want to keep the latest height where we have everything (hash tree/certificiation/state), so that the height where we answer queries doesn't go backwards. Or even worse, we wouldn't want to fall back into a state where we have no certified states. Specifically, we want to protect whatever height is returned from latest_certified_state.

Furthermore, also in this function, I believe that self.latest_certified_height is not updated correctly. It should always be the height for latest_certified_state, but here we only check for the presence of a certification (instead of certification+hash tree).

schneiderstefan · 2026-01-07T09:46:05Z

rs/state_manager/src/lib.rs

+                .certifications_metadata
+                .entry(height)
+                .or_insert_with(|| {
+                    Self::compute_certification_metadata(&self.metrics, &self.log, &state)


This is what is tripping up latest_certified_state. Here we populate metadata.hash, and then later consensus might call deliver_state_certification. We then have as hash tree, a certification, but no corresponding state in self.snapshots.

I think there are two fixes:

Here, only poplulate metadata.certified_state_hash, but not metadata.hash_tree.

Populate both, but rewrite latest_certified_state so that it doesn't assume that if we have hash tree and certification, we also have a state.

Either way, you then also need to fix how we update self.latest_certified_height in deliver_state_certification. It should only be updated to heights where we have hash tree+certification+state. Whether you pick (1) or (2) also affects how you need to update self.latest_certified_height in remove_states_below.

schneiderstefan · 2026-01-07T09:49:05Z

rs/state_manager/src/lib.rs

+
+        let latest_subnet_certified_height =
+            self.latest_subnet_certified_height.load(Ordering::Relaxed);
+        if matches!(scope, CertificationScope::Metadata)


Nice to have would be some metrics about how often we skip steps due to this new logic. Specifically, how often we skip both cloning and hashing, how often we do it anyway due to the is_multiple_of(10) rule, and how often do we hash due to missing certifications.

schneiderstefan · 2026-01-07T09:51:15Z

rs/state_manager/src/lib.rs

+            states.tip = Some((height, state));
+            self.tip_height.store(height.get(), Ordering::Relaxed);
+            return;
+        }


The is_multiple_of(10) case is not quite correctly handled:

If we already have a CertificationMetadata we shouldn't just overwrite it. Instead, we want to preserve the certification if it has one. Otherwise we rely on consensus to be able to serve it to us again.

We already do check if we already have a CertificationMetadata, and assert that the has is the same hash. Until now, this could only trigger if you have a state sync ending around the same time as execution finishes. But now, this is how we detect divergences in the is_multiple_of(10) case. So we should strengthen it a bit, and do the same as we do in deliver_state_certification on divergence. Mainly, there we call create_diverged_state_marker to log the divergence on disk.

Until now, this could only trigger if you have a state sync ending around the same time as execution finishes.

Why don't we log this divergence on disk already?

I think it's just an oversight. But because it was basically dead code it didn't matter.

rs/state_manager/src/lib.rs

…ailable

schneiderstefan · 2026-01-16T08:22:01Z

rs/state_manager/src/lib.rs

+            .sum::<u64>()
+    }
+
+    fn state_manager_for_tests(log: ReplicaLogger) -> (MetricsRegistry, StateManagerImpl) {


This should be a state_manager_test like this one, and also be in tests/state_manager.rs where all these other tests are instead of here.

and also be in tests/state_manager.rs

if the test is in a different crate, then I can't access private fields though

This should be a state_manager_test

Unless I'm missing something, my helpers should do exactly the same as the helpers in the integration test crate.

I am talking about the file rs/state_manager/tests/state_manager.rs, which is not a different crate. I agree your helpers do the same as the existing ones, hence why the duplication is unnecessary.

…imized-node-catch-up

mraszyk added 11 commits January 6, 2026 10:39

chore: store latest subnet certified height in StateManagerImpl

7cd5410

relax invariant

d7872ea

list_state_hashes_to_certify returns optional hash

d1c16b3

ask consensus for extra heights without hashes

4dbdd32

make certified_state_hash optional

acbcaa1

take_tip compute state hash if needed

aa1e514

deliver_state_certification pushes certification without tree

4c4fd79

take_tip assumes certified_state_hash is computed

83e52e9

optimize commit_and_certify

de71c94

use tip_height instead of latest_state_height

272578e

Merge branch 'master' into mraszyk/optimized-node-catch-up

4f13401

github-actions bot added the feat label Jan 6, 2026

mraszyk added 2 commits January 6, 2026 17:06

fix test

6670542

enable system test

24499c5

mraszyk commented Jan 6, 2026

View reviewed changes

rs/state_manager/src/lib.rs Outdated Show resolved Hide resolved

mraszyk commented Jan 6, 2026

View reviewed changes

rs/state_manager/src/lib.rs Outdated Show resolved Hide resolved

mraszyk added 4 commits January 6, 2026 17:34

revert unneeded change

dfd81b8

make certified_state_hash non-optional again

1ee8cca

update latest_subnet_certified_height in remove_states_below

04996e0

monotonicity

5d5338d

mraszyk commented Jan 6, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

schneiderstefan reviewed Jan 7, 2026

View reviewed changes

mraszyk added 7 commits January 8, 2026 17:20

Merge branch 'master' into mraszyk/optimized-node-catch-up

6016aa9

move certification_requested_at to hash_tree

bc25b5b

fix last_certification_height_to_keep and introduce state_snapshot_av…

248a756

…ailable

fix latest certified height

34eac7e

simplify

e52a7e8

introduce state_heights_and_hashes_to_certify

b5948df

create diverged marker in commit and certify

dfb8be6

mraszyk added 13 commits January 13, 2026 10:44

fix: do aggregate

80a861d

fix

024f01e

fix tests

9bc4120

wip

fd09ab0

simplify

f90541d

simplify

3732c30

do not set latest_subnet_certified_height in remove_states_below

6ad725d

purge certifications

7956022

Merge branch 'master' into mraszyk/optimized-node-catch-up

321843a

fixes

9f06d5a

Merge branch 'master' into mraszyk/optimized-node-catch-up

0a37889

remove outdated assert

7376c9a

some unit tests of state manager

24bbdbb

schneiderstefan reviewed Jan 16, 2026

View reviewed changes

mraszyk added 16 commits January 17, 2026 17:51

move tests to integration test suite

1ce3a4e

unit tests

e6301f6

debug log

7f3023d

Merge branch 'master' into mraszyk/optimized-node-catch-up

9db50f9

unit tests

dc251f1

feat: no state cloning during catch-up

1af7c8f

no_clone_count -> no_state_clone_count

1d8cb46

comments and assertions

25b6301

do not clamp height in calls to remove_inmemory_states_below

afb37e4

drop tip_height for now

68ceb1e

harden test

5f23b0e

simplify certifications_metadata_hash_tree

3c3fd82

Merge branch 'master' into mraszyk/optimized-node-catch-up

3cadb15

Merge branch 'mraszyk/no-state-clone-during-catchup' into mraszyk/opt…

dcba251

…imized-node-catch-up

commit_and_certify_reuses_certification

410b3a8

lint

e938cb8

feat: efficient node catch-up on subnets with long DSM rounds #8232

Are you sure you want to change the base?

feat: efficient node catch-up on subnets with long DSM rounds #8232

Uh oh!

Conversation

mraszyk commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mraszyk commented Jan 6, 2026 •

edited

Loading