Skip to content

Conversation

@mraszyk
Copy link
Contributor

@mraszyk mraszyk commented Jan 6, 2026

WORK IN PROGRESS!

Screenshot from 2026-01-09 23-09-51

@github-actions github-actions bot added the feat label Jan 6, 2026
},
)
(None, Some(hash)) => Some((height, hash)),
(None, None) => None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're always ending here for heights in between the current DSM height and the latest certified height of the subnet, i.e., certifications for those heights are not available in the pool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By returning None here we are filtering out all heights that don't have a hash, which means they will not be part of the state_hashes_to_certify vector. Later on in the function we only validate artifacts for heights part of this vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have a hash for those heights, we could only easily execute this part:

        let certifications = state_hashes_to_certify
            .iter()
            .flat_map(|(height, _)| self.aggregate(certification_pool, *height))
            .collect::<Vec<_>>();
    
        if !certifications.is_empty() {
            self.metrics
                .certifications_aggregated
                .inc_by(certifications.len() as u64);
            trace!(
                &self.log,
                "Aggregated {} threshold-signatures in {:?}",
                certifications.len(),
                start.elapsed()
            );
            return certifications
                .into_iter()
                .map(ChangeAction::AddToValidated)
                .collect();
        }

but not

        let change_set = self.validate(certification_pool, &state_hashes_to_certify);

Is it enough to execute the former or do we also need to execute the latter? If we also need to execute the latter, I'd refactor the functions validate_share and validate_certification to take hash: &Option<CryptoHashOfPartialState> and skip the hash check if there's no hash available, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that you have to do the validate changes. Otherwise, you would never validate the certification shares, so the other code will also never have enough validated shares to aggregate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, I think we briefly touched on that in this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works now that I call validate for the corresponding state heights, but then I wonder how pool.get_finalized_tip().context.certified_height can advance to begin with. Is it because the consensus pool stores unvalidated blocks (i.e., pool.get_finalized_tip().context.certified_height is only an approximation, but not necessarily trustworthy) and only after calling validate, the certification pool returns the certification as validated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I 100% understand: As we finalize and execute new blocks, the state heights and hashes will be given to the certifier via list_state_hashes_to_certify above. For each height/hash each node then creates a certification share. These are validated until there are enough to aggregate a full certification. The full certification is delivered back to the DSM which increases the latest_certified_height(). Whenever a node constructs a new block proposal, it will include the current latest_certified_height() in block.context.certified_height. So as long as the latest certified height increases pool.get_finalized_tip().context.certified_height will increase accordingly.

@mraszyk

This comment was marked as resolved.

}

let tip_height = self.tip_height.load(Ordering::Relaxed);
let last_certification_height_to_keep = min(last_height_to_keep, Height::new(tip_height));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be the tip height, but the latest_certified_height. We always want to keep the latest height where we have everything (hash tree/certificiation/state), so that the height where we answer queries doesn't go backwards. Or even worse, we wouldn't want to fall back into a state where we have no certified states. Specifically, we want to protect whatever height is returned from latest_certified_state.

Furthermore, also in this function, I believe that self.latest_certified_height is not updated correctly. It should always be the height for latest_certified_state, but here we only check for the presence of a certification (instead of certification+hash tree).

.certifications_metadata
.entry(height)
.or_insert_with(|| {
Self::compute_certification_metadata(&self.metrics, &self.log, &state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what is tripping up latest_certified_state. Here we populate metadata.hash, and then later consensus might call deliver_state_certification. We then have as hash tree, a certification, but no corresponding state in self.snapshots.

I think there are two fixes:

  1. Here, only poplulate metadata.certified_state_hash, but not metadata.hash_tree.
  2. Populate both, but rewrite latest_certified_state so that it doesn't assume that if we have hash tree and certification, we also have a state.

Either way, you then also need to fix how we update self.latest_certified_height in deliver_state_certification. It should only be updated to heights where we have hash tree+certification+state. Whether you pick (1) or (2) also affects how you need to update self.latest_certified_height in remove_states_below.


let latest_subnet_certified_height =
self.latest_subnet_certified_height.load(Ordering::Relaxed);
if matches!(scope, CertificationScope::Metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have would be some metrics about how often we skip steps due to this new logic. Specifically, how often we skip both cloning and hashing, how often we do it anyway due to the is_multiple_of(10) rule, and how often do we hash due to missing certifications.

states.tip = Some((height, state));
self.tip_height.store(height.get(), Ordering::Relaxed);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_multiple_of(10) case is not quite correctly handled:

  1. If we already have a CertificationMetadata we shouldn't just overwrite it. Instead, we want to preserve the certification if it has one. Otherwise we rely on consensus to be able to serve it to us again.

  2. We already do check if we already have a CertificationMetadata, and assert that the has is the same hash. Until now, this could only trigger if you have a state sync ending around the same time as execution finishes. But now, this is how we detect divergences in the is_multiple_of(10) case. So we should strengthen it a bit, and do the same as we do in deliver_state_certification on divergence. Mainly, there we call create_diverged_state_marker to log the divergence on disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now, this could only trigger if you have a state sync ending around the same time as execution finishes.

Why don't we log this divergence on disk already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just an oversight. But because it was basically dead code it didn't matter.

.sum::<u64>()
}

fn state_manager_for_tests(log: ReplicaLogger) -> (MetricsRegistry, StateManagerImpl) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a state_manager_test like this one, and also be in tests/state_manager.rs where all these other tests are instead of here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also be in tests/state_manager.rs

if the test is in a different crate, then I can't access private fields though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a state_manager_test

Unless I'm missing something, my helpers should do exactly the same as the helpers in the integration test crate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am talking about the file rs/state_manager/tests/state_manager.rs, which is not a different crate. I agree your helpers do the same as the existing ones, hence why the duplication is unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants