Skip to content

Use events for adding and removing committees.#3432

Merged
afck merged 9 commits intolinera-io:mainfrom
afck:epoch-events
Mar 10, 2025
Merged

Use events for adding and removing committees.#3432
afck merged 9 commits intolinera-io:mainfrom
afck:epoch-events

Conversation

@afck
Copy link
Contributor

@afck afck commented Feb 27, 2025

Motivation

To make epoch changes scalable, we want to use events instead of messages (#365).

Proposal

Remove the committee messages and add add ProcessNewEpoch and ProcessRemovedEpoch system operations instead. The committee change operations on the admin change create events now, and the two new operations consume those events, i.e. verify that they exist, change the epoch and committees on the local chain, and create an OracleResponse::Event entry in the execution outcome.

Clients get notified about new epochs because they always follow/synchronize the admin chain and thus store the epoch change events locally. They will then run process_inbox which, in addition to incoming messages, looks for epoch change events, and includes the new operations in the proposal accordingly.

Test Plan

Several existing tests verify that reconfigurations are processed correctly. These check the new mechanism now instead of the message-based one.

Release Plan

  • Nothing to do / These changes follow the usual release cycle.

Links

@afck afck requested review from bart-linera, deuszx and ma2bd February 27, 2025 19:40
@afck afck marked this pull request as ready for review February 27, 2025 19:40
Copy link
Contributor

@deuszx deuszx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass through the code. It looks good overall but I had some more general questions. I think the general approach is slightly hacky (i.e. looping through the storage in search for interesting events). It could be improved if we had an option to subscribe for specific event streams in the client.

I am not giving an approval yet - I want to see the answers and then will re-review it.

@afck
Copy link
Contributor Author

afck commented Mar 3, 2025

Closing this in favor of #3453.
(The implementation will probably use some of this PR's code, but also change a lot.)

@afck afck closed this Mar 3, 2025
@afck
Copy link
Contributor Author

afck commented Mar 6, 2025

Reopening as a draft. I will mark it ready for review when #3453 is fully implemented.

@afck afck reopened this Mar 6, 2025
@afck afck marked this pull request as draft March 6, 2025 10:35
@afck afck force-pushed the epoch-events branch 2 times, most recently from aa3299f to 016c387 Compare March 6, 2025 10:41
afck added a commit that referenced this pull request Mar 6, 2025
…nsistencies. (#3489)

## Motivation

Working on #3432 I
found a few issues that can be fixed separately.

## Proposal

* Move the operation inspection code from the two `published_blob_ids`
implementations to `Operation` and deduplicate it.
* Use the `select!` macro in the client listener. This will make it
easier to add another stream, to listen for admin chain events.
* Fix a typo where a blob ID was called `chain_id`.
* Add a check that events don't get overwritten. Testing this will be
complicated, and possibly require more SDK features:
#3488
* Treat `stage_new_committee` like other client operations. The issue
with conflicting blocks needs to be addressed for all operations anyway,
and admin operations are always on a single-user chain in our case.
#3172
* Use `stage_new_committee` in the node service` instead of manually
creating the operation. There is no reason to have different semantics
here in CLI vs. the node service API, and to have the user input the
current epoch manually.

## Test Plan

CI should catch regressions. I created #3488 for the events check.

## Release Plan

- Nothing to do / These changes follow the usual release cycle.

## Links

- Found while working on
#3432.
- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
/// The block's validation round.
Round(Option<u32>),
/// An event was read.
Event(EventId, Vec<u8>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so if we don't include the hash of the certificate that created the event, this means that a user event (stream) can become deprecated -- which may be a good thing after all (except for fast blocks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure: This oracle response does re-certify the event itself (including its content), just not the certificate that created it.

let chain_ids = {
let guard = context.lock().await;
let mut chain_ids = BTreeSet::from_iter(guard.wallet().chain_ids());
chain_ids.insert(guard.wallet().genesis_admin_chain());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this code belong here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add a getter to the wallet that returns the wallet's chains plus the admin chain.

{
let mut guard = listening.lock().await;
if guard.contains(&chain_id) {
if !listening.lock().await.insert(chain_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

let admin_listener: NotificationStream = if client.admin_id() == chain_id {
Box::pin(stream::pending())
} else {
Box::pin(client.subscribe_to(client.admin_id()).await?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this creating a distinct admin chain listener for all mon-admin chains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just a local entry in the notifier.

#[cfg(with_metrics)]
let _latency = metrics::PROCESS_INBOX_WITHOUT_PREPARE_LATENCY.measure_latency();

self.synchronize_chain_state(self.admin_id).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a bit expensive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I need to revisit whether it's actually necessary, once we've got all the tests working.

Copy link
Contributor Author

@afck afck Mar 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the only way to learn about admin chain events now. It doesn't send us a message anymore.

We could make it a separate command, but I'd rather not: It means that more users won't end up migrating to new epochs.

(Edit: Actually, maybe it is redundant, since we're already doing this in synchronize_from_validators. Will remove if the tests pass without it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 34d744e.

@afck afck force-pushed the epoch-events branch 3 times, most recently from dcf3eee to fbc6f93 Compare March 9, 2025 12:36
@afck afck marked this pull request as ready for review March 9, 2025 13:11
#[error(transparent)]
BcsError(#[from] bcs::Error),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BcsError could be put in the ViewError. Non blocker of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure: I've also heard the argument that we should create errors only in their own module.

bcs::to_bytes(&blob_hash)?,
);
}
AdminOperation::RemoveCommittee { epoch } => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little strange to have such a formulation.
I would expect RemoveCommittee { committee }.
Then maybe rename RemoveFromCommittee.
Not blocker of course.

Copy link
Contributor Author

@afck afck Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it should probably be called RevokeEpoch? But then I'd rename the other one also from CreateCommittee to CreateEpoch.

Not sure what the best names are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #3516 for this.

}

/// Creates a new committee [`BlobContent`] from the provided serialized committee.
pub fn new_committee(committee: impl Into<Box<[u8]>>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API should be more type-safe. Right now it accepts any bytes blindly a s Committee which can led to problems. Why not pass the actual committee instead and serialise it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #3520 for this.

/// A blob containing an application description.
ApplicationDescription,
/// A blob containing a committee of validators.
Committee,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're adding more and more here I think we should have a discussion and some design doc (agreement) what can/should become a BlobType.

Comment on lines +3056 to +3087
let (mut min_epoch, mut next_epoch) = {
let query = ChainInfoQuery::new(self.chain_id).with_committees();
let info = *self
.client
.local_node
.handle_chain_info_query(query)
.await?
.info;
let committees = info
.requested_committees
.ok_or(LocalNodeError::InvalidChainInfoResponse)?;
let min_epoch = *committees.keys().next().unwrap_or(&Epoch::ZERO);
let epoch = info.epoch.ok_or(LocalNodeError::InvalidChainInfoResponse)?;
(min_epoch, epoch.try_add_one()?)
};
let mut epoch_change_ops = Vec::new();
while self.has_admin_event(EPOCH_STREAM_NAME, next_epoch).await? {
epoch_change_ops.push(Operation::System(SystemOperation::ProcessNewEpoch(
next_epoch,
)));
next_epoch.try_add_assign_one()?;
}
while self
.has_admin_event(REMOVED_EPOCH_STREAM_NAME, min_epoch)
.await?
{
epoch_change_ops.push(Operation::System(SystemOperation::ProcessRemovedEpoch(
min_epoch,
)));
min_epoch.try_add_assign_one()?;
}
let mut epoch_change_ops = epoch_change_ops.into_iter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract this to a helper method? Sth like collect_committee_changes ?

));
let blob_hash = committee_blob.id().hash;
self.storage
.write_blob(&committee_blob)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing the resource policy should not require re-publishing a new committee. There should be a separate mechanism for that. If we change the cost of a Wasm multiplication will we also publish a new committee? It should be stored somewhere in the "system storage" and its updates should require quorum of validators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it also needs to be applied separately to each chain, just like committee changes. It makes sense to me to treat all changes with that property as one thing. Anyway, this wasn't changed in this PR, so it's probably a separate discussion.

stream_id: StreamId::system(EPOCH_STREAM_NAME),
key: bcs::to_bytes(&epoch)?,
};
let bytes = match txn_tracker.next_replayed_oracle_response()? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern of checking for oracle response mismatches is now repeated in a few different places, maybe it could somehow be extracted into a method on TransactionTracker? (Although it might not be so simple, because one of the parameters would have to be a pattern... it would need some thought.)

Anyway, not a blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good point: We could pass in the closure that extracts the data by matching the pattern.

What's probably less easy to do is to also pass in the closure that uses the actual oracle, since that's going to be async and often borrow a lot of references.

But yes, I'd prefer to do that in a separate PR.

I created #3519.

Copy link
Contributor

@deuszx deuszx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@afck afck merged commit edc6de0 into linera-io:main Mar 10, 2025
25 checks passed
@afck afck deleted the epoch-events branch March 10, 2025 12:18
jvff added a commit that referenced this pull request Mar 11, 2025
…3540)

## Motivation

Epoch changes were recently changed to use events (#3432), but the
`TestValidator::change_resource_control_policy` was still trying to
receive epoch change system messages instead of creating an operation to
process the new epoch.

## Proposal

Use the new `SystemOperation::ProcessNewEpoch` to update the chains to
use the new epoch, and skip updating the admin chain because it fails to
try to process the same epoch twice.

## Test Plan

Tested with a rebased version of the tests in #3509. The tests stop
working after #3432 but pass again after this PR.

## Release Plan

- These changes follow the usual release cycle, because it fixes an
issue with SDK that is not present in any released version.

## Links

- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
afck added a commit that referenced this pull request Mar 24, 2025
## Motivation

As of #3432 we are not
using system channels anymore.

## Proposal

Remove the `SystemChannel` type and any system channel-specific logic,
in particular the "off-chain subscription handling".

## Test Plan

The `social` example tests make sure that _user_ pub-sub channels still
work.

## Release Plan

- Nothing to do / These changes follow the usual release cycle.

## Links

- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants