Skip to content

Make sure that blobs downloaded by the client are validated#4082

Merged
afck merged 15 commits intolinera-io:mainfrom
afck:issue-2351
Jun 11, 2025
Merged

Make sure that blobs downloaded by the client are validated#4082
afck merged 15 commits intolinera-io:mainfrom
afck:issue-2351

Conversation

@afck
Copy link
Contributor

@afck afck commented Jun 9, 2025

This is a completion of @bart-linera's #4075:

Motivation

#3787 left a potential security issue: when a client was downloading a missing ChainDescription blob, it would just download a blob, but wouldn't make sure that it was legitimately created on another chain.

Proposal

Whenever a ChainDescription is fetched, fetch the certificate for the block that created it and validate it against committees known from the admin chain.

Test Plan

The only occurrences of RemoteNode::download_blob[s] are now after process_certificate returned BlobsNotFound, i.e. after checking a certificate's signatures.

CI should catch regressions.

Release Plan

  • Nothing to do / These changes follow the usual release cycle.

Links

Notes

The TODO in attempted_changes.rs has been removed after verifying that the safety issue is no longer present there: the committees are taken from a blob, but that blob can only exist on the validator if it has been legitimately created on another chain.

Copy link
Contributor

@christos-h christos-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good but there are some test failures. Happy to improved once fixed :)

self.local_node.storage_client().write_blobs(&blobs).await?;
// This should be a single blob: the ChainDescription of the chain we're
// fetching the info for.
assert_eq!(blob_ids.len(), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use ensure! here instead of assert_eq?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. It should be unreachable, but I can make it an InternalError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I'll just remove it. Not worth adding another error variant.

// TODO(#2351): Don't store downloaded blobs without certificate.
let _ = self.local_node.store_blobs(&blobs).await;
result = self.handle_certificate(certificate.clone()).await;
for blob_id in blob_ids {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks trivially parallelizable?

) -> Option<BlockHeight> {
let info = self.local_chain_info(chain_id, local_node).await?;
Some(info.next_block_height)
) -> BlockHeight {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

@afck afck marked this pull request as draft June 9, 2025 14:55
@afck afck marked this pull request as ready for review June 10, 2025 13:09
Ok(info) => Ok(info),
Err(LocalNodeError::BlobsNotFound(blob_ids)) => {
// Make sure the admin chain is up to date.
self.synchronize_chain_state(self.admin_id).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially a performance regression. Is there a "quick" way to verify if we're up-to-date? Can we optimistically try to skip synchronization with admin chain and somehow detect it?

};
// Recover history from the current validators, according to the admin chain.
// TODO(#2351): make sure that the blob is legitimately created!
self.synchronize_chain_state(self.admin_id).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again we're synchronizing the admin chain - is this one of the reasons you want to move everything to a single place so that we have a better control over what has been done or what needs to be done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly!
I don't know how to get around these otherwise.

E.g. we shouldn't do this at all if we're in "connected" mode, i.e. a chain listener is running. I'm thinking about writing a design doc for a "local node synchronizer" that does all this properly. I'd rather not try to add more ad-hoc case distinctions here for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely.

.await?;

Ok(())
.await
Copy link
Contributor

@deuszx deuszx Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I read this correctly - we try to download all blobs in parallel and within a "single blob context" we will try to download it from different validators sequentially, by scheduling it to start after an ever-growing timeout?

Would it be more readable if you extract the async move { } closure to a function that tries a new node only if the previous one failed? That way the retries can be "tighter" - i.e. we won't wait a timeout * i * i but can start immediately (if previous node didn't have the blob).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only if the previous one failed

But that's what we had before, and what is causing the client performance issues: If the first validator we ask takes a very long time to respond or fail, we just keep waiting.

But you're right that we could try the next one immediately after the earlier ones failed, and skip the timeout.

Anyway, this affects not only this particular place in the code, but all of them. Again: This is why I think we need some sort of local node synchronizer that has all that logic in one place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first validator we ask takes a very long time to respond or fail, we just keep waiting.

So we have a timeout. We try the next if one of the two happens

  1. Validator responds with BlobsNotFound
  2. Request times out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's postpone this. I have so many questions and comments about this, it'll end up more complex than this whole PR.

Copy link
Contributor

@deuszx deuszx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last comment is more like an idea for improvement - not a blocker. You can decide what to do with it.

@afck afck added this pull request to the merge queue Jun 11, 2025
Merged via the queue into linera-io:main with commit c7c08d1 Jun 11, 2025
26 checks passed
@afck afck deleted the issue-2351 branch June 11, 2025 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't set the committee before checking it in process_confirmed_block.

4 participants