Make sure that blobs downloaded by the client are validated#4082
Make sure that blobs downloaded by the client are validated#4082afck merged 15 commits intolinera-io:mainfrom
Conversation
christos-h
left a comment
There was a problem hiding this comment.
The code looks good but there are some test failures. Happy to improved once fixed :)
linera-core/src/client/mod.rs
Outdated
| self.local_node.storage_client().write_blobs(&blobs).await?; | ||
| // This should be a single blob: the ChainDescription of the chain we're | ||
| // fetching the info for. | ||
| assert_eq!(blob_ids.len(), 1); |
There was a problem hiding this comment.
Do we want to use ensure! here instead of assert_eq?
There was a problem hiding this comment.
Maybe. It should be unreachable, but I can make it an InternalError.
There was a problem hiding this comment.
Or I'll just remove it. Not worth adding another error variant.
linera-core/src/client/mod.rs
Outdated
| // TODO(#2351): Don't store downloaded blobs without certificate. | ||
| let _ = self.local_node.store_blobs(&blobs).await; | ||
| result = self.handle_certificate(certificate.clone()).await; | ||
| for blob_id in blob_ids { |
There was a problem hiding this comment.
This looks trivially parallelizable?
| ) -> Option<BlockHeight> { | ||
| let info = self.local_chain_info(chain_id, local_node).await?; | ||
| Some(info.next_block_height) | ||
| ) -> BlockHeight { |
| Ok(info) => Ok(info), | ||
| Err(LocalNodeError::BlobsNotFound(blob_ids)) => { | ||
| // Make sure the admin chain is up to date. | ||
| self.synchronize_chain_state(self.admin_id).await?; |
There was a problem hiding this comment.
This is potentially a performance regression. Is there a "quick" way to verify if we're up-to-date? Can we optimistically try to skip synchronization with admin chain and somehow detect it?
| }; | ||
| // Recover history from the current validators, according to the admin chain. | ||
| // TODO(#2351): make sure that the blob is legitimately created! | ||
| self.synchronize_chain_state(self.admin_id).await?; |
There was a problem hiding this comment.
And again we're synchronizing the admin chain - is this one of the reasons you want to move everything to a single place so that we have a better control over what has been done or what needs to be done?
There was a problem hiding this comment.
Yes, exactly!
I don't know how to get around these otherwise.
E.g. we shouldn't do this at all if we're in "connected" mode, i.e. a chain listener is running. I'm thinking about writing a design doc for a "local node synchronizer" that does all this properly. I'd rather not try to add more ad-hoc case distinctions here for now.
| .await?; | ||
|
|
||
| Ok(()) | ||
| .await |
There was a problem hiding this comment.
Do I read this correctly - we try to download all blobs in parallel and within a "single blob context" we will try to download it from different validators sequentially, by scheduling it to start after an ever-growing timeout?
Would it be more readable if you extract the async move { } closure to a function that tries a new node only if the previous one failed? That way the retries can be "tighter" - i.e. we won't wait a timeout * i * i but can start immediately (if previous node didn't have the blob).
There was a problem hiding this comment.
only if the previous one failed
But that's what we had before, and what is causing the client performance issues: If the first validator we ask takes a very long time to respond or fail, we just keep waiting.
But you're right that we could try the next one immediately after the earlier ones failed, and skip the timeout.
Anyway, this affects not only this particular place in the code, but all of them. Again: This is why I think we need some sort of local node synchronizer that has all that logic in one place.
There was a problem hiding this comment.
If the first validator we ask takes a very long time to respond or fail, we just keep waiting.
So we have a timeout. We try the next if one of the two happens
- Validator responds with
BlobsNotFound - Request times out
There was a problem hiding this comment.
Let's postpone this. I have so many questions and comments about this, it'll end up more complex than this whole PR.
deuszx
left a comment
There was a problem hiding this comment.
The last comment is more like an idea for improvement - not a blocker. You can decide what to do with it.
This is a completion of @bart-linera's #4075:
Motivation
#3787 left a potential security issue: when a client was downloading a missing
ChainDescriptionblob, it would just download a blob, but wouldn't make sure that it was legitimately created on another chain.Proposal
Whenever a
ChainDescriptionis fetched, fetch the certificate for the block that created it and validate it against committees known from the admin chain.Test Plan
The only occurrences of
RemoteNode::download_blob[s]are now afterprocess_certificatereturnedBlobsNotFound, i.e. after checking a certificate's signatures.CI should catch regressions.
Release Plan
Links
process_confirmed_block. #2351Notes
The
TODOinattempted_changes.rshas been removed after verifying that the safety issue is no longer present there: the committees are taken from a blob, but that blob can only exist on the validator if it has been legitimately created on another chain.