Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fastx client] Adding robust client <-> authorities primitives #336

Merged
merged 2 commits into from
Feb 6, 2022

Conversation

gdanezis
Copy link
Collaborator

@gdanezis gdanezis commented Feb 1, 2022

In this PR we build a library of primitives for robust interactions between the client and the set of authorities, covering failure cases in the general model of orders beyond transfers.

Specifically:

[Client logic]

  • We include a generic map/reduce pattern for interacting with authorities, in which we provide an async function to query each authority, and then a reduce function to aggregate responses, and adaptively end, continue or timeout the queries.
  • We use the abstraction to retrieve objects at all versions that are know by the authorities to be owned by an address.
  • We use the abstraction to retrieve all versions of an object known by the authorities and associated certificates.
  • We use the abstraction to update authorities to the latest state of objected owned by an address using the latest certs.

[Authority logic]

  • Augmented the authority with an object version in parent_sync when the object is deleted, and a link back to the cert that deleted it.
  • Changed the semantics of ObjectInfoRequest to return the latest ObjectRef for an object id, and the certificate that leads to it, including for deleted objects.

For another PR:

  • Record and propagate errors in a suitable way, probably send back to authority client logic.
  • Use shared cached state instead of queries where possible -- right now we are very chatty.

@gdanezis gdanezis marked this pull request as draft February 1, 2022 21:55
@gdanezis gdanezis force-pushed the update-client-authority-interactions branch from 9bb93c5 to 14bf67a Compare February 2, 2022 12:50
@gdanezis
Copy link
Collaborator Author

gdanezis commented Feb 2, 2022

Hey @patrickkuo -- I promised to make this available for review, but I am still trying to finalise the sync_all_owned_objects to be correct is objects are either deleted or transferred to other users. Its taking longer than I expected, sorry about the delay.

@gdanezis gdanezis force-pushed the update-client-authority-interactions branch from 662edd5 to 3643466 Compare February 2, 2022 20:02
@gdanezis gdanezis changed the title [fastx client] Adding robust client <-> authorities primitives (draft) [fastx client] Adding robust client <-> authorities primitives Feb 4, 2022
@gdanezis gdanezis force-pushed the update-client-authority-interactions branch from 5d8ca5c to dbdf6b2 Compare February 4, 2022 11:17
@gdanezis gdanezis marked this pull request as ready for review February 4, 2022 11:22
Copy link
Contributor

@patrickkuo patrickkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some nits, otherwise LGTM

fastpay_core/src/client.rs Outdated Show resolved Hide resolved
fastpay_core/src/client.rs Outdated Show resolved Hide resolved
fastpay_core/src/client.rs Outdated Show resolved Hide resolved
fastpay_core/src/client.rs Outdated Show resolved Hide resolved
fastpay_core/src/client.rs Outdated Show resolved Hide resolved
fastpay_core/src/unit_tests/client_tests.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks like an interesting direction. A couple of comments with this first pas of review:

  • unit tests for get_object_by_id, get_all_owned_objects, sync_all_owned_objects would help .. or at least smaller tests. The tests that are here (thank you!) are integration tests. They're a journey, I would love something more bite-sized for daily reading.
  • anything that lowers the complexity of quorum_map_then_reduce_with_timeout is good. I left a few suggestions, but I'll try to help more on the 2d pass, especially with the reduce

)
.await;

// TODO: log or report the errors if there is one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing allows you to de-couple the logging from the publishing of the logs, which means you can leave it to a further PR to actually figure out how to display things, and actually get on with figuring out interesting messages to communicate.

I wonder: is there something we could do to make the bar to entry easier and make sure logging / tracing is not left as a TODO in general?

Copy link
Collaborator Author

@gdanezis gdanezis Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing communicates things to the devops person baby sitting a server, or user process (actually logs on the user side are usually for debug, no user should look there). Here what needs doing is collect errors, and send them to the part of the client that can do something about them: ie deprioritize interactions with this authority, etc.

I changed the comment to:

// TODO: collect errors and propagate them to the right place

Your reflection on how to systematically log is still very valid.

fastpay_core/src/authority.rs Show resolved Hide resolved
fastpay_core/src/authority/authority_store.rs Outdated Show resolved Hide resolved
fastpay_core/src/authority/authority_store.rs Show resolved Hide resolved
fastpay_core/src/client.rs Outdated Show resolved Hide resolved
fastpay_core/src/client.rs Outdated Show resolved Hide resolved
Comment on lines 683 to 686
/// This function provides a flexible way to communicate with a quorum of authorities, processing and
/// processing their results into a safe overall result, and also safely allowing operations to continue
/// past the quorum to ensure all authorities are up to date (up to a timeout).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A doc example here would increase the usability of the function, especially if it shows how to make use of timeouts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now have quite a few examples in the unit tests, including using timeouts. We can try to adapt them to a doc example (I do not know how to set up the scaffolding as we do in tests for docs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scaffolding is there already, it's a Rust built-in, cargo test will run your doc tests at the end.

fastpay_core/src/client.rs Outdated Show resolved Hide resolved
fastpay_core/src/unit_tests/client_tests.rs Outdated Show resolved Hide resolved
fastx_types/src/base_types.rs Outdated Show resolved Hide resolved
@gdanezis
Copy link
Collaborator Author

gdanezis commented Feb 4, 2022

Many thanks for the reviews @patrickkuo & @huitseeker -- you have identified some key issues to fix. Will do and respond by tomorrow.

fastpay_core/src/client.rs Outdated Show resolved Hide resolved
(
BTreeMap<
(ObjectRef, TransactionDigest),
(Option<Object>, Vec<(AuthorityName, Option<SignedOrder>)>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to pass around the reference to the Object here instead of the whole object since they could be arbitrarily huge.
(in the meantime, Sam and I are trying to define a reasonable upper bound for object sizes)

Copy link
Collaborator Author

@gdanezis gdanezis Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The objects can be huge because they contain large allocations in the heap. However, moving an Object around (except if we clone or if it implements the copy trait) does not actually copy the allocation, but just the fixed size stack allocated part I think.

fastpay_core/src/client.rs Outdated Show resolved Hide resolved
@gdanezis gdanezis force-pushed the update-client-authority-interactions branch from 3e36713 to 3f9362e Compare February 5, 2022 10:30
@gdanezis
Copy link
Collaborator Author

gdanezis commented Feb 5, 2022

@huitseeker I have added unit tests as you suggested, but for quorum_map_then_reduce_with_timeout. To have effective unit (rather than integration) tests for get_object_by_id, get_all_owned_objects, sync_all_owned_objects, we will need to create a more expressive mocking framework around LocalAuthorityClient to induce specific responses, errors, long delays, etc. I want this, but its a bigger piece than this PR.

Do we have experience, or should we use something like: https://docs.rs/mockall/latest/mockall/ ?

Copy link
Contributor

@lxfind lxfind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be an issue here:
When we delete an object by wrapping it to another object, we insert (obj_id, seq+1, ..) to parent_sync, but we don't actually mutate the sequence number of the object (it stays at seq). When that same object gets unwrapped, its sequence number becomes seq+1 and now parent_sync has two entries for the same object_id + seq.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @lxfind's comment: do we not clobber the original deleted object recording when unwrapping?

Comment on lines 309 to 315
if let Ok(response) = result {
let certificate = response
.parent_certificate
.expect("Unable to get certificate");
let certificate = if let Some(certificate) = response.parent_certificate {
certificate
} else {
continue;
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            if let Ok(ObjectInfoResponse {
                parent_certificate: Some(certificate),
                ..
            }) = result
            {

Comment on lines 1000 to 1002
// NOTE: This implies this is a genesis object. We should check that it is.
// We can do this by looking into the genesis, or the object_refs of the genesis.
// Otherwise report the authority as potentially faulty.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably calling for a specific is_genesis function somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does. It also calls for checking all information the client downloads from an authority, something that is done in at best an ah-hoc way right now, if at all.

Comment on lines 790 to 791
if total_stake.1 > validity {
return Err(FastPayError::TooManyIncorrectAuthorities);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very underwhelming. 😢

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I do not know what to do about this. If its the name that is underwhelming, we can change it to something else. If the amount of reporting is underwhelming, then it will have to be a separate PR. We cannot block all progress on the known issue that error reporting on the client is lacking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for sure, there's no blocking issue here.

Remove annotation

Implemented get_object_by_ref

Download objects at version

Added tests and fix corner cases

Return certificate with retrieved object

Added docs

Finished sync_all_owned_objects

Docs

Return available objects

Add an assert that we only side insert genesis objects

Added a an authority test with a deleted object

Add deleted objects to the parent index

Added get_latest_parent_entry & tests

ObjectInfoResponse sends the latest certificate

Added requested_object_reference to ObjectInfoResponse

Fixed bug in query()

Fix sync_all_owned_objects

Added tests for the client full sync

Make fmt and clippy happy

Clean up comments

Removed unused function

Make fmt happy

Add docs

Clean up tests

Removed wrapped

Remove s.truncate

Fixed rebase errors

Make get_object_by_id return pending orders

clippy + fmt

Changes followin review

Added unit tests for map/reducer
@gdanezis gdanezis force-pushed the update-client-authority-interactions branch from 7b56a00 to ee0d7f7 Compare February 6, 2022 15:36
@gdanezis
Copy link
Collaborator Author

gdanezis commented Feb 6, 2022

There may be an issue here: When we delete an object by wrapping it to another object, we insert (obj_id, seq+1, ..) to parent_sync, but we don't actually mutate the sequence number of the object (it stays at seq). When that same object gets unwrapped, its sequence number becomes seq+1 and now parent_sync has two entries for the same object_id + seq.

Yes, this issue is very real, and either we have to deal with it by incrementing the version correctly, or implementing the lamport timestamps. Either should work.

@gdanezis gdanezis deleted the update-client-authority-interactions branch February 6, 2022 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants