Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Conversation

@tomaka
Copy link
Contributor

@tomaka tomaka commented Apr 30, 2019

Rewrites the peerset manager (PSM) to include reputations.

I started adjusting the PSM code to add reputations, but realized it would be way too complex to handle.

The PSM is now composed of a PeersState data structure (in peersstate.rs) that just holds the state of each peer (their reputation, whether they are connected, whether they are reserved), plus a few utility functions.

This makes the main code (in lib.rs) way easier to read and much more straight-forward.

I had to remove several tests because they were too reliant on implementation details.
The new code uses a HashMap<PeerId, ...> and iterates over the entries. Since PeerIds in tests and the SipHasher are both random, the order in which we iterate is also random, and thus we can't write precise-enough tests.
In my opinion, these tests were too precise anyway and would make modifying the logic of the peerset too annoying, and the solution is instead to do #2243.

At the moment, reputations increase automatically by 1 per second of being connected. Every time we disconnect from a node, we reduce its reputation of 20% minus 10. Keep in mind that the peerset doesn't differentiate between "connected" and "trying to connect" (which can lead to the node being unreachable or on the wrong chain), which means that we have to drop some reputation when disconnecting to counter-balance the one gained when connected.
Additionally, if a node's reputation is i32::min_value(), we simply don't try. This makes banning nodes work again.

@tomaka tomaka added the A0-please_review Pull request needs code review. label Apr 30, 2019
@Demi-Marie Demi-Marie self-requested a review April 30, 2019 19:51
@gavofyork
Copy link
Member

gavofyork commented May 1, 2019

there's a grave error here which is in increasing reputation monotonically over time from the point of "connection" (which as you correctly state is not really being connected).

reputation should in fact be decreased over time, with increments only coming from useful information gleaned from the peer.

@tomaka
Copy link
Contributor Author

tomaka commented May 1, 2019

reputation should in fact be decreased over time, with increments only coming from useful information gleaned from the peer.

I can remove that, but then we should probably delay this PR until sync performs reputation adjustments. Otherwise everybody would stay at a reputation of 0 at the moment.

Copy link
Contributor

@Demi-Marie Demi-Marie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the code should be factored into utility functions.


/// Returns the list of all the peers we know of.
// Note: this method could theoretically return a `Peer`, but implementing that
// isn't simple.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because this code does not uses futures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because Peer mutably borrows the PeersState and we return multiple peers at once.
We can change Peer to only borrow the specific entry, but this has some other implications, or we can use some sort of Rc<RefCell<&mut PeersState>> instead of just &mut PeersState.


/// Returns the first reserved peer that we are not connected to.
///
/// If multiple nodes are reserved, which one is returned is unspecified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// If multiple nodes are reserved, which one is returned is unspecified.
/// If multiple nodes are reserved, which one is returned is unspecified
/// and nondeterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might become deterministic in the future.


/// Returns the peer with the highest reputation and that we are not connected to.
///
/// If multiple nodes have the same reputation, which one is returned is unspecified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// If multiple nodes have the same reputation, which one is returned is unspecified.
/// If multiple nodes have the same reputation, which one is returned is unspecified
/// and nondeterministic.

/// Returns the peer with the highest reputation and that we are not connected to.
///
/// If multiple nodes have the same reputation, which one is returned is unspecified.
pub fn highest_not_connected_peer(&mut self) -> Option<NotConnectedPeer> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function will be called often, we should use a data structure such as a priority queue or Fibonacci heap.

@tomaka
Copy link
Contributor Author

tomaka commented May 3, 2019

reputation should in fact be decreased over time, with increments only coming from useful information gleaned from the peer.

I changed that to regress reputations over time towards the mean (0).
In the latest commit, every second we multiply all the reputations by 0.98, which means it takes ~34 seconds to reduce reputations by half.

match self.data.peer(&peer) {
peersstate::Peer::Connected(mut peer) => {
let cur_reput = peer.reputation();
peer.set_reputation(cur_reput.saturating_sub(cur_reput / 50));
Copy link
Contributor

@marcio-diaz marcio-diaz May 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be increasing negative reputations. Also, I'm not sure we really need the secs_diff loop (or both).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be increasing negative reputations.

That's on purpose!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. But then is it possible for someone to misbehave, wait a few minutes and do it again? and what if this someone has many nodes doing it concurrently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then is it possible for someone to misbehave, wait a few minutes and do it again?

Yes, but one can also generate a brand new key at any time, so decreasing the reputation of a node cannot be a protection mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm having trouble understanding why it should be an exception to the principle we are following (increasing reputation only when peer provides useful information).

Copy link
Contributor

@marcio-diaz marcio-diaz May 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work the mechanism after they are taken to 0? So they are not blocked forever by high reputation nodes. I still feel that 0 shouldn't be special.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there must be a default reputation for unknown nodes. Zero is a reasonable choice for that. In my mind the only question is whether to allow a reputation to go below zero. But I think it's actually fine; it means "choose this guy only if there are no newcomers to try out".

Copy link
Contributor

@marcio-diaz marcio-diaz May 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok, I was just thinking it more like:

  • The default reputation is the mean between nodes.
  • Non negative reputation (or doesn't matters).
  • Choosing to connect just by ranking. Or ranking + last time updated, if we want to forgive.

Basically, separating more the estimation of the utility from the part that chooses the next node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the mean between nodes"

the mean between which nodes? all nodes we've ever seen? at which point? right now? even if that made sense (which i don't think it does), then that would suggest there be no difference between a newcomer (which we have precisely no evidence is useful and which could be trivially sybil attacked) and a pre-existing node that has behaved "ok" for a long time.

Copy link
Contributor

@marcio-diaz marcio-diaz May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it could be the median between the last reputations of N recently connected nodes (also counting the now disconnected). I guess in a sybil attack it will go down fast.


fn state(&self) -> &Node {
self.parent.nodes.get(&self.peer_id)
.expect("We only ever build a ConnectedPeer if the node's in the list; QED")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only ever build a ConnectedPeer if the node's in the list; QED

This is not a strong proof.

If the node is in what list? At what point would it need to be? What stops the list mutating after the ConnectedPeer has been built but before state is called?

Copy link
Contributor Author

@tomaka tomaka May 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PeersState struct is essentially a glorified HashMap (it doesn't handle any logic by itself), and a ConnectedPeer is the equivalent of OccupiedEntry (https://doc.rust-lang.org/nightly/std/collections/hash_map/struct.OccupiedEntry.html).

The contract of ConnectedPeer is "as long as this object is alive, we guarantee that the peer is in list and in the connected state". It holds a mutable borrows the PeersState, meaning that nothing else can change it.

I don't really know how to provide a brief explanation of that, though.

Copy link
Contributor Author

@tomaka tomaka May 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll change it to not have any possible panic instead.
Can't really do that without adding an overhead, so I'll expand the proof.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't sound like a reasonable proof.

proofs cannot be of the form "the program state is valid; this thing always succeeds when the program state is valid; qed".

this is because they cannot presume continuous-correct-state of the program (at least not unless there's a completely trivial means of auditing it).

Copy link
Contributor Author

@tomaka tomaka May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a new commit that expands a bit.

at least not unless there's a completely trivial means of auditing it

It is extremely simple to prove that the code here is always in a valid state. Nothing is racy, everything is single-threaded, everything mutably borrows, and there's no logic in that module, it's a pure data structure.

The two states that have to be in sync (the presence of a node in a list and the fact that ConnectedPeer/NotConnectedPeer is alive) only concerns 2 locations in the code, and it is done in a very straight-forward way.

If this still isn't ok, I'll do the change that removes any possible panic but that has a small overhead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small overhead is fine. proofs should not extend beyond the scope that they're in.


fn state_mut(&mut self) -> &mut Node {
self.parent.nodes.get_mut(&self.peer_id)
.expect("We only ever build a ConnectedPeer if the node's in the list; QED")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.


fn state(&self) -> &Node {
self.parent.nodes.get(&self.peer_id)
.expect("We only ever build a NotConnectedPeer if the node's in the list; QED")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.


fn state_mut(&mut self) -> &mut Node {
self.parent.nodes.get_mut(&self.peer_id)
.expect("We only ever build a NotConnectedPeer if the node's in the list; QED")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Contributor

@gterzian gterzian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general suggestion on the overall design

SlotState::MaxCapacity(peer_id) => {
self.data.discovered.add_peer(peer_id, SlotType::Reserved);
return;
let mut entry = match self.data.peer(&peer_id) {
Copy link
Contributor

@gterzian gterzian May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general suggestion on the PeerState design:

Have you considered just keeping the entire data hidden inside of it? For example here, why not just do a let connected = self.data.add_reserved_peer(peer_id);, followed by if connected {// queue the message}.

It appears to me that almost every operation made on self.data in this file, which currently involves making the internal data of the PeerState available to the client for manipulation, could be replaced by a method call on the PeerState, with a return value if necessary.

The current design seems to introduce a new data structure, the PeerState, and then make it's internal available to the client. My suggestion would be to hide the internal state of PeerState, and remove what is currently done with Peer.

Just to drive the point home, currently inside the loop at alloc_slots, we see the following:

  1. self.data.reserved_not_connected_peer()
  2. self.data.highest_not_connected_peer()
  3. next.reputation() (with next being a variant of Peer)
  4. match next.try_outgoing(),
  5. Ok(conn) => self.message_queue.push_back(Message::Connect(conn.into_peer_id())),

What's the usefulness of PeerState as an abstraction if the client needs to see all this?

I would suggest something more like:

match self.data.next_peer_to_connect_to() {
    Some(peer_id) => // enqueue the Connect message
    None => // No slot available
}

It's sharing this Peer back with the peerset that makes PeerState complicated, and I'm wondering if it's necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is to separate the data structure from the logic. The logic right now is extremely simple, but once it becomes more complex it's IMO better to not have to deal with the increased complexity of both deciding who to connect to and how to store data in memory.

Copy link
Contributor

@gterzian gterzian May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see, in that case it does make sense to share Peer as a handle to the data as a way to keep the business logic in the peerset part and out of the data structure itself.

/// We are not connected to this node.
NotConnected(NotConnectedPeer<'a>),
/// We have never heard of this node.
Unknown(UnknownPeer<'a>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this state needed? Is it somehow different from NotConnected?

}
}

/// A peer that we have never heard of.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the Id come from then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is given by the user. This object represents an entry not-inserted-yet, just like the VacantEntry object in the hash map API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peer_id: Cow<'a, PeerId>,
num_in: &'a mut u32,
num_out: &'a mut u32,
max_in: u32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these? Do they need to be stored for each peer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added them in the latest commit to address Gav's review about the panic proof not being strong enough. We now store a &mut Node instead of the &mut Peerstate, but that means we copy over some fields of the Peerstate.

@gavofyork
Copy link
Member

@demimarie-parity any further showstoppers?

@gavofyork gavofyork merged commit 9448965 into paritytech:master May 10, 2019
@tomaka tomaka deleted the peerset-rewrite branch May 10, 2019 12:12
MTDK1 pushed a commit to bdevux/substrate that referenced this pull request Jul 10, 2019
* Rewrite the PSM

* Fix disconnecting from reserved peers

* Minor adjustements

* Address review

* Reputation changes adjustements

* More adjustements

* Adjust all reputations

* More fixes and adjustments

* Improve proof

* Remove the possible panic

* Make sure reputation reaches 0
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

A0-please_review Pull request needs code review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants