Save bandwidth by avoiding superfluous Nodes Requests to peers already on the Close List #511

zugz · 2017-03-12T11:50:08Z

Currently, when we receive a response to a getnodes request, for each node in
the response we check if it could be added to the close list, and if so we send
the node a getnodes request; if it replies, we add it to the close list.

We do the same for each friends list, but for them we first check whether the
node is already on the friends list, and don't send a getnodes request if it
is.

This change adds a corresponding check for the close list, such that we only
send a getnodes request if the node can be added to the close list and is not
already on the close list.

The original behaviour could, and as far as I can see should be expected
typically to, lead to getnode loops resulting in lots of unnecessary traffic.
This can happen as follows: suppose A, B, and C are all close to each other in
the DHT, and all know each other, and A sends B a getnodes request searching
for A (as happens at least every 60 seconds). B's response will likely include
C. It's quite likely that the part of A's close list (the "k-bucket") into
which C falls will not be full, i.e. that A does not know 8 nodes at that
distance. Then even though A already knows C, when A receives B's response A
will send a getnodes to C. Then C's response is likely to include B, in which
case (if that bucket is also not full) A will send another getnodes to B. So we
obtain a tight loop, with the only delays being network delays and the delay
between calls to do_DHT().

I don't see that there can be any downside to adding this check, but I'm not
entirely confident that I couldn't be missing something subtle, so please do
review carefully.

This change is

zugz · 2017-03-12T15:23:30Z

I did some primitive testing, and I'm sad to say that this patch doesn't seem to lead to any immediately discernable reduction in bandwidth. Whether this is because the scenario I sketched isn't actually occuring, or just because it's being drowned out by all the other sources of bandwidth use, I don't know.

robinlinden · 2017-03-23T14:58:38Z

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

toxcore/DHT.c, line 1025 at r1 (raw file):

    if (index > LCLIENT_LENGTH) {
        index = LCLIENT_LENGTH - 1;

So LCLIENT_LENGTH is okay, but you set index to LCLIENT_LENGTH - 1? Did you mean to use >= in the if-check or is this all intentional?

Comments from Reviewable

nurupo · 2017-03-23T16:47:04Z

Is the "close list" equivalent of the k-bucket list in Kademlia? If so, the behaviour you are complaining about might actually be the desired behaviour and you fixing it would make DHT less reliable. In Kademlia, each k-bucket contains at most k nodes and is sorted by the most recently seen time. You update the k-bucket list every time you receive any DHT reply from any node, putting that node on the top (or bottom, depends on how you implemented it) of the list, since it's the most recently seen node in that bucket. Of course, if the node is already somewhere in the k-bucket, you remove it before adoing it to the top, as you don't want duplicates in the k-bucket. The property of k-buckets being always sorted by the most recently seen time is key to Kademlia DHT's reliability, because once a k-bucket is full and you add a new node to it, the least recently seen node is removed from the k-bucket, i.e. the bottom most node is removed. This makes it so that k-bucket contains k nodes which are very likely to be online. If you change the code in such a way that you don't put a recently seen node at the top of the k-bucket if it's already in it, you violate the property of the k-bucket being sorted by the most recently seen time, which allows for the exact opposite behaviour to happen -- the bottom most node might in fact be one of the most recently seen nodes, which should be on the top of the k-bucket instead, and since the bottom most node is removed when you try to insert a new node in a full k-bucket, because the algorithm assumes that the bottom most node is the least recently seen node, it would unknowingly remove the most recently node.

That said, I'm only familiar with Kademlia DHT, not Tox DHT, but Tox DHT is supposedly based on Kademlia DHT so this is likely to still apply.

zugz · 2017-03-23T18:00:26Z

* Thursday, 2017-03-23 at 07:58 -0700 - Robin Lindén <[email protected]>:

Reviewed 1 of 1 files at r1. Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. --- *[toxcore/DHT.c, line 1025 at r1](https://reviewable.io:443/reviews/toktok/c-toxcore/511#-Kfveq4VV6Tj0gSbGrLa:-Kfveq4VV6Tj0gSbGrLb:b42aezb) ([raw file](https://github.com/toktok/c-toxcore/blob/2f172769d11b03488ea278dc34c0c421b211b759/toxcore/DHT.c#L1025)):* > ```C > > if (index > LCLIENT_LENGTH) { > index = LCLIENT_LENGTH - 1; > ``` So `LCLIENT_LENGTH` is okay, but you set index to `LCLIENT_LENGTH - 1`? Did you mean to use `>=` in the if-check or is this all intentional?

Good catch! I copied this from add_to_close(), but indeed it should be '>=' both there and here, because in neither case is index=LCLIENT_LENGTH a good idea. Fixed in both functions.

zugz · 2017-03-23T18:41:12Z

* Thursday, 2017-03-23 at 09:47 -0700 - nurupo <[email protected]>:

Is the "close list" equivalent of the k-bucket list in Kademlia?

Yes.

If so, the behaviour you are complaining about might actually be the desired behaviour and you fixing it would make DHT less reliable. In Kademlia, each k-bucket contains at most k nodes and is sorted by the most recently seen time.

Tox doesn't order the buckets at all. Instead, it "pings" every node in each bucket every 60 seconds by sending it a getnodes request, and if the node replies a timestamp on the node is updated. If a node fails to reply to requests for long enough (122s) it's marked as "bad", which effectively means it's removed from the bucket. The PR doesn't interfere with this "pinging". This frequent "pinging" is expensive, so it might well be that in the future we want to move to something more like the Kademlia system. So it might be worth thinking through how this PR would affect such a system:

You update the k-bucket list every time you receive any DHT reply from any node, putting that node on the top (or bottom, depends on how you implemented it) of the list, since it's the most recently seen node in that bucket. Of course, if the node is already somewhere in the k-bucket, you remove it before adoing it to the top, as you don't want duplicates in the k-bucket.

This PR is not changing that. When we receive a reply to a getnodes request, we update the node we got the reply from; this PR doesn't affect that. What it affects is what we do with the nodes the reply tells us about. Currently we immediately send a getnodes request to each of those nodes, searching for ourself, unless the corresponding bucket is full; this PR prevents sending such a request when the node is already in the bucket. The only possible interaction with a least-recently-seen policy is that this the current behaviour would prompt nodes which show up in a getnodes response to declare that they're still alive. But given the pathological behaviour I described in the original PR text, I don't think this can be a good method for doing that. The Kademlia paper describes an entirely different method for refreshing buckets when usual traffic isn't enough, by doing a search for a random ID in the bucket's range.

robinlinden · 2017-03-25T17:49:41Z

Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

iphydf · 2017-03-26T19:34:38Z

Please enable the checkbox "Allow edits from maintainers." on the bottom right.

iphydf · 2017-04-01T12:38:09Z

@nurupo are you happy with this PR?

GrayHatter · 2017-04-07T19:42:07Z

let me take a look too

GrayHatter · 2017-04-08T05:40:11Z

I did some primitive testing, and I'm sad to say that this patch doesn't seem to lead to any immediately discernable reduction in bandwidth. Whether this is because the scenario I sketched isn't actually occuring, or just because it's being drowned out by all the other sources of bandwidth use, I don't know.

@JFreegman has a feature branch with a packet count/size, did you, or can you use that to see if it's changing the packet counts in other ways?

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.

Comments from Reviewable

GrayHatter · 2017-04-08T06:06:04Z

That said, I'm only familiar with Kademlia DHT, not Tox DHT, but Tox DHT is supposedly based on Kademlia DHT so this is likely to still apply.

From my understanding that's not how Tox sorts it's close list. Unless I misunderstand something. ONLY replaces nodes that have timed out.

Comments from Reviewable

GrayHatter · 2017-04-08T06:12:52Z

afaict, @zugz is correct about everything. All this pull does is prevent the DHT from adding nodes that exist in the close list to the to_bootstrap list. Which IIRC would only then attempt to add them to the close list.

If they're already in the close list, we honestly don't need to interact with them at all*, assuming other nodes closer send us a get nodes request, we don't need to ask them for close nodes anymore*.

*many restrictions apply, and I'm not advocating we actually do this.

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.

Comments from Reviewable

iphydf · 2017-04-09T11:35:26Z

@irungentoo would you like to take a look at this to make sure it makes sense?

irungentoo · 2017-04-10T14:56:08Z

This looks fine.

zugz · 2017-04-28T19:05:57Z

I did a little imprecise testing, and I'm observing roughly a quartering of
the number of getnodes requests sent during normal DHT operation once this
patch is applied (from about 8 per second to about 2). So this should indeed
help with our bandwidth problem.

fix index bounds check in add_to_close() and is_pk_in_close_list() add TODO to write test for bug which fixed by this commit

SkyzohKey added the PR:review_needed label Mar 21, 2017

iphydf modified the milestone: v0.1.8 Mar 26, 2017

zugz force-pushed the checkInCloselist branch 2 times, most recently from e7c2f48 to b8ac109 Compare April 1, 2017 12:30

zugz changed the title ~~check if already in close list in ping_node_from_getnodes_ok()~~ Save bandwith by avoiding sending superfluous Nodes Requests to peers already on the Close List Apr 1, 2017

iphydf changed the title ~~Save bandwith by avoiding sending superfluous Nodes Requests to peers already on the Close List~~ Save bandwidth by avoiding superfluous Nodes Requests to peers already on the Close List Apr 1, 2017

iphydf assigned irungentoo Apr 9, 2017

iphydf modified the milestones: v0.1.8, v0.1.9 Apr 12, 2017

check if already in close list in ping_node_from_getnodes_ok()

2474f43

fix index bounds check in add_to_close() and is_pk_in_close_list() add TODO to write test for bug which fixed by this commit

robinlinden force-pushed the checkInCloselist branch from b8ac109 to 2474f43 Compare April 29, 2017 16:32

robinlinden unassigned irungentoo Apr 29, 2017

robinlinden merged commit 2474f43 into TokTok:master Apr 29, 2017

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save bandwidth by avoiding superfluous Nodes Requests to peers already on the Close List #511

Save bandwidth by avoiding superfluous Nodes Requests to peers already on the Close List #511

zugz commented Mar 12, 2017 •

edited by iphydf

Loading

zugz commented Mar 12, 2017

robinlinden commented Mar 23, 2017

nurupo commented Mar 23, 2017

zugz commented Mar 23, 2017 via email

zugz commented Mar 23, 2017 via email

robinlinden commented Mar 25, 2017

iphydf commented Mar 26, 2017

iphydf commented Apr 1, 2017

GrayHatter commented Apr 7, 2017

GrayHatter commented Apr 8, 2017

GrayHatter commented Apr 8, 2017

GrayHatter commented Apr 8, 2017

iphydf commented Apr 9, 2017

irungentoo commented Apr 10, 2017

zugz commented Apr 28, 2017

Save bandwidth by avoiding superfluous Nodes Requests to peers already on the Close List #511

Save bandwidth by avoiding superfluous Nodes Requests to peers already on the Close List #511

Conversation

zugz commented Mar 12, 2017 • edited by iphydf Loading

zugz commented Mar 12, 2017

robinlinden commented Mar 23, 2017

nurupo commented Mar 23, 2017

zugz commented Mar 23, 2017 via email

zugz commented Mar 23, 2017 via email

robinlinden commented Mar 25, 2017

iphydf commented Mar 26, 2017

iphydf commented Apr 1, 2017

GrayHatter commented Apr 7, 2017

GrayHatter commented Apr 8, 2017

GrayHatter commented Apr 8, 2017

GrayHatter commented Apr 8, 2017

iphydf commented Apr 9, 2017

irungentoo commented Apr 10, 2017

zugz commented Apr 28, 2017

zugz commented Mar 12, 2017 •

edited by iphydf

Loading