Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PutValue appears to not put to K peers #139

Closed
whyrusleeping opened this issue Apr 23, 2018 · 54 comments
Closed

PutValue appears to not put to K peers #139

whyrusleeping opened this issue Apr 23, 2018 · 54 comments
Labels
kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up status/ready Ready to be worked

Comments

@whyrusleeping
Copy link
Contributor

In order for queries to function properly, PutValue should ensure it puts out to K (20) peers. It currently seems that this is not happening, and that a much smaller number of records (less than 16, our GetValue threshold) is actually being put.

This was brought up in ipfs/kubo#3860

We should investigate and fix this.

@inetic
Copy link

inetic commented Apr 26, 2018

I've been testing go-ipfs as is currently in master (sorry, I'm still quite new to go/gx so don't know to which commit it translates here, but the ipfs version is QmY1y2M1aCcVh...).

I've added a few fmt.Printf(...) lines to routing.go/PutValue and routing.go/GetValues. Basically I added a line

fmt.Printf("routing.go/GetValues query reply from %v %q hasval:%v err:%v\n", p, key, rec.GetValue() != nil, err)

Right after the call to dht.getValueOrPeers in routing.go/GetValues and I added these lines:

if err != nil {
    fmt.Printf("routing.go/PutValue failed putting value to peer %v %q\n", p, err)
} else {
    fmt.Printf("routing.go/PutValue success putting value to peer %v\n", p)
}

Right after the call to dht.putValueToPeer(ctx, p, key, rec) in routing.go/PutValue.

Then I started the a program that does the publishing in a loop (with few seconds in between publishing) and got this output. That is, the publisher never actually succeeded in publishing the value to any node (am I reading it correctly?).

Once the call to the first publish in the loop was done, I fired another program which started resolving that IPNS (in a loop as well). The output is here. That is, the only peer from which the resolver was able to get the value first 4 times was the publisher itself (look for the hasval:true string). Later, one more node (ID atonFX) came to know that value as well, although the publisher's PutValue function never seems to have noticed that it succeeded with that one user.

@whyrusleeping
Copy link
Contributor Author

@inetic thats disconcerting... If your code is right then i'd say your observation is correct. There are a weirdly large number of 'default failure' errors there that really shouldnt be popping up like that.

@whyrusleeping
Copy link
Contributor Author

In your case, it looks like you are unable to dial to any of the DHT nodes you are supposed to be connecting to.

@whyrusleeping
Copy link
Contributor Author

@inetic question, how many peers show up in ipfs swarm peers ?

@inetic
Copy link

inetic commented Apr 26, 2018

There are a weirdly large number of 'default failure' errors there that really shouldnt be popping up like that.

It seems the default failure errors happen for nodes that previously failed with a different error.

@inetic
Copy link

inetic commented Apr 26, 2018

In your case, it looks like you are unable to dial to any of the DHT nodes you are supposed to be connecting to.

I'll try to modify the ipfs executable directly tomorrow to make sure it's not my misuse of the go API, as I'm doing direct calls to some of the ipfs functions (1, 2). But in the mean time, do you guys experience better connectivity in the PutValue function?

@whyrusleeping
Copy link
Contributor Author

Currently, my only internet connection is a tethered LTE connection on a 1GB sim card. I'm trying to conserve data.

Maybe @Kubuxu @vyzo or @magik6k could take a look?

@vyzo
Copy link
Contributor

vyzo commented Apr 26, 2018

@inetic were you testing go-libp2p-kad-dht on its own?

@inetic
Copy link

inetic commented Apr 27, 2018

@vyzo I did just now, nothing out of the ordinary seems to be there. With these modifications to QmY1y2M1aCcV.../go-libp2p-kad-dht/routing.go

+++ /tmp/go-libp2p-kad-dht.orig/routing.go	2018-04-27 13:56:08.998398626 +0200
@@ -83,11 +83,6 @@
 
 			err := dht.putValueToPeer(ctx, p, key, rec)
 			if err != nil {
-				fmt.Printf("PutValue failure %q %v %q\n", p, err, key)
-			} else {
-				fmt.Printf("PutValue success %q %q\n", p, key)
-			}
-			if err != nil {
 				log.Debugf("failed putting value to peer: %s", err)
 			}
 		}(p)
@@ -211,7 +206,6 @@
 		})
 
 		rec, peers, err := dht.getValueOrPeers(ctx, p, key)
-		fmt.Printf("getValueOrPeers peercount:%d hasval:%v %v\n", len(peers), rec.GetValue() != nil, err)
 		switch err {
 		case routing.ErrNotFound:
 			// in this case, they responded with nothing,

I'm seeing this output.

@inetic
Copy link

inetic commented Apr 27, 2018

So I've done the same test with only go-ipfs sources. Basically what I've done is this:

  1. Install fresh 1.9.5 go
  2. Compile and install newest go-ipfs from github
  3. Add the fmt.Printf debug printing to routing.go/PutValue
  4. Run the daemon, wait for it to get ready
  5. Run ipfs name publish Qm...

Here Here is the script in detail (and here the printf.patch file which the script uses).

Unfortunately, my node still doesn't connect to any other peer when doing PutValue. The log can be found here.

EDIT: Improved the script

@inetic
Copy link

inetic commented Apr 27, 2018

@whyrusleeping sorry, missed your question yesterday

question, how many peers show up in ipfs swarm peers ?

In the above tests, I started publishing very soon after the daemon printed "Daemon is ready".

Now I modified the script again so that it prints swarms, I waited about a minute to get ~100 peers in the swarm and then started publishing. This time the PutValue did succeed with 7 peers (out of 80). Here is the log.

@inetic
Copy link

inetic commented Apr 27, 2018

One more test: I did the same as above, but this time I did it on a DigitalOcean instance and I waited for ~200 peers. Unfortunately, did not succeed in connecting to any peer. Here is the log.

@whyrusleeping
Copy link
Contributor Author

So something is up here. Either your node isnt able to dial properly for some reason (could be bad addresses in the network, or something) or there are a huge number of undialable nodes in the network.

@vyzo
Copy link
Contributor

vyzo commented Apr 27, 2018

I tried a test IPNS publish from my node -- it took 2m49s and I am seeing lots of failed putting value to peer errors.

Note I am running master (no printf patches) on my laptop node.

@inetic
Copy link

inetic commented Apr 27, 2018

Either your node isnt able to dial properly for some reason (could be bad addresses in the network, or something)

That's what I was thinking as well and why I did the test on the DigitalOcean's instance which has an open IP to the internet. All the other tests I did were on my home PC which is behind a router.

@whyrusleeping whyrusleeping added kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up labels Apr 28, 2018
@whyrusleeping
Copy link
Contributor Author

Escalating this...

@vyzo
Copy link
Contributor

vyzo commented Apr 30, 2018

I wrote a debug crawler that crawls the dht and tries to connect to each discovered peer.

The results after an overnight run:

  • 2197 peers discovered
  • 1336 of them are undialable (1155 fail with context deadline exceeded)

In short, 61% of the dht is undialable!

Edit: Up to 2268 peers now, 1391 undialable.

@MichaelMure
Copy link
Contributor

FWIW, I get a similar result from a home network:

799 "ERROR"
503 "OK"

61% undialable.

@whyrusleeping
Copy link
Contributor Author

Thanks @MichaelMure!

We're working on adding more data collection to this crawler. Current approaches we're thinking about are:

  • Use dial failure data to build a better heuristic for dialable addresses
  • Enable relay advertisement, put up relay infra nodes, and encourage others to do so as well
    • Going this route, we should make relay smarter
  • Have NAT detection libp2p services to aid nodes in discovering their own NAT situation

@florianlenz
Copy link

florianlenz commented Jun 3, 2018

When I put a value to the DHT it's only put to 3 out of 20 peers (sometimes more sometimes less). I think the error I get panic: routing: not found is related to this issue. The project I am working on is blocked by this issue. Is there something I can help with? Wouldn't fetching peers and putting the key to them till we are sure it was put to 20 peers fix the problem?

@whyrusleeping
Copy link
Contributor Author

@florianlenz yeah, making sure we put it to 20 peers would help, but we need to make sure those are roughly the same 20 peers that someone doing a lookup of the value will contact. I think changing the code slightly to find the '20 closest peers we can connect to' might help dramatically. I'm doing some experiments right now with DHT resolution, let me know if you try anything and figure something interesting out

@whyrusleeping
Copy link
Contributor Author

random thing i tried:

reducing the dial timeout in go-libp2p-conn from 60s to 5s. Seems to have made things quite a bit faster overall, without reducing the number of records I get back. (i'm testing DHT 'Gets' for ipns records)

@whyrusleeping
Copy link
Contributor Author

another interesting thing:

I changed the GetClosestPeers logic to only return peers we 'successfully connected' to (by calling host.Connect on each peer before adding it to our 'good set'). For some reason, i'm still getting puts failing because of 'connection reset'. This is really fishy...

cc @Stebalien

@whyrusleeping
Copy link
Contributor Author

yamux returns ErrConnectionReset when the stream gets reset. Very misleading error message.

@whyrusleeping
Copy link
Contributor Author

Two things.

First off. It turns out, that if a peer doesnt have the public key to validate an ipns record it receives, it just tosses the record and resets the stream. That stream reset was producing the 'connection reset' errors i was seeing, which led me to this bug.
Basically, using an ipns key other than your peers key will result in most peers not accepting the record you are putting (aside from the peers that overlap in the keyspace between the closest peers to your record and the closest peers to your public key). This is why both @florianlenz and I were seeing puts only going to like three peers successfully. Its not that we were sending it to them, its that they were just dropping them. This is something we need to address ASAP.

The other thing,
This patch makes things work reasonably well:

diff --git a/lookup.go b/lookup.go
index dd47516..d6f04a5 100644
--- a/lookup.go
+++ b/lookup.go
@@ -10,6 +10,7 @@ import (
 	pb "github.com/libp2p/go-libp2p-kad-dht/pb"
 	kb "github.com/libp2p/go-libp2p-kbucket"
 	peer "github.com/libp2p/go-libp2p-peer"
+	pstore "github.com/libp2p/go-libp2p-peerstore"
 	notif "github.com/libp2p/go-libp2p-routing/notifications"
 )

@@ -100,13 +101,19 @@ func (dht *IpfsDHT) GetClosestPeers(ctx context.Context, key string) (<-chan pee

 		if res != nil && res.finalSet != nil {
 			sorted := kb.SortClosestPeers(res.finalSet.Peers(), kb.ConvertKey(key))
-			if len(sorted) > KValue {
-				sorted = sorted[:KValue]
-			}

+			var found int
 			for _, p := range sorted {
-				out <- p
+				if err := dht.host.Connect(ctx, pstore.PeerInfo{ID: p}); err == nil {
+					found++
+					out <- p
+				}
+				if found >= KValue {
+					break
+				}
 			}
 		}
 	}()

Basically "Get me the K closest peers that I can actually connect to".
Theres definitely room for some parallelism here, and also we should do some analysis on this, but in general, using this, my ipns resolves get all 16 expected values back, and they get them back pretty fast (especially on repeated resolves).

@whyrusleeping
Copy link
Contributor Author

(brb, getting some dinner)

@whyrusleeping
Copy link
Contributor Author

So it appears that the lack of access to public keys is a pretty significant issue here... If we started switching over the embeddable ed25519 keys already, we would be okay, but with the RSA its a problem.

One option would be to augment the DHT codebase in a pretty ugly way to make PutValue for IPNS records, first put the public key to the peer, then put the ipns record. This is the easiest change to make from a protocol design standpoint, but actually writing that code will be... rather unpleasant.

Another option would be to add the public key to the ipns record itself so that we guarantee the peer has it. This would require no changes to the DHT codebase, but would require some changes in the IPNS validator, to handle that. This change is easier from the implementation perspective, but requires pretty much everyone update their nodes for it to take effect.

I'm leaning towards the first choice, but it will require some code mangling. I'll have a chat with @Stebalien when he wakes up, need to get this fixed ASAP

@florianlenz
Copy link

@whyrusleeping really nice work :) Would it make sense to drop the connection after connecting? A host could end up having a lot off connections when doing a lot of lookups.

@whyrusleeping
Copy link
Contributor Author

I think its probably best to leave that up to the connection manager. It will be able to make decisions about connections informed by the broader picture than just the DHT

@whyrusleeping
Copy link
Contributor Author

@florianlenz question, which version of the DHT code are you testing with?

@florianlenz
Copy link

I tested with version 4.0.4 (QmSBxn1eLMdViZRDGW9rRHRYwtqq5bqUgipqTMPuTim616)

@whyrusleeping
Copy link
Contributor Author

Ah, interesting. So you we'rent running with @Stebalien's latest changes (which resulted in gx version 4.1.0). There are some changes there that make this process work a little differently, thanks for the data point :)

@whyrusleeping
Copy link
Contributor Author

This patch should help for ipns publishing in general: ipfs/kubo#5079
It will at least stop certain peers from throwing away the record you are trying to put (once peers update to that version of the code :/ )

The other bigger problem i see is that (as discussed previously) 'GetClosestPeers' currently returns the 20 closest peers it has heard of. I have a patch locally that gathers all the 'close' peers, sorts them, then tries connecting to them in order of closeness. Once it succeeds in connecting to 20 of them, it returns those 20. This results in a very well work ipns (but my patch is pretty ugly). Most runs go through 40-70 peers before they find 20 connectable ones (implying between 50% and 30% connectivity, which roughly matches what our earlier experiments showed).

I'm going to try out a patch where we try to relay every connection and see how that affects those ratios.

@whyrusleeping
Copy link
Contributor Author

automatically trying to relay every connection definitely lowers the 'miss rate' on connecting to closest peers. Last few tests i ran show it about 30% lower (we can connect to 30% more peers than before). But for some currently-unknown reason, overall performance is pretty degraded compared to not trying relays. I also end up connected to a lot more peers at the end of a run than before, >300 as opposed to ~100 before... very odd.

@Stebalien
Copy link
Member

@whyrusleeping what's the status of your closest peers patch?

@whyrusleeping
Copy link
Contributor Author

@Stebalien should be much better: #159 (merged)

I havent had great internet since then, but I do now, and can start debugging things a bit more. In any case though, we should start pushing these changes out and cut a new ipfs release. Let's get it all deployed!

@Stebalien
Copy link
Member

Got it. I asked because I tried publishing an IPNS entry on two different temporary nodes and:

  1. On the first node, multiple publishes all succeeded but my other node only ever saw the first value (even after, e.g., restarting).
  2. On the second node, I was never able to complete a single publish.

@whyrusleeping
Copy link
Contributor Author

@Stebalien was this on a testnet of nodes all using the proper value? or was this on the main network?

Getting back an old value seems a bit odd though, we should investigate that.

@Stebalien
Copy link
Member

Main network.

@whyrusleeping
Copy link
Contributor Author

whyrusleeping commented Jun 21, 2018 via email

@Stebalien
Copy link
Member

Oh. Yeah, I forgot about that.

stebalien slinks off into a corner to quietly continue breaking everything

@anacrolix anacrolix self-assigned this Jan 17, 2019
@anacrolix anacrolix added the status/ready Ready to be worked label Jan 22, 2019
@anacrolix anacrolix removed their assignment Dec 10, 2019
@markg85
Copy link

markg85 commented Dec 17, 2019

Hi,

A couple of hours ago i made issue ipfs/kubo#6800 (name resolve takes exactly 1 minute and loses cache after 30 seconds). That issue is closed which is fine. But when reading though this whole thread i'm lead to believe this was (or is?) the actual core issue of the observed behavior.

Now this does reference issue filecoin-project/venus#437 (filecoin?.. what does that have to do with this very issue?.. anyhow.) where it's said to be no issue anymore (that was on Feb. 27 of 2019) where it was subsequently closed. This issue however remained open.

And then i read much of ipfs/kubo#3860 where it eventually turns out to be a feature in it's current form (if i'm reading it correctly).

So now i'm totally lost if the behavior i see in ipfs name resolve (that takes exactly 1 minute for new publishes) is by design and working properly (thus meaning the issue in this thread is resolved too) or if there still is an issue, as i can't imagine 1 minute is by design and "proper".

Please note, i try to access a newly published website so i'm using a browser to access that ipns link. Command line trickery to speed it up won't work in the browser. I'm guessing i have to modify ipfs companion if i want to try the same command line trickery in the browser?

I'd happily try out a go-ipfs 0.50.0 alpha + IPFS companing to go along with it to see if the actual issue is resolved (aka, have a fast working IPNS) but i have no clue what i'm even supposed to install now. So just a friendly question: Is this issue resolved in the current master that is going to be 0.50.0? Can i help? (by testing stuff out, and how?)

@aschmahmann
Copy link
Contributor

@markg85 your question about the current state of IPNS is better asked on discuss.ipfs.io then here. If you start an issue there I'll happily go through why things are working how they are currently, some of the ongoing work to improve performance, and some areas where having some help would be great (e.g. testing).

@markg85
Copy link

markg85 commented Dec 17, 2019

@aschmahmann that's a weird and concerning request. I'm only wondering how it's supposed to work because of said "perceived bugs" and the "hints" i'm catching elsewhere that this is actually a feature. This very bug report should be the place as it's the root cause. If i move to discuss, it will also add yet another place where "ipns performance" is discussed where it's already been discussed in so many different places (multiple threads on discuss, multiple github issues and even branches). Are you sure you want that?

If you still say yes to this then i happily open a new topic there, i just try to prevent even more thinning out of information in this regard.

@markg85
Copy link

markg85 commented Feb 17, 2020

Hi,

Any update on this one? I tried using IPFS again on some test site which now works awesomely with an ipfs link in the dns record. An ipns link still seemed way too ambitious (aka, so slow that it times out). That being said, the latest IPFS release really works loads faster for me for ipfs data. While off topic here, thanx a lot for that :)

But ontopic.
The initial topic here was the PutValue that is used for ipns is not putting the value or not to enough nodes which then subsequently causes an ipns lookup to be as slow as the timeout that is set for ipfs publish (which is 1 minute by default) as it can't find enough nodes to get the latest link.

Now this issue is a prio 1, it is tagged as such, so I'm really curious if this is on the radar for 0.50.0?

@Stebalien
Copy link
Member

This specific issue will be fixed in the next go-ipfs release, but the relevant changes haven't been merged to master yet (currently in testing).

Changes:

  1. We put to the K closest peers we can actually connect to.
  2. Undialable (e.g., behind NATs) peers won't join the DHT.
  3. We search until we find the K closest peers and then stop instead of backtracking indefinitely. Then, we return whatever results we get regardless of how many results we see.

Unfortunately, a large portion of this depends on 2 which means the new code won't actually perform any better than the existing code until it's actually deployed widely on the network. The current plan is:

  • Run two DHTs, the old kind of crappy one and a shiny new fast one.
  • New nodes query the fast DHT first, falling back on the old DHT.

@markg85
Copy link

markg85 commented Feb 17, 2020

@Stebalien that sounds like a lot of good news! Thank you very much!

But I do wonder what you mean by that last part.

I'm assuming that once enough nodes switch to 0.50.0 the ipns, in this regard, just slowly gets better as the new code path is being used by more and more nodes, right?

By running 2 DHT nodes you mean one IPFS instance before 0.50.0 and one at (or above) 0.50.0? Or will there be a config option to pick different DHT versions?

@aschmahmann
Copy link
Contributor

I'm assuming that once enough nodes switch to 0.50.0 the ipns, in this regard, just slowly gets better as the new code path is being used by more and more nodes, right?

Yes, although if a provider and consumer of an IPNS record both upgrade to go-ipfs 0.5.0 they will immediately get benefits without waiting for the rest of the network to upgrade.

By running 2 DHT nodes you mean one IPFS instance before 0.50.0 and one at (or above) 0.50.0? Or will there be a config option to pick different DHT versions?

@markg85 the plan is that your go-ipfs 0.5.0 instance will be able to participate in a DHT with nodes running the current protocol (kad v1) while simultaneously participating in a DHT with the new protocol (kad v2). This will allow users of go-ipfs 0.5.0 to continue to find peers and information that are only in the v1 DHT while still benefiting from the fast-path of using the v2 DHT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up status/ready Ready to be worked
Projects
None yet
Development

No branches or pull requests

9 participants