-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PutValue appears to not put to K peers #139
Comments
I've been testing go-ipfs as is currently in master (sorry, I'm still quite new to go/gx so don't know to which commit it translates here, but the ipfs version is QmY1y2M1aCcVh...). I've added a few
Right after the call to
Right after the call to Then I started the a program that does the publishing in a loop (with few seconds in between publishing) and got this output. That is, the publisher never actually succeeded in publishing the value to any node (am I reading it correctly?). Once the call to the first publish in the loop was done, I fired another program which started resolving that IPNS (in a loop as well). The output is here. That is, the only peer from which the resolver was able to get the value first 4 times was the publisher itself (look for the |
@inetic thats disconcerting... If your code is right then i'd say your observation is correct. There are a weirdly large number of 'default failure' errors there that really shouldnt be popping up like that. |
In your case, it looks like you are unable to dial to any of the DHT nodes you are supposed to be connecting to. |
@inetic question, how many peers show up in |
It seems the |
I'll try to modify the |
@inetic were you testing |
@vyzo I did just now, nothing out of the ordinary seems to be there. With these modifications to
I'm seeing this output. |
So I've done the same test with only go-ipfs sources. Basically what I've done is this:
Unfortunately, my node still doesn't connect to any other peer when doing EDIT: Improved the script |
@whyrusleeping sorry, missed your question yesterday
In the above tests, I started publishing very soon after the daemon printed "Daemon is ready". Now I modified the script again so that it prints swarms, I waited about a minute to get ~100 peers in the swarm and then started publishing. This time the |
One more test: I did the same as above, but this time I did it on a DigitalOcean instance and I waited for ~200 peers. Unfortunately, did not succeed in connecting to any peer. Here is the log. |
So something is up here. Either your node isnt able to dial properly for some reason (could be bad addresses in the network, or something) or there are a huge number of undialable nodes in the network. |
I tried a test IPNS publish from my node -- it took 2m49s and I am seeing lots of Note I am running master (no printf patches) on my laptop node. |
That's what I was thinking as well and why I did the test on the DigitalOcean's instance which has an open IP to the internet. All the other tests I did were on my home PC which is behind a router. |
Escalating this... |
I wrote a debug crawler that crawls the dht and tries to connect to each discovered peer. The results after an overnight run:
In short, 61% of the dht is undialable! Edit: Up to 2268 peers now, 1391 undialable. |
FWIW, I get a similar result from a home network:
61% undialable. |
Thanks @MichaelMure! We're working on adding more data collection to this crawler. Current approaches we're thinking about are:
|
When I put a value to the DHT it's only put to 3 out of 20 peers (sometimes more sometimes less). I think the error I get |
@florianlenz yeah, making sure we put it to 20 peers would help, but we need to make sure those are roughly the same 20 peers that someone doing a lookup of the value will contact. I think changing the code slightly to find the '20 closest peers we can connect to' might help dramatically. I'm doing some experiments right now with DHT resolution, let me know if you try anything and figure something interesting out |
random thing i tried: reducing the dial timeout in go-libp2p-conn from 60s to 5s. Seems to have made things quite a bit faster overall, without reducing the number of records I get back. (i'm testing DHT 'Gets' for ipns records) |
another interesting thing: I changed the cc @Stebalien |
yamux returns |
Two things. First off. It turns out, that if a peer doesnt have the public key to validate an ipns record it receives, it just tosses the record and resets the stream. That stream reset was producing the 'connection reset' errors i was seeing, which led me to this bug. The other thing, diff --git a/lookup.go b/lookup.go
index dd47516..d6f04a5 100644
--- a/lookup.go
+++ b/lookup.go
@@ -10,6 +10,7 @@ import (
pb "github.com/libp2p/go-libp2p-kad-dht/pb"
kb "github.com/libp2p/go-libp2p-kbucket"
peer "github.com/libp2p/go-libp2p-peer"
+ pstore "github.com/libp2p/go-libp2p-peerstore"
notif "github.com/libp2p/go-libp2p-routing/notifications"
)
@@ -100,13 +101,19 @@ func (dht *IpfsDHT) GetClosestPeers(ctx context.Context, key string) (<-chan pee
if res != nil && res.finalSet != nil {
sorted := kb.SortClosestPeers(res.finalSet.Peers(), kb.ConvertKey(key))
- if len(sorted) > KValue {
- sorted = sorted[:KValue]
- }
+ var found int
for _, p := range sorted {
- out <- p
+ if err := dht.host.Connect(ctx, pstore.PeerInfo{ID: p}); err == nil {
+ found++
+ out <- p
+ }
+ if found >= KValue {
+ break
+ }
}
}
}() Basically "Get me the K closest peers that I can actually connect to". |
(brb, getting some dinner) |
So it appears that the lack of access to public keys is a pretty significant issue here... If we started switching over the embeddable ed25519 keys already, we would be okay, but with the RSA its a problem. One option would be to augment the DHT codebase in a pretty ugly way to make Another option would be to add the public key to the ipns record itself so that we guarantee the peer has it. This would require no changes to the DHT codebase, but would require some changes in the IPNS validator, to handle that. This change is easier from the implementation perspective, but requires pretty much everyone update their nodes for it to take effect. I'm leaning towards the first choice, but it will require some code mangling. I'll have a chat with @Stebalien when he wakes up, need to get this fixed ASAP |
@whyrusleeping really nice work :) Would it make sense to drop the connection after connecting? A host could end up having a lot off connections when doing a lot of lookups. |
I think its probably best to leave that up to the connection manager. It will be able to make decisions about connections informed by the broader picture than just the DHT |
@florianlenz question, which version of the DHT code are you testing with? |
I tested with version 4.0.4 (QmSBxn1eLMdViZRDGW9rRHRYwtqq5bqUgipqTMPuTim616) |
Ah, interesting. So you we'rent running with @Stebalien's latest changes (which resulted in gx version 4.1.0). There are some changes there that make this process work a little differently, thanks for the data point :) |
This patch should help for ipns publishing in general: ipfs/kubo#5079 The other bigger problem i see is that (as discussed previously) 'GetClosestPeers' currently returns the 20 closest peers it has heard of. I have a patch locally that gathers all the 'close' peers, sorts them, then tries connecting to them in order of closeness. Once it succeeds in connecting to 20 of them, it returns those 20. This results in a very well work ipns (but my patch is pretty ugly). Most runs go through 40-70 peers before they find 20 connectable ones (implying between 50% and 30% connectivity, which roughly matches what our earlier experiments showed). I'm going to try out a patch where we try to relay every connection and see how that affects those ratios. |
automatically trying to relay every connection definitely lowers the 'miss rate' on connecting to closest peers. Last few tests i ran show it about 30% lower (we can connect to 30% more peers than before). But for some currently-unknown reason, overall performance is pretty degraded compared to not trying relays. I also end up connected to a lot more peers at the end of a run than before, >300 as opposed to ~100 before... very odd. |
@whyrusleeping what's the status of your closest peers patch? |
@Stebalien should be much better: #159 (merged) I havent had great internet since then, but I do now, and can start debugging things a bit more. In any case though, we should start pushing these changes out and cut a new ipfs release. Let's get it all deployed! |
Got it. I asked because I tried publishing an IPNS entry on two different temporary nodes and:
|
@Stebalien was this on a testnet of nodes all using the proper value? or was this on the main network? Getting back an old value seems a bit odd though, we should investigate that. |
Main network. |
Yeah, you're unlikely to succeed much at all trying to publish to the main
network right now
…On Thu, Jun 21, 2018, 12:21 PM Steven Allen ***@***.***> wrote:
Main network.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#139 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABL4HAl1ljIjQvgZGFBjDIMmAYVLC2DEks5t-_IvgaJpZM4Tfzrq>
.
|
Oh. Yeah, I forgot about that. stebalien slinks off into a corner to quietly continue breaking everything |
Hi, A couple of hours ago i made issue ipfs/kubo#6800 (name resolve takes exactly 1 minute and loses cache after 30 seconds). That issue is closed which is fine. But when reading though this whole thread i'm lead to believe this was (or is?) the actual core issue of the observed behavior. Now this does reference issue filecoin-project/venus#437 (filecoin?.. what does that have to do with this very issue?.. anyhow.) where it's said to be no issue anymore (that was on Feb. 27 of 2019) where it was subsequently closed. This issue however remained open. And then i read much of ipfs/kubo#3860 where it eventually turns out to be a feature in it's current form (if i'm reading it correctly). So now i'm totally lost if the behavior i see in ipfs name resolve (that takes exactly 1 minute for new publishes) is by design and working properly (thus meaning the issue in this thread is resolved too) or if there still is an issue, as i can't imagine 1 minute is by design and "proper". Please note, i try to access a newly published website so i'm using a browser to access that ipns link. Command line trickery to speed it up won't work in the browser. I'm guessing i have to modify ipfs companion if i want to try the same command line trickery in the browser? I'd happily try out a go-ipfs 0.50.0 alpha + IPFS companing to go along with it to see if the actual issue is resolved (aka, have a fast working IPNS) but i have no clue what i'm even supposed to install now. So just a friendly question: Is this issue resolved in the current master that is going to be 0.50.0? Can i help? (by testing stuff out, and how?) |
@markg85 your question about the current state of IPNS is better asked on discuss.ipfs.io then here. If you start an issue there I'll happily go through why things are working how they are currently, some of the ongoing work to improve performance, and some areas where having some help would be great (e.g. testing). |
@aschmahmann that's a weird and concerning request. I'm only wondering how it's supposed to work because of said "perceived bugs" and the "hints" i'm catching elsewhere that this is actually a feature. This very bug report should be the place as it's the root cause. If i move to discuss, it will also add yet another place where "ipns performance" is discussed where it's already been discussed in so many different places (multiple threads on discuss, multiple github issues and even branches). Are you sure you want that? If you still say yes to this then i happily open a new topic there, i just try to prevent even more thinning out of information in this regard. |
Hi, Any update on this one? I tried using IPFS again on some test site which now works awesomely with an But ontopic. Now this issue is a prio 1, it is tagged as such, so I'm really curious if this is on the radar for 0.50.0? |
This specific issue will be fixed in the next go-ipfs release, but the relevant changes haven't been merged to master yet (currently in testing). Changes:
Unfortunately, a large portion of this depends on 2 which means the new code won't actually perform any better than the existing code until it's actually deployed widely on the network. The current plan is:
|
@Stebalien that sounds like a lot of good news! Thank you very much! But I do wonder what you mean by that last part. I'm assuming that once enough nodes switch to 0.50.0 the ipns, in this regard, just slowly gets better as the new code path is being used by more and more nodes, right? By running 2 DHT nodes you mean one IPFS instance before 0.50.0 and one at (or above) 0.50.0? Or will there be a config option to pick different DHT versions? |
Yes, although if a provider and consumer of an IPNS record both upgrade to go-ipfs 0.5.0 they will immediately get benefits without waiting for the rest of the network to upgrade.
@markg85 the plan is that your go-ipfs 0.5.0 instance will be able to participate in a DHT with nodes running the current protocol (kad v1) while simultaneously participating in a DHT with the new protocol (kad v2). This will allow users of go-ipfs 0.5.0 to continue to find peers and information that are only in the v1 DHT while still benefiting from the fast-path of using the v2 DHT. |
In order for queries to function properly, PutValue should ensure it puts out to K (20) peers. It currently seems that this is not happening, and that a much smaller number of records (less than 16, our GetValue threshold) is actually being put.
This was brought up in ipfs/kubo#3860
We should investigate and fix this.
The text was updated successfully, but these errors were encountered: