-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start server fails with "error validating peerURLs" (race?) #10626
Comments
How do you start the 3 members? (As member to a new cluster? or as new member to an existing cluster?) Can you share the related parameters when you start the member? |
@jingyih I start H1 as a new cluster. H2 and H3 are joining the existing clusters. Happy to share any parameters you want. |
Is it possible for you to configure all 3 members as new cluster? I think this is the recommended way of starting a cluster with multiple members. Here is an example: Lines 2 to 4 in c7c6894
You are right about the racing part. If creating the cluster your way, the second member will take some time to catch up applying the raft log entries replicated from member 1. Therefore there is a short period of time when member 2 does not have the most recent view of the cluster. |
@jingyih (Please let me know if this understanding is correct...) I can't create H2 and H3 as new cluster, because if I did, that would mean that I can't use the cluster until they all joined, because H1 would be expecting two more members to be part of their quorum. Rephrasing the problem, imagine that I startup H1. Then I use that for ten minutes, and THEN I decide to add H2 and H3 rapidly. My understanding is H2 and H3 would need to use existing cluster flag. ( If there is a signal I can wait for to know when server is caught up to cluster state, that would be fine. I'm using golang embed package. Currently after I run:
I wait on:
and
Is there another signal I should wait for to know cluster is caught up? Thanks! |
Yes, in your use case, H2 and H3 need to use existing cluster flag. Could you share the member add commands and etcd startup parameters so that me and other contributors can try to reproduce and investigate? |
@jingyih Yes, you can reproduce (modulo races) very easily like this: I'm using
Run Run To test:
That's how you start H1, H2 and H3 up. Do H3 very quickly after you run H2. |
Could you provide a script that does the same procedure but using etcd directly? From the script you provided, it is not clear how the etcd is used. |
@jingyih A big golang program is such a script :) It would not be easy to write such a standalone script. But the basic algorithm is described at the top here: https://github.com/purpleidea/mgmt/blob/a87853508a12b478a8825c4f16ae33ddec0765e6/etcd/etcd.go#L31 And the actual server startup stuff happens here: https://github.com/purpleidea/mgmt/blob/a87853508a12b478a8825c4f16ae33ddec0765e6/etcd/server.go#L161 Let me know if you need anything else. |
PS: In: #10626 (comment)
Thanks! |
I think you are looking at the correct signal. However, If I understand correctly, it is not directly related to when the config change entry gets applied. But first I should probably verify if what I described earlier is what actually happened in your case. |
@jingyih I appreciate that you're looking into this, and please let me know if I can help in any way. Ultimately I want clustering of etcd to be automatic, so hopefully this helps get us one step closer :) |
Possibly related, I was able to trigger a panic (just once so far) by adding a member fairly quickly: I'm using v.3.3.12
|
I am still trying to understand more about the server booting process. So far this is my understanding of the issue. Not sure if this is by design due to bootstrapping needs. In current implementation, serving reads directly from this local member structure could include stale information. etcd/etcdserver/api/membership/cluster.go Line 56 in c7c6894
For example, in a two node cluster where node 1 is raft leader, a I did not actually reproduce the issue using your software, so this is my speculation based on code reading. |
We can get this fixed if my speculation is proved by issue reproduction. Currently my bandwidth is very limited. I may revisit this issue later. For now, adding some delay will help mitigate the issue. |
@jingyih If I understand this correctly, you are describing a possible bug in etcd itself, correct? If so perhaps this could be easily fixed by having a signal that an This will be helpful for many situations I think too. With such a signal, I could wait for that and test to see if it removes the race. |
I did some more extensive testing today. I think that if I run member add and then wait enough time before starting the new server, then this is not an issue. It's possible that it's a race in my code as well. In any case I think this could also be solved by: #10537 (comment) specifically:
That way the PUT operation that a new client is watching on to tell it to start up, can be done at the same time as the MemberAdd operation. I will close this for now. |
Member operations are configuration changes, whereas PUT is a mutating operation on key space. For example you cannot add or delete a etcd member by simply updating a key value. They are very very different kinds of operations. Sorry but what you suggested does not make sense to me. |
I understand, I was just proposing an idea for a race-free API that would help solve the need for |
I'm building some etcd clusters with the
embed
package. I startup three hosts: h1, then h2, then h3. Occasionally when I add the third host, the server fails to start with:The error is here:
etcd/etcdserver/server.go
Line 318 in ad5e169
I think this is a race/timing bug. I believe that it happens more often if I startup H3 very quickly after H2. If I wait say 5 seconds before doing so, then I don't think it ever happens.
My suspicion is that after a new member (eg H2) is added, some async goroutine runs to update some cluster state. If this doesn't happen before H3 joins, then we hit this issue.
If I were to guess, I'd think that perhaps the client
AutoSyncInterval
is related, but I'm really out of my depth, in that I don't know all the etcd internals.Thanks for reading!
The text was updated successfully, but these errors were encountered: