-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible split-brain & loss of committed write with loss of un-fsynced data #14143
Comments
Oof, here's an even worse run. Here we appear to have gotten into some kind of full split-brain where values on different replicas diverged at some point, and the cluster proceeded in two independent timelines. Here you can see reads on node n3 (processes 382, 392, and 397, modulo 5 nodes, are all 2, which is the third node in the cluster) saw a version of key 150 which went Here's the full logs from this run, including etcd data files. At 12:39:28 we killed every node, and at 12:39:49 we restarted them all. 2022-06-22 12:39:28,541{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :kill :all
2022-06-22 12:39:30,591{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :kill {"n1" :done, "n2" :done, "n3" :done, "n4" :done, "n5" :done}
2022-06-22 12:39:49,206{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :start :all
2022-06-22 12:39:49,289{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :start {"n1" :started, "n2" :started, "n3" :started, "n4" :started, "n5" :started} Immediately on startup {"level":"info","ts":"2022-06-22T12:39:49.279-0400","caller":"wal/repair.go:40","msg":"repairing","path":"n3.etcd/member/wal/0000000000000000-0000000000000000.wal"}
{"level":"info","ts":"2022-06-22T12:39:50.386-0400","caller":"wal/repair.go:96","msg":"repaired","path":"n3.etcd/member/wal/0000000000000000-0000000000000000.wal","error":"unexpected EOF"}
{"level":"info","ts":"2022-06-22T12:39:50.386-0400","caller":"etcdserver/storage.go:109","msg":"repaired WAL","error":"unexpected EOF"} n3 clearly knew about the other members in the cluster, because it recovered them from store at 12:39:50 {"level":"info","ts":"2022-06-22T12:39:50.963-0400","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"76f85bbb28f6ace1","local-member-id":"1153c9690d2b2284","recovered-remote-peer-id":"4824313a421b2502","recovered-remote-peer-urls":["http://192.168.122.105:2380"]}
{"level":"info","ts":"2022-06-22T12:39:50.963-0400","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"76f85bbb28f6ace1","local-member-id":"1153c9690d2b2284","recovered-remote-peer-id":"4d6e27d122507e9c","recovered-remote-peer-urls":["http://192.168.122.104:2380"]}
{"level":"info","ts":"2022-06-22T12:39:50.963-0400","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"76f85bbb28f6ace1","local-member-id":"1153c9690d2b2284","recovered-remote-peer-id":"a1ffd5acd6a88a6a","recovered-remote-peer-urls":["http://192.168.122.102:2380"]}
{"level":"info","ts":"2022-06-22T12:39:50.963-0400","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"76f85bbb28f6ace1","local-member-id":"1153c9690d2b2284","recovered-remote-peer-id":"afa39e55dee6dc2e","recovered-remote-peer-urls":["http://192.168.122.101:2380"]}
{"level":"info","ts":"2022-06-22T12:39:50.963-0400","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"76f85bbb28f6ace1","local-member-id":"1153c9690d2b2284","recovered-remote-peer-id":"1153c9690d2b2284","recovered-remote-peer-urls":["http://192.168.122.103:2380"]} n3 recorded that its peers were active around 12:40:17 {"level":"info","ts":"2022-06-22T12:40:17.830-0400","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"a1ffd5acd6a88a6a"} And noted that it had a conflict in the log at index 906 between a local entry with term 4 and a conflicting entry at term 5: {"level":"info","ts":"2022-06-22T12:40:18.043-0400","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"1153c9690d2b2284 became follower at term 5"}
{"level":"info","ts":"2022-06-22T12:40:18.043-0400","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"1153c9690d2b2284 [logterm: 4, index: 907, vote: 0] rejected MsgVote from 4d6e27d122507e9c [logterm: 4, index: 905] at term 5"}
{"level":"info","ts":"2022-06-22T12:40:18.044-0400","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"found conflict at index 906 [existing term: 4, conflicting term: 5]"}
{"level":"info","ts":"2022-06-22T12:40:18.044-0400","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"replace the unstable entries from index 906"} n3 went on to serve client requests at 12:40:18: {"level":"info","ts":"2022-06-22T12:40:18.047-0400","caller":"embed/serve.go:98","msg":"ready to serve client requests"} The very first requests to execute after this point show that n3 diverged from the rest of the cluster. Inconsistencies appeared in hundreds of keys. n1 and n4 believed key 78 was
Here's another case: key 87 diverged into 2022-06-22 12:40:23,335{GMT} INFO [jepsen worker 12] jepsen.util: 392 :ok :txn [[:r 87 [1 2]] [:r 87 [1 2]]]
2022-06-22 12:40:23,338{GMT} INFO [jepsen worker 10] jepsen.util: 410 :ok :txn [[:append 87 4] [:r 87 [1 3 4]] [:r 87 [1 3 4]] [:r 87 [1 3 4]]] At 12:40:32 we killed all nodes, and at 12:41:01 we restarted them. 022-06-22 12:40:32,662{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :kill :all
2022-06-22 12:40:34,759{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :kill {"n1" :done, "n2" :done, "n3" :done, "n4" :done, "n5" :done}
2022-06-22 12:41:01,673{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :start :all
2022-06-22 12:41:01,756{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :start {"n1" :started, "n2" :started, "n3" :started, "n4" :started, "n5" :started} At 12:41:30.322, node n3 logged that it had detected an inconsistency with a peer, and appears to have shut down: {"level":"error","ts":"2022-06-22T12:41:30.320-0400","caller":"embed/etcd.go:259","msg":"checkInitialHashKV failed","error":"1153c9690d2b2284 found data inconsistency with peers","stacktrace":"go.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:259\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
...
{"level":"info","ts":"2022-06-22T12:41:30.322-0400","caller":"embed/etcd.go:370","msg":"closed etcd server","name":"n3","data-dir":"n3.etcd","advertise-peer-urls":["http://192.168.122.103:2380"],"advertise-client-urls":["http://192.168.122.103:2379"]} At 12:31:35 Jepsen restarted the crashed n3: 2022-06-22 12:41:35,166{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :start :all
2022-06-22 12:41:35,248{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :start {"n1" :already-running, "n2" :already-running, "n3" :started, "n4" :already-running, "n5" :already-running} And n3 crashed again, complaining of data inconsistency: {"level":"error","ts":"2022-06-22T12:41:35.283-0400","caller":"embed/etcd.go:259","msg":"checkInitialHashKV failed","error":"1153c9690d2b2284 found data inconsistency with peers","stacktrace":"go.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:259\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"} At 12:41:38 Jepsen killed every node, and restarted them all at 12:42:01: 2022-06-22 12:41:38,199{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :kill :all
2022-06-22 12:41:40,254{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :kill {"n1" :done, "n2" :done, "n3" :done, "n4" :done, "n5" :done}
2022-06-22 12:42:01,589{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :start :all
2022-06-22 12:42:01,674{GMT} INFO [jepsen worker nemesis] jepsen.util: :nemesis :info :start {"n1" :started, "n2" :started, "n3" :started, "n4" :started, "n5" :started} However, this time etcd did not crash, and did not detect data inconsistency for the rest of the test. |
I've got a good dozen of these cases now. Also seeing a couple new behaviors! In this case, clients started returning errors like
This was accompanied by basically every single type of anomaly we know how to detect, including lost updates, split-brain, aborted read, write cycles, cyclic information flow, etc etc.
|
@aphyr can you still reproduce this issue on 3.5.4? @endocrimes did you ever see this issue on 3.5.4? cc @serathius |
Hi @ahrtr! We've been tracking down bugs in lazyfs itself--still don't have all of them sanded off yet, so it's possible this may not be etcd's fault. I'm afraid my own contract working on etc analysis ended last week, so I won't be able to devote much time to this, but I'll try to come back and do a second pass once we have lazyfs in a better spot. |
I'm trying to see if this still replicates on 3.5.4 and master - I think I have it repro-ing but it also depends on lazyfs rn |
Based on m private discussion with @endocrimes this is reproducible on v3.4 |
Hi @aphyr, Just to be sure ... what's your LazyFS-toml-config for these tests? |
Hey there! Sorry I don't have this available off the top of my head, and I'm completely swamped right now with other work stuff--no longer working on etcd testing. I did include repro instructions in the issue though, and that'll spit out a file in... I think /opt/etcd/data.lazyfs or something like that! |
I think I've managed to accidentally confirm the existence of this bug. When starting to investigate I short circuited the This somewhat implies that there is a path that is writing data and failing to fsync it. Next on my list is properly tracing through the data path for kv writes to both get a deeper understanding of where that might be. Probably some time next week with my current backlog in k/k. |
I was looking into paths to |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
My guess:
Example:
The scenario is so tricky and I am not sure how to reproduce it. We may need an integration of LazyFS, the interaction test suite in the raft library, and the e2e test suite. Anyone please evaluate whether the above case makes sense. ================================================================= |
@CaojiamingAlan, what you are describing sounds like an issue with raft implementation. Let's cc the experts I have some old PoC of using LazyFS for robustness tests. However scenario you described requires multiple nodes to down, which robustness testing doesn't support yet. I don't have time this week, but if you want I think you should be able to create an e2e test that simulates this. |
This should not be true. In your example, node 3 isn't able to be elected as the leader, as its local entries lag behind other nodes. Note that when a node campaigns, it uses the lastIndex instead of commitIndex. |
@CaojiamingAlan if you are interested into digging deeper, I encourage you to rerun the Jepsen tests. @endocrimes managed to fix the jepsen docker setup and reproduce this issue. There were a lot of changes to etcd since, including addressing multiple data inconsistencies, so need to confirm the reproduction. Not sure she contributed the docker setup code back. Ofc the main blocker is knowing/learning closure. |
Agreed with @ahrtr. It's useful to make a mental distinction between "known committed" and "actually committed". An entry is committed if it is durable on a quorum of followers. But it may not be known as committed to everyone for some time. The Sadly, due to the way As for reproducing things like this, I can recommend trying this via Footnotes
|
What happened?
With a five-node etcd 3.5.3 cluster running on Debian stable, with a data directory mounted on the lazyfs filesystem, killing nodes and losing their un-fsynced writes appears to infrequently cause the loss of committed writes. Here's an example test run where we appended 4 to key 233 via a kv transaction guarded by a revision comparison:
This transaction completed successfully, and its write of 4 was visible to reads for at least 690 milliseconds. Then we killed every node in the cluster and restarted them:
Upon restarting, 4 was no longer visible. Instead readers observed that key 233 did not exist, and writers began appending new values starting with 18:
Here are all reads of key 233, arranged in chronological order from top to bottom:
This history cannot be serializable: every append transaction involves a read of any key to be written, followed by a write transaction which compares those keys to make sure all their revisions are the same as the versions which were originally read. In such a history no writes should be lost--and yet here, etcd appears to have lost the write of 4.
As with previous lazyfs issues, this may be due to a bug in lazyfs, or it might represent a failure of etcd to fsync values on a majority of nodes before committing them.
What did you expect to happen?
Even with the loss of un-fsynced writes, etcd should not lose committed transactions.
How can we reproduce it (as minimally and precisely as possible)?
With https://github.com/jepsen-io/etcd 181656bb551bbc10cdc3d959866637574cdc9e17 and Jepsen 9e40a61d89de1343e06b9e8d1f77fe2c0be2e6ec, run:
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: