-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible data loss -- fsync parent directories #6368
Comments
@heyitsanthony This seems like a bug. I marked this as P2 since this is unlikely to happen in practice. But we should fix this soon. @ramanala Can you say more about your testing framework? We are VERY interested in auto testing to ensure etcd works reliably. |
@xiang90 -- we have a testing framework that can test distributed storage systems for problems like data loss, unavailability etc in the presence of correlated crashes (i.e., all servers crashing together and in some servers, there are problems wrt FS like the above mentioned.). We have couple more issues in etcd where the cluster can become unavailable. I will file those issues in sometime. Thanks for your interest! This is a research tool and the related paper will be publicly available in OSDI this year. We will also make the tool publicly available. I will update you with more information about the tool in a few days. |
Thank you @heyitsanthony and @xiang90 ! |
This fixes bug with wal handling etcd-io/etcd#6368 Signed-off-by: Alexander Morozov <[email protected]>
This fixes bug with wal handling etcd-io/etcd#6368 Signed-off-by: Alexander Morozov <[email protected]>
#3960) In #3959 , bulk loader crashes when trying to move a directory into itself with a new name /dgraph/tmp/shards/shard_0 /dgraph/tmp/shards/shard_0/shard_0 The bulk loader logic is the mapper produce output as .../tmp/shards/000 .../tmp/shards/001 read the list of shards under .../tmp/shards/ create the reducer shards as .../tmp/shards/shard_0 .../tmp/shards/shard_1 move the list read in step 2 into the reducer shards created in step 3 Though I cannot reproduce the problem, but it seems creating of the reducer shard directory .../tmp/shards/shard_0 and listing all the mapper shards in step 2 are re-ordered. Something similar is mentioned in etcd-io/etcd#6368 This PR avoids such possibilities by putting the mapper output into an independent directory ../tmp/map_output, so that the program works correctly even if the reordering happens.
Hi @ramanala i am interested in knowing more about your testing framework. Is it available publicly now? |
I am running a three node etcd cluster. When I insert a new key value pair into the store, I see the following sequence of system calls on the server.
1 creat("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp")
2 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp")
3 fdatasync("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp")
4 rename("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp", "/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
5 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
6 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
7 fdatasync("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
8 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
9 fdatasync("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
==========Data is durable here -- Ack the user====================
For the 4th operation (rename of wal.tmp to wal) to be persisted to disk, the parent directory has to be fsync'd. If not, a crash just after acknowledging the user can result in a data loss. Specifically, the rename can be reordered on some file systems, thus not issued immediately by the fs. In such a case, on recovery, the server would see the file wal.tmp but not wal. On seeing this, I believe etcd just unlinks the tmp file and therefore can lose the user data. If this happens on two nodes on a three node cluster, then a global data loss is possible. We have reproduced this particular data loss issue using our testing framework. As a fix, it would be safe to fsync the parent directory on creat or rename of files.
The text was updated successfully, but these errors were encountered: