Possible data loss -- fsync parent directories #6368

ramanala · 2016-09-07T01:16:45Z

I am running a three node etcd cluster. When I insert a new key value pair into the store, I see the following sequence of system calls on the server.

1 creat("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp")
2 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp")
3 fdatasync("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp")
4 rename("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal.tmp", "/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
5 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
6 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
7 fdatasync("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
8 append("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
9 fdatasync("/data/etcd/infra0.etcd/member/wal/0000000000000001-000000000000001b.wal")
==========Data is durable here -- Ack the user====================

For the 4th operation (rename of wal.tmp to wal) to be persisted to disk, the parent directory has to be fsync'd. If not, a crash just after acknowledging the user can result in a data loss. Specifically, the rename can be reordered on some file systems, thus not issued immediately by the fs. In such a case, on recovery, the server would see the file wal.tmp but not wal. On seeing this, I believe etcd just unlinks the tmp file and therefore can lose the user data. If this happens on two nodes on a three node cluster, then a global data loss is possible. We have reproduced this particular data loss issue using our testing framework. As a fix, it would be safe to fsync the parent directory on creat or rename of files.

xiang90 · 2016-09-07T01:43:40Z

@heyitsanthony This seems like a bug. I marked this as P2 since this is unlikely to happen in practice. But we should fix this soon. @ramanala Can you say more about your testing framework? We are VERY interested in auto testing to ensure etcd works reliably.

ramanala · 2016-09-07T14:01:15Z

@xiang90 -- we have a testing framework that can test distributed storage systems for problems like data loss, unavailability etc in the presence of correlated crashes (i.e., all servers crashing together and in some servers, there are problems wrt FS like the above mentioned.). We have couple more issues in etcd where the cluster can become unavailable. I will file those issues in sometime.

Thanks for your interest! This is a research tool and the related paper will be publicly available in OSDI this year. We will also make the tool publicly available. I will update you with more information about the tool in a few days.

Fixes etcd-io#6368

ramanala · 2016-09-08T21:56:48Z

Thank you @heyitsanthony and @xiang90 !

Fixes #6368

This fixes bug with wal handling etcd-io/etcd#6368 Signed-off-by: Alexander Morozov <[email protected]>

#3960) In #3959 , bulk loader crashes when trying to move a directory into itself with a new name /dgraph/tmp/shards/shard_0 /dgraph/tmp/shards/shard_0/shard_0 The bulk loader logic is the mapper produce output as .../tmp/shards/000 .../tmp/shards/001 read the list of shards under .../tmp/shards/ create the reducer shards as .../tmp/shards/shard_0 .../tmp/shards/shard_1 move the list read in step 2 into the reducer shards created in step 3 Though I cannot reproduce the problem, but it seems creating of the reducer shard directory .../tmp/shards/shard_0 and listing all the mapper shards in step 2 are re-ordered. Something similar is mentioned in etcd-io/etcd#6368 This PR avoids such possibilities by putting the mapper output into an independent directory ../tmp/map_output, so that the program works correctly even if the reordering happens.

utkarshmani1997 · 2020-04-02T08:21:08Z

Hi @ramanala i am interested in knowing more about your testing framework. Is it available publicly now?

xiang90 added the type/bug label Sep 7, 2016

xiang90 added this to the v3.1.0 milestone Sep 7, 2016

xiang90 added the priority/P2 label Sep 7, 2016

heyitsanthony self-assigned this Sep 7, 2016

heyitsanthony pushed a commit to heyitsanthony/etcd that referenced this issue Sep 7, 2016

wal: fsync directory after wal file rename

6c9d2a7

Fixes etcd-io#6368

heyitsanthony mentioned this issue Sep 7, 2016

wal: fsync directory after wal file rename #6381

Merged

heyitsanthony pushed a commit to heyitsanthony/etcd that referenced this issue Sep 7, 2016

wal: fsync directory after wal file rename

c108b9d

Fixes etcd-io#6368

heyitsanthony pushed a commit to heyitsanthony/etcd that referenced this issue Sep 8, 2016

wal: fsync directory after wal file rename

875fcda

Fixes etcd-io#6368

heyitsanthony pushed a commit to heyitsanthony/etcd that referenced this issue Sep 8, 2016

wal: fsync directory after wal file rename

d2e62f9

Fixes etcd-io#6368

heyitsanthony pushed a commit to heyitsanthony/etcd that referenced this issue Sep 8, 2016

wal: fsync directory after wal file rename

6ccc55a

Fixes etcd-io#6368

heyitsanthony pushed a commit to heyitsanthony/etcd that referenced this issue Sep 8, 2016

wal: fsync directory after wal file rename

08f8a21

Fixes etcd-io#6368

heyitsanthony pushed a commit to heyitsanthony/etcd that referenced this issue Sep 8, 2016

wal: fsync directory after wal file rename

985346d

Fixes etcd-io#6368

heyitsanthony pushed a commit to heyitsanthony/etcd that referenced this issue Sep 8, 2016

wal: fsync directory after wal file rename

bd7107b

Fixes etcd-io#6368

heyitsanthony closed this as completed in #6381 Sep 8, 2016

gyuho pushed a commit that referenced this issue Sep 9, 2016

wal: fsync directory after wal file rename

202da92

Fixes #6368

LK4D4 added a commit to LK4D4/swarmkit that referenced this issue Sep 29, 2016

vendor: update etcd to bd7107bd4bf26219ba9852aa6c4c817ccde0191c

7d89580

This fixes bug with wal handling etcd-io/etcd#6368 Signed-off-by: Alexander Morozov <[email protected]>

LK4D4 mentioned this issue Sep 29, 2016

Update etcd to bd7107bd4bf26219ba9852aa6c4c817ccde0191c moby/swarmkit#1587

Merged

niau pushed a commit to vasil-yordanov/swarmkit that referenced this issue Oct 21, 2016

vendor: update etcd to bd7107bd4bf26219ba9852aa6c4c817ccde0191c

0140170

This fixes bug with wal handling etcd-io/etcd#6368 Signed-off-by: Alexander Morozov <[email protected]>

srh mentioned this issue Jul 17, 2017

Add pid.lock file locking in the opt.Dir directory dgraph-io/badger#103

Closed

philips unassigned heyitsanthony Aug 28, 2018

gitlw mentioned this issue Sep 11, 2019

Change the mapper output directory from $TMP/shards to $TMP/map_output hypermodeinc/dgraph#3960

Merged

utkarshmani1997 mentioned this issue Apr 3, 2020

feat(replica): run fsync on files & dir after create/remove/rename operation on files openebs-archive/jiva#278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible data loss -- fsync parent directories #6368

Possible data loss -- fsync parent directories #6368

ramanala commented Sep 7, 2016

xiang90 commented Sep 7, 2016

ramanala commented Sep 7, 2016

ramanala commented Sep 8, 2016

utkarshmani1997 commented Apr 2, 2020 •

edited

Loading

Possible data loss -- fsync parent directories #6368

Possible data loss -- fsync parent directories #6368

Comments

ramanala commented Sep 7, 2016

xiang90 commented Sep 7, 2016

ramanala commented Sep 7, 2016

ramanala commented Sep 8, 2016

utkarshmani1997 commented Apr 2, 2020 • edited Loading

utkarshmani1997 commented Apr 2, 2020 •

edited

Loading