-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve crash consistency on Linux #8815
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for the clear description of the tested cases and the areas of concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is nice. Thanks for including the technical details!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/label ready-for-merge
This PR is now ready for merge, after ~24 hours, we will merge it if there's no negative feedback.
Thanks!
Problem
See this talk (transcript):
Solution
While the optimal solution would involve implementing an undo log / rollback journal as described in the talk or in SQLite, this would be a drastic redesign of this core subsystem.
Rather, without breaking compatibility or changing the existing design we can
fsync
the parent directory afterfsync
ing the new file. This ensures the parent directory's inode is flushed to disk with the new directory entry for the new file. This reduces the impact of a crash. Currently, wefsync
the new file itself, but the directory entry in the parent directory is not necessarily persisted to disk until some arbitrary amount of time passes, and if we crash during this window then we could (in some filesystem/mount option combinations) get the old version of the file when we come back up, though we may have taken other actions in the meantime that assume the new file has been persisted, thus leading to inconsistent state. This PRfsync
s the parent directory right away, reducing the window of time where a crash could have negative consequences to the period of time between thefsync
of the file and thefsync
of its parent directory, which is now a much smaller and bounded amount of time.This is also the approach taken by default in Sendmail on Linux in https://github.com/Distrotech/sendmail/blob/547129475fc1db35ae9b893a4782884c68b182fb/sendmail/deliver.c#L933-L986 whose
README
contains the following:This correctly notes that this second
fsync
isn't required in all combinations of filesystems and mount options but that it is (or was) required in some of them, and to be safe from a data correctness perspective it is enabled by default. In this PR we do the same in Jenkins. In general, filesystem behavior in this regard is poorly documented and changes frequently over time, so it is best to take the safest approach by default and only allow users to pick a more dangerous approach if they are confident their filesystem configuration can tolerate it.More to the point, the author of
ext3
explicitly states the need tofsync
the parent directory on Linux in this post:The performance impact should be negligible for local disk users. Users of pathologically slow NFS servers might have a problem, in which case an escape hatch has been added to restore the old behavior at the price of decreased crash consistency. If accepted, I will update this page of the documentation.
Testing done
Confirmed after this PR with
bpftrace
'ssyncsnoop
andopensnoop
that the parent directory was beingfsync
ed as expected. Confirmed that the old behavior was restored when setting the escape hatch to false.Proposed changelog entries
Proposed upgrade guidelines
N/A
Submitter checklist
Desired reviewers
@mention
Before the changes are marked as
ready-for-merge
:Maintainer checklist