[RFC (against master branch)] etcdserver: when using --unsafe-no-fsync write data #12752

cwedgwood · 2021-03-05T19:00:58Z

(this is a master/3.5 variant of the previous PR i made, feel free to ignore for now)

There are situations where we don't wish to fsync but we do want to
write the data.

Typically this occurs in clusters where fsync latency (often the
result of firmware) transiently spikes. For Kubernetes clusters this
causes (many) elections which have knock-on effects such that the API
server will transiently fail causing other components fail in turn.

By writing the data (buffered and asynchronously flushed, so in most
situations the write is fast) and avoiding the fsync we no longer
trigger this situation and opportunistically write out the data.

Anecdotally:
Because the fsync is missing there is the argument that certain
types of failure events will cause data corruption or loss, in
testing this wasn't seen. If this was to occur the expectation is
the member can be readded to a cluster or worst-case restored from a
robust persisted snapshot.

The etcd members are deployed across isolated racks with different
power feeds. An instantaneous failure of all of them simultaneously
is unlikely.

Testing was usually of the form:

create (Kubernetes) etcd write-churn by creating replicasets of
some 1000s of pods
break/fail the leader

Failure testing included:

hard node power-off events
disk removal
orderly reboots/shutdown

In all cases when the node recovered it was able to rejoin the
cluster and synchronize.

Please read https://github.com/etcd-io/etcd/blob/master/CONTRIBUTING.md#contribution-flow.

There are situations where we don't wish to fsync but we do want to write the data. Typically this occurs in clusters where fsync latency (often the result of firmware) transiently spikes. For Kubernetes clusters this causes (many) elections which have knock-on effects such that the API server will transiently fail causing other components fail in turn. By writing the data (buffered and asynchronously flushed, so in most situations the write is fast) and avoiding the fsync we no longer trigger this situation and opportunistically write out the data. Anecdotally: Because the fsync is missing there is the argument that certain types of failure events will cause data corruption or loss, in testing this wasn't seen. If this was to occur the expectation is the member can be readded to a cluster or worst-case restored from a robust persisted snapshot. The etcd members are deployed across isolated racks with different power feeds. An instantaneous failure of all of them simultaneously is unlikely. Testing was usually of the form: * create (Kubernetes) etcd write-churn by creating replicasets of some 1000s of pods * break/fail the leader Failure testing included: * hard node power-off events * disk removal * orderly reboots/shutdown In all cases when the node recovered it was able to rejoin the cluster and synchronize.

codecov-io · 2021-03-05T20:05:18Z

Codecov Report

Merging #12752 (b63d31e) into master (f400163) will decrease coverage by 9.78%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #12752      +/-   ##
==========================================
- Coverage   66.85%   57.06%   -9.79%     
==========================================
  Files         416      406      -10     
  Lines       32849    31994     -855     
==========================================
- Hits        21961    18258    -3703     
- Misses       8795    12022    +3227     
+ Partials     2093     1714     -379

Flag	Coverage Δ
all	`57.06% <0.00%> (-9.79%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/wal/wal.go	`55.28% <0.00%> (+20.52%)`	⬆️
client/v3/utils.go	`0.00% <0.00%> (-100.00%)`	⬇️
client/v3/compact_op.go	`0.00% <0.00%> (-100.00%)`	⬇️
client/v3/ordering/util.go	`0.00% <0.00%> (-100.00%)`	⬇️
pkg/tlsutil/cipher_suites.go	`0.00% <0.00%> (-100.00%)`	⬇️
client/v3/naming/endpoints/endpoints.go	`0.00% <0.00%> (-100.00%)`	⬇️
...ver/proxy/grpcproxy/adapter/lock_client_adapter.go	`0.00% <0.00%> (-100.00%)`	⬇️
client/v3/leasing/util.go	`0.00% <0.00%> (-98.04%)`	⬇️
pkg/report/report.go	`0.00% <0.00%> (-95.58%)`	⬇️
client/v3/namespace/watch.go	`0.00% <0.00%> (-92.86%)`	⬇️
... and 233 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f400163...b63d31e. Read the comment docs.

cwedgwood changed the title ~~[RFC] etcdserver: when using --unsafe-no-fsync write data~~ [RFC (against master branch)] etcdserver: when using --unsafe-no-fsync write data Mar 5, 2021

cwedgwood mentioned this pull request Mar 7, 2021

When using --unsafe-no-fsync still write out the data #12751

Merged

ptabor merged commit 7556b9a into etcd-io:master Mar 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC (against master branch)] etcdserver: when using --unsafe-no-fsync write data #12752

[RFC (against master branch)] etcdserver: when using --unsafe-no-fsync write data #12752

cwedgwood commented Mar 5, 2021

codecov-io commented Mar 5, 2021

[RFC (against master branch)] etcdserver: when using --unsafe-no-fsync write data #12752

[RFC (against master branch)] etcdserver: when using --unsafe-no-fsync write data #12752

Conversation

cwedgwood commented Mar 5, 2021

codecov-io commented Mar 5, 2021

Codecov Report