etcd detects data corruption by default #14039

serathius · 2022-05-13T13:35:36Z

P0 action item proposed in https://github.com/etcd-io/etcd/blob/main/Documentation/postmortems/v3.5-data-inconsistency.md

Problems:

Lack of adoption prevents early detection hampering etcd community ability to fix issues.
No hard guarantees about detecting inconsistency, undermines purpose of this feature.
Supportability of v3.5 release is very restricted in case of new data inconsistency issues.
Properly fixing data inconsistency check can potentially take multiple releases as it requires a new WAL entry and larger code changes.

Proposal

Develop a dedicated v3.5 improvement to data consistency check that will be enabled by default and backported to v3.5 release. Scope of this patch will be heavy limited to avoid introducing any breaking changes or performance regressions.

I think it's worth to invest into this change now to make sure v3.5 is in acceptable state, instead of trying to rush v3.6 release with half backed state. Due to heavy restrictions it might turn out that there is no reasonable improvement that we can make, still I think it's worth to consider.

Current state of consistency check

Initial check during bootstrap of each member validates that hash for its latest revision matches hashes of all peers. Doesn't check peers that haven't seen the revision or have run compaction during check.
Periodic check run by leader validates that hash for its latest revision matches hashes of all peers and confirms that no follower has revision newer than leader. Doesn't check slow peers that haven't seen the revision or have run compaction during check.

Design

Goals

Consistency check should work even when there are slow cluster members and independent on ongoing compactions.

Requirements:

No/minimal performance impact when check is enabled. Check needs to be enabled by default, so we should have not performance impact when compared to original check that was disabled by default.
No WAL and API changes. This will be a patch fix so we cannot make any changes to maintain compatibility.

Implementation

Evaluate consistency on compacted revisions by calculating hashes during compaction. Compactions are negotiated via raft, so executed by all members on the same revision, meaning slow members don't matter. During compaction we already need to touch bbolt, meaning calculating hash at the same time should only have minimal performance cost. No API changes, existing function to return HashKV, will just be extended to serve hashes from selected compacted revisions.

Only issue is that compactions are not done automatically by default, however Kubernetes already runs them every 5 minutes which should be enough to provide significant improvement to majority of etcd users.

Algorithm:

During each compaction member calculates hash on data that is being compacted and stores last N hashes to be served by HashKV method.
Every X minutes, leader looks through stored hashes for compacted revisions (starting from newest one) and ask members for them. If all members correctly respond, leader removes this and older hashes.

cc @ptabor @ahrtr @spzala

ptabor · 2022-05-13T14:21:42Z

Overall looks good, and I agree it's probably the best thing we can do.

Consistency check should work even when there are slow cluster members and independent on ongoing compactions.

Just noticing that the design below is based on compaction and would not work if there were no compactions... so is not truly 'independent'. Just nitpicking...

calculating hash at the same time should only have minimal performance cost.

Just noticing that if you look at the compaction logic:

etcd/server/storage/mvcc/kvstore_compaction.go

Line 33 in bcebd12

binary.BigEndian.PutUint64(end, uint64(compactMainRev+1))

It only scans bbolt up to the 'compaction horizont' and the data in the bbolt are ordered by the the 'revision'.
So we are not traversing the fresh data at all. With this change we will need to iterate in background across whole bbolt.
I don't think it's a fundamental problem, as usually 'fresh' data are minority of the storage.

Every X minutes, leader looks through stored hashes for compacted revisions (starting from newest one) and ask members for them. If all members correctly respond, leader removes this and older hashes.

There is a challenge when to rise alert when we don't observe matching checksums. The compaction request goes through raft but the compaction is computed within the backend and finishes asynchronously. The only signal that it completed is updating bbolt here:

etcd/server/storage/mvcc/kvstore_compaction.go

Line 56 in bcebd12

UnsafeSetFinishedCompact(tx, compactMainRev)

but it doesn't seem to be exposed by any public API. It's hard to distinguish between:

the remote node has not finished compaction yet, thus does not know the checksum for the revision
the remote node got already restarted after the compaction thus does not have this revision (and might be crashlooping more frequently then the compactions/checks happen)
the remote node is really badly broken and not able to compact when requested.

Partial mitigation for 2. is too compute the checksum of the last compactedRev on boot apart of the head checksum.
Partial mitigation for 3. is to have some monitoring that compactions do finish... but it's not something we can backport to 3.5 and rather it's something customer's should just track using Prometheus.

Do you have thoughts on this aspect ?

ahrtr · 2022-05-14T01:38:24Z

The existing data corruption detection mechanism indeed needs improvement.

It seems we can NOT achieve both "independent on ongoing compactions" and "No/minimal performance impact", instead we have to sacrifice one of them at the interested in the other one based on the proposal. But I don't have a better idea right now.

One proposal is to add a new API and sub-command (i.e. data-corruption) under the existing check command, and let users to decide when to check the data corruption.

serathius · 2022-05-15T14:47:23Z

Overall looks good, and I agree it's probably the best thing we can do.

Consistency check should work even when there are slow cluster members and independent on ongoing compactions.

Just noticing that the design below is based on compaction and would not work if there were no compactions... so is not truly 'independent'. Just nitpicking...

You are right, what I meant that compaction during corruption check should not cause it to skip validating peer. Still proposed design needs compaction to be run.

calculating hash at the same time should only have minimal performance cost.

Just noticing that if you look at the compaction logic:

etcd/server/storage/mvcc/kvstore_compaction.go

Line 33 in bcebd12

binary.BigEndian.PutUint64(end, uint64(compactMainRev+1))

It only scans bbolt up to the 'compaction horizont' and the data in the bbolt are ordered by the the 'revision'. So we are not traversing the fresh data at all. With this change we will need to iterate in background across whole bbolt. I don't think it's a fundamental problem, as usually 'fresh' data are minority of the storage.

Scanning whole bbolt was not my intention, I meant to calculate hash on compacted data, so one between the oldest revision and compaction horizon.

the remote node has not finished compaction yet, thus does not know the checksum for the revision

This is why I want leader to store and ask about multiple different revisions during one cycle. This way asks about same revision multiple times giving follower chance to compact. There are still 2 things I need to confirm:

For revision to be considered not corrupted, all followers need to respond within one check cycle with hash. This is because don't want to store any state about followers. However this should not be a problem as each member will store short history of revision->hash and compaction frequency (~minutes), should be much shorter than time needed to execute async compaction.
Is checking corruption for latest revision enough to confirm that there is no corruption in older ones. Basically if leader should check from latest revision and if all hashes match it can remove all other revisions from checking. To confirm if this is correct I will need to read more compaction code as I'm not sure if compaction removes all changes that happen before compaction horizon or just history of changes (what I expect).

the remote node got already restarted after the compaction thus does not have this revision (and might be crashlooping more frequently then the compactions/checks happen)
Partial mitigation for 2. is too compute the checksum of the last compactedRev on boot apart of the head checksum.

I don't think we can handle this one until hashes are not stored in raft. Not sure how your proposed mitigation would work. Still I think we are ok, as if cluster crashes every couple of minutes (expected compaction frequency) it means that there is already a problem that admin should fix.

the remote node is really badly broken and not able to compact when requested.

Not sure if we can treat clusters that have members with broken compaction as healthy. They their storage would grow infinitely and they would OOM. This is assuming more K8s usecase where clusters have a lot of activity and require periodic compaction to function.

serathius · 2022-05-17T15:29:39Z

Send implementation of hash calculation #14049
Please take a look to confirm that there is no/minimal performance impact

serathius · 2022-05-19T08:07:15Z

Backporting of this feature depends on this feature no impact on user, however there is still potential impact if there is a bug in implementation. As so I propose that we will introduce a prerelease with this feature. We will publish v3.5.5-rc.0, with note that this is not a full release, and don't provide installation instructions.

This way we can do additional testing to prevent bugs, like run K8s CI. Only when we are sure that everything is ok we make a full one. WDYT @ptabor @ahrtr

ahrtr · 2022-05-19T11:52:36Z

Backporting of this feature depends on this feature no impact on user, however there is still potential impact if there is a bug in implementation. As so I propose that we will introduce a prerelease with this feature. We will publish v3.5.5-rc.0, with note that this is not a full release, and don't provide installation instructions.

This way we can do additional testing to prevent bugs, like run K8s CI. Only when we are sure that everything is ok we make a full one. WDYT @ptabor @ahrtr

Basically it looks good to me. Although usually it's not proper to backport a non-trivial feature to a stable branch and it's also the first time for me to see a RC version for a patch release, it can improve the maintainability of 3.5 and we need to support & maintain 3.5 for a long time. So it's OK!

serathius · 2022-09-07T14:30:07Z

Implemented

serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 16, 2022

serathius mentioned this issue May 17, 2022

Calculate hash during compaction #14049

Merged

serathius added the stage/tracked label Jun 15, 2022

This was referenced Jun 15, 2022

server: Implement compaction hash checking #14120

Merged

Plans for v3.5.5 release #14138

Closed

dashpole mentioned this issue Jul 11, 2022

Use of outdated OpenTelemetry Go package kubernetes/kubernetes#106536

Closed

serathius closed this as completed Sep 7, 2022

serathius mentioned this issue Sep 13, 2022

etcdserver: fix corruption check when server has just been compacted #14457

Merged

serathius mentioned this issue Sep 27, 2022

Improve slow follower reliability/performance #14501

Closed

ahrtr mentioned this issue Oct 10, 2022

Update changelog readme to add rules on how to add changelog items #14570

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd detects data corruption by default #14039

etcd detects data corruption by default #14039

serathius commented May 13, 2022 •

edited

Loading

ptabor commented May 13, 2022

ahrtr commented May 14, 2022 •

edited

Loading

serathius commented May 15, 2022

serathius commented May 17, 2022

serathius commented May 19, 2022

ahrtr commented May 19, 2022

serathius commented Sep 7, 2022

etcd detects data corruption by default #14039

etcd detects data corruption by default #14039

Comments

serathius commented May 13, 2022 • edited Loading

Proposal

Current state of consistency check

Design

Implementation

ptabor commented May 13, 2022

ahrtr commented May 14, 2022 • edited Loading

serathius commented May 15, 2022

serathius commented May 17, 2022

serathius commented May 19, 2022

ahrtr commented May 19, 2022

serathius commented Sep 7, 2022

serathius commented May 13, 2022 •

edited

Loading

ahrtr commented May 14, 2022 •

edited

Loading