Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd detects data corruption by default #14039

Closed
Tracked by #14138
serathius opened this issue May 13, 2022 · 7 comments
Closed
Tracked by #14138

etcd detects data corruption by default #14039

serathius opened this issue May 13, 2022 · 7 comments
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. stage/tracked

Comments

@serathius
Copy link
Member

serathius commented May 13, 2022

P0 action item proposed in https://github.com/etcd-io/etcd/blob/main/Documentation/postmortems/v3.5-data-inconsistency.md

Problems:

  • Lack of adoption prevents early detection hampering etcd community ability to fix issues.
  • No hard guarantees about detecting inconsistency, undermines purpose of this feature.
  • Supportability of v3.5 release is very restricted in case of new data inconsistency issues.
  • Properly fixing data inconsistency check can potentially take multiple releases as it requires a new WAL entry and larger code changes.

Proposal

Develop a dedicated v3.5 improvement to data consistency check that will be enabled by default and backported to v3.5 release. Scope of this patch will be heavy limited to avoid introducing any breaking changes or performance regressions.

I think it's worth to invest into this change now to make sure v3.5 is in acceptable state, instead of trying to rush v3.6 release with half backed state. Due to heavy restrictions it might turn out that there is no reasonable improvement that we can make, still I think it's worth to consider.

Current state of consistency check

  • Initial check during bootstrap of each member validates that hash for its latest revision matches hashes of all peers. Doesn't check peers that haven't seen the revision or have run compaction during check.
  • Periodic check run by leader validates that hash for its latest revision matches hashes of all peers and confirms that no follower has revision newer than leader. Doesn't check slow peers that haven't seen the revision or have run compaction during check.

Design

Goals

  • Consistency check should work even when there are slow cluster members and independent on ongoing compactions.

Requirements:

  • No/minimal performance impact when check is enabled. Check needs to be enabled by default, so we should have not performance impact when compared to original check that was disabled by default.
  • No WAL and API changes. This will be a patch fix so we cannot make any changes to maintain compatibility.

Implementation

Evaluate consistency on compacted revisions by calculating hashes during compaction. Compactions are negotiated via raft, so executed by all members on the same revision, meaning slow members don't matter. During compaction we already need to touch bbolt, meaning calculating hash at the same time should only have minimal performance cost. No API changes, existing function to return HashKV, will just be extended to serve hashes from selected compacted revisions.

Only issue is that compactions are not done automatically by default, however Kubernetes already runs them every 5 minutes which should be enough to provide significant improvement to majority of etcd users.

Algorithm:

  • During each compaction member calculates hash on data that is being compacted and stores last N hashes to be served by HashKV method.
  • Every X minutes, leader looks through stored hashes for compacted revisions (starting from newest one) and ask members for them. If all members correctly respond, leader removes this and older hashes.

cc @ptabor @ahrtr @spzala

@ptabor
Copy link
Contributor

ptabor commented May 13, 2022

Overall looks good, and I agree it's probably the best thing we can do.

Consistency check should work even when there are slow cluster members and independent on ongoing compactions.

Just noticing that the design below is based on compaction and would not work if there were no compactions... so is not truly 'independent'. Just nitpicking...

calculating hash at the same time should only have minimal performance cost.

Just noticing that if you look at the compaction logic:

binary.BigEndian.PutUint64(end, uint64(compactMainRev+1))

It only scans bbolt up to the 'compaction horizont' and the data in the bbolt are ordered by the the 'revision'.
So we are not traversing the fresh data at all. With this change we will need to iterate in background across whole bbolt.
I don't think it's a fundamental problem, as usually 'fresh' data are minority of the storage.

Every X minutes, leader looks through stored hashes for compacted revisions (starting from newest one) and ask members for them. If all members correctly respond, leader removes this and older hashes.

There is a challenge when to rise alert when we don't observe matching checksums. The compaction request goes through raft but the compaction is computed within the backend and finishes asynchronously. The only signal that it completed is updating bbolt here:

UnsafeSetFinishedCompact(tx, compactMainRev)

but it doesn't seem to be exposed by any public API. It's hard to distinguish between:

  1. the remote node has not finished compaction yet, thus does not know the checksum for the revision
  2. the remote node got already restarted after the compaction thus does not have this revision (and might be crashlooping more frequently then the compactions/checks happen)
  3. the remote node is really badly broken and not able to compact when requested.

Partial mitigation for 2. is too compute the checksum of the last compactedRev on boot apart of the head checksum.
Partial mitigation for 3. is to have some monitoring that compactions do finish... but it's not something we can backport to 3.5 and rather it's something customer's should just track using Prometheus.

Do you have thoughts on this aspect ?

@ahrtr
Copy link
Member

ahrtr commented May 14, 2022

The existing data corruption detection mechanism indeed needs improvement.

It seems we can NOT achieve both "independent on ongoing compactions" and "No/minimal performance impact", instead we have to sacrifice one of them at the interested in the other one based on the proposal. But I don't have a better idea right now.

One proposal is to add a new API and sub-command (i.e. data-corruption) under the existing check command, and let users to decide when to check the data corruption.

@serathius
Copy link
Member Author

Overall looks good, and I agree it's probably the best thing we can do.

Consistency check should work even when there are slow cluster members and independent on ongoing compactions.

Just noticing that the design below is based on compaction and would not work if there were no compactions... so is not truly 'independent'. Just nitpicking...

You are right, what I meant that compaction during corruption check should not cause it to skip validating peer. Still proposed design needs compaction to be run.

calculating hash at the same time should only have minimal performance cost.

Just noticing that if you look at the compaction logic:

binary.BigEndian.PutUint64(end, uint64(compactMainRev+1))

It only scans bbolt up to the 'compaction horizont' and the data in the bbolt are ordered by the the 'revision'. So we are not traversing the fresh data at all. With this change we will need to iterate in background across whole bbolt. I don't think it's a fundamental problem, as usually 'fresh' data are minority of the storage.

Scanning whole bbolt was not my intention, I meant to calculate hash on compacted data, so one between the oldest revision and compaction horizon.

  1. the remote node has not finished compaction yet, thus does not know the checksum for the revision

This is why I want leader to store and ask about multiple different revisions during one cycle. This way asks about same revision multiple times giving follower chance to compact. There are still 2 things I need to confirm:

  • For revision to be considered not corrupted, all followers need to respond within one check cycle with hash. This is because don't want to store any state about followers. However this should not be a problem as each member will store short history of revision->hash and compaction frequency (~minutes), should be much shorter than time needed to execute async compaction.
  • Is checking corruption for latest revision enough to confirm that there is no corruption in older ones. Basically if leader should check from latest revision and if all hashes match it can remove all other revisions from checking. To confirm if this is correct I will need to read more compaction code as I'm not sure if compaction removes all changes that happen before compaction horizon or just history of changes (what I expect).
  1. the remote node got already restarted after the compaction thus does not have this revision (and might be crashlooping more frequently then the compactions/checks happen)
    Partial mitigation for 2. is too compute the checksum of the last compactedRev on boot apart of the head checksum.

I don't think we can handle this one until hashes are not stored in raft. Not sure how your proposed mitigation would work. Still I think we are ok, as if cluster crashes every couple of minutes (expected compaction frequency) it means that there is already a problem that admin should fix.

  1. the remote node is really badly broken and not able to compact when requested.

Not sure if we can treat clusters that have members with broken compaction as healthy. They their storage would grow infinitely and they would OOM. This is assuming more K8s usecase where clusters have a lot of activity and require periodic compaction to function.

@serathius serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 16, 2022
@serathius
Copy link
Member Author

Send implementation of hash calculation #14049
Please take a look to confirm that there is no/minimal performance impact

@serathius
Copy link
Member Author

Backporting of this feature depends on this feature no impact on user, however there is still potential impact if there is a bug in implementation. As so I propose that we will introduce a prerelease with this feature. We will publish v3.5.5-rc.0, with note that this is not a full release, and don't provide installation instructions.

This way we can do additional testing to prevent bugs, like run K8s CI. Only when we are sure that everything is ok we make a full one. WDYT @ptabor @ahrtr

@ahrtr
Copy link
Member

ahrtr commented May 19, 2022

Backporting of this feature depends on this feature no impact on user, however there is still potential impact if there is a bug in implementation. As so I propose that we will introduce a prerelease with this feature. We will publish v3.5.5-rc.0, with note that this is not a full release, and don't provide installation instructions.

This way we can do additional testing to prevent bugs, like run K8s CI. Only when we are sure that everything is ok we make a full one. WDYT @ptabor @ahrtr

Basically it looks good to me. Although usually it's not proper to backport a non-trivial feature to a stable branch and it's also the first time for me to see a RC version for a patch release, it can improve the maintainability of 3.5 and we need to support & maintain 3.5 for a long time. So it's OK!

@serathius
Copy link
Member Author

Implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. stage/tracked
Development

No branches or pull requests

3 participants