-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd detects data corruption by default #14039
Comments
Overall looks good, and I agree it's probably the best thing we can do.
Just noticing that the design below is based on compaction and would not work if there were no compactions... so is not truly 'independent'. Just nitpicking...
Just noticing that if you look at the compaction logic:
It only scans bbolt up to the 'compaction horizont' and the data in the bbolt are ordered by the the 'revision'.
There is a challenge when to rise alert when we don't observe matching checksums. The compaction request goes through raft but the compaction is computed within the backend and finishes asynchronously. The only signal that it completed is updating bbolt here:
but it doesn't seem to be exposed by any public API. It's hard to distinguish between:
Partial mitigation for 2. is too compute the checksum of the last Do you have thoughts on this aspect ? |
The existing data corruption detection mechanism indeed needs improvement. It seems we can NOT achieve both " One proposal is to add a new API and sub-command (i.e. |
You are right, what I meant that compaction during corruption check should not cause it to skip validating peer. Still proposed design needs compaction to be run.
Scanning whole bbolt was not my intention, I meant to calculate hash on compacted data, so one between the oldest revision and compaction horizon.
This is why I want leader to store and ask about multiple different revisions during one cycle. This way asks about same revision multiple times giving follower chance to compact. There are still 2 things I need to confirm:
I don't think we can handle this one until hashes are not stored in raft. Not sure how your proposed mitigation would work. Still I think we are ok, as if cluster crashes every couple of minutes (expected compaction frequency) it means that there is already a problem that admin should fix.
Not sure if we can treat clusters that have members with broken compaction as healthy. They their storage would grow infinitely and they would OOM. This is assuming more K8s usecase where clusters have a lot of activity and require periodic compaction to function. |
Send implementation of hash calculation #14049 |
Backporting of this feature depends on this feature no impact on user, however there is still potential impact if there is a bug in implementation. As so I propose that we will introduce a prerelease with this feature. We will publish v3.5.5-rc.0, with note that this is not a full release, and don't provide installation instructions. This way we can do additional testing to prevent bugs, like run K8s CI. Only when we are sure that everything is ok we make a full one. WDYT @ptabor @ahrtr |
Basically it looks good to me. Although usually it's not proper to backport a non-trivial feature to a stable branch and it's also the first time for me to see a RC version for a patch release, it can improve the maintainability of 3.5 and we need to support & maintain 3.5 for a long time. So it's OK! |
Implemented |
P0 action item proposed in https://github.com/etcd-io/etcd/blob/main/Documentation/postmortems/v3.5-data-inconsistency.md
Problems:
Proposal
Develop a dedicated v3.5 improvement to data consistency check that will be enabled by default and backported to v3.5 release. Scope of this patch will be heavy limited to avoid introducing any breaking changes or performance regressions.
I think it's worth to invest into this change now to make sure v3.5 is in acceptable state, instead of trying to rush v3.6 release with half backed state. Due to heavy restrictions it might turn out that there is no reasonable improvement that we can make, still I think it's worth to consider.
Current state of consistency check
Design
Goals
Requirements:
Implementation
Evaluate consistency on compacted revisions by calculating hashes during compaction. Compactions are negotiated via raft, so executed by all members on the same revision, meaning slow members don't matter. During compaction we already need to touch bbolt, meaning calculating hash at the same time should only have minimal performance cost. No API changes, existing function to return HashKV, will just be extended to serve hashes from selected compacted revisions.
Only issue is that compactions are not done automatically by default, however Kubernetes already runs them every 5 minutes which should be enough to provide significant improvement to majority of etcd users.
Algorithm:
cc @ptabor @ahrtr @spzala
The text was updated successfully, but these errors were encountered: