-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offline migration of expiring TTL content in a cluster causes v3 store to be inconsistent across cluster members #8305
Comments
We did this twice in a row (upgraded, observed the staleness, restored from backup, ran through the whole procedure again) in our stage environment, and it occurred both times. We noticed that the apiserver on the same node as the etcd leader was the one affected both times (going to run to a third time to see if it is consistent). |
I am not sure if this is an etcd issue, etcd/client issue or k8s apiserver issue. Need some investigation there first. @smarterclayton can you share the migration data somehow? so someone from etcd team can easily reproduce the problem? |
Unfortunately no, it's private internal data. Some of the things we have ruled out:
We do have a reproducible env, so we'll try to get additional data from the servers in order to debug. |
so issues reads from etcdctl returns new vaules, but calling etcd.read inside apiserver returns stale data?
have you tried to restart the etcd server that api server connects? if restarting etcd does not solve the problem, the the issue is probably inside apiserver or client code. |
Yes
Yes, that will be the next attempt. |
@smarterclayton also can you try with 1 node etcd cluster + 1 kube-apiservers? |
So we don't have a setup where we can downgrade this particular cluster to single of each. I will say that we have never observed this in v2 mode, and have not observed it yet in any cluster (this is the standard config for openshift, 3+3) that was started with etcd v3 mode on. We also don't believe we've observed it in non-migration 1+1 setups where the cluster was created from scratch. We did confirm that this occurred three times in a row, exactly the same, so we can consistently recreate the scenario with this particular cluster config. Will try to have more info by tomorrow. |
We reproduced this again. This time we performed the migration, thought that the environment was sane, some 30 minutes later discovered problems with stale data. We then restarted all api servers and the problem persisted. Then I restarted etcd on the leader and the problem went away without restarting the api server. |
Interesting... If you use etcdctl to query etcd leader, will you see stale data? |
after further debugging, one of the etcd nodes actually has different data in its store, despite claiming to be up to date with the raft index. Our migration process:
After a period of time that varies, we observe the etcd nodes reporting different data for the same read queries. This persists across restarts of all etcd members, with all clients shut down, no writes occurring, and all etcd members reporting the same raft index. |
So this is NOT the case? We need to get a clear answer on this. |
correct, using etcdctl directly against etcd returns different results depending on which member you query. https://bugzilla.redhat.com/show_bug.cgi?id=1475351#c8 has more details as well |
Can you actually summarize the problem and post it here? Thanks. |
#8305 (comment) is a good summary |
after this, can you check if the hashes of member are matched? (either via gRPC hash API or manually check contents) |
the member data? |
the data you can get from the kv store. a range over all keys. |
we tracked this down to an issue migrating v2->v3 stores when the v2 store contains actively expiring TTL keys. Reproducer script at https://gist.github.com/liggitt/0feedecf5d6d1b51113bf58d10a22b4c We followed the offline migration guide at https://coreos.com/etcd/docs/latest/op-guide/v2-migration.html#offline-migration, which did not mention issues migrating data stores containing actively expiring TTL keys. This text is not correct in the presence of TTL keys:
If differing content is migrated, it puts the mvcc stores in an inconsistent state that can affect future transactions on migrated data, or on new data. It seems like the following should be done:
|
It seems the problem is that the data on members are not converged when you run migration tool due to expiring TTL keys. The doc has the assumption that the state of the members are consistent. Can you help to improve the migration doc to make it clear? |
The current doc recommends using a command that requires the master be online. In the presence of TTL there is no way for an online master to be sure it is consistent (expirations in the TTL store don't seem to increment the raft index?). Is there an offline command? In the presence of TTL you'd have to continually stop and start the cluster until you were sure the stores were consistent, and only do the check while offline on each node. |
If TTL expiration doesn't go through raft (is this correct that it does not?), how can you be sure the cluster members are consistent? |
@smarterclayton TTL expiration goes through consensus. Related: https://github.com/coreos/etcd/blob/master/etcdserver/apply_v2.go#L122. |
Ok. So we see the following:
RESULT: members are not consistent. If the expiration increments raft, how do we ensure that we don't race on shutdown with a TTL application? If you have any TTLs, then the currently recommended doc is wrong, because there is no way to prevent an expiration happening between 2 and 3. Even if we could check it offline, you'd basically just have to write a start / check / stop / check loop that runs until it gets lucky. That's pretty crazy. |
Also, the failure mode for a migration with inconsistent content is pretty bad. Is there a better process the doc should recommend to ensure the stores are consistent post migration that we are missing? |
We are working on adding runtime hash checking. |
Closing in favor of #8348 so inconsistencies can be detected at boot via command line. |
Is there going to be a separate issue to track changing the doc? |
@smarterclayton I added a note on the etcdctl issue; it doesn't need to be tracked separately |
We have a 3 node etcd 3.1.9 cluster (with three kubernetes 1.6 api servers contacting it) that we are upgrading from v2 mode to v3. Post upgrade, one of the api servers appears to be serving stale reads and watches from right about the time of the upgrade - a few of the API requests that call down into etcd retrieve current data, but a large number never see updates (and never see compaction either). I.e. at resource version 3,000,000 at upgrade, writes continue to the cluster, but while other members report at 3,012,000 after 20-30 minutes, the old member is still returning GET/LIST call resource versions at 3,000,000 or near there.
Scenario:
Outcome:
One of the three api servers responds to GET/LIST/WATCH with resource versions from before or right after the upgrade. It accepts writes, but never returns the results of writes to those key ranges. Other api servers serve reads and writes fine. All etcd instances report the same leader and have the same raft term and report. We verified that the api servers were calling down into etcd.
After a restart of the affected apiserver, it begins serving up to date reads. Have not observed any subsequent stale reads.
The text was updated successfully, but these errors were encountered: