Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd Cluster with different revisions #6567

Closed
zbindenren opened this issue Oct 3, 2016 · 9 comments · Fixed by #6633
Closed

Etcd Cluster with different revisions #6567

zbindenren opened this issue Oct 3, 2016 · 9 comments · Fixed by #6633
Assignees
Labels

Comments

@zbindenren
Copy link
Contributor

Cluster Version: v3.04
OS: RHEL7

I run the defrag tool on our cluster and that did unfortunately timeout. Aftwerwards I got an unhealthy cluster:

https://server007.pnet.ch:7002 is healthy: successfully committed proposal: took = 7.695945ms
https://server006.pnet.ch:7002 is unhealthy: failed to commit proposal: etcdserver: mvcc: required revision has been compacted
https://server008.pnet.ch:7002 is healthy: successfully committed proposal: took = 9.584828ms

If I check the revisions I see the following:

[{
    "Endpoint": "https://server006.pnet.ch:7002",
    "Status": {
        "header": {
            "cluster_id": 4645787881757527929,
            "member_id": 13952225244599230523,
            "revision": 12606424,
            "raft_term": 38
        },
        "version": "3.0.4",
        "dbSize": 1744896,
        "leader": 13952225244599230523,
        "raftIndex": 29770445,
        "raftTerm": 38
    }
}, {
    "Endpoint": "https://server007.pnet.ch:7002",
    "Status": {
        "header": {
            "cluster_id": 4645787881757527929,
            "member_id": 15253774746781001409,
            "revision": 12606428,
            "raft_term": 38
        },
        "version": "3.0.4",
        "dbSize": 1781760,
        "leader": 13952225244599230523,
        "raftIndex": 29770445,
        "raftTerm": 38
    }
}, {
    "Endpoint": "https://server008.pnet.ch:7002",
    "Status": {
        "header": {
            "cluster_id": 4645787881757527929,
            "member_id": 16318023537984784089,
            "revision": 12606428,
            "raft_term": 38
        },
        "version": "3.0.4",
        "dbSize": 1802240,
        "leader": 13952225244599230523,
        "raftIndex": 29770445,
        "raftTerm": 38
    }
}]

As you can see, the first (unhealthy) cluster has a different revision than the other two.

In the logs I see for that cluster a lot of:

ct 03 15:33:25 server006 etcd[68451]: invalid auth token: <token>
Oct 03 15:33:25 server006 etcd[68451]: invalid auth token: <token>
Oct 03 15:33:25 server006 etcd[68451]: invalid auth token: <token>
Oct 03 15:33:25 server006 etcd[68451]: invalid auth token: <token>

On 006 I also see two panics:

journalctl -lu etcd |grep panic
Oct 03 14:56:42 server006 etcd[106313]: panic: unexpected error during txn
Oct 03 15:10:07 server006 etcd[68248]: panic: unexpected error during txn

Is it possible to fix that cluster or do I have to recreate it?

@xiang90
Copy link
Contributor

xiang90 commented Oct 3, 2016

@zbindenren Can you reliably reproduce this issue? Did you use lease feature?

To fix the cluster, you can remove 006 and add it back.

@zbindenren
Copy link
Contributor Author

@xiang90 unfortunately not.

I tried as suggested:

  1. Remove failed member
  2. Stop failed member
  3. Remove data directory
  4. Add member again.
  5. Join Member
  6. Restart failed member

Then I did a check:

https://server008.pnet.ch:7002 is healthy: successfully committed proposal: took = 7.067664ms 
https://server007.pnet.ch:7002 is healthy: successfully committed proposal: took = 7.765323ms
https://server006.pnet.ch:7002 is unhealthy: failed to commit proposal: etcdserver: mvcc: required revision has been compacted

And I again have a cluster with different revisions.

Did I do something wrong?

@xiang90
Copy link
Contributor

xiang90 commented Oct 4, 2016

Do your cluster contain sensitive data? If not, can you send your data dir to me at [email protected]?

@xiang90
Copy link
Contributor

xiang90 commented Oct 7, 2016

@zbindenren kindly ping.

@xiang90
Copy link
Contributor

xiang90 commented Oct 11, 2016

@zbindenren If you cannot provide the data, can you somehow reproduce this?

@zbindenren
Copy link
Contributor Author

@xiang90 sorry for the late response. I created a new cluster, there is no data anymore.

We ran into this state as follows:

  1. hit quota limit 3.5GB with Data
  2. run compaction
  3. run into timeout running defrag

Now I have autocompaction enabled so we probably do not run into that situation again.

Thanks for your help.

@xiang90
Copy link
Contributor

xiang90 commented Oct 11, 2016

  1. hit quota limit 3.5GB with Data
  2. run compaction
  3. run into timeout running defrag

If you run the 3 steps, you can still reproduce the bad revision thing? We will give this a try.

@zbindenren
Copy link
Contributor Author

I remeber that revision was pretty high when I ran compaction.

I didn't have the time to try to reproduce it.

@xiang90
Copy link
Contributor

xiang90 commented Oct 12, 2016

@zbindenren Thanks for the hints. I think I know the root cause. Will fix soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

2 participants