Etcd Cluster with different revisions #6567

zbindenren · 2016-10-03T13:40:07Z

Cluster Version: v3.04
OS: RHEL7

I run the defrag tool on our cluster and that did unfortunately timeout. Aftwerwards I got an unhealthy cluster:

https://server007.pnet.ch:7002 is healthy: successfully committed proposal: took = 7.695945ms
https://server006.pnet.ch:7002 is unhealthy: failed to commit proposal: etcdserver: mvcc: required revision has been compacted
https://server008.pnet.ch:7002 is healthy: successfully committed proposal: took = 9.584828ms

If I check the revisions I see the following:

[{
    "Endpoint": "https://server006.pnet.ch:7002",
    "Status": {
        "header": {
            "cluster_id": 4645787881757527929,
            "member_id": 13952225244599230523,
            "revision": 12606424,
            "raft_term": 38
        },
        "version": "3.0.4",
        "dbSize": 1744896,
        "leader": 13952225244599230523,
        "raftIndex": 29770445,
        "raftTerm": 38
    }
}, {
    "Endpoint": "https://server007.pnet.ch:7002",
    "Status": {
        "header": {
            "cluster_id": 4645787881757527929,
            "member_id": 15253774746781001409,
            "revision": 12606428,
            "raft_term": 38
        },
        "version": "3.0.4",
        "dbSize": 1781760,
        "leader": 13952225244599230523,
        "raftIndex": 29770445,
        "raftTerm": 38
    }
}, {
    "Endpoint": "https://server008.pnet.ch:7002",
    "Status": {
        "header": {
            "cluster_id": 4645787881757527929,
            "member_id": 16318023537984784089,
            "revision": 12606428,
            "raft_term": 38
        },
        "version": "3.0.4",
        "dbSize": 1802240,
        "leader": 13952225244599230523,
        "raftIndex": 29770445,
        "raftTerm": 38
    }
}]

As you can see, the first (unhealthy) cluster has a different revision than the other two.

In the logs I see for that cluster a lot of:

ct 03 15:33:25 server006 etcd[68451]: invalid auth token: <token>
Oct 03 15:33:25 server006 etcd[68451]: invalid auth token: <token>
Oct 03 15:33:25 server006 etcd[68451]: invalid auth token: <token>
Oct 03 15:33:25 server006 etcd[68451]: invalid auth token: <token>

On 006 I also see two panics:

journalctl -lu etcd |grep panic
Oct 03 14:56:42 server006 etcd[106313]: panic: unexpected error during txn
Oct 03 15:10:07 server006 etcd[68248]: panic: unexpected error during txn

Is it possible to fix that cluster or do I have to recreate it?

The text was updated successfully, but these errors were encountered:

xiang90 · 2016-10-03T22:43:01Z

@zbindenren Can you reliably reproduce this issue? Did you use lease feature?

To fix the cluster, you can remove 006 and add it back.

zbindenren · 2016-10-04T12:41:45Z

@xiang90 unfortunately not.

I tried as suggested:

Remove failed member
Stop failed member
Remove data directory
Add member again.
Join Member
Restart failed member

Then I did a check:

https://server008.pnet.ch:7002 is healthy: successfully committed proposal: took = 7.067664ms 
https://server007.pnet.ch:7002 is healthy: successfully committed proposal: took = 7.765323ms
https://server006.pnet.ch:7002 is unhealthy: failed to commit proposal: etcdserver: mvcc: required revision has been compacted

And I again have a cluster with different revisions.

Did I do something wrong?

xiang90 · 2016-10-04T12:43:46Z

Do your cluster contain sensitive data? If not, can you send your data dir to me at [email protected]?

xiang90 · 2016-10-07T20:20:17Z

@zbindenren kindly ping.

xiang90 · 2016-10-11T20:25:23Z

@zbindenren If you cannot provide the data, can you somehow reproduce this?

zbindenren · 2016-10-11T20:42:08Z

@xiang90 sorry for the late response. I created a new cluster, there is no data anymore.

We ran into this state as follows:

hit quota limit 3.5GB with Data
run compaction
run into timeout running defrag

Now I have autocompaction enabled so we probably do not run into that situation again.

Thanks for your help.

xiang90 · 2016-10-11T20:43:48Z

hit quota limit 3.5GB with Data
run compaction
run into timeout running defrag

If you run the 3 steps, you can still reproduce the bad revision thing? We will give this a try.

zbindenren · 2016-10-11T20:49:03Z

I remeber that revision was pretty high when I ran compaction.

I didn't have the time to try to reproduce it.

xiang90 · 2016-10-12T04:48:37Z

@zbindenren Thanks for the hints. I think I know the root cause. Will fix soon.

gyuho assigned xiang90 Oct 11, 2016

xiang90 mentioned this issue Oct 11, 2016

etcdserver: better panic logging #6627

Merged

xiang90 added the type/bug label Oct 12, 2016

xiang90 mentioned this issue Oct 12, 2016

mvcc: fix rev inconsistency #6633

Merged

xiang90 closed this as completed in #6633 Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd Cluster with different revisions #6567

Etcd Cluster with different revisions #6567

zbindenren commented Oct 3, 2016

xiang90 commented Oct 3, 2016

zbindenren commented Oct 4, 2016

xiang90 commented Oct 4, 2016

xiang90 commented Oct 7, 2016

xiang90 commented Oct 11, 2016

zbindenren commented Oct 11, 2016

xiang90 commented Oct 11, 2016

zbindenren commented Oct 11, 2016

xiang90 commented Oct 12, 2016

Etcd Cluster with different revisions #6567

Etcd Cluster with different revisions #6567

Comments

zbindenren commented Oct 3, 2016

xiang90 commented Oct 3, 2016

zbindenren commented Oct 4, 2016

xiang90 commented Oct 4, 2016

xiang90 commented Oct 7, 2016

xiang90 commented Oct 11, 2016

zbindenren commented Oct 11, 2016

xiang90 commented Oct 11, 2016

zbindenren commented Oct 11, 2016

xiang90 commented Oct 12, 2016