Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

Better handling for >1 matching EBS volumes #281

Open
mtolan opened this issue Dec 9, 2019 · 0 comments
Open

Better handling for >1 matching EBS volumes #281

mtolan opened this issue Dec 9, 2019 · 0 comments

Comments

@mtolan
Copy link

mtolan commented Dec 9, 2019

We've been using kops for a few years, and prior to the introduction of etcd-manager we relied on our own EBS backup strategy. This led to a number of etcd volumes being present in our AWS account that matched the tags used by etcd-manager to select and mount storage.

The first host that came up after the rolling-update that installed etcd-manager had the following in its logs:

I1209 16:26:18.958984   18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.959912   18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.960468   18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.961016   18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.961520   18726 main.go:254] Mounting available etcd volumes matching tags [k8s.io/etcd/main k8s.io/role/master=1 kubernetes.io/cluster/kube.us-east-1.dev.deploys.brightcove.com=owned]; nameTag=k8s.io/etcd/main
I1209 16:26:18.962655   18726 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I1209 16:26:19.152540   18726 mounter.go:302] Currently attached volumes: [0xc00025af00]
I1209 16:26:19.152574   18726 mounter.go:64] Master volume "vol-0a5a75bec90179bd8" is attached at "/dev/xvdu"
I1209 16:26:19.152590   18726 mounter.go:78] Doing safe-format-and-mount of /dev/xvdu to /mnt/master-vol-0a5a75bec90179bd8
I1209 16:26:19.152604   18726 volumes.go:233] volume vol-0a5a75bec90179bd8 not mounted at /rootfs/dev/xvdu
I1209 16:26:19.152639   18726 volumes.go:247] found nvme volume "nvme-Amazon_Elastic_Block_Store_vol0a5a75bec90179bd8" at "/dev/nvme1n1"
I1209 16:26:19.152652   18726 mounter.go:116] Found volume "vol-0a5a75bec90179bd8" mounted at device "/dev/nvme1n1"
I1209 16:26:19.153151   18726 mounter.go:173] Device already mounted on "/mnt/master-vol-0a5a75bec90179bd8", verifying it is our device
I1209 16:26:19.153167   18726 mounter.go:185] Found existing mount of "/dev/nvme1n1" at "/mnt/master-vol-0a5a75bec90179bd8"
I1209 16:26:19.153241   18726 mount_linux.go:164] Detected OS without systemd
I1209 16:26:19.153789   18726 mounter.go:226] matched device "/dev/nvme1n1" and "/dev/nvme1n1" via '\x00'
I1209 16:26:19.153803   18726 mounter.go:86] mounted master volume "vol-0a5a75bec90179bd8" on /mnt/master-vol-0a5a75bec90179bd8
I1209 16:26:19.153816   18726 main.go:269] discovered IP address: 10.250.16.215
I1209 16:26:19.153823   18726 main.go:274] Setting data dir to /rootfs/mnt/master-vol-0a5a75bec90179bd8
I1209 16:26:19.154260   18726 server.go:71] starting GRPC server using TLS, ServerName="etcd-manager-server-etcd-b"
I1209 16:26:19.154403   18726 s3context.go:331] product_uuid is "ec2004e4-d619-9524-bf5b-e56ce28c2bd6", assuming running on EC2
I1209 16:26:19.155152   18726 s3context.go:164] got region from metadata: "us-east-1"
I1209 16:26:19.212772   18726 s3context.go:210] found bucket in region "us-east-1"
I1209 16:26:19.212798   18726 s3fs.go:128] Writing file "s3://com.brightcove.deploys.dev.kube.dev-us-east-1/kube.us-east-1.dev.deploys.brightcove.com/backups/etcd/main/control/etcd-cluster-created"
I1209 16:26:19.212816   18726 s3context.go:238] Checking default bucket encryption for "com.brightcove.deploys.dev.kube.dev-us-east-1"
W1209 16:26:19.272282   18726 controller.go:135] not enabling TLS for etcd, this is insecure
I1209 16:26:19.272306   18726 server.go:89] GRPC server listening on "10.250.16.215:3996"
I1209 16:26:19.272403   18726 etcdserver.go:534] starting etcd with state cluster:<cluster_token:"ckDjqRPhIBJGj0dtx6qVlw" nodes:<name:"etcd-a" peer_urls:"http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:2380" client_urls:"http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:4001" quarantined_client_urls:"http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:3994" > nodes:<name:"etcd-b" peer_urls:"http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:2380" client_urls:"http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:4001" quarantined_client_urls:"http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:3994" > nodes:<name:"etcd-c" peer_urls:"http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:2380" client_urls:"http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:4001" quarantined_client_urls:"http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:3994" > > etcd_version:"2.2.1" 
I1209 16:26:19.272549   18726 etcdserver.go:543] starting etcd with datadir /rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw
I1209 16:26:19.272548   18726 volumes.go:85] AWS API Request: ec2/DescribeVolumes
W1209 16:26:19.272599   18726 pki.go:46] not generating peer keypair as peers-ca not set
W1209 16:26:19.272626   18726 pki.go:84] not generating client keypair as clients-ca not set
I1209 16:26:19.272703   18726 etcdprocess.go:180] executing command /opt/etcd-v2.2.1-linux-amd64/etcd [/opt/etcd-v2.2.1-linux-amd64/etcd]
W1209 16:26:19.272749   18726 etcdprocess.go:234] using insecure configuration for etcd peers
W1209 16:26:19.272774   18726 etcdprocess.go:243] using insecure configuration for etcd clients
2019-12-09 16:26:19.277754 I | flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:4001
2019-12-09 16:26:19.277784 I | flags: recognized and used environment variable ETCD_DATA_DIR=/rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw
2019-12-09 16:26:19.277799 I | flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:2380
2019-12-09 16:26:19.277814 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd-a=http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:2380,etcd-b=http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:2380,etcd-c=http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:2380
2019-12-09 16:26:19.277820 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
2019-12-09 16:26:19.277830 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=ckDjqRPhIBJGj0dtx6qVlw
2019-12-09 16:26:19.277838 I | flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:4001
2019-12-09 16:26:19.277848 I | flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2019-12-09 16:26:19.277859 I | flags: recognized and used environment variable ETCD_NAME=etcd-b
2019-12-09 16:26:19.277889 W | flags: unrecognized environment variable ETCD_LISTEN_METRICS_URLS=
2019-12-09 16:26:19.277934 I | etcdmain: etcd Version: 2.2.1
2019-12-09 16:26:19.277938 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2019-12-09 16:26:19.277941 I | etcdmain: Go Version: go1.12.5
2019-12-09 16:26:19.277945 I | etcdmain: Go OS/Arch: linux/amd64
2019-12-09 16:26:19.277949 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2019-12-09 16:26:19.277992 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2019-12-09 16:26:19.278095 I | etcdmain: listening for peers on http://0.0.0.0:2380
2019-12-09 16:26:19.278118 I | etcdmain: listening for client requests on http://0.0.0.0:4001
2019-12-09 16:26:19.371091 I | etcdserver: recovered store from snapshot at index 380038
2019-12-09 16:26:19.371126 I | etcdserver: name = etcd-b
2019-12-09 16:26:19.371130 I | etcdserver: data dir = /rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw
2019-12-09 16:26:19.371134 I | etcdserver: member dir = /rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw/member
2019-12-09 16:26:19.371138 I | etcdserver: heartbeat = 100ms
2019-12-09 16:26:19.371140 I | etcdserver: election = 1000ms
2019-12-09 16:26:19.371144 I | etcdserver: snapshot count = 10000
2019-12-09 16:26:19.371155 I | etcdserver: advertise client URLs = http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:4001
2019-12-09 16:26:19.371185 I | etcdserver: loaded cluster information from store: <nil>
I1209 16:26:19.373963   18726 volumes.go:85] AWS API Request: ec2/DescribeInstances
2019-12-09 16:26:19.412180 I | etcdserver: restarting member a8bc606d954cb360 in cluster 362b3eb57d5b3247 at commit index 386849
2019-12-09 16:26:19.412557 I | raft: a8bc606d954cb360 became follower at term 826
2019-12-09 16:26:19.412578 I | raft: newRaft a8bc606d954cb360 [peers: [21c1cba54be22c9a,85558b08fd6377a2,a8bc606d954cb360], term: 826, commit: 386849, applied: 380038, lastindex: 386849, lastterm: 20]
2019-12-09 16:26:19.419037 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[21c1cba54be22c9a]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.419054 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[21c1cba54be22c9a]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.419064 E | rafthttp: failed to dial 21c1cba54be22c9a on stream Message (cluster ID mismatch)
2019-12-09 16:26:19.419073 E | rafthttp: failed to dial 21c1cba54be22c9a on stream MsgApp v2 (cluster ID mismatch)
2019-12-09 16:26:19.419903 I | etcdserver: starting server... [version: 2.2.1, cluster version: 2.2]
2019-12-09 16:26:19.422255 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[85558b08fd6377a2]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.422279 E | rafthttp: failed to dial 85558b08fd6377a2 on stream Message (cluster ID mismatch)
2019-12-09 16:26:19.422552 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[85558b08fd6377a2]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.422565 E | rafthttp: failed to dial 85558b08fd6377a2 on stream MsgApp v2 (cluster ID mismatch)
I1209 16:26:19.439578   18726 peers.go:101] found new candidate peer from discovery: etcd-a [{10.250.17.141 0} {10.250.17.141 0}]
I1209 16:26:19.439616   18726 peers.go:101] found new candidate peer from discovery: etcd-b [{10.250.16.215 0} {10.250.16.215 0}]
I1209 16:26:19.439629   18726 peers.go:101] found new candidate peer from discovery: etcd-c [{10.250.18.173 0} {10.250.18.173 0}]
I1209 16:26:19.439703   18726 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"
I1209 16:26:19.439733   18726 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:[10.250.17.141 10.250.17.141] etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:[10.250.16.215 10.250.16.215] etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:[10.250.18.173 10.250.18.173]], final=map[10.250.16.215:[etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com] 10.250.17.141:[etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com] 10.250.18.173:[etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com]]
I1209 16:26:19.439903   18726 peers.go:281] connecting to peer "etcd-c" with TLS policy, servername="etcd-manager-server-etcd-c"
I1209 16:26:19.439982   18726 peers.go:281] connecting to peer "etcd-b" with TLS policy, servername="etcd-manager-server-etcd-b"
W1209 16:26:19.440686   18726 peers.go:325] unable to grpc-ping discovered peer 10.250.18.173:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.250.18.173:3996: connect: connection refused"
I1209 16:26:19.440719   18726 peers.go:347] was not able to connect to peer etcd-c: map[10.250.18.173:3996:true]
W1209 16:26:19.440745   18726 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-c
W1209 16:26:19.441043   18726 peers.go:325] unable to grpc-ping discovered peer 10.250.17.141:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.250.17.141:3996: connect: connection refused"
I1209 16:26:19.441077   18726 peers.go:347] was not able to connect to peer etcd-a: map[10.250.17.141:3996:true]
W1209 16:26:19.441096   18726 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-a
2019-12-09 16:26:19.520566 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[21c1cba54be22c9a]=138e61746ab70219, local=362b3eb57d5b3247)

As the first of 3 hosts that would eventually have etcd-manager installed, the gossip-specific warnings are to be expected. The cluster ID mismatch errors are more significant, and are the consequence of etcd mounting a volume that was several months old.

Some of the approaches that occur to me to address this are:

  1. Ordering matching etcd volumes by creation date and selecting the most recent by default
  2. Log a warning message when multiple volumes are found matching etcd-manager's search for what to mount, but continue to start etcd as we do now
  3. Log a warning message when multiple volumes are found matching etcd-manager's search for what to mount, prevent etcd from starting

I think the first would suffice in most cases, but the other options result in less unexpected behavior in the future.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant