deploy etcd as static pod #148

abhinavdahiya · 2018-10-29T23:20:59Z

openshift/origin#21274 was merged that allows kubelet to run static pods even when kubelet is bootstrapping its certifcates with apiserver.

/hold

abhinavdahiya · 2018-10-30T00:09:57Z

testing locally:
using the latest version of RHCOS from https://rhcos-release-browser-coreos.int.open.paas.redhat.com/html/

[core@adahiya-0-master-0 ~]$ cat /etc/os-release
NAME="Red Hat CoreOS"
VERSION="4.0"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION_ID="4.0"
PRETTY_NAME="Red Hat CoreOS 4.0"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat 7"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.0"
REDHAT_SUPPORT_PRODUCT="Red Hat"
REDHAT_SUPPORT_PRODUCT_VERSION="4.0"
OSTREE_VERSION=47.29

etcd is running as static pod:

oc -n kube-system get pods etcd-member-adahiya-0-master-0
NAME                             READY     STATUS    RESTARTS   AGE
etcd-member-adahiya-0-master-0   1/1       Running   0          8m

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/config.hash: c26521395dbc1da7d5ef74aca0694d11
    kubernetes.io/config.mirror: c26521395dbc1da7d5ef74aca0694d11
    kubernetes.io/config.seen: 2018-10-29T23:56:49.011541332Z
    kubernetes.io/config.source: file
  creationTimestamp: 2018-10-29T23:59:58Z
  labels:
    k8s-app: etcd
  name: etcd-member-adahiya-0-master-0
  namespace: kube-system
  resourceVersion: "524"
  selfLink: /api/v1/namespaces/kube-system/pods/etcd-member-adahiya-0-master-0
  uid: bd5393bf-dbd6-11e8-8085-62c04d92e4a1
spec:
  containers:
  - command:
    - /bin/sh
    - -c
    - |
      #!/bin/sh
      set -euo pipefail

      source /run/etcd/environment

      /usr/local/bin/etcd \
        --discovery-srv tt.testing \
        --initial-advertise-peer-urls=https://${ETCD_IPV4_ADDRESS}:2380 \
        --cert-file=/etc/ssl/etcd/system:etcd-server:${ETCD_DNS_NAME}.crt \
        --key-file=/etc/ssl/etcd/system:etcd-server:${ETCD_DNS_NAME}.key \
        --trusted-ca-file=/etc/ssl/etcd/ca.crt \
        --client-cert-auth=true \
        --peer-cert-file=/etc/ssl/etcd/system:etcd-peer:${ETCD_DNS_NAME}.crt \
        --peer-key-file=/etc/ssl/etcd/system:etcd-peer:${ETCD_DNS_NAME}.key \
        --peer-trusted-ca-file=/etc/ssl/etcd/ca.crt \
        --peer-client-cert-auth=true \
        --advertise-client-urls=https://${ETCD_IPV4_ADDRESS}:2379 \
        --listen-client-urls=https://0.0.0.0:2379 \
        --listen-peer-urls=https://0.0.0.0:2380 \
    env:
    - name: ETCD_DATA_DIR
      value: /var/lib/etcd
    - name: ETCD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    image: quay.io/coreos/etcd:v3.2.14
    imagePullPolicy: IfNotPresent
    name: etcd-member
    ports:
    - containerPort: 2380
      hostPort: 2380
      name: peer
      protocol: TCP
    - containerPort: 2379
      hostPort: 2379
      name: server
      protocol: TCP
    resources: {}
    securityContext: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /run/etcd/
      name: discovery
    - mountPath: /etc/ssl/etcd
      name: certs
    - mountPath: /var/lib/etcd
      name: data-dir
  dnsPolicy: ClusterFirst
  hostNetwork: true
  initContainers:
  - args:
    - --discovery-srv=tt.testing
    - --output-file=/run/etcd/environment
    - --v=4
    image: docker.io/abhinavdahiya/origin-setup-etcd-environment
    imagePullPolicy: Always
    name: discovery
    resources: {}
    securityContext: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /run/etcd/
      name: discovery
  - command:
    - /bin/sh
    - -c
    - |
      #!/bin/sh
      set -euo pipefail

      source /run/etcd/environment

      [ -e /etc/ssl/etcd/system:etcd-server:${ETCD_DNS_NAME}.crt -a \
        -e /etc/ssl/etcd/system:etcd-server:${ETCD_DNS_NAME}.key ] || \
        /usr/local/bin/kube-client-agent \
          request \
            --orgname=system:etcd-servers \
            --cacrt=/etc/ssl/etcd/root-ca.crt \
            --assetsdir=/etc/ssl/etcd \
            --address=https://adahiya-0-api.tt.testing:6443 \
            --dnsnames=localhost,etcd.kube-system.svc,etcd.kube-system.svc.cluster.local,${ETCD_DNS_NAME} \
            --commonname=system:etcd-server:${ETCD_DNS_NAME} \
            --ipaddrs=${ETCD_IPV4_ADDRESS},127.0.0.1 \

      [ -e /etc/ssl/etcd/system:etcd-peer:${ETCD_DNS_NAME}.crt -a \
        -e /etc/ssl/etcd/system:etcd-peer:${ETCD_DNS_NAME}.key ] || \
        /usr/local/bin/kube-client-agent \
          request \
            --orgname=system:etcd-peers \
            --cacrt=/etc/ssl/etcd/root-ca.crt \
            --assetsdir=/etc/ssl/etcd \
            --address=https://adahiya-0-api.tt.testing:6443 \
            --dnsnames=${ETCD_DNS_NAME},tt.testing \
            --commonname=system:etcd-peer:${ETCD_DNS_NAME} \
            --ipaddrs=${ETCD_IPV4_ADDRESS} \
    image: quay.io/coreos/kube-client-agent:678cc8e6841e2121ebfdb6e2db568fce290b67d6
    imagePullPolicy: IfNotPresent
    name: certs
    resources: {}
    securityContext: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /run/etcd/
      name: discovery
    - mountPath: /etc/ssl/etcd
      name: certs
  nodeName: adahiya-0-master-0
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    operator: Exists
  volumes:
  - emptyDir: {}
    name: discovery
  - hostPath:
      path: /etc/ssl/etcd
      type: ""
    name: certs
  - hostPath:
      path: /var/lib/etcd
      type: ""
    name: data-dir
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2018-10-29T23:57:17Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2018-10-29T23:57:26Z
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: null
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2018-10-29T23:56:49Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://93397f616d48322fbd1130d0921543f1971fa88f42e1ec25f078cd9c78937deb
    image: quay.io/coreos/etcd:v3.2.14
    imageID: quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1
    lastState: {}
    name: etcd-member
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2018-10-29T23:57:25Z
  hostIP: 192.168.126.11
  initContainerStatuses:
  - containerID: cri-o://d562d128a0c6f1f7933358d8cf8428f97497959aeffda587de3725e0bb0ef15e
    image: docker.io/abhinavdahiya/origin-setup-etcd-environment:latest
    imageID: docker.io/abhinavdahiya/origin-setup-etcd-environment@sha256:1535b9373527ded931114a497bcd61c3ebe5385e4eda05c3abcbc638366023cd
    lastState: {}
    name: discovery
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: cri-o://d562d128a0c6f1f7933358d8cf8428f97497959aeffda587de3725e0bb0ef15e
        exitCode: 0
        finishedAt: 2018-10-29T23:57:09Z
        reason: Completed
        startedAt: 2018-10-29T23:57:09Z
  - containerID: cri-o://c1788ae91c7ebc348e9fd3905091b2b853bfe8544efd5aee66c0bbd007ed9a34
    image: quay.io/coreos/kube-client-agent:678cc8e6841e2121ebfdb6e2db568fce290b67d6
    imageID: quay.io/coreos/kube-client-agent@sha256:8564ab65bcb1064006d2fc9c6e32a5ca3f4326cdd2da9a2efc4fb7cc0e0b6041
    lastState: {}
    name: certs
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: cri-o://c1788ae91c7ebc348e9fd3905091b2b853bfe8544efd5aee66c0bbd007ed9a34
        exitCode: 0
        finishedAt: 2018-10-29T23:57:17Z
        reason: Completed
        startedAt: 2018-10-29T23:57:16Z
  phase: Running
  podIP: 192.168.126.11
  qosClass: BestEffort
  startTime: 2018-10-29T23:56:49Z

abhinavdahiya · 2018-10-30T00:14:01Z

/hold cancel

/cc @crawford @aaronlevy @deads2k

abhinavdahiya · 2018-10-30T00:32:50Z

from one of the master installed by CI.

[core@ip-10-0-11-127 ~]$ cat /etc/os-release
NAME="Red Hat CoreOS"
VERSION="4.0.6953"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION_ID="4.0"
PRETTY_NAME="Red Hat CoreOS 4.0.6953"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat 7"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.0"
REDHAT_SUPPORT_PRODUCT="Red Hat"
REDHAT_SUPPORT_PRODUCT_VERSION="4.0"
OSTREE_VERSION=4.0.6953

This does not have the updated kubelet from openshift/origin#21274

ashcrow · 2018-10-30T13:08:15Z

/test e2e-aws

ashcrow · 2018-10-30T13:09:57Z

@abhinavdahiya I think this was brought up in chat, but RHCOS pipeline v1 is in a frozen state (but still producing images), RHCOS pipeline v2 is including the latest kubelet changes a bit faster.

abhinavdahiya · 2018-10-30T15:57:22Z

Requires installer moving to new pipeline for RHCOS :( openshift/installer#554

abhinavdahiya · 2018-10-31T17:12:12Z

from the bootstrap node in ci

[core@ip-10-0-9-10 ~]$ sudo oc --config /opt/tectonic/auth/kubeconfig get pods -n kube-system
NAME                                                 READY     STATUS    RESTARTS   AGE
etcd-member-ip-10-0-1-182.ec2.internal               1/1       Running   0          5m
etcd-member-ip-10-0-16-168.ec2.internal              1/1       Running   0          5m
etcd-member-ip-10-0-35-16.ec2.internal               1/1       Running   0          5m
kube-controller-manager-8mmzl                        1/1       Running   0          6m
kube-controller-manager-qxft8                        1/1       Running   0          6m
kube-controller-manager-vvxr5                        1/1       Running   0          6m
kube-dns-787c975867-lrbxm                            3/3       Running   0          6m
kube-flannel-9m6v7                                   2/2       Running   0          4m
kube-flannel-cxcxn                                   2/2       Running   0          4m
kube-flannel-lvlcb                                   2/2       Running   0          4m
kube-proxy-8kfhr                                     1/1       Running   0          6m
kube-proxy-c6m62                                     1/1       Running   0          6m
kube-proxy-fddw4                                     1/1       Running   0          6m
kube-scheduler-7msvs                                 1/1       Running   0          6m
kube-scheduler-8lrkr                                 1/1       Running   0          6m
kube-scheduler-zdpxr                                 1/1       Running   0          6m
metrics-server-5767bfc576-7c845                      0/2       Pending   0          1m
pod-checkpointer-7wp2b                               1/1       Running   0          6m
pod-checkpointer-7wp2b-ip-10-0-16-168.ec2.internal   1/1       Running   0          6m
pod-checkpointer-cdg5w                               1/1       Running   0          6m
pod-checkpointer-cdg5w-ip-10-0-1-182.ec2.internal    1/1       Running   0          6m
pod-checkpointer-hwgwr                               1/1       Running   0          6m
pod-checkpointer-hwgwr-ip-10-0-35-16.ec2.internal    1/1       Running   0          6m
tectonic-network-operator-k8qx6                      1/1       Running   0          6m
tectonic-network-operator-s46sd                      1/1       Running   0          6m
tectonic-network-operator-vhj7s                      1/1       Running   0          6m
[core@ip-10-0-9-10 ~]$ cat /etc/os-release
NAME="Red Hat CoreOS"
VERSION="4.0"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION_ID="4.0"
PRETTY_NAME="Red Hat CoreOS 4.0"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat 7"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.0"
REDHAT_SUPPORT_PRODUCT="Red Hat"
REDHAT_SUPPORT_PRODUCT_VERSION="4.0"
OSTREE_VERSION=47.38

etcd running as static pods.

abhinavdahiya · 2018-10-31T17:18:35Z

The workers cannot reach ignition server due to AWS load balancer error
cc @crawford

ashcrow · 2018-10-31T18:34:03Z

Looks like the same issue we keep hitting with #126

ashcrow · 2018-10-31T19:29:56Z

Yeah, we're hitting the same thing here with e2e. Mine was able to get through after ~7 retries.

ashcrow · 2018-10-31T19:30:02Z

/test e2e-aws

aaronlevy

Couple questions and some minor nits about follow-up tasks.

aaronlevy · 2018-10-31T21:20:07Z