Cluster not reachable #932

mjudeikis · 2020-08-11T14:15:34Z

I have a suspicion if we ran a pod (In the example in a form of DS ) on each node, mount /tmp file system and produce big files, it might crash all control plane.

This should not happen as individual pods should be limited in how much space they can consume from tmpfs.

If this code/DS is running on the cluster http://git.bytheb.org/cgit/stap.git/
cluster is not usable after 1-2 days.

We need to do some debugging to understand why this is happening.
Potentially in cluster with local-RP, persist=true and some similar pod to produce a lot of data

The text was updated successfully, but these errors were encountered:

m1kola · 2020-08-21T17:07:10Z

In the example daemonset I see that pods get deployed to masters and mount hostPath from host machine.

I think control plane becomes unreachable due to master nodes running out of disk space. Once there is no disk space left on a node, etcd stops running on that node. We run 3 etcd replicas on each master node: so when we run out of disk on 2 master nodes etcd loses quorum. When we lose etcd quorum we lose api server too.

I prepared a reproducer for this:

Create a cluster. Make sure that you can access master nodes via SSH.

Apply the following manifest:

master-tmp-writer.yaml

kikind: DaemonSet
apiVersion: apps/v1
metadata:
name: master-tmp-writer
namespace: default
annotations:
    kubernetes.io/description: This DaemonSet launches containers that writes a lot into a hostPath on master hosts.
spec:
selector:
    matchLabels:
    app: master-tmp-writer
template:
    metadata:
    labels:
        app: master-tmp-writer
    spec:
    priorityClassName: "system-cluster-critical"
    containers:
    - name: master-tmp-writer
        image: registry.access.redhat.com/ubi8/ubi
        command:
        - /bin/bash
        - -c
        - |
        i=0
        while :
        do
            dd if=/dev/zero of=/tmp/fake-file-$i bs=4k iflag=fullblock,count_bytes count=10G
            sleep 30
            ((i++))
        done
        securityContext:
        privileged: true
        volumeMounts:
        - mountPath: /tmp
        name: host-tmp
        terminationMessagePolicy: FallbackToLogsOnError
    nodeSelector:
        node-role.kubernetes.io/master: ""
        beta.kubernetes.io/os: "linux"
        master-to-target: ""
    volumes:
    - name: host-tmp
        hostPath:
        path: /tmp
    tolerations:
    - key: "node-role.kubernetes.io/master"
        operator: "Exists"
    - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
    - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
    - key: "node.kubernetes.io/network-unavailable"
        operator: "Exists"

This manifest creates a DaemonSet which deploys a pod on master nodes with label master-to-target. The pod writes 10G files into /tmp on master host with 30 seconds between writes.

Add master-to-target label to master nodes:

oc label node $MASTERNODENAME master-to-target=""

SSH into the node and watch df -h and see how node goes out of disk space.
In 3 master nodes setup API server will stop responding once we run out of disk on 2 out of 3 nodes.

ehashman · 2020-08-21T17:54:49Z

@m1kola is that the right reproducer for this? Typically intensive customer workloads will never be scheduled on the masters. If they run out of disk space etcd is going to have issues, because etcd needs to write to disk to reach consensus. That's not specific to /tmp; it would be true of any host-mounted directory.

From what I can tell, nodes do not have /tmp in a tmpfs, it appears to be mounted on the host disk:

ehashman@red-dot:~$ oc debug node/aro-v4-shared-gxqb4-worker-eastus1-xh9ml
sh-4.2# df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
overlay         128G   17G  112G  13% /

@mjudeikis can you clarify what we are looking to reproduce?

m1kola · 2020-08-24T09:18:01Z

@ehashman correct: It is not related to tmpfs and this can happen with any host mounted dir. I think there is not much we (or upstream) can do to protect from this. The only more or less reasonable thing is to have /tmp separate from the root fs: it will make it harder to break the cluster for people who use /tmp as a staging/trash directory. But I'm not sure if worth the effort.

m1kola self-assigned this Aug 11, 2020

m1kola removed their assignment May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster not reachable #932

Cluster not reachable #932

mjudeikis commented Aug 11, 2020

m1kola commented Aug 21, 2020

ehashman commented Aug 21, 2020

m1kola commented Aug 24, 2020

Cluster not reachable #932

Cluster not reachable #932

Comments

mjudeikis commented Aug 11, 2020

m1kola commented Aug 21, 2020

ehashman commented Aug 21, 2020

m1kola commented Aug 24, 2020