Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster not reachable #932

Open
mjudeikis opened this issue Aug 11, 2020 · 3 comments
Open

Cluster not reachable #932

mjudeikis opened this issue Aug 11, 2020 · 3 comments

Comments

@mjudeikis
Copy link
Contributor

I have a suspicion if we ran a pod (In the example in a form of DS ) on each node, mount /tmp file system and produce big files, it might crash all control plane.

This should not happen as individual pods should be limited in how much space they can consume from tmpfs.

If this code/DS is running on the cluster http://git.bytheb.org/cgit/stap.git/
cluster is not usable after 1-2 days.

We need to do some debugging to understand why this is happening.
Potentially in cluster with local-RP, persist=true and some similar pod to produce a lot of data

@m1kola m1kola self-assigned this Aug 11, 2020
@m1kola
Copy link
Contributor

m1kola commented Aug 21, 2020

In the example daemonset I see that pods get deployed to masters and mount hostPath from host machine.

I think control plane becomes unreachable due to master nodes running out of disk space. Once there is no disk space left on a node, etcd stops running on that node. We run 3 etcd replicas on each master node: so when we run out of disk on 2 master nodes etcd loses quorum. When we lose etcd quorum we lose api server too.

I prepared a reproducer for this:

  1. Create a cluster. Make sure that you can access master nodes via SSH.

  2. Apply the following manifest:

    master-tmp-writer.yaml
    kikind: DaemonSet
    apiVersion: apps/v1
    metadata:
    name: master-tmp-writer
    namespace: default
    annotations:
        kubernetes.io/description: This DaemonSet launches containers that writes a lot into a hostPath on master hosts.
    spec:
    selector:
        matchLabels:
        app: master-tmp-writer
    template:
        metadata:
        labels:
            app: master-tmp-writer
        spec:
        priorityClassName: "system-cluster-critical"
        containers:
        - name: master-tmp-writer
            image: registry.access.redhat.com/ubi8/ubi
            command:
            - /bin/bash
            - -c
            - |
            i=0
            while :
            do
                dd if=/dev/zero of=/tmp/fake-file-$i bs=4k iflag=fullblock,count_bytes count=10G
                sleep 30
                ((i++))
            done
            securityContext:
            privileged: true
            volumeMounts:
            - mountPath: /tmp
            name: host-tmp
            terminationMessagePolicy: FallbackToLogsOnError
        nodeSelector:
            node-role.kubernetes.io/master: ""
            beta.kubernetes.io/os: "linux"
            master-to-target: ""
        volumes:
        - name: host-tmp
            hostPath:
            path: /tmp
        tolerations:
        - key: "node-role.kubernetes.io/master"
            operator: "Exists"
        - key: "node.kubernetes.io/not-ready"
            operator: "Exists"
        - key: "node.kubernetes.io/unreachable"
            operator: "Exists"
        - key: "node.kubernetes.io/network-unavailable"
            operator: "Exists"

    This manifest creates a DaemonSet which deploys a pod on master nodes with label master-to-target. The pod writes 10G files into /tmp on master host with 30 seconds between writes.

  3. Add master-to-target label to master nodes:

    oc label node $MASTERNODENAME master-to-target=""
  4. SSH into the node and watch df -h and see how node goes out of disk space.

  5. In 3 master nodes setup API server will stop responding once we run out of disk on 2 out of 3 nodes.

@ehashman
Copy link
Contributor

@m1kola is that the right reproducer for this? Typically intensive customer workloads will never be scheduled on the masters. If they run out of disk space etcd is going to have issues, because etcd needs to write to disk to reach consensus. That's not specific to /tmp; it would be true of any host-mounted directory.

From what I can tell, nodes do not have /tmp in a tmpfs, it appears to be mounted on the host disk:

ehashman@red-dot:~$ oc debug node/aro-v4-shared-gxqb4-worker-eastus1-xh9ml
sh-4.2# df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
overlay         128G   17G  112G  13% /

@mjudeikis can you clarify what we are looking to reproduce?

@m1kola
Copy link
Contributor

m1kola commented Aug 24, 2020

@ehashman correct: It is not related to tmpfs and this can happen with any host mounted dir. I think there is not much we (or upstream) can do to protect from this. The only more or less reasonable thing is to have /tmp separate from the root fs: it will make it harder to break the cluster for people who use /tmp as a staging/trash directory. But I'm not sure if worth the effort.

@m1kola m1kola removed their assignment May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants