Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent is not starting with error /sys/fs/cgroup/freezer/kubepods/guaranteed: no such file or directory #26

Closed
soanni86 opened this issue Feb 4, 2025 · 12 comments
Assignees

Comments

@soanni86
Copy link

soanni86 commented Feb 4, 2025

hello, i'm trying to run perforator in k8s using helm chart but most of agent pods are crashlooping with the error

tskv ts=2025-02-04T15:14:49.326760602Z level=error logger=profiler worker=pods cgroup tracker msg=Worker failed error=open /sys/fs/cgroup/freezer/kubepods/guaranteed: no such file or directory tskv ts=2025-02-04T15:14:49.326874872Z level=error logger=profiler worker=process poller msg=Worker failed error=context canceled
control plane version is 1.30

Kernel Version: 6.6.56+ OS Image: Container-Optimized OS from Google Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.7.24 Kubelet Version: v1.31.4-gke.1256000 Kube-Proxy Version: v1.31.4-gke.1256000

what is odd i always have 1 agent running without any errors on the node with the same characteristics as above

Please advise.
Thanks

@soanni86 soanni86 changed the title agent is not starting with error agent is not starting with error /sys/fs/cgroup/freezer/kubepods/guaranteed: no such file or directory Feb 4, 2025
@MikailBag
Copy link
Contributor

Hello.
Please, run the following commands on any node (outside of containers) where you see this error:

ls /sys/fs/cgroup
ls /sys/fs/cgroup/freezer
ls /sys/fs/cgroup/freezer/kubepods

What do they say?

@soanni86
Copy link
Author

soanni86 commented Feb 4, 2025

here is the thing, cluster has 10 nodes, they are the same os (cos) and kernel

k describe no | grep "Kernel Version"
  Kernel Version:             6.6.56+
  Kernel Version:             6.6.56+
  Kernel Version:              6.6.56+
  Kernel Version:             6.6.56+
  Kernel Version:             6.6.56+
  Kernel Version:             6.6.56+
  Kernel Version:             6.6.56+
  Kernel Version:             6.6.56+
  Kernel Version:             6.6.56+
  Kernel Version:              6.6.56+

only one (random) perforator agent works well and sends profiles etc, all others are crashlooping with error above

perforator-agent-bwgtz                          1/1     Running            0             4m29s
perforator-agent-ghzds                          0/1     CrashLoopBackOff   5 (67s ago)   4m33s
perforator-agent-kxrcq                          0/1     CrashLoopBackOff   5 (78s ago)   4m33s
perforator-agent-m4k2r                          0/1     CrashLoopBackOff   5 (75s ago)   4m33s
perforator-agent-p4wsj                          0/1     CrashLoopBackOff   5 (84s ago)   4m32s
perforator-agent-pdtlt                          0/1     CrashLoopBackOff   5 (66s ago)   4m33s
perforator-agent-rgjgz                          0/1     CrashLoopBackOff   5 (71s ago)   4m33s
perforator-agent-rhr5f                          0/1     CrashLoopBackOff   5 (89s ago)   4m33s
perforator-agent-vqtgk                          0/1     CrashLoopBackOff   5 (63s ago)   4m34s
perforator-agent-xjcmv                          0/1     CrashLoopBackOff   5 (79s ago)   4m33s

i'm a bit limited with ssh to these nodes, once i can get in i will provide cgroup info. thanks

@soanni86
Copy link
Author

soanni86 commented Feb 6, 2025

hello @MikailBag

# ls /sys/fs/cgroup
blkio  cpu  cpu,cpuacct  cpuacct  cpuset  devices  freezer  hugetlb  memory  net_cls  net_cls,net_prio  net_prio  perf_event  pids  rdma  systemd  unified

# ls /sys/fs/cgroup/freezer
cgroup.clone_children  cgroup.procs  cgroup.sane_behavior  kubepods  notify_on_release  release_agent  tasks

# ls /sys/fs/cgroup/freezer/kubepods
besteffort  cgroup.clone_children  freezer.parent_freezing  freezer.state      pod00c336ea-6a1a-4e56-bf97-8fcdb6e491dd  pod6074e345-504d-484a-9eea-ba8eec042601  tasks
burstable   cgroup.procs           freezer.self_freezing    notify_on_release  pod2f491c63-b829-4101-aceb-1d36927e72b5  podd2af3764-fec5-460d-bda0-23274b092b32

@soanni86
Copy link
Author

soanni86 commented Feb 6, 2025

the agent that is not crashlooping runs on the node where there is also no guaranteed

gke-cluster-al-d-mon-nap-e2-standard--e86cd0b9-vowz ~ # ls /sys/fs/cgroup/freezer/kubepods/
besteffort  burstable  cgroup.clone_children  cgroup.procs  freezer.parent_freezing  freezer.self_freezing  freezer.state  notify_on_release  tasks

@MikailBag
Copy link
Contributor

Thank you, I think this clarifies things a lot.

It seems that your nodes have cgroupsPerQOS: false in their kubelet config. And agent is not prepared for this.

the agent that is not crashlooping runs on the node where there is also no guaranteed

This is strange) Maybe for some reason it has other configuration (i.e. profiles whole system instead of contacting he kubelet).

Anyway, for now we need to add support for cgroupsPerQOS being disabled, it should help.

@soanni86
Copy link
Author

soanni86 commented Feb 6, 2025

interesting i dont see it set to false

cat /home/kubernetes/kubelet-config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    enabled: true
  x509:
    clientCAFile: /etc/srv/kubernetes/pki/ca-certificates.crt
authorization:
  mode: Webhook
cgroupRoot: /
clusterDNS:
- 240.7.0.10
clusterDomain: cluster.local
enableDebuggingHandlers: true
evictionHard:
  memory.available: 100Mi
  nodefs.available: 10%
  nodefs.inodesFree: 5%
  pid.available: 10%
featureGates:
  DisableKubeletCloudCredentialProviders: true
  ExecProbeTimeout: false
  RotateKubeletServerCertificate: true
kernelMemcgNotification: true
kind: KubeletConfiguration
kubeReserved:
  cpu: 90m
  ephemeral-storage: 41Gi
  memory: 3606Mi
maxParallelImagePulls: 3
readOnlyPort: 10255
serializeImagePulls: false
serverTLSBootstrap: true
staticPodPath: /etc/kubernetes/manifests
/home/kubernetes/bin/kubelet \
--v=2 \
--cloud-provider=external \
--experimental-mounter-path=/home/kubernetes/containerized_mounter/mounter \
--cert-dir=/var/lib/kubelet/pki/ \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--image-credential-provider-config=/etc/srv/kubernetes/cri_auth_config.yaml \
--image-credential-provider-bin-dir=/home/kubernetes/bin \
--max-pods=110 \
--node-labels=addon.gke.io/node-local-dns-ds-ready=true,cloud.google.com/gke-boot-disk=pd-ssd,cloud.google.com/gke-container-runtime=containerd,cloud.google.com/gke-cpu-scaling-level=8,cloud.google.com/gke-logging-variant=DEFAULT,cloud.google.com/gke-max-pods-per-node=110,cloud.google.com/gke-memory-gb-scaling-level=32,cloud.google.com/gke-netd-ready=true,cloud.google.com/gke-nodepool=nap-e2-standard-8-18ikblck,cloud.google.com/gke-os-distribution=cos,cloud.google.com/gke-provisioning=standard,cloud.google.com/gke-stack-type=IPV4,cloud.google.com/machine-family=e2,cloud.google.com/private-node=false,iam.gke.io/gke-metadata-server-enabled=true,node.kubernetes.io/masq-agent-ds-ready=true \
--volume-plugin-dir=/home/kubernetes/flexvolume \
--node-status-max-images=25 \
--container-runtime-endpoint=unix:///run/containerd/containerd.sock \
--runtime-cgroups=/system.slice/containerd.service \
--registry-qps=10 \
--registry-burst=20 \
--config /home/kubernetes/kubelet-config.yaml \
--pod-sysctls=net.core.optmem_max=20480,net.core.somaxconn=1024,net.ipv4.conf.all.accept_redirects=0,net.ipv4.conf.all.forwarding=1,net.ipv4.conf.all.route_localnet=1,net.ipv4.conf.default.forwarding=1,net.ipv4.ip_forward=1,net.ipv4.tcp_fin_timeout=60,net.ipv4.tcp_keepalive_intvl=60,net.ipv4.tcp_keepalive_probes=5,net.ipv4.tcp_keepalive_time=300,net.ipv4.tcp_rmem=4096 87380 6291456,net.ipv4.tcp_syn_retries=6,net.ipv4.tcp_tw_reuse=0,net.ipv4.tcp_wmem=4096 16384 4194304,net.ipv4.udp_rmem_min=4096,net.ipv4.udp_wmem_min=4096,net.ipv6.conf.all.disable_ipv6=1,net.ipv6.conf.default.accept_ra=0,net.ipv6.conf.default.disable_ipv6=1,net.netfilter.nf_conntrack_generic_timeout=600,net.netfilter.nf_conntrack_tcp_be_liberal=1,net.netfilter.nf_conntrack_tcp_timeout_close_wait=3600,net.netfilter.nf_conntrack_tcp_timeout_established=86400 \
--pod-infra-container-image=gke.gcr.io/pause:3.8@sha256:880e63f94b145e46f1b1082bb71b85e21f16b99b180b9996407d61240ceb9830 \
--version=v1.31.4-gke.1256000

@soanni86
Copy link
Author

soanni86 commented Feb 6, 2025

k get --raw /api/v1/nodes/gke-cluster-al-d-mon-nap-e2-standard--e86cd0b9-vowz/proxy/configz | jq . | grep cgroup
    "cgroupRoot": "/",
    "cgroupsPerQOS": true,
    "cgroupDriver": "cgroupfs",

full config

{
  "kubeletconfig": {
    "enableServer": true,
    "staticPodPath": "/etc/kubernetes/manifests",
    "podLogsDir": "/var/log/pods",
    "syncFrequency": "1m0s",
    "fileCheckFrequency": "20s",
    "httpCheckFrequency": "20s",
    "address": "0.0.0.0",
    "port": 10250,
    "readOnlyPort": 10255,
    "serverTLSBootstrap": true,
    "authentication": {
      "x509": {
        "clientCAFile": "/etc/srv/kubernetes/pki/ca-certificates.crt"
      },
      "webhook": {
        "enabled": true,
        "cacheTTL": "2m0s"
      },
      "anonymous": {
        "enabled": false
      }
    },
    "authorization": {
      "mode": "Webhook",
      "webhook": {
        "cacheAuthorizedTTL": "5m0s",
        "cacheUnauthorizedTTL": "30s"
      }
    },
    "registryPullQPS": 10,
    "registryBurst": 20,
    "eventRecordQPS": 50,
    "eventBurst": 100,
    "enableDebuggingHandlers": true,
    "healthzPort": 10248,
    "healthzBindAddress": "127.0.0.1",
    "oomScoreAdj": -999,
    "clusterDomain": "cluster.local",
    "clusterDNS": [
      "240.7.0.10"
    ],
    "streamingConnectionIdleTimeout": "4h0m0s",
    "nodeStatusUpdateFrequency": "10s",
    "nodeStatusReportFrequency": "5m0s",
    "nodeLeaseDurationSeconds": 40,
    "imageMinimumGCAge": "2m0s",
    "imageMaximumGCAge": "0s",
    "imageGCHighThresholdPercent": 85,
    "imageGCLowThresholdPercent": 80,
    "volumeStatsAggPeriod": "1m0s",
    "cgroupRoot": "/",
    "cgroupsPerQOS": true,
    "cgroupDriver": "cgroupfs",
    "cpuManagerPolicy": "none",
    "cpuManagerReconcilePeriod": "10s",
    "memoryManagerPolicy": "None",
    "topologyManagerPolicy": "none",
    "topologyManagerScope": "container",
    "runtimeRequestTimeout": "2m0s",
    "hairpinMode": "promiscuous-bridge",
    "maxPods": 110,
    "podPidsLimit": -1,
    "resolvConf": "/etc/resolv.conf",
    "cpuCFSQuota": true,
    "cpuCFSQuotaPeriod": "100ms",
    "nodeStatusMaxImages": 25,
    "maxOpenFiles": 1000000,
    "contentType": "application/vnd.kubernetes.protobuf",
    "kubeAPIQPS": 50,
    "kubeAPIBurst": 100,
    "serializeImagePulls": false,
    "maxParallelImagePulls": 3,
    "evictionHard": {
      "memory.available": "100Mi",
      "nodefs.available": "10%",
      "nodefs.inodesFree": "5%",
      "pid.available": "10%"
    },
    "evictionPressureTransitionPeriod": "5m0s",
    "enableControllerAttachDetach": true,
    "makeIPTablesUtilChains": true,
    "iptablesMasqueradeBit": 14,
    "iptablesDropBit": 15,
    "featureGates": {
      "DisableKubeletCloudCredentialProviders": true,
      "ExecProbeTimeout": false,
      "RotateKubeletServerCertificate": true
    },
    "failSwapOn": true,
    "memorySwap": {},
    "containerLogMaxSize": "10Mi",
    "containerLogMaxFiles": 5,
    "containerLogMaxWorkers": 1,
    "containerLogMonitorInterval": "10s",
    "configMapAndSecretChangeDetectionStrategy": "Watch",
    "kubeReserved": {
      "cpu": "90m",
      "ephemeral-storage": "41Gi",
      "memory": "3606Mi"
    },
    "enforceNodeAllocatable": [
      "pods"
    ],
    "volumePluginDir": "/home/kubernetes/flexvolume",
    "kernelMemcgNotification": true,
    "logging": {
      "format": "text",
      "flushFrequency": "5s",
      "verbosity": 2,
      "options": {
        "text": {
          "infoBufferSize": "0"
        },
        "json": {
          "infoBufferSize": "0"
        }
      }
    },
    "enableSystemLogHandler": true,
    "enableSystemLogQuery": false,
    "shutdownGracePeriod": "0s",
    "shutdownGracePeriodCriticalPods": "0s",
    "enableProfilingHandler": true,
    "enableDebugFlagsHandler": true,
    "seccompDefault": false,
    "memoryThrottlingFactor": 0.9,
    "registerNode": true,
    "localStorageCapacityIsolation": true,
    "containerRuntimeEndpoint": "unix:///run/containerd/containerd.sock",
    "failCgroupV1": false
  }
}

@MikailBag
Copy link
Contributor

I think 5fcb948 should resolve this issue once we make a new release.

@soanni86
Copy link
Author

soanni86 commented Feb 7, 2025

thanks @MikailBag , i look forward

@MikailBag
Copy link
Contributor

Update: release v0.0.2 is now available

@soanni86
Copy link
Author

thanks @MikailBag now the agent works

@MikailBag
Copy link
Contributor

Thank you for the report, it was very helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants