Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic tests on long-running cluster are failing #532

Closed
mszostok opened this issue Oct 11, 2021 · 1 comment
Closed

Periodic tests on long-running cluster are failing #532

mszostok opened this issue Oct 11, 2021 · 1 comment
Assignees
Labels
bug Something isn't working
Milestone

Comments

@mszostok
Copy link
Member

Description

Periodic tests on long-running cluster are failing, e.g.: https://github.com/capactio/capact/actions/runs/13269448
Screenshot 2021-10-11 at 12 14 56

@mszostok mszostok added the bug Something isn't working label Oct 11, 2021
@mszostok mszostok added this to the 0.6.0 milestone Oct 11, 2021
@mszostok
Copy link
Member Author

mszostok commented Oct 11, 2021

  1. Kubed pod is OOMKilled:

    capact on (main) kubectl describe po kubed-d568d8d7f-7mlc2
    Name:         kubed-d568d8d7f-7mlc2
    Namespace:    capact-system
    Priority:     0
    Node:         gke-capact-stage-node-pool-stage-4f8e4c43-knhl/172.16.0.8
    Start Time:   Mon, 11 Oct 2021 01:10:49 <ins>0200
    Labels:       app.kubernetes.io/instance=kubed
                  app.kubernetes.io/name=kubed
                  pod-template-hash=d568d8d7f
    Annotations:  checksum/apiregistration.yaml: 89aae8312d7859954e75d721c986bd04afb8ccd45cdb666eadabc69ae0feed56
    Status:       Running
    IP:           10.0.1.3
    IPs:
      IP:           10.0.1.3
    Controlled By:  ReplicaSet/kubed-d568d8d7f
    Containers:
      kubed:
        Container ID:  docker://dd91e63dac3071dbc1d4d329326642480417ec8073c81830ef84dafc5bc232de
        Image:         appscode/kubed:v0.12.0
        Image ID:      docker-pullable://appscode/kubed@sha256:d694910be47b07f941e44cf60514aa5284944b1271a456e51c45208fae296ee9
        Port:          8443/TCP
        Host Port:     0/TCP
        Args:
          run
          --v=3
          --secure-port=8443
          --audit-log-path=-
          --tls-cert-file=/var/serving-cert/tls.crt
          --tls-private-key-file=/var/serving-cert/tls.key
          --use-kubeapiserver-fqdn-for-aks=true
          --enable-analytics=false
          --cluster-name=stage
          --config-source-namespace=capact-system
        State:          Waiting
          Reason:       CrashLoopBackOff
        Last State:     Terminated
          Reason:       OOMKilled
          Exit Code:    137
          Started:      Mon, 11 Oct 2021 10:28:07 </ins>0200
          Finished:     Mon, 11 Oct 2021 10:28:34 <ins>0200
        Ready:          False
        Restart Count:  104
        Limits:
          cpu:     100m
          memory:  100Mi
        Requests:
          cpu:        20m
          memory:     50Mi
        Environment:  <none>
        Mounts:
          /srv/kubed from config (rw)
          /tmp from scratch (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from kubed-token-zcknj (ro)
          /var/serving-cert from serving-cert (rw)
    Conditions:
      Type              Status
      Initialized       True
      Ready             False
      ContainersReady   False
      PodScheduled      True
    Volumes:
      config:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  kubed
        Optional:    false
      scratch:
        Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
        Medium:
        SizeLimit:  <unset>
      serving-cert:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  kubed-apiserver-cert
        Optional:    false
      kubed-token-zcknj:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  kubed-token-zcknj
        Optional:    false
    QoS Class:       Burstable
    Node-Selectors:  <none>
    Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                     node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
    Events:
      Type     Reason   Age                 From     Message
      ----     ------   ----                ----     -------
      Warning  BackOff  4s (x2352 over 9h)  kubelet  Back-off restarting failed container
    
  2. A lot of pods in Shutdown state but based on docs: “We do still stop and remove all containers, clean up cgroups, unmount volumes, etc to ensure that we reclaim all resources that were in use by the pod.” https://github.com/kubernetes/kubernetes/issues/54525#issuecomment-340035375

  3. I removed shutdown pod via:kubectl get pods | grep Shutdown | awk '{print $1}' | xargs kubectl delete pod

  4. Kubed is responsible to synchronize ConfigMaps and Secrets across namespaces.

  5. On cluster we have 60 CM and 317 Secrets. Secrets are created by the Helm for each new release. This may cause OOMKill on kubed when it tries to list all secrets and ConfigMaps to test which should be synchronized.

  6. Kubed watches capact-system and Helm Secrets for Helm release are also created in capact-system

  7. Unfortunately, there is no way to restrict/filter fetched data, see: https://github.com/kubeops/kubed/blob/5ffd89761d6419be4f462565d462891a1dec241a/pkg/syncer/syncer.go#L117-L143

  8. We also had 3 orhpan TI: capact ti get

    ---------------------------------------</ins>------------------------------------------------------<ins>---------------------------------------</ins>--------------------------------------<ins>----------</ins>---------
      5eac1c2a-ca57-4dd3-a965-765a36be65a8   cap.type.capactio.capact.validation.download:0.1.0      ——                                      ——                                           1   false
    ---------------------------------------<ins>------------------------------------------------------</ins>---------------------------------------<ins>--------------------------------------</ins>----------<ins>---------
      36bfa0c1-12e5-4f3d-a944-c0f7d3d8f620   cap.type.capactio.capact.validation.update:0.1.0        ——                                      ——                                           2   false
    ---------------------------------------</ins>------------------------------------------------------<ins>---------------------------------------</ins>--------------------------------------<ins>----------</ins>---------
      de374c26-d008-4be9-9415-374e329d22b7   cap.type.capactio.capact.validation.single-key:0.1.0    ——                                      ——                                           1   false```
    
    
    
    

Summary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant