the status.phase is always Pending after deleting a pod #96

mackwong · 2019-08-21T10:35:29Z

deploy a cassandra cluster, and the cassandracluster is:

$ kubectl get cassandracluster -n test my-cluster -o yaml

apiVersion: db.orange.com/v1alpha1
kind: CassandraCluster
metadata:
  finalizers:
  - kubernetes.io/pvc-to-delete
  name: my-cluster
  namespace: test
spec:
  baseImage: xxx.com/diamond/service-providers/cassandra
  dataCapacity: 20Gi
  dataStorageClass: ceph-rbd
  deletePVC: true
  gcStdout: true
  imageJolokiaSecret: {}
  imagePullSecret: {}
  imagepullpolicy: IfNotPresent
  maxPodUnavailable: 1
  nodesPerRacks: 1
  resources:
    limits:
      cpu: "1"
      memory: 2Gi
    requests:
      cpu: "1"
      memory: 2Gi
  runAsUser: 1000
  topology:
    dc:
    - name: dc1
      nodesPerRacks: 2
      rack:
      - name: rack1
  version: 3.11.4-8u212-0.3.2-release-cqlsh
status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: Initializing
        endTime: "2019-08-21T10:21:48Z"
        status: Done
      phase: Running
      podLastOperation: {}
  lastClusterAction: Initializing
  lastClusterActionStatus: Done
  phase: Running
  seedlist:
  - my-cluster-dc1-rack1-0.my-cluster.test
  - my-cluster-dc1-rack1-1.my-cluster.test

I delete a pod:

kubectl  delete pod my-cluster-dc1-rack1-0  -n test

after a while, the cassandracluster becomes like this, and it doesn't change anymore.

apiVersion: db.orange.com/v1alpha1
kind: CassandraCluster
metadata:
  finalizers:
  - kubernetes.io/pvc-to-delete
  generation: 6
  name: my-cluster
  namespace: test
spec:
  baseImage: xxx.com/diamond/service-providers/cassandra
  dataCapacity: 20Gi
  dataStorageClass: ceph-rbd
  deletePVC: true
  gcStdout: true
  imageJolokiaSecret: {}
  imagePullSecret: {}
  imagepullpolicy: IfNotPresent
  maxPodUnavailable: 1
  nodesPerRacks: 1
  resources:
    limits:
      cpu: "1"
      memory: 2Gi
    requests:
      cpu: "1"
      memory: 2Gi
  runAsUser: 1000
  topology:
    dc:
    - name: dc1
      nodesPerRacks: 2
      rack:
      - name: rack1
  version: 3.11.4-8u212-0.3.2-release-cqlsh
status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: Initializing
        endTime: "2019-08-21T10:21:48Z"
        status: Done
      phase: Running
      podLastOperation: {}
  lastClusterAction: Initializing
  lastClusterActionStatus: Done
  phase: Pending
  seedlist:
  - my-cluster-dc1-rack1-0.my-cluster.test
  - my-cluster-dc1-rack1-1.my-cluster.test

the status.phase is Pending.

allamand · 2019-08-27T10:16:08Z

Hi @mackwong, thanks for reporting.

Do you have access to Operator logs and to kubernetes events while this happened ?
did the deleted pod has been recreated ?

thanks

mackwong · 2019-08-28T06:11:49Z

$ kubectl logs -f cassandra-operator-f97949c48-5cglh -n cassandra-demo

time="2019-08-28T06:05:36Z" level=debug msg="Statefulsets Are Equal: No Update" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:05:41Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:05:41Z" level=debug msg="Statefulsets Are Equal: No Update" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:05:46Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:05:46Z" level=debug msg="Statefulsets Are Equal: No Update" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:05:51Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:05:51Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:05:51Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:05:56Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:05:56Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:05:56Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:01Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:01Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:01Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:06Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:06Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:06Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:11Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:11Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:11Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:16Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:06:16Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:16Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:17Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:17Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:17Z" level=info msg="Update Rack Status: Pending" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:17Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:06:17Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:17Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:17Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:17Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:17Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:06:17Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:17Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:17Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:17Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:22Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:06:22Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:22Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:22Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:22Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:27Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:06:27Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:27Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:27Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:27Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:32Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:06:32Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:32Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:32Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:32Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:37Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:06:37Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:37Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:37Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:37Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:42Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:06:42Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:42Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:42Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:42Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:47Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:06:47Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:47Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:47Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:47Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:52Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:06:52Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:52Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:52Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:52Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:57Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:06:57Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:57Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:06:57Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:06:57Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:02Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:07:02Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:02Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:07:02Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:02Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:07Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:07:07Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:07Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:07:07Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:07Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:12Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:07:12Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:12Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:07:12Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:12Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:17Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:07:17Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:17Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:07:17Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:17Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:22Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:07:22Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:22Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:07:22Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:22Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:27Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing) Replicas Number Not OK: 2 on 2, ready[1]"
time="2019-08-28T06:07:27Z" level=info msg="We don't check for new action before the cluster become stable again" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:27Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:07:27Z" level=info msg="Cluster has Disruption on Pods, we wait before applying any change to statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:27Z" level=info msg="Waiting Rack to be running before continuing, we break ReconcileRack after updated statefulset" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:32Z" level=info msg="[my-cluster][dc1-rack1]: StatefulSet(Initializing): Replicas Number OK: ready[2]"
time="2019-08-28T06:07:32Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:07:32Z" level=debug msg="Statefulsets Are Equal: No Update" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:32Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:07:32Z" level=debug msg="Statefulsets Are Equal: No Update" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:32Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:07:32Z" level=debug msg="Statefulsets Are Equal: No Update" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:37Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:07:37Z" level=debug msg="Statefulsets Are Equal: No Update" cluster=my-cluster dc-rack=dc1-rack1
time="2019-08-28T06:07:42Z" level=debug msg="Display information about pods" cluster=my-cluster podLastOperation.OperatorName= podLastOperation.Pods="[]" rack=dc1-rack1
time="2019-08-28T06:07:42Z" level=debug msg="Statefulsets Are Equal: No Update" cluster=my-cluster dc-rack=dc1-rack1

$ kc get event -n test
LAST SEEN   TYPE      REASON                    OBJECT                                              MESSAGE
7m10s       Normal    ExternalProvisioning      persistentvolumeclaim/data-my-cluster-dc1-rack1-0   waiting for a volume to be created, either by external provisioner "csi-rbdplugin" or manually created by system administrator
6m22s       Normal    Provisioning              persistentvolumeclaim/data-my-cluster-dc1-rack1-0   External provisioner is provisioning volume for claim "test/data-my-cluster-dc1-rack1-0"
6m21s       Normal    ProvisioningSucceeded     persistentvolumeclaim/data-my-cluster-dc1-rack1-0   Successfully provisioned volume pvc-5889b689-c112-43b0-a534-13a448ee834f
5m56s       Normal    ExternalProvisioning      persistentvolumeclaim/data-my-cluster-dc1-rack1-1   waiting for a volume to be created, either by external provisioner "csi-rbdplugin" or manually created by system administrator
5m9s        Normal    Provisioning              persistentvolumeclaim/data-my-cluster-dc1-rack1-1   External provisioner is provisioning volume for claim "test/data-my-cluster-dc1-rack1-1"
5m8s        Normal    ProvisioningSucceeded     persistentvolumeclaim/data-my-cluster-dc1-rack1-1   Successfully provisioned volume pvc-4f6ccb23-15ab-46b3-b3d0-178024fef2a6
6m22s       Normal    ProvisionedSuccessfully   serviceinstance/demo                                The instance was provisioned successfully
7m10s       Warning   FailedScheduling          pod/my-cluster-dc1-rack1-0                          pod has unbound immediate PersistentVolumeClaims (repeated 2 times)
7m8s        Normal    Scheduled                 pod/my-cluster-dc1-rack1-0                          Successfully assigned test/my-cluster-dc1-rack1-0 to bj-idc1-10-10-40-57
7m8s        Normal    SuccessfulAttachVolume    pod/my-cluster-dc1-rack1-0                          AttachVolume.Attach succeeded for volume "pvc-5889b689-c112-43b0-a534-13a448ee834f"
7m2s        Normal    Pulled                    pod/my-cluster-dc1-rack1-0                          Container image "registry.test.com/diamond/service-providers/cassandra:3.11.4-8u212-0.3.2-release-cqlsh" already present on machine
7m1s        Normal    Created                   pod/my-cluster-dc1-rack1-0                          Created container cassandra
7m1s        Normal    Started                   pod/my-cluster-dc1-rack1-0                          Started container cassandra
2m57s       Normal    Killing                   pod/my-cluster-dc1-rack1-0                          Stopping container cassandra
2m29s       Warning   Unhealthy                 pod/my-cluster-dc1-rack1-0                          Liveness probe failed:
2m29s       Warning   FailedPreStopHook         pod/my-cluster-dc1-rack1-0                          Exec lifecycle hook ([/bin/bash -c /etc/cassandra/pre_stop.sh]) for Container "cassandra" in Pod "my-cluster-dc1-rack1-0_test(34fbeb0c-902e-4b84-a8a6-b73126d4ba05)" failed - error: command '/bin/bash -c /etc/cassandra/pre_stop.sh' exited with 137: , message: "INFO  [main] 2019-08-28 06:06:13,282 YamlConfigurationLoader.java:89 - Configuration location: file:/etc/cassandra/cassandra.yaml\nINFO  [main] 2019-08-28 06:06:14,001 Config.java:496 - Node configuration:[allocate_tokens_for_keyspace=null; authenticator=PasswordAuthenticator; authorizer=CassandraAuthorizer; auto_bootstrap=true; auto_snapshot=true; back_pressure_enabled=false; back_pressure_strategy=null; batch_size_fail_threshold_in_kb=50; batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; broadcast_address=10.244.0.64; broadcast_rpc_address=10.244.0.64; buffer_pool_use_heap_if_exhausted=true; cas_contention_timeout_in_ms=1000; cdc_enabled=false; cdc_free_space_check_interval_ms=250; cdc_raw_directory=null; cdc_total_space_in_mb=0; client_encryption_options=<REDACTED>; cluster_name=my-cluster; column_index_cache_size_in_kb=2; column_index_size_in_kb=64; commit_failure_policy=stop; commitlog_compression=null; commitlog_directory=/var/lib/cassandra/commitlog; commitlog_max_compression_buffers_in_pool=3; commitlog_periodic_queue_size=-1; commitlog_segment_size_in_mb=32; commitlog_sync=periodic; commitlog_sync_batch_window_in_ms=NaN; commitlog_sync_period_in_ms=10000; commitlog_total_space_in_mb=null; compaction_large_partition_warning_threshold_mb=100; compaction_throughput_mb_per_sec=16; concurrent_compactors=null; concurrent_counter_writes=32; concurrent_materialized_view_writes=32; concurrent_reads=32; concurrent_replicates=null; concurrent_writes=32; counter_cache_keys_to_save=2147483647; counter_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=5000; credentials_cache_max_entries=1000; credentials_update_interval_in_ms=-1; credentials_validity_in_ms=2000; cross_node_timeout=false; data_file_directories=[Ljava.lang.String;@66d3eec0; disk_access_mode=auto; disk_failure_policy=stop; disk_optimization_estimate_percentile=0.95; disk_optimization_page_cross_chance=0.1; disk_optimization_strategy=ssd; dynamic_snitch=false; dynamic_snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=600000; dynamic_snitch_update_interval_in_ms=100; enable_materialized_views=true; enable_scripted_user_defined_functions=false; enable_user_defined_functions=false; enable_user_defined_functions_threads=true; encryption_options=null; endpoint_snitch=GossipingPropertyFileSnitch; file_cache_round_up=null; file_cache_size_in_mb=null; gc_log_threshold_in_ms=200; gc_warn_threshold_in_ms=1000; hinted_handoff_disabled_datacenters=[]; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; hints_compression=null; hints_directory=/var/lib/cassandra/hints; hints_flush_period_in_ms=10000; incremental_backups=false; index_interval=null; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; initial_token=null; inter_dc_stream_throughput_outbound_megabits_per_sec=200; inter_dc_tcp_nodelay=false; internode_authenticator=null; internode_compression=all; internode_recv_buff_size_in_bytes=0; internode_send_buff_size_in_bytes=0; key_cache_keys_to_save=2147483647; key_cache_save_period=14400; key_cache_size_in_mb=null; listen_address=10.244.0.64; listen_interface=null; listen_interface_prefer_ipv6=false; listen_on_broadcast_address=false; max_hint_window_in_ms=10800000; max_hints_delivery_threads=2; max_hints_file_size_in_mb=128; max_mutation_size_in_kb=null; max_streaming_retries=3; max_value_size_in_mb=256; memtable_allocation_type=offheap_objects; memtable_cleanup_threshold=null; memtable_flush_writers=2; memtable_heap_space_in_mb=null; memtable_offheap_space_in_mb=null; min_free_space_per_drive_in_mb=50; native_transport_flush_in_batches_legacy=true; native_transport_max_concurrent_connections=-1; native_transport_max_concurrent_connections_per_ip=-1; native_transport_max_frame_size_in_mb=256; native_transport_max_threads=128; native_transport_port=9042; native_transport_port_ssl=null; num_tokens=256; otc_backlog_expiration_interval_ms=200; otc_coalescing_enough_coalesced_messages=8; otc_coalescing_strategy=DISABLED; otc_coalescing_window_us=200; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_cache_max_entries=1000; permissions_update_interval_in_ms=-1; permissions_validity_in_ms=2000; phi_convict_threshold=8.0; prepared_statements_cache_size_mb=null; range_request_timeout_in_ms=10000; read_request_timeout_in_ms=5000; request_scheduler=org.apache.cassandra.scheduler.NoScheduler; request_scheduler_id=null; request_scheduler_options=null; request_timeout_in_ms=10000; role_manager=CassandraRoleManager; roles_cache_max_entries=1000; roles_update_interval_in_ms=-1; roles_validity_in_ms=2000; row_cache_class_name=org.apache.cassandra.cache.OHCProvider; row_cache_keys_to_save=2147483647; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=0.0.0.0; rpc_interface=null; rpc_interface_prefer_ipv6=false; rpc_keepalive=true; rpc_listen_backlog=50; rpc_max_threads=2147483647; rpc_min_threads=16; rpc_port=9160; rpc_recv_buff_size_in_bytes=null; rpc_send_buff_size_in_bytes=null; rpc_server_type=sync; saved_caches_directory=/var/lib/cassandra/saved_caches; seed_provider=org.apache.cassandra.locator.SimpleSeedProvider{seeds=my-cluster-dc1-rack1-0.my-cluster.test.svc.cluster.local}; server_encryption_options=<REDACTED>; slow_query_log_timeout_in_ms=500; snapshot_before_compaction=false; ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=false; storage_port=7000; stream_throughput_outbound_megabits_per_sec=200; streaming_keep_alive_period_in_secs=300; streaming_socket_timeout_in_ms=86400000; thrift_framed_transport_size_in_mb=15; thrift_max_message_length_in_mb=16; thrift_prepared_statements_cache_size_mb=null; tombstone_failure_threshold=100000; tombstone_warn_threshold=1000; tracetype_query_ttl=86400; tracetype_repair_ttl=604800; transparent_data_encryption_options=org.apache.cassandra.config.TransparentDataEncryptionOptions@1e04fa0a; trickle_fsync=false; trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=60000; unlogged_batch_across_partitions_warn_threshold=10; user_defined_function_fail_timeout=1500; user_defined_function_warn_timeout=500; user_function_timeout_policy=die; windows_timer_interval=1; write_request_timeout_in_ms=5000]\nINFO  [main] 2019-08-28 06:06:14,002 DatabaseDescriptor.java:373 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap\nINFO  [main] 2019-08-28 06:06:14,002 DatabaseDescriptor.java:431 - Global memtable on-heap threshold is enabled at 30MB\nINFO  [main] 2019-08-28 06:06:14,002 DatabaseDescriptor.java:435 - Global memtable off-heap threshold is enabled at 30MB\nWARN  [main] 2019-08-28 06:06:14,006 DatabaseDescriptor.java:480 - Small commitlog volume detected at /var/lib/cassandra/commitlog; setting commitlog_total_space_in_mb to 5007.  You can override this in cassandra.yaml\nWARN  [main] 2019-08-28 06:06:14,008 DatabaseDescriptor.java:556 - Only 19.517GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots\nINFO  [main] 2019-08-28 06:06:14,115 RateBasedBackPressure.java:123 - Initialized back-pressure with high ratio: 0.9, factor: 5, flow: FAST, window size: 5000.\nINFO  [main] 2019-08-28 06:06:14,115 DatabaseDescriptor.java:735 - Back-pressure is disabled with strategy null.\nINFO  [main] 2019-08-28 06:06:14,189 GossipingPropertyFileSnitch.java:68 - Unable to load cassandra-topology.properties; compatibility mode disabled\n"
2m29s       Warning   Unhealthy                 pod/my-cluster-dc1-rack1-0                          Readiness probe failed:
2m27s       Normal    Scheduled                 pod/my-cluster-dc1-rack1-0                          Successfully assigned test/my-cluster-dc1-rack1-0 to bj-idc1-10-10-40-57
2m23s       Normal    Pulled                    pod/my-cluster-dc1-rack1-0                          Container image "registry.test.com/diamond/service-providers/cassandra:3.11.4-8u212-0.3.2-release-cqlsh" already present on machine
2m23s       Normal    Created                   pod/my-cluster-dc1-rack1-0                          Created container cassandra
2m23s       Normal    Started                   pod/my-cluster-dc1-rack1-0                          Started container cassandra
5m56s       Warning   FailedScheduling          pod/my-cluster-dc1-rack1-1                          pod has unbound immediate PersistentVolumeClaims (repeated 2 times)
5m55s       Normal    Scheduled                 pod/my-cluster-dc1-rack1-1                          Successfully assigned test/my-cluster-dc1-rack1-1 to bj-idc1-10-10-40-73
5m55s       Normal    SuccessfulAttachVolume    pod/my-cluster-dc1-rack1-1                          AttachVolume.Attach succeeded for volume "pvc-4f6ccb23-15ab-46b3-b3d0-178024fef2a6"
5m4s        Normal    Pulled                    pod/my-cluster-dc1-rack1-1                          Container image "registry.test.com/diamond/service-providers/cassandra:3.11.4-8u212-0.3.2-release-cqlsh" already present on machine
5m4s        Normal    Created                   pod/my-cluster-dc1-rack1-1                          Created container cassandra
5m3s        Normal    Started                   pod/my-cluster-dc1-rack1-1                          Started container cassandra
7m10s       Normal    SuccessfulCreate          statefulset/my-cluster-dc1-rack1                    create Claim data-my-cluster-dc1-rack1-0 Pod my-cluster-dc1-rack1-0 in StatefulSet my-cluster-dc1-rack1 success
2m27s       Normal    SuccessfulCreate          statefulset/my-cluster-dc1-rack1                    create Pod my-cluster-dc1-rack1-0 in StatefulSet my-cluster-dc1-rack1 successful
5m56s       Normal    SuccessfulCreate          statefulset/my-cluster-dc1-rack1                    create Claim data-my-cluster-dc1-rack1-1 Pod my-cluster-dc1-rack1-1 in StatefulSet my-cluster-dc1-rack1 success
5m56s       Normal    SuccessfulCreate          statefulset/my-cluster-dc1-rack1                    create Pod my-cluster-dc1-rack1-1 in StatefulSet my-cluster-dc1-rack1 successful
7m10s       Normal    NoPods                    poddisruptionbudget/my-cluster                      No matching pods found

$ kubectl get po -n test
NAME                     READY   STATUS    RESTARTS   AGE
my-cluster-dc1-rack1-0   1/1     Running   0          80s
my-cluster-dc1-rack1-1   1/1     Running   0          4m49s

$ kc get cassandracluster -n test my-cluster -o yaml
apiVersion: db.orange.com/v1alpha1
kind: CassandraCluster
metadata:
  annotations:
    cassandraclusters.db.orange.com/last-applied-configuration: '{"kind":"CassandraCluster","apiVersion":"db.orange.com/v1alpha1","metadata":{"name":"my-cluster","namespace":"test","selfLink":"/apis/db.orange.com/v1alpha1/namespaces/test/cassandraclusters/my-cluster","uid":"907c913f-7bcf-49ee-835e-235ae68d4992","generation":6,"creationTimestamp":"2019-08-28T06:01:34Z","labels":{"instance_name":"666d211f-c959-11e9-b739-a2502451f700","plan_id":"2db0b31d-6912-4d24-8704-cfdf9b98af81","service_id":"137a3ded-59ab-4ece-bbda-9cfff850a1f3"},"finalizers":["kubernetes.io/pvc-to-delete"]},"spec":{"nodesPerRacks":1,"baseImage":"registry.test.com/diamond/service-providers/cassandra","version":"3.11.4-8u212-0.3.2-release-cqlsh","imagepullpolicy":"IfNotPresent","runAsUser":1000,"resources":{"requests":{"cpu":"1","memory":"2Gi"},"limits":{"cpu":"1","memory":"2Gi"}},"deletePVC":true,"gcStdout":true,"maxPodUnavailable":1,"dataCapacity":"20Gi","dataStorageClass":"ceph-rbd","imagePullSecret":{},"imageJolokiaSecret":{},"topology":{"dc":[{"name":"dc1","rack":[{"name":"rack1"}],"nodesPerRacks":2}]}},"status":{}}'
  creationTimestamp: "2019-08-28T06:01:34Z"
  finalizers:
  - kubernetes.io/pvc-to-delete
  generation: 6
  labels:
    instance_name: 666d211f-c959-11e9-b739-a2502451f700
    plan_id: 2db0b31d-6912-4d24-8704-cfdf9b98af81
    service_id: 137a3ded-59ab-4ece-bbda-9cfff850a1f3
  name: my-cluster
  namespace: test
  resourceVersion: "4751173"
  selfLink: /apis/db.orange.com/v1alpha1/namespaces/test/cassandraclusters/my-cluster
  uid: 907c913f-7bcf-49ee-835e-235ae68d4992
spec:
  baseImage: registry.test.com/diamond/service-providers/cassandra
  dataCapacity: 20Gi
  dataStorageClass: ceph-rbd
  deletePVC: true
  gcStdout: true
  imageJolokiaSecret: {}
  imagePullSecret: {}
  imagepullpolicy: IfNotPresent
  maxPodUnavailable: 1
  nodesPerRacks: 1
  resources:
    limits:
      cpu: "1"
      memory: 2Gi
    requests:
      cpu: "1"
      memory: 2Gi
  runAsUser: 1000
  topology:
    dc:
    - name: dc1
      nodesPerRacks: 2
      rack:
      - name: rack1
  version: 3.11.4-8u212-0.3.2-release-cqlsh
status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: Initializing
        endTime: "2019-08-28T06:04:01Z"
        status: Done
      phase: Running
      podLastOperation: {}
  lastClusterAction: Initializing
  lastClusterActionStatus: Done
  phase: Pending
  seedlist:
  - my-cluster-dc1-rack1-0.my-cluster.test
  - my-cluster-dc1-rack1-1.my-cluster.test

The deleted pod was recreated and running, but the status.phase was still Pending

mackwong · 2019-08-28T06:19:31Z

I think the reason is that the lastClusterActionStatus is always Done during pod is recreating, so the status.Phase can not be updated because of:
https://github.com/Orange-OpenSource/cassandra-k8s-operator/blob/73fd24e4d80d853a640defa42b91345cb29847ba/pkg/controller/cassandracluster/reconcile.go#L527

allamand · 2019-10-11T13:29:09Z

Thanks @mackwong you're right, i'm going to fix this.

allamand · 2019-10-16T15:45:07Z

closed in #128

mackwong mentioned this issue Aug 21, 2019

add statefulset recovery action #97

Closed

allamand added the bug label Oct 8, 2019

allamand mentioned this issue Oct 15, 2019

fix issue 96: cluster stay pending indefinitely in some cases #128

Merged

allamand closed this as completed Oct 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the status.phase is always Pending after deleting a pod #96

the status.phase is always Pending after deleting a pod #96

mackwong commented Aug 21, 2019

allamand commented Aug 27, 2019

mackwong commented Aug 28, 2019 •

edited

Loading

mackwong commented Aug 28, 2019

allamand commented Oct 11, 2019

allamand commented Oct 16, 2019

the status.phase is always Pending after deleting a pod #96

the status.phase is always Pending after deleting a pod #96

Comments

mackwong commented Aug 21, 2019

allamand commented Aug 27, 2019

mackwong commented Aug 28, 2019 • edited Loading

mackwong commented Aug 28, 2019

allamand commented Oct 11, 2019

allamand commented Oct 16, 2019

mackwong commented Aug 28, 2019 •

edited

Loading