You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While updating a Bigbang deployment we noticed the monitoring release was failing to reconcile on repeated attempts with a timeout error.
The underlying cause of the issue was that prometheus-node-exporter pods were unable to deploy to all nodes due to insufficient resources. This was indicated in the failed helm release:
DaemonSet is not ready: monitoring/monitoring-monitoring-prometheus-node-exporter. 0 out of 10 expected pods have been scheduled
DaemonSet is not ready: monitoring/monitoring-monitoring-prometheus-node-exporter. 1 out of 10 expected pods have been scheduled
DaemonSet is not ready: monitoring/monitoring-monitoring-prometheus-node-exporter. 2 out of 10 expected pods have been scheduled
warning: Upgrade "monitoring-monitoring" failed: timed out waiting for the condition
reason: UpgradeFailed
As well as the pod and events:
Warning FailedScheduling 39s (x6 over 2m18s) default-scheduler 0/10 nodes are available: 1 Insufficient cpu, 9 node(s) didn't match Pod's node affinity/selector
Temporary resolution
The cluster we were deploying into happened to have a pre-defined PolicyClass. We manually added this to the daemonset's pod spec so that the pods were deployed and the release was able to reconcile as expected.
Notional feature request
There are several daemonsets deployed by Bigbang which I would imagine could run into similar scenarios. Promtail, twistlock defenders, velero/restic all come to mind. Bigbang could define a PolicyClass (perhaps as part of /base) and add it to daemonsets and other critical components as appropriate to ensure that reconciliation of BB managed helm releases is able to complete without hanging on resource scheduling constraints.
The text was updated successfully, but these errors were encountered:
Summary
While updating a Bigbang deployment we noticed the monitoring release was failing to reconcile on repeated attempts with a timeout error.
The underlying cause of the issue was that
prometheus-node-exporter
pods were unable to deploy to all nodes due to insufficient resources. This was indicated in the failed helm release:As well as the pod and events:
Temporary resolution
The cluster we were deploying into happened to have a pre-defined PolicyClass. We manually added this to the daemonset's pod spec so that the pods were deployed and the release was able to reconcile as expected.
Notional feature request
There are several daemonsets deployed by Bigbang which I would imagine could run into similar scenarios. Promtail, twistlock defenders, velero/restic all come to mind. Bigbang could define a PolicyClass (perhaps as part of
/base
) and add it to daemonsets and other critical components as appropriate to ensure that reconciliation of BB managed helm releases is able to complete without hanging on resource scheduling constraints.The text was updated successfully, but these errors were encountered: