Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define and apply a priority class to critical components #92

Open
p1-bot-repo-sync bot opened this issue Feb 3, 2025 · 0 comments
Open

Define and apply a priority class to critical components #92

p1-bot-repo-sync bot opened this issue Feb 3, 2025 · 0 comments

Comments

@p1-bot-repo-sync
Copy link

Summary

While updating a Bigbang deployment we noticed the monitoring release was failing to reconcile on repeated attempts with a timeout error.

The underlying cause of the issue was that prometheus-node-exporter pods were unable to deploy to all nodes due to insufficient resources. This was indicated in the failed helm release:

      DaemonSet is not ready: monitoring/monitoring-monitoring-prometheus-node-exporter. 0 out of 10 expected pods have been scheduled
      DaemonSet is not ready: monitoring/monitoring-monitoring-prometheus-node-exporter. 1 out of 10 expected pods have been scheduled
      DaemonSet is not ready: monitoring/monitoring-monitoring-prometheus-node-exporter. 2 out of 10 expected pods have been scheduled
      warning: Upgrade "monitoring-monitoring" failed: timed out waiting for the condition
    reason: UpgradeFailed

As well as the pod and events:

Warning  FailedScheduling   39s (x6 over 2m18s)  default-scheduler   0/10 nodes are available: 1 Insufficient cpu, 9 node(s) didn't match Pod's node affinity/selector

Temporary resolution

The cluster we were deploying into happened to have a pre-defined PolicyClass. We manually added this to the daemonset's pod spec so that the pods were deployed and the release was able to reconcile as expected.

Notional feature request

There are several daemonsets deployed by Bigbang which I would imagine could run into similar scenarios. Promtail, twistlock defenders, velero/restic all come to mind. Bigbang could define a PolicyClass (perhaps as part of /base) and add it to daemonsets and other critical components as appropriate to ensure that reconciliation of BB managed helm releases is able to complete without hanging on resource scheduling constraints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

0 participants