Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(helm): Add health probes to long-lived containers #15359

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

lindhe
Copy link
Contributor

@lindhe lindhe commented Dec 11, 2024

What this PR does / why we need it:

This PR attempts to add configurable health probes to all long-lived containers where they are currently missing, to improve Kubernetes' lifecycle management of Loki's pods.

Which issue(s) this PR fixes:

Fixes #13967
Partially addresses #13678

Special notes for your reviewer:

  • This PR grew significantly from what I imagined initially. Tell me if this should be split into multiple PRs instead!
  • There are more components with missing health probes than I can test. Some I can test, because they are used in my current setup. Others I can only visually inspected that the rendered templates looks reasonable.
  • I think we should create a follow-up PR that handles the remaining parts of Add support for liveness and startup probes in loki helm chart #13678 and generally aims to optimize all health probes.
  • This PR does not add probes to init containers. While sometimes a bit tricker to write good probes for them, it's every bit as important as for long-lived containers. But I think that belongs in a separate PR. Same goes for containers in Jobs.
  • I am trying to keep the code style consistent within each file. But I do notice that there are many places where the same thing (e.g. inserting a readinessProbe) is done in slightly different ways. I think it might be a good thing to eventually streamline the code style, but this too belongs in another PR.
  • I've noticed that there are many unused values in values.yaml (e.g. .Values.bloomGateway.livenessProbe before this PR). It's probably worth auditing the values and attempt to catch any such cases (in another PR).

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

This comment has been minimized.

@lindhe lindhe changed the title feat(helm): Add missing livenessProbes and startupProbes feat(helm): Add missing health probes Dec 12, 2024
@lindhe lindhe force-pushed the lindhe/add-health-probes branch from d281b9d to 3c99bd4 Compare December 12, 2024 13:36
@pull-request-size pull-request-size bot added size/L and removed size/S labels Dec 12, 2024

This comment has been minimized.

@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Dec 12, 2024

This comment has been minimized.

@lindhe lindhe changed the title feat(helm): Add missing health probes feat(helm): Add health probes to long-lived containers Dec 12, 2024
@lindhe lindhe force-pushed the lindhe/add-health-probes branch from 1d4bc12 to bb82bc9 Compare December 12, 2024 13:57
@lindhe lindhe marked this pull request as ready for review December 12, 2024 13:57
@lindhe lindhe requested a review from a team as a code owner December 12, 2024 13:57

This comment has been minimized.

@lindhe
Copy link
Contributor Author

lindhe commented Dec 12, 2024

The failing test worked before I rebased. I'll see if I can understand why it's failing, but I'd guess it's not my fault.

@AndreZiviani
Copy link
Contributor

visually it looks ok (did not actually test it) but you are changing some behaviors that I'm not sure you should be doing, for example: readiness and liveness only run after startup (not in parallel) so now instead of loki-gateway (or any other you configured) taking 15 seconds to be available (readinessProbe initial delay) it will take 20 seconds (startupProbe initial delay + readinessProbe initial delay). Maybe leave them commented out on values.yaml in order to not change current behavior

I'm not a maintainer, just some feedback from our discussion on Slack

@lindhe
Copy link
Contributor Author

lindhe commented Dec 12, 2024

Thank you, that is great input. I agree, we should set the new probes to {} so the behavior is unchanged! But if the user chooses to, or if a future PR wants to make improvements on the default values, it's now possible to configure!

I'll fix that.

@lindhe lindhe force-pushed the lindhe/add-health-probes branch from bb82bc9 to 9aba722 Compare December 12, 2024 16:08

This comment has been minimized.

@lindhe
Copy link
Contributor Author

lindhe commented Dec 12, 2024

I have now updated the default values so that I introduce no new probes via the values set in values.yaml. I think there may be some changes where probes used to not exist but now they use the shared .loki.* probes, for example. But this variant should be a lot more careful than the last revision I pushed.

Looking forward to more input!

@lindhe
Copy link
Contributor Author

lindhe commented Dec 13, 2024

Apparently, empty probes for Deployments and DaemonSets are no good. I'll fix it.

@lindhe lindhe force-pushed the lindhe/add-health-probes branch from 9aba722 to a770361 Compare December 13, 2024 10:04

This comment has been minimized.

@lindhe
Copy link
Contributor Author

lindhe commented Dec 13, 2024

Pretty sure this error is not my fault:

image

@lindhe lindhe force-pushed the lindhe/add-health-probes branch from a770361 to 59b7323 Compare December 13, 2024 11:07

This comment has been minimized.

The helper template definitions really got me scratching my head:

- I had never seen a `with` statement with an `else` clause before and
  I cannot find any reference indicating that it is valid. Helm's
  documentation never mentions it.
- There's an extra `{{- end -}}` with not apparent start, yet it did not
  render an error in the templates (which is very surprising).

While it kind of seemed to work, I'm convinced it was more by chance
than by design.
The old  logic should not work, I suspect that the errors got masked by
the overly inclusive `with` statement.

This change rewrites the template definitions so it's easier to follow
and uses valid Helm template flow control.

Signed-off-by: Andreas Lindhé <[email protected]>
This change makes it possibe to configure `livenessProbe` and
`startupProbe` to the nginx container in the gateway pods.
Also adds a guard for when the `readinessProbe` is unset.

Signed-off-by: Andreas Lindhé <[email protected]>
This change makes it possible to configure `livenessProbe` and
`startupProbe` to the admin-api container in the admin-api pods.
Also adds a guard for when the `readinessProbe` is unset.

Signed-off-by: Andreas Lindhé <[email protected]>
This change makes it possible to configure `startupProbe` for the
loki-sc-rules sidecar container in the backend pods.

Signed-off-by: Andreas Lindhé <[email protected]>
This change makes it possible to configure `livenessProbe` and
`startupProbe` to the loki container in the backend pods.
Also adds a guard for when the `readinessProbe` is unset.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `livenessProbe` and `startupProbe` to the bloom-builder
container in the bloom-builder pods.
Also adds a guard for when the `readinessProbe` is unset.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `livenessProbe` and `startupProbe` to the bloom-gateway
container in the bloom-gateway pods.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `livenessProbe` and `startupProbe` to the bloom-planner
container in the bloom-planner pods.

Signed-off-by: Andreas Lindhé <[email protected]>
This change makes it possible to configure `livenessProbe`,
`readinessProbe` and `startupProbe` to the exporter container in the
memcached pods.

Signed-off-by: Andreas Lindhé <[email protected]>
This change makes it possible to configure `livenessProbe`,
`readinessProbe` and `startupProbe` to the memcached container in the
memcached pods.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `startupProbe` to the compactor container
in the compactor pods.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `startupProbe` to the distributor container
in the distributor pods.
Adds guards for when the `readinessProbe` or `livenessProbe` are unset.

Signed-off-by: Andreas Lindhé <[email protected]>
This change makes it possible to configure `livenessProbe` and
`startupProbe` to the gateway container in the enterprise-gateway pods.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `startupProbe` to the index-gateway container in the
index-gateway pods.
It also adds guards for the `readinessProbe` and `livenessProbe`.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `startupProbe` to the ingester container in the
ingester pods.

Signed-off-by: Andreas Lindhé <[email protected]>
This change makes the previously hard-coded `readinessProbe` in
loki-canary configurable via Helm values.

Signed-off-by: Andreas Lindhé <[email protected]>
This change makes it possible to configure `livenessProbe` and
`startupProbe` to the loki-canary container in the loki-canary pods.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `livenessProbe` and `startupProbe` to the
pattern-ingester container in the pattern-ingester pods.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `startupProbe` to the querier container in the querier
pods.
It also adds guards for the `readinessProbe` and `livenessProbe`.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `livenessProbe` and `startupProbe` to the
query-frontend container in the query-frontend pods.
It also adds a guard for when the `readinessProbe` is unset.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `startupProbe` to the query-scheduler container in the
query-scheduler pods.
Also adds guards for when `livenessProbe` or `readinessProbe` are unset.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `livenessProbe` and `startupProbe` to the
loki container in the read pods.
Also adds a guard for when the `readinessProbe` is unset.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `livenessProbe` and `startupProbe` to the ruler
container in the ruler pods.
Also adds a guard for when the `readinessProbe` is unset.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `startupProbe` to the loki-sc-rules sidecar container
in the single-binary pods.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `startupProbe` to the table-manager container in the
table-manager pods.
Also adds guards for when `livenessProbe` or `readinessProbe` are unset.

Signed-off-by: Andreas Lindhé <[email protected]>
Adds configurable `livenessProbe` and `startupProbe` to the loki
container in the write pods.
Also adds a guard for when the `readinessProbe` is unset.

Signed-off-by: Andreas Lindhé <[email protected]>
Done by running this command:

```
make -C docs sources/setup/install/helm/reference.md
```

Signed-off-by: Andreas Lindhé <[email protected]>
@lindhe lindhe force-pushed the lindhe/add-health-probes branch from 59b7323 to d6b22d2 Compare December 14, 2024 15:43
Copy link
Contributor

Kubernetes Manifest Diff Summary

Scenario: default-single-binary-values (Added: 0, Modified: 8, Removed: 0)

Summary:

  • Added: 0

  • Modified: 8

  • Removed: 0

Added Files

No added files

Modified Files

loki/templates/compactor/statefulset-compactor.yaml
*** /home/runner/work/loki/loki/output/base/default-single-binary-values/loki/templates/compactor/statefulset-compactor.yaml	2024-12-14 15:44:10.919877504 +0000
--- /home/runner/work/loki/loki/output/pr/default-single-binary-values/loki/templates/compactor/statefulset-compactor.yaml	2024-12-14 15:44:13.991894339 +0000
***************
*** 74,79 ****
--- 74,80 ----
 initialDelaySeconds: 30
 timeoutSeconds: 1
 
+ 
 volumeMounts:
 - name: temp
 mountPath: /tmp
loki/templates/distributor/deployment-distributor.yaml
*** /home/runner/work/loki/loki/output/base/default-single-binary-values/loki/templates/distributor/deployment-distributor.yaml	2024-12-14 15:44:10.918877499 +0000
--- /home/runner/work/loki/loki/output/pr/default-single-binary-values/loki/templates/distributor/deployment-distributor.yaml	2024-12-14 15:44:13.991894339 +0000
***************
*** 72,79 ****
 port: http-metrics
 initialDelaySeconds: 30
 timeoutSeconds: 1
- livenessProbe:
- null
 volumeMounts:
 - name: config
 mountPath: /etc/loki/config
--- 72,77 ----
loki/templates/querier/deployment-querier.yaml
*** /home/runner/work/loki/loki/output/base/default-single-binary-values/loki/templates/querier/deployment-querier.yaml	2024-12-14 15:44:10.918877499 +0000
--- /home/runner/work/loki/loki/output/pr/default-single-binary-values/loki/templates/querier/deployment-querier.yaml	2024-12-14 15:44:13.991894339 +0000
***************
*** 79,86 ****
 port: http-metrics
 initialDelaySeconds: 30
 timeoutSeconds: 1
- livenessProbe:
- null
 volumeMounts:
 - name: config
 mountPath: /etc/loki/config
--- 79,84 ----
loki/templates/index-gateway/statefulset-index-gateway.yaml
*** /home/runner/work/loki/loki/output/base/default-single-binary-values/loki/templates/index-gateway/statefulset-index-gateway.yaml	2024-12-14 15:44:10.919877504 +0000
--- /home/runner/work/loki/loki/output/pr/default-single-binary-values/loki/templates/index-gateway/statefulset-index-gateway.yaml	2024-12-14 15:44:13.991894339 +0000
***************
*** 69,76 ****
 port: http-metrics
 initialDelaySeconds: 30
 timeoutSeconds: 1
- livenessProbe:
- null
 volumeMounts:
 - name: config
 mountPath: /etc/loki/config
--- 69,74 ----
loki/templates/ingester/statefulset-ingester-zone-b.yaml
*** /home/runner/work/loki/loki/output/base/default-single-binary-values/loki/templates/ingester/statefulset-ingester-zone-b.yaml	2024-12-14 15:44:10.919877504 +0000
--- /home/runner/work/loki/loki/output/pr/default-single-binary-values/loki/templates/ingester/statefulset-ingester-zone-b.yaml	2024-12-14 15:44:13.991894339 +0000
***************
*** 91,96 ****
--- 91,97 ----
 initialDelaySeconds: 30
 timeoutSeconds: 1
 
+ 
 volumeMounts:
 - name: config
 mountPath: /etc/loki/config
loki/templates/ingester/statefulset-ingester-zone-c.yaml
*** /home/runner/work/loki/loki/output/base/default-single-binary-values/loki/templates/ingester/statefulset-ingester-zone-c.yaml	2024-12-14 15:44:10.919877504 +0000
--- /home/runner/work/loki/loki/output/pr/default-single-binary-values/loki/templates/ingester/statefulset-ingester-zone-c.yaml	2024-12-14 15:44:13.991894339 +0000
***************
*** 91,96 ****
--- 91,97 ----
 initialDelaySeconds: 30
 timeoutSeconds: 1
 
+ 
 volumeMounts:
 - name: config
 mountPath: /etc/loki/config
loki/templates/ingester/statefulset-ingester-zone-a.yaml
*** /home/runner/work/loki/loki/output/base/default-single-binary-values/loki/templates/ingester/statefulset-ingester-zone-a.yaml	2024-12-14 15:44:10.919877504 +0000
--- /home/runner/work/loki/loki/output/pr/default-single-binary-values/loki/templates/ingester/statefulset-ingester-zone-a.yaml	2024-12-14 15:44:13.991894339 +0000
***************
*** 91,96 ****
--- 91,97 ----
 initialDelaySeconds: 30
 timeoutSeconds: 1
 
+ 
 volumeMounts:
 - name: config
 mountPath: /etc/loki/config
loki/templates/query-scheduler/deployment-query-scheduler.yaml
*** /home/runner/work/loki/loki/output/base/default-single-binary-values/loki/templates/query-scheduler/deployment-query-scheduler.yaml	2024-12-14 15:44:10.919877504 +0000
--- /home/runner/work/loki/loki/output/pr/default-single-binary-values/loki/templates/query-scheduler/deployment-query-scheduler.yaml	2024-12-14 15:44:13.991894339 +0000
***************
*** 70,77 ****
 port: http-metrics
 initialDelaySeconds: 30
 timeoutSeconds: 1
- livenessProbe:
- null
 volumeMounts:
 - name: config
 mountPath: /etc/loki/config
--- 70,75 ----

Removed Files

No removed files

Scenario: default-values (Added: 0, Modified: 0, Removed: 0)

Summary:

  • Added: 0

  • Modified: 0

  • Removed: 0

Added Files

No added files

Modified Files

No modified files

Removed Files

No removed files

Scenario: ingress-values (Added: 0, Modified: 0, Removed: 0)

Summary:

  • Added: 0

  • Modified: 0

  • Removed: 0

Added Files

No added files

Modified Files

No modified files

Removed Files

No removed files

Scenario: legacy-monitoring-values (Added: 0, Modified: 0, Removed: 0)

Summary:

  • Added: 0

  • Modified: 0

  • Removed: 0

Added Files

No added files

Modified Files

No modified files

Removed Files

No removed files

Scenario: simple-scalable-aws-kube-irsa-values (Added: 0, Modified: 1, Removed: 0)

Summary:

  • Added: 0

  • Modified: 1

  • Removed: 0

Added Files

No added files

Modified Files

loki/templates/gateway/deployment-gateway-enterprise.yaml
*** /home/runner/work/loki/loki/output/base/simple-scalable-aws-kube-irsa-values/loki/templates/gateway/deployment-gateway-enterprise.yaml	2024-12-14 15:44:11.325879722 +0000
--- /home/runner/work/loki/loki/output/pr/simple-scalable-aws-kube-irsa-values/loki/templates/gateway/deployment-gateway-enterprise.yaml	2024-12-14 15:44:14.403896602 +0000
***************
*** 28,34 ****
 app.kubernetes.io/component: gateway
 annotations:
 checksum/config: 8cfa6bbdb9a4dbb7b195fc7162e2fddac7836db36a03aa494881c1fc01f7c3ac
! spec: 
 serviceAccountName: enterprise-logs
 securityContext:
 fsGroup: 10001
--- 28,34 ----
 app.kubernetes.io/component: gateway
 annotations:
 checksum/config: 8cfa6bbdb9a4dbb7b195fc7162e2fddac7836db36a03aa494881c1fc01f7c3ac
! spec:
 serviceAccountName: enterprise-logs
 securityContext:
 fsGroup: 10001

Removed Files

No removed files

@lindhe
Copy link
Contributor Author

lindhe commented Dec 14, 2024

I've rebased on main a couple of times now, still getting this error:

image

I cannot image that is caused by my changes. So I have to wait until someone fixes main, I guess. 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/helm size/XL type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Removal of Liveness Probe in Loki 3.x for Query Frontend
2 participants