Feature: Add lifecycle hooks to pods from jobs automatically #8006

jack1902 · 2022-03-04T11:26:06Z

What problem are you trying to solve?

When using linkerd to inject everything inside a cluster, pods spawned from jobs fall into a NotReady state as the main container inside the pod has completed its task but the proxy runs forever.

Additionally, it is impossible to use defaultAllowPolicy: "cluster-authenticated" without injecting jobs because they will not be able to communicate with the relevant things inside the mesh.

Slack Threads:

policy - https://linkerd.slack.com/archives/C89RTCWJF/p1646315631241249
job/await discussion - https://linkerd.slack.com/archives/C89RTCWJF/p1646235837693339

How should the problem be solved?

When a pod is spawned which belongs to a job / cronjob the pod should have a lifecycleHook automatically injected to run curl -X POST http://localhost:4191/shutdown or equivalent to ensure the container running the work terminates the proxy.

Additionally, it could be beneficial to have an annotation that could configure the lifecycleHook, for example:

annotations:
  config.linkerd.io/lifecycle-hook-enabled: "true"
  config.linkerd.io/lifecycle-hook-binary: "wget" # could also be curl or others

Any alternatives you've considered?

Configuring a bunch of policies cluster wide to enable jobs to work whilst 99% of other traffic is authed and through the mesh. Ideally, getting fresh clusters onboarded would be pretty quick and painless where possible for many users.

Additionally, I've considered adding the hook myself to my objects but some of them are spawned via third-party charts which don't provide a clean interface to add these relevant hooks. I would have to resort to kustomize to add the lifecycle hook for each job within the cluster that needs to communicate to things on the mesh

How would users interact with this feature?

They could configure it via annotations that are read by the injection webhook which vary the output slightly (curl vs wget vs other) and would be able to enable/disable the hook injection aswell as the injection of the proxy

Would you like to work on this feature?

No response

The text was updated successfully, but these errors were encountered:

olix0r · 2022-03-08T18:08:08Z

If I understand correctly, lifecycle hooks can't actually do this. From https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#hook-handler-execution:

PreStop

This hook is called immediately before a container is terminated due to an API request or management event such as a liveness/startup probe failure, preemption, resource contention and others.

That is, lifecycle hooks don't apply when a container exits gracefully. They only apply when Kubelet decides to terminate a container; and if Kubelet is deciding to terminate the Job, the proxy will shutdown gracefully.

I think the only real approach to solving this problem would be to write a controller that deletes jobs when the linkerd proxy is the only running container.

jack1902 · 2022-03-10T09:21:44Z

So i did look at this: https://itnext.io/three-ways-to-use-linkerd-with-kubernetes-jobs-c12ccc6d4c7c

I thought it was kind of neat to do this, but the cleanest and easiest way would be to have a controller like you say, since if I have numerous jobs to configure in numerous places it just becomes tedious to manage. Having a controller constantly check to auto-cleanup would be useful

mladedav · 2022-04-30T08:29:54Z

Another option would be to make the linkerd proxy container (either the binary directly or another process therein) aware of the state of the pod by polling the kubernetes API.

Instead of having a central controller polling the state of the pods each pod would poll its own state and terminate its own proxy.

Advantage would be that there is no controller installation needed. There could also be less resource usage when there are no jobs running. I think it should also perform better when there are significantly less job pods than other pods which is probably more common.

I am not sure though if the default service account has permissions for that or if linkerd can possibly inject those, but I believe it could.

mateiidavid · 2023-01-23T12:27:07Z

Update:

I came up with a proof of concept that I've been developing as an extension for Linkerd (and a personal side project). The POC works as-is, but it has two issues:

It is undocumented
Progress has been slow. The idea is a bit too complicated and it introduces some security implications that I am not 100% sure I want to tackle.

The basic idea for the POC is to have an admission controller that will modify the entrypoint of pods to call into linkerd-await:

Based on a label we figure out if a pod should be mutated.
A pod needs to contain linkerd-proxy and at least one container with command field exposed in the template spec.
The controller will add: an empty dir volume, an init container with linkerd-await, and it will copy the binary from the init container into the empty dir volume. The volume will be attached to any container that exposes the command field.
The controller will then modify the command entrypoint to call into the await binary. If this is done for more than one container, then whichever finishes first will shut down the proxy.

e.g

# if I'd write it in pseuo-bash-code:
before: ./my-process
after: ./linkerd-await --shutdown -- '$@'  where '$@' is "./my-process" followed by whatever args were originally passed in.

While the idea works in practice, it exposes us to a few questions around security implications. It also makes testing a bit of a nightmare, and it still involves changing manifests to expose commands and args.
This is the "safest" way I found to do it though. If you have feedback on it, or a better way to do it, I'd be interested in resuming the work. But as it stands, I'm not sure how useful it would be in its current form.

I also considered the approach suggested above (in the previous comment). This is not something I think we'd be open to doing in the proxy (it shouldn't have a Kubernetes client, or even the notion of running in Kubernetes) but I did think of having an "init system" that will fork() the proxy process and shut it down whenever the other container has finished. The problem is:

If we want to do this at the container level, we need to shareProcessNamespace, which I'm reluctant to suggest (again, based on security implications and configuration nightmares)
If we were to do this at the pod level, we'd need a Kubernetes client for each pod, not necessarily a bad thing, but Linkerd (or the extension) will need to handle RBAC, which is not something I want to get into.

A different approach would be to have a controller that mutates pods and runs this init system and then watches the state of all pods. When a pod should be terminated, it signals the init system (network call) which will kill the proxy's process. I haven't looked much into this alternative, it's probably the simplest since:

RBAC can be a bit more centralised (what I mean by it is that only one deployment will be able to watch the resources)
If the proxy's image contains the binary, we don't need to mutate a bunch of pods, mutations will also be a bit more deterministic.

Will wait for some feedback but this is what I've come up with so far.

howardjohn · 2023-02-08T04:49:58Z

fyi you may run into kubernetes/kubernetes#106896 if you are watching the API server. Although I recall there are some cases where it works, just not all - I don't think I tested jobs

salvalcantara · 2023-05-17T17:52:24Z

Hey, what is the current state of this issue?

jack1902 · 2023-09-21T10:20:56Z

in light of sidecars (actually) being added to k8s (read: https://buoyant.io/blog/kubernetes-1-28-revenge-of-the-sidecars) i believe this might become something that can be addressed once it is in stable. Hopefully with that being added the answer to this is to simply ensure that any job uses the above mentioned bit around ensuring linkerd is injected a true sidecar and this will ensure the proxy shutsdown accordingly when the main container is shutdown

alpeb · 2023-10-05T17:36:47Z

Agreed on @jack1902's statement; sidecar containers in k8s are only on alpha stage for now, waiting in particular for proper termination ordering to be implemented (see kubernetes/kubernetes#120620 ). When that implementation matures we'll prioritize its integration with linkerd to solve this issue.

kflynn · 2023-10-07T21:33:27Z

We're using #11461 to track the work of implementing KEP-753.

olix0r · 2024-09-10T17:38:16Z

Linkerd now supports native sidecars.

jack1902 added the enhancement label Mar 4, 2022

adleong added the help wanted label Mar 8, 2022

kleimkuhler assigned mateiidavid May 3, 2022

adleong added this to the stable-2.13.0 milestone Sep 22, 2022

adleong added the priority/triage label Sep 22, 2022

adleong removed this from the stable-2.13.0 milestone Jan 19, 2023

StrongMonkey mentioned this issue Jan 31, 2023

Support service meshes acorn-io/runtime#743

Closed

nathanmcgarvey-modopayments mentioned this issue Aug 31, 2023

Do we handle shutdown incorrectly? #10379

Closed

adleong assigned alpeb and unassigned mateiidavid Oct 5, 2023

olix0r closed this as completed Sep 10, 2024

github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Add lifecycle hooks to pods from jobs automatically #8006

Feature: Add lifecycle hooks to pods from jobs automatically #8006

jack1902 commented Mar 4, 2022 •

edited

Loading

olix0r commented Mar 8, 2022 •

edited

Loading

jack1902 commented Mar 10, 2022

mladedav commented Apr 30, 2022

mateiidavid commented Jan 23, 2023

howardjohn commented Feb 8, 2023

salvalcantara commented May 17, 2023

jack1902 commented Sep 21, 2023

alpeb commented Oct 5, 2023

kflynn commented Oct 7, 2023

olix0r commented Sep 10, 2024

Feature: Add lifecycle hooks to pods from jobs automatically #8006

Feature: Add lifecycle hooks to pods from jobs automatically #8006

Comments

jack1902 commented Mar 4, 2022 • edited Loading

What problem are you trying to solve?

How should the problem be solved?

Any alternatives you've considered?

How would users interact with this feature?

Would you like to work on this feature?

olix0r commented Mar 8, 2022 • edited Loading

jack1902 commented Mar 10, 2022

mladedav commented Apr 30, 2022

mateiidavid commented Jan 23, 2023

howardjohn commented Feb 8, 2023

salvalcantara commented May 17, 2023

jack1902 commented Sep 21, 2023

alpeb commented Oct 5, 2023

kflynn commented Oct 7, 2023

olix0r commented Sep 10, 2024

jack1902 commented Mar 4, 2022 •

edited

Loading

olix0r commented Mar 8, 2022 •

edited

Loading