Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] deferContainers - Pod Termination Semantics #483

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
309 changes: 309 additions & 0 deletions contributors/design-proposals/defer-containers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
# Defer Containers

[email protected]

March 2017

## Prerequisite
Understanding of [initContainers](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-init.md)

## Motivation
Since the introduction to Statefulset and Daemonset PODs that consist of stateful containers are now fully supported by
Kubernetes. Such stateful pods benefit from constructor and destructor semantics. InitContainers provide constructor
semantics. The proposed DeferContainers will provide destructor semantics for such stateful PODs

## Abstract
This introduces the concept of ‘deferContainers’, inspired by golang’s keyword ‘defer’. This will allow one to define a
set of Containers to be executed at the time of POD termination. The defined containers will be executed sequentially and
in the specified order. It will bring destructor() capability to Pods.

## Use Cases
[5mins Uses Presentation Video during recent Sig-App Meeting](https://youtu.be/s1oZ00_JA00?t=2692)

### Cleanup:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to handle situation:

  • you want to delete pod because it's broken or whatever
  • it has deferContainer that will hung because of "some reasons"

How am I supposed to not wait for deferContainer? Are we going to add kubectl delete .... --ignore-defer-containers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thinking we should probably force the delete in that case we wont need to run deferContainers

`kubectl delete pod foo --grace-period=0 --force

But this new option is cool too, it seperates out the funtionality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't backwards compatibility mean that if someone previously ran delete pod foo --grace-period=0 to immediately terminate without waiting, that should skip deferContainers as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it might force --grace-period to all the other deferContainers, predicting the running-time of a certain cleaup/termination activity might be difficult. Or may be we could have --grace-period=-1 for unlimited termination time for a POD?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to solve "force shutdown" "force shutdown + ignore defer"? Pls, don't get me wrong but it seems to be a bit inconsistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I now see the inconsistency, to keep it more predictable we should stick with,

kubectl delete pod foo --grace-period=0 --force

as originally proposed. Even now this is the way to avoid running PreStop hooks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd have to come up with some other signal. --grace-period=1 means "delete fast" and --grace-period=0 means "give up strong consistency and risk split brain". You don't want to do the latter.

I'm not sure there is a backward compatible way to allow defer containers to run without being bounded in time by the grace-period. I'm also not sure we want defer containers to run unbounded, since that was a key part of the graceful deletion design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton I have added some design thoughts on how we handle time boundness for deferContainers, this was something i wanted to bring up in the SIG-APPs meeting yesterday. PTAL.

Also this particular discussion is about deleting the POD without running deferContainers or preStopHooks, an equivalent of kill -9 PID. Could you please elaborate why should we come up with a different signal for this?

Most stateful workloads require certain Cleanup activity before or after the appContainers are terminated such as
* Sync/Flush the Disk or logs before the POD is evicted off a node.
* Delete or Update Global Configuration / Release Application level Lock.
* Some Legacy applications may not respond to SIGTERM signals sent by docker daemon, they might require special command/procedure
to initiate graceful shutdown.

### Shard rebalancing.
In a sharded workload configured as a statefulset, if the application is scaled down, then the exiting Pod should trigger
a re-balance of keys from the current shard to the rest of them. This needs to be done befoe the shard goes down.
‘deferContainers’ could handle such scenarios with ease.

### Statefulset Upgrading
If in future, statefulsets support rolling updates, an update request might attempt to replace Pods one after the other.
With deferContainers each Pod could be removed gracefully without having to implement complicated sidecar logic to prevent data loss.

### Master-Slave or Leader-Follower Statefulset down-size
Consider a Stateful set consisting a Master-Slave or Leader-follower type application running N replicas. Ordinal index
ranging from 0 to N-1. If you scale it down (reduce the number of replicas), the statefulset controller would attempt
to bring down the last replica (N-1th pod in this case). If that pod is the elected Leader or Master, then this would
disrupt the service intermittently. With deferContainers, this handover or re-election could be programmed gracefully/elegantly.

### SHUTDOWN Sequence
* Traditional relational databases typically support a reasonable shutdown sequence, for instance, Oracle has 4 types of
shutdown such as NORMAL, IMMEDIATE, TRANSACTIONAL and ABORT. ‘deferContainers’ will allow us to program and wait for such
complex shutdown scenarios.
* In future when kubernetes supports Virtual Machine runtime (eg: hyperContainer) for better isolation, we should shutdown
the VMs instead of killing them abruptly. ‘deferContainers’ could help us run such shutdown commands.

## Limitation with the current system
Container pre-stop lifecycle hook is not sufficient for all termination cases:
* It is container Specific and not pod specific
* They cannot easily coordinate complex termination conditions across containers in multi-container pods
* They can only function with code in the image or code in a shared volume, which would have to be statically linked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would I want to pull down new code to turn down a storage system. If the primary use cases are those expressed above, isn't all the code in my container already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The objective is not always to pull a new image, the objective is to seperate out/de-couple Termination sequence. Most often we wont have to pull a new image all togather. I can also imagine simple cleanups like 'rm -rf /tmp/dir1' wrapped around bash or busyboxy images which are lighter.

If your pod is part of a CI engine and running 1000s of testcases, it should be easy to configure a deferContainer (from a different image/source) simply upload (curl POST) currently processed test results to a different server when the POD is getting evicted out of that node.
If curl is not available in your primary container you dont have to re-package and create a new image just to run curl at the termination - which will be the case for preStop hooks.

(not a common pattern in wide use)
* Does not work across kubelet restart
* Waits for the entire graceperiod even after the pre-stop hook finished earlier.
* Wont restart on failed termination steps.
* Cannot contain complex termination scripts as no logging support.

## Design Requirements
* deferContainers should be able to:
* Use the same volume (PV) as appContainers such that it can
* Perform cleanup of shared volume, such as delete several temp directories.
* Delete unwanted files before a final sync is initiated.
* Update Configuration files about the changes in the distributed system so that the next pod getting attached to this PV will benefit from it. (like a new leader/master etc).
* Deleted secrets or security related files before the pod de-couples from the PV.
* Delay the termination of application containers until operations are complete
* De-Register the pod with other components of the system
* Program termination sequence for cases where TerminationGracePeriod will be hard to predict before hand.

* Reduce coupling:
* Between application images, eliminating the need to customize those images for Kubernetes generally or specific roles
* Inside of images, by specializing which containers perform which tasks (install git into init container, use filesystem
contents in web container)
* Between termination steps, by supporting multiple sequential cleanup containers
* Pre-Exit
* Should act as pre-exit trigger, should be called when the application is about to be deleted.
* restart on Failure
* If a certain deferContainer failed while execution it should be automatically restarted
* GracePeriod behaviour
* It should be possible to mention overall terminationGracePeriod for defercontainers, if the termination sequences completed before the overall graceperiod then the pod should be deleted without waiting further.
* Reduce Complexity,
* It should be possible to use a generic container as a deferContainer,
* A deferContainer should be independently invokable, ie:- should not require code in the same image as the appContainers.
* deferContainer Images that are not already in the node will be pre-populated while the application is being executed.

## Design
This proposed pod spec would look like below.
```yaml
pod:
spec:
initContainers: ...
containers: ...
deferContainers:
- name: defer-container1
image: ...
...
- name: defer-container2
...
status:
initContainersStatuses: ...
containerStatuses: ...
deferContainerStatuses:
- name: defer-container1
...
- name: defer-container2
...
```
The api will look like below
```
// PodSpec is a description of a pod.
type PodSpec struct {
.
.
.

InitContainers []Container `json:"initContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,20,rep,name=initContainers"`
// List of containers belonging to the pod.
// Containers cannot currently be added or removed.
// There must be at least one container in a Pod.
// Cannot be updated.
// +patchMergeKey=name
// +patchStrategy=merge
Containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,2,rep,name=containers"`
//List of termination Containers, those will be executed when during the TerminationGracePeriod of the pod
// +patchMergeKey=name
// +patchStrategy=merge
DeferContainers []Container `json:"deferContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,26,rep,name=deferContainers"`
.
.
.
}
```
* Will have 0...N containers and will be executed in sequence (specified order)
* Restart policy for deferConatiners are ‘onFailure’
* Adding a new phase ‘Terminating’ in the Pod lifecycle

### Terminating Phase / Defer Phase
* A POD reaches terminating phase when it is about to be removed from Kubernetes Cluster
* During this phase, the appContainers are not restarted if they get terminated/killed
* deferContainers will be executed one-after-the-other in the same sequence as they were specified in the POD Spec. From the above
pod spec example execution sequence will be defer-Container1, then defer-Container2, …, etc.,
* if a particular deferContainer failed, it will be restarted until it succeeds.
* if the user specifies `kubectl delete pod foo --grace-period=0 --force` to delete a pod deferContainers will not be executed.
```
Example status output when a pod is being terminated.
NAME READY STATUS RESTARTS AGE
foo-0 2/2 Defer:0/4 0 7m
```
* Failure of one or all deferContainers will not trigger a POD restart.
* If deferContainers are configured Pre-Stop hooks will not be executed.

### TerminationGracePeriod
* It takes default value (30 seconds as of today), explicitly mentioning this flag overrides the default value.
* Then deferContainer will start to execute one after the other in the specified order
* If a particular deferContainer failed it will be restart until it succeeds or graceperiod is exhausted.
* When the configured graceperiod expires then all the containers (AppContainers) including the current deferContainer will be terminated.
* It will kill currently executing deferContainer and no further deferContainer will be executed (if there are any).
* deferContainers are time bound by TerminationGracePeriod
* If all the deferContainers completed execution well ahead of TerminationGracePeriod, then we should

### PrePopulate deferContainers Images
By default, all the deferContainer images will be pulled (if not available) when the POD reaches ‘running’ stage.

## Implementation Plan
Development and release lifecycle of this feature will follow other kubernetes experimental feature. This will be originally
rolled out as alpha as annotations based on its usefulness and community feedback it will graduate to PodSpec.

## Examples
### Cleanup
```yaml
pod:
spec:
terminationGracePeriod: 60
initContainers: ...
containers: ...
deferContainers:
#pre exit operations
- name: fsync
image: my-utils
#Something like this https://docs.mongodb.com/manual/reference/command/fsync/
comamnd: ["/bin/sh", "-C", "./syncDisk.sh"]
name: killApp
image: my-utils
command: ["/bin/sh", "-C", "./killAndWait.sh", "--name=${POD_NAME}"]
#post exit operation
- name: rm-tmpdir
image: my-utils
command: ["/bin/sh", "-C", "./disk_cleanup.sh"]
#contact and inform a thirdparty system about this pods termination, such as reducing a reference counter (if one is maintained)
name: ref-counter
image: my-utils
comamnd: ["/bin/sh", "-C", "./decrementRefCount.sh"]
...
```

### Master-slave / Leader-follower statefulset down-size / scale down the replicas
Below scripts 'selectaMaster.sh and reConfSlaves.sh should be designed in such a way that even if a terminating pod is a slave it
should not affect the cluster. This will fit controllers such as Statefulset because they guarantee only one pod goes down at once.
```yaml
pod:
spec:
terminationGracePeriod: 60
initContainers: ...
containers: ...
deferContainers:
- name: electMaster
image: my-utils
#Select a mastter among the available slaves
comamnd: ["/bin/sh", "-C", "./selectAMaster.sh"]
name: reconfigureSlaves
image: my-utils
#Re-configure all the slaves to the new master
command: ["/bin/sh", "-C", "./reconfSlaves.sh", "--MasterSlave=${MasterEP}"]
name: killApp
image: my-utils
command: ["/bin/sh", "-C", "./killAndWait.sh", "--name=${POD_NAME}"]
...
```
### Shutdown Sequence
If we attempt to shutdown a runing Oracle instance which has four stages https://docs.oracle.com/cd/B28359_01/backup.111/b28273/rcmsynta045.htm#RCMRF155
```yaml
pod:
spec:
initContainers: ...
containers: ...
deferContainers:
#Pre-Exit
- name: shutdown-db
image: db-utils
#Select the shutdown sequnce
comamnd: ["/bin/sh", "-C", "./shutDown.sh", "--shutdowntype=${SH_TYPE}"]
name: waitforDB
image: db-utils
#run a script that will wait until the DB is down.
command: ["/bin/sh", "-C", "./waitForDB.sh"]
...
```

## Kubelet Changes
* The images are pre-pulled in SyncPod() when pod phase is ‘Running’ as a step 7
* killPodWithSyncResult() is blocking, deferContainers execution is implemented inside this function.
* We needed killPod or killPodWithSyncResults to get access to pullSecrets and podStatus, these two have been propagated
* A new method for ContainerManager interface added WaitForContainer (containerID string) error so that we could start a container and block on it during termination.

A simple pseudo code implementation
```go
func killContainersWithSyncResult() {
runDeferContainers()

for _, container := range runningPod.Containers {

//if deferContainers are configured skip preStopHook()
go killContainer(pod, container.ID, container.Name)

}
//Wait for all the container to be killed
}
```
And runDeferContainers will be implemented as
```go
func runDeferContainers(){
for _, container := range pod.Spec.DeferContainers {

m.startContainerAndWait(podSandboxID,podSandboxConfig, container)

//Wait for container to finish or time.After(GracePeriod)

}
}
```
Sync POD has a new phase 7 to pre-pull deferContainer images if not available
```go
func SyncPod() {

// Step 1: Compute sandbox and container changes.

// Step 2: Kill the pod if the sandbox has changed.

// Step 3: kill any running containers in this pod which are not to keep.

// Step 4: Create a sandbox for the pod if necessary.

// Step 5: start init containers.

// Step 6: start containers in podContainerChanges.ContainersToStart.

//Step 7: If the Pods is in running phase pre-populate deferContainer images

if pod.Status.Phase == v1.PodRunning {

pre-pullDeferContainerImage()

}
}
```

## Caviate
This Design primarily focuses on handling graceful termination cases. If the Node running a deferContainer configured pod
crashes abruptly, then this design does not guarantee that cleanup was performed gracefully. This still requires community
feedback on how such scenarios are handled and how important it is for deferContainers to handle that situation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is quite crucial. In case of initContainers you have a guarantee if container is running => Init container was executed and it executed successfully. In case of deferContainers we have no guarantees about that. Not to mention fetching logs from deleted pods (no pod => no pod name in case of controllers => no way to get deferContainer logs).

Probably all of that has been already mentioned somewhere else, but I wanted to raise it also here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, we could not think of an elegant way to solve this problem, we really need to get communities feedback on how this is done today but we believe even without this deferContainers would be useful in many cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding this without any idea how to solve that problem is bad idea, it may create some illusion that some functionality is provided, while on production it may work not as "expected". ( compared with initContainers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Termination constructs are important but providing absolute guarantee like their initialization counterparts are difficult to achieve, especially when there are ungraceful terminations like node crash.

In programming languages C++'s destructor, golang's defer(), atexit in C. None of this is guaranteed to work if the process crashes (ungraceful termination). This does not mean they are not useful. Their initialization counterpart offers higher level of guarantee.

We think the current design is a good starting point for providing better support for graceful POD termination. We might be able to improve it along the way as the users try it and provide more feedback. Initial implementation will be alpha status so should caution community from trying this in Production without enough tests.

What do you think?

Copy link

@kfox1111 kfox1111 Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about renaming it an eventContainer, and then passing in event kind as an env var?
something like: POD_EVENT=stop, POD_EVENT=recover, etc. We only need to implement the first, stop, for now, but can reuse the machinery later for things like, maybe informing a statefulset pod it will be shutting down but will be replaced. so like, in cephs case, the statefulset might want to set no-out during the upgrade as it will come right back. but not set no-out during a delete as it might not come back and the cluster should recover. Kind of like how rpm spec install hooks work during remove/upgrade.


## Reference
[Community Request](https://github.com/kubernetes/kubernetes/issues/35183)
[WIP PR](https://github.com/kubernetes/kubernetes/pull/47422)
[UseCase Sides](https://docs.google.com/presentation/d/12WEEWQh8ffiLyqh8F60PgRvQn3mfdC2rx3E8biZm3oM/edit?usp=sharing)