-
Notifications
You must be signed in to change notification settings - Fork 224
checkpointer: should GC itself if installer no longer scheduled #253
Comments
Today I have an idea about using the checkpointer to checkpoint itself so that we can get rid of the checkpointer installer. Here is how it looks like in several scenarios: Scenario A: checkpoint pod gets scheduled on node A as a daemonsett0: Checkpointer (call it C1) gets running Scenario B: node A reboots, but cannot reach API servert0: Kubelet starts Scenario C: node A reached API servert0: Daemonset version of checkpoint (C1) gets scheduled and started on node A During [1], if checkpoint's spec is changed on the API server, then C2 will checkpoint the new spec, and gets restarted by kubelet. So the running and on-disk checkpoint specs are always the latest. This is actually not very different from the The imperfect part is that users will see 2 checkpointer pods running on each master node, with one being active, and one being stand-by. |
/cc @aaronlevy @pbx0 @derekparker ^^ |
I think this is a really interesting idea! As far as the 2 checkpointer pods - really we already have this problem. There is a "checkpoint-installer" and "pod-checkpointer" on every node. Another option to get rid of the 2 checkpointer pods might be to implement something like "exit on lock-contention" for the checkpointed copy. This is what we did for the self-hosted kubelet, such that if it saw anything attempting to acquire the lock, it would exit - allowing the copy sourced from the api-server (when available) to take over. This more or less covers #206 - but we would still need to add (regardless of this change) some logic to the checkpointer to be aware that if it is to be GC'd it needs to clean up all other checkpoints before removing itself. |
Had a little bit more thought on this: This implies the static checkpointer will have a slightly different config than the daemonset checkpointer because it needs to exit on contention. Which means the the checkpointer needs to treat this as a special case when checkpointing itself (e.g. modify the command in the spec before writing to disk) |
Yeah, and I don't actually think exiting on lock contention is a valid solution - the checkpointer is special in that it always needs to be running a static copy (so it will come up on reboot without an api-server). So ignore my previous suggestion. |
Yeah it does (I think it needs to have "fixes" for github to pick up). Closed in: #366 |
The checkpointer is deployed via a daemonset which "installs" a static manifest to the host.
If the checkpoint-installer is no longer scheduled to a node, the checkpointer should know how to GC itself (and all checkpoints).
The text was updated successfully, but these errors were encountered: